Introduction to Xarray¶
In data science setting, it is common to deal with datasets that by nature are tabular and easier to manage. However, how these same ideas translate to designing datasets and datastructures that take into account specific domain knowledge. For example, for earth and climate sciences it is important to manage remote sensing data that usually comes in the form of large dimensional arrays that can include two-dimensional arrays that can contain time series data and measurements in more than one specific band.
Resources:
Acknowledgment: Large part of the contents in this notebook were done by Dr. Chelle Gentemann.
1. Motivation¶
Assumptions about how data is structured. For example, the two basic types of datastructures we work in Python are
Lists, matrices, multidimensional arrays: These are very quite common structures in the physical sciences. The main library to manipulate these tensorial structures in Python is
numpy.Tabular data:
pandas, assumption of observations and features This is quite natural datastructures that we often find in data science projects. However, they don’t include all the types of data structures we want to work with.
However, when we start dealing with multidimensional data (eg, three dimensional data involving latitude, longitude and time) we start having problems, including:
how do we keep track of which ones are our coordinate variables? Latitude, longitude and time are quite special physical quantities. If we just work with numpy arrays, we have no way of knowing which dimension corresponds to each coordinate.
How to store multiple datasets using the same coordinate system. Even worse, how do we keep track of datasets with different dimensions? For example, for a dataset that collects temperature measurements, we can imagine using a three dimensional array (lat, lon, time). However, we may want to also include a dataset with surface elevations, for which time is useless and we have a dataset in (lat, lon).
How can we include information in our array about the dataset? This includes the metadata, units, product specifications. Notice that
numpyarrays don’t carry units.
2. Xarray¶
# Stdlib imports
from pathlib import Path
# Third-party imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
# Small style adjustments for more readable plots
plt.style.use("seaborn-v0_8-whitegrid")
plt.rcParams["figure.figsize"] = (8, 6)
plt.rcParams["font.size"] = 14For the purposes of this tutorial, we are going to be working with the satellite data product ERA5:
Atmospheric global climate reanalyses
From 1979-2019, hourly estimates of atmospheric, land and oceanic climate variables.
30km global grid with 137 vertical grid points.
DATA_DIR = Path.home()/Path('shared/climate-data')
monthly_2deg_path = DATA_DIR / "era5_monthly_2deg_aws_v20210920.nc"
ds = xr.open_dataset(monthly_2deg_path)We can open the .nc dataset directly as a xarray. A xarray.Dataset consists in a collection of objects, including:
Dimensions
Coordinates
Data Variables
You can directly visualize all these objects by displaying the xarray:
dsThere are a few things we can observe here:
Types of each data object and their respective dimensions.
Metadata
We can observe at the atributes of the datasets by clicking in the icons next to each dataset.
2.1. Basic exploration¶
All the datasets can have different contents, but the coordinates are fixed for all of them. In order to access these and their respective name, we can use .dims and .coords:
ds.dimsFrozenMappingWarningOnValuesAccess({'time': 504, 'latitude': 90, 'longitude': 180})ds.coordsCoordinates:
* time (time) datetime64[ns] 4kB 1979-01-16T11:30:00 ... 2020-12-16T1...
* latitude (latitude) float32 360B -88.88 -86.88 -84.88 ... 87.12 89.12
* longitude (longitude) float32 720B 0.875 2.875 4.875 ... 354.9 356.9 358.9Just as in pandas, we can read the datasets using two different syntaxes.
temp = ds["air_temperature_at_2_metres"]
tempds.air_temperature_at_2_metresNotice that it is easier to keep track of what our code is doing since we can read what the datasets are.
2.2. Subsetting data¶
There are three different ways of access subsets of the data
[]: Simple and flexible, but confusing. Also notice we can do this just for DataArrays, while the next ones can be applied directy to the xarray.isel: Integer, positional, end-point exclusing. Still sensitive to error.sel: This is good for labels, end-point inclusive.
In general, when working with thes datasets is better to use sel.
What is the difference between doing slice and then .data or the other wat around?
temp[0, 63, 119].dataarray(280.93103, dtype=float32)temp.data[0, 63, 119]280.93103temp.isel(time=0,
latitude=63,
longitude=119).dataarray(280.93103, dtype=float32)temp.sel(time="1979-01",
latitude=37.125,
longitude=238.875).dataarray([280.93103], dtype=float32)What happens if we want to access the closest point in latitude and longitude? If the key we use for latitude, longitude or time is not in the dataset we will see the following error message:
temp.sel(time="1979-01", latitude=37.126, longitude=238.875, method='nearest')lat, lon = 37.126, 238.875
abslat = np.abs(ds.latitude - lat)
abslon = np.abs(ds.longitude - lon)
distance2 = abslat ** 2 + abslon ** 2
([xloc, yloc]) = np.where(distance2 == np.min(distance2))
temp.sel(time="1979-01", latitude=ds.latitude.data[xloc], longitude=ds.longitude.data[yloc])We can also access slices of data
temp.isel(time=0, latitude=slice(63, 65), longitude=slice(119, 125))temp.sel(time="1979-01",
latitude=slice(temp.latitude[63], temp.latitude[65]),
longitude=slice(temp.longitude[119], temp.longitude[125]))2.3. Making plots¶
Making plots with xarray is extremely easy. One of the advantages of using xarray is that the plot will automatically include axis information, since each numerical value in the xarray has assigned a name, either by they coordinate of the dataset product name.
ds.air_temperature_at_2_metres.sel(time="1979-01").plot();
This is just a line of code... quite impressive.
Also, depending what we want to plot, xarray will realize which type of plot we want to make. For example, if we subset in both latitude and longitude, xarray realizes that we want to plot a timeseries:
ds.air_temperature_at_2_metres.sel(latitude=37.125, longitude=238.875).plot();
2.4. Operations with xArray¶
In many aspects, you can manipulate xarrays as if you were working with pandas dataframes. For example, you can compute means by
ds.air_temperature_at_2_metres.mean("time")ds.mean("time").air_temperature_at_2_metresds.mean("time").air_temperature_at_2_metres.plot();
2.5. Groupby¶
Just like you can do it with pandas, you can use groupby with xarray in order to group dataset variables as a function of their coordinates. The most important argument of groupby() is group. When grouping by time attributes, you can use time.dt to access that coordinate as a datetime object.
ds.groupby(ds.time.dt.year).mean()