Multi-D arrays and datasets with xarray
Xarray library is built on top of numpy and pandas, and it brings the power of pandas to multidimensional arrays. There are two main data structures in xarray:
- xarray.DataArray is a fancy, labelled version of numpy.ndarray
- xarray.Dataset is a collection of multiple xarray.DataArray’s that share dimensions
Data array: simple example from scratch
import xarray as xr
import numpy as np
= xr.DataArray(
data =(4,3)),
np.random.random(size=("y","x"), # dimension names (row,col); we want `y` to represent rows and `x` columns
dims={"x": [10,11,12], "y": [10,20,30,40]} # coordinate labels/values
coords
)
datatype(data) # <class 'xarray.core.dataarray.DataArray'>
We can access various attributes of this array:
# the 2D numpy array
data.values 0,0] = 0.53 # can modify in-place
data.values[# ('y', 'x')
data.dims # all coordinates
data.coords 'x'] # one coordinate
data.coords['x'][1] # a number
data.coords[1] # the same data.x[
Let’s add some arbitrary metadata:
= {"author": "Alex", "date": "2020-08-26"}
data.attrs "name"] = "density"
data.attrs["units"] = "g/cm^3"
data.attrs["units"] = "cm"
data.x.attrs["units"] = "cm"
data.y.attrs[# global attributes
data.attrs # global attributes show here as well
data # only `x` attributes data.x
Subsetting arrays
We can subset using the usual Python square brackets:
0,:] # first row
data[-1] # last column data[:,
In addition, xarray provides these functions:
isel()
selects by coordinate index, could be replaced by [index1] or [index1,…]sel()
selects by coordinate valueinterp()
interpolates by coordinate value
# same as `data`
data.isel() =1) # second row
data.isel(y=2) # third column
data.isel(x=0, x=[-2,-1]) # first row, last two columns data.isel(y
# it is integer
data.x.dtype =10) # certain value of `x`
data.sel(x# array([10, 20, 30, 40])
data.y =slice(15,30)) # only values with 15<=y<=30 (two rows) data.sel(y
There are aggregate functions, e.g.
= data.mean(dim='y') # apply mean over y
meanOfEachColumn = data.mean()
spatialMean = data.mean(dim=['x','y']) # same spatialMean
Finally, we can interpolate. However, this requires scipy
library and might throw some warnings, so use at your own risk:
=10.5, y=10) # first row, between 1st and 2nd columns
data.interp(x=10.5, y=15) # between 1st and 2nd rows, between 1st and 2nd columns
data.interp(x# can use different interpolation methods ?data.interp
Plotting
Matplotlib is integrated directly into xarray:
=5) # 2D heatmap
data.plot(size=0).plot(marker="o", size=5) # 1D line plot data.isel(x
Vectorized operations
You can perform element-wise operations on xarray.DataArray
like with numpy.ndarray
:
+ 100 # element-wise like numpy arrays
data - data.mean()) / data.std() # normalize the data
(data - data[0,:] # use numpy broadcasting → subtract first row from all rows data
Split your data into multiple independent groups
"x") # 3 groups with labels 10, 11, 12; each column becomes a group
data.groupby("x")[10] # specific group (first column)
data.groupby("x").map(lambda v: v-v.min()) # apply separately to each group
data.groupby(# from each column (fixed x) subtract the smallest value in that column
Dataset: simple example from scratch
Let’s initialize two 2D arrays with the identical dimensions:
= {"x": np.linspace(0,1,5), "y": np.linspace(0,1,5)}
coords = xr.DataArray( # first 2D array
temp 20 + np.random.randn(5,5),
=("y","x"),
dims=coords
coords
)= xr.DataArray( # second 2D array
pres 100 + 10*np.random.randn(5,5),
=("y","x"),
dims=coords
coords )
From these we can form a dataset:
= xr.Dataset({"temperature": temp, "pressure": pres,
ds "bar": ("x", 200+np.arange(5)), "pi": np.pi})
ds
As you can see, ds
includes two 2D arrays on the same grid, one 1D array on x
, and one number:
# 2D array
ds.temperature # 1D array
ds.bar # one element ds.pi
Subsetting works the usual way:
=0) # each 2D array becomes 1D array, the 1D array becomes a number, plus a number
ds.sel(x=0) # 'temperature' is now a 1D array
ds.temperature.sel(x=0.25, y=0.5) # one element of `temperature` ds.temperature.sel(x
We can save this dataset to a file:
%pip install netcdf4
"test.nc")
ds.to_netcdf(= xr.open_dataset("test.nc") # try reading it new
We can even try opening this 2D dataset in ParaView - select (y,x) and deselect Spherical.
Recall the 2D function we plotted when we were talking about numpy’s array broadcasting. Let’s scale it to a unit square x,y∈[0,1]:
= np.linspace(0, 1, 50)
x = np.linspace(0, 1, 50).reshape(50,1)
y = np.sin(5*x)**8 + np.cos(5+25*x*y)*np.cos(5*x) z
This is will our image at z=0. Then rotate this image 90 degrees (e.g. flip x and y), and this will be our function at z=1. Now interpolate linearly between z=0 and z=1 to build a 3D function in the unit cube x,y,z∈[0,1]. Check what the function looks like at intermediate z. Write out a NetCDF file with the 3D function.
Time series data
In xarray you can work with time-dependent data. Xarray accepts pandas time formatting, e.g. pd.to_datetime("2020-09-10")
would produce a timestamp. To produce a time range, we can use:
import pandas as pd
= pd.date_range("2000-01-01", freq="D", periods=365*3+1) # 2000-Jan-01 to 2002-Dec-31 (3 full years)
time
time# 1096 days
time.shape # same length (1096), but each element is replaced by the month number
time.month # same length (1096), but each element is replaced by the day-of-the-month
time.day ?pd.date_range
Using this time
construct, let’s initialize a time-dependent dataset that contains a scalar temperature variable (no space) mimicking seasonal change. We can do this directly without initializing an xarray.DataArray first – we just need to specify what this temperature variable depends on:
import xarray as xr
import numpy as np
= len(time)
ntime = 10 + 5*np.sin((250+np.arange(ntime))/365.25*2*np.pi) + 2*np.random.randn(ntime)
temp = xr.Dataset({ "temperature": ("time", temp), # it's 1D function of time
ds "time": time })
=8) ds.temperature.plot(size
We can do the usual subsetting:
=100) # 101st timestep
ds.isel(time="2002-12-22") ds.sel(time
Time dependency in xarray allows resampling with a different timestep:
='7D') # 1096 times -> 157 time groups
ds.resample(time= ds.resample(time='7D').mean() # compute mean for each group
weekly
weekly.dims=8) weekly.temperature.plot(size
Now, let’s combine spatial and time dependency and construct a dataset containing two 2D variables (temperature and pressure) varying in time. The time dependency is baked into the coordinates of these xarray.DataArray’s and should come before the spatial coordinates:
= pd.date_range("2020-01-01", freq="D", periods=91) # January - March 2020
time = len(time)
ntime = 100 # spatial resolution in each dimension
n = np.linspace(0,1,n)
axis = np.meshgrid(axis,axis) # 2D Cartesian meshes of x,y coordinates
X, Y = (1-Y)*np.sin(np.pi*X) + Y*(np.sin(2*np.pi*X))**2
initialState = (1-X)*np.sin(np.pi*Y) + X*(np.sin(2*np.pi*Y))**2
finalState = np.zeros((ntime,n,n))
f for t in range(ntime):
= (t+0.5) / ntime # dimensionless time from 0 to 1
z = (1-z)*initialState + z*finalState
f[t,:,:]
= {"time": time, "x": axis, "y": axis}
coords = xr.DataArray(
temp 20 + f, # this 2D array varies in time from initialState to finalState
=("time","y","x"),
dims=coords
coords
)= xr.DataArray( # random 2D array
pres 100 + 10*np.random.randn(ntime,n,n),
=("time","y","x"),
dims=coords
coords
)= xr.Dataset({"temperature": temp, "pressure": pres})
ds ="2020-03-15").temperature.plot(size=8) # temperature distribution on a specific date
ds.sel(time"evolution.nc") ds.to_netcdf(
The file evolution.nc
should be 100^2 x 2 variables x 8 bytes x 91 steps = 14MB. We can load it into ParaView and play back the pressure and temperature!
Climate and forecast (CF) NetCDF convention in spherical geometry
So far we’ve been working with datasets in Cartesian coordinates. How about spherical geometry – how do we initialize and store a dataset in spherical coordinates (lon
- lat
- elevation
)? It turns out this is very easy: 1. define these coordinates and your data arrays on top of these coordinates, 1. put everything into an xarray dataset, and 1. finally specify the following units:
"units"] = "degrees_north" # this line is important to adhere to CF convention
ds.lat.attrs["units"] = "degrees_east" # this line is important to adhere to CF convention ds.lon.attrs[
Let’s do it! Create a small (one-degree horizontal + some vertical resolution), stationary (no time dependency) dataset in spherical geometry with one 3D variable and write it to spherical.nc
. Load it into ParaView to make sure the geometry is spherical.
Working with atmospheric data
Let’s take a look at some real (but very low-resolution) data stored in the NetCDF-CF convention. Preparing for this workshop, I took one of the ECCC (Environment and Climate Change Canada) historical model datasets that contains only the near-surface air temperature and that was published on the CMIP6 Data-Archive. I reduced its size, picking only a subset of timesteps:
# FYI - here is how I created a smaller dataset
import xarray as xr
= xr.open_dataset('/Users/razoumov/tmp/xarray/atmosphere/tas_Amon_CanESM5_historical_r1i1p2f1_gn_185001-201412.nc')
data =slice('2001', '2020')).to_netcdf("tasReduced.nc") # last 168 steps data.sel(time
Let’s download the file tasReduced.nc
in the terminal:
wget http://bit.ly/atmosdata -O tasReduced.nc
First, quickly check this dataset in ParaView (use Dimensions = (lat,lon)).
= xr.open_dataset('tasReduced.nc')
data # this is a time-dependent 2D dataset: print out the metadata, coordinates, data variables
data # time goes monthly from 2001-01-16 to 2014-12-16
data.time # there are 168 timesteps
data.time.shape # metadata for the data variable (time: 168, lat: 64, lon: 128)
data.tas # (168, 64, 128) = (time, lat, lon)
data.tas.shape # at the fixed height=2m data.height
These five lines all produce the same result:
0] - 273.15 # take all values at data.time[0], convert to Celsius
data.tas[0,:] - 273.15
data.tas[0,:,:] - 273.15
data.tas[=0) - 273.15
data.tas.isel(time= data.tas.sel(time='2001-01-16') - 273.15 air
These two lines produce the same result (1D vector of temperatures as a function of longitude):
0,5]
data.tas[=0, lat=5) data.tas.isel(time
Check temperature variation in the last timestep:
= data.tas.isel(time=-1) - 273.15 # last timestep, to celsius
air # (64, 128)
air.shape min(), air.max() # -43.550903, 36.82956 air.
Selecting data is slightly more difficult with approximate floating coordinates:
data.tas.lat
data.tas.lat.dtype=0) # the first value lat=-87.86
data.tas.isel(lat0] # print the first latitude and try to use it below
data.lat[=-87.86379884) # does not work due to floating precision
data.tas.sel(lat=data.lat[0]) # this works
data.tas.sel(lat= data.tas.sel(lat=slice(-90,-80)) # only select data in a slice lat=[-90,-80]
latSlice # (168, 3, 128) - 3 latitudes in this slice latSlice.shape
Multiple ways to select time:
-10:] # last ten times
data.time[= data.tas.sel(time='2014-12-16') - 273.15 # last date
air = data.tas.sel(time='2014') - 273.15 # select everything in 2014
air # 12 steps
air.shape
air.time= data.tas.sel(time='2014-01') - 273.15 # select everything in January 2014 air
Aggregate functions:
= data.tas.mean(dim='time') - 273.15
meanOverTime = data.tas.mean(dim=['lat','lon']) - 273.15 # mean over space for each timestep
meanOverSpace # time series (168,)
meanOverSpace.shape ="o", size=8) # calls matplotlib.pyplot.plot meanOverSpace.plot(marker
Interpolate to a specific location:
= data.tas.interp(lat=48.43, lon=360-123.37) - 273.15
victoria # (168,) only time
victoria.shape ="o", size=8) # simple 1D plot
victoria.plot(marker=slice('2010','2020')).plot(marker="o", size=8) # zoom in on the 2010s points victoria.sel(time
Let’s plot in 2D:
= data.tas.isel(time=-1) - 273.15 # last timestep
air
air.time=8) # 2D plot, very poor resolution (lat: 64, lon: 128)
air.plot(size=8, y="lon", x="lat") # can specify which axis is which air.plot(size
What if we have time-dependency in the plot? We put each frame into a separate panel:
= data.tas[-6:] - 273.15 # last 6 timesteps => 3D dataset => which coords to use for what?
a ="lon", y="lat", col="time", col_wrap=3) a.plot(x
Breaking into groups and applying a function to each group:
len(data.time) # 168 steps
"time") # 168 groups
data.tas.groupby(def standardize(x):
return (x - x.mean()) / x.std()
= data.tas.groupby("time").map(standardize) # apply this function to each group
standard # (1980, 64, 128) same shape as the original but now normalized over each group standard.shape