Moana thredds server connector

This notebook describes how to get observation data from the thredds server set on top of the Moana 25 year hydrodynamic hindcast of New Zealand waters. The data is freely available from that server under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (see for more).

Sebastien Delaux
Kaitiakitanga License, Creative Commons Attribution-ShareAlike 4.0 International


The Moana backbone hindcast datasets comes in the form of 4 sources of data stored in 4 subfolders on the thredds server (

Connector description

For now the connector provides only a way to load the data under the shape of an xarray dataset.

The motivation being using xarray are the following:

  • xarray is a popular python library that allow to work with NetCDF files stored on drive or on openDap servers
  • xarray is built on top of dask which allow a number of operations, in particular sub-sampling, to be done in lazy mode (data don't get loaded until really needed)
  • xarray is highly compatible with the numpy library
  • xarray offers a number of useful export/plotting functions

First we install xarray which we will use to handle the data and siphon which we will use to talk with the thredds server catalog. Also we install matplotlib to generate plots later on.

In [1]:

import sys !{sys.executable} -m pip install xarray siphon matplotlib
Collecting xarray Downloading (668kB) 100% |████████████████████████████████| 675kB 1.2MB/s ta 0:00:01 Collecting siphon Using cached Collecting matplotlib Using cached Collecting numpy>=1.15 (from xarray) Using cached Collecting setuptools>=41.2 (from xarray) Using cached Collecting pandas>=0.25 (from xarray) Downloading (10.0MB) 100% |████████████████████████████████| 10.0MB 175kB/s ta 0:00:01 Collecting requests>=1.2 (from siphon) Using cached Collecting beautifulsoup4>=4.6 (from siphon) Using cached Collecting protobuf>=3.0.0a3 (from siphon) Using cached Collecting python-dateutil>=2.1 (from matplotlib) Using cached Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib) Using cached Collecting cycler>=0.10 (from matplotlib) Using cached Collecting kiwisolver>=1.0.1 (from matplotlib) Using cached Collecting pytz>=2017.2 (from pandas>=0.25->xarray) Using cached Collecting chardet<4,>=3.0.2 (from requests>=1.2->siphon) Using cached Collecting certifi>=2017.4.17 (from requests>=1.2->siphon) Using cached Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests>=1.2->siphon) Using cached Collecting idna<3,>=2.5 (from requests>=1.2->siphon) Using cached Collecting soupsieve>=1.2 (from beautifulsoup4>=4.6->siphon) Using cached Collecting six>=1.9 (from protobuf>=3.0.0a3->siphon) Using cached Installing collected packages: numpy, setuptools, pytz, six, python-dateutil, pandas, xarray, chardet, certifi, urllib3, idna, requests, soupsieve, beautifulsoup4, protobuf, siphon, pyparsing, cycler, kiwisolver, matplotlib Successfully installed beautifulsoup4-4.8.2 certifi-2019.11.28 chardet-3.0.4 cycler-0.10.0 idna-2.9 kiwisolver-1.1.0 matplotlib-3.2.1 numpy-1.18.2 pandas-1.0.3 protobuf-3.11.3 pyparsing-2.4.6 python-dateutil-2.8.1 pytz-2019.3 requests-2.23.0 setuptools-46.1.3 siphon-0.8.0 six-1.14.0 soupsieve-2.0 urllib3-1.25.8 xarray-0.15.1

Now we load xarray and siphon

In [2]:

import xarray as xr from siphon.catalog import TDSCatalog

Then we load a couple of reasonably standard python library

In [3]:

import numpy from datetime import datetime from dateutil.relativedelta import relativedelta import itertools import re

We define a couple of methods helping to identify whether a filename contains data relevant to a given time range. Those allow to avoid loading unnecessary metadata.

In [4]:

def datetime_from_filename(filename): """ Parse a moana filename and return the datetime corresponding to the first timestamp in the monthly file Parameters ---------- filename : str The name of one of the Moana netCDF files from the thredds server. Returns ------- datetime A datetime object corresponding to the first timestamp in the file """ datestring ='(20\d{2}\d{2}|19\d{2}\d{2})', filename).group() return datetime.strptime(datestring, '%Y%m') def monthly_file_in_interval(filename, tmin=None, tmax=None): """ Check whether a monthly Moana file contains data relevant to the time range [t1:t2] Parameters ---------- filename : str The name of one of the Moana netCDF files from the thredds server. t1 : datetime The start of the time range (None for an open interval). The start datetime is inclusive t2 : datetime The end of the time range (None for an open interval). The end datetime is exclusive Returns ------- boolean Whether the file data relevant to the time range [t1:t2] """ if tmin is None and tmax is None: return True t1 = datetime_from_filename(filename) if tmin is None: return t1 < tmax if tmax is None: return tmin < t1 + relativedelta(months=1) return t1 < tmax and tmin < t1 + relativedelta(months=1)
We define the connector

In [5]:

class MoanaThreddsServerConnector(object): """ A class used to connect one the thredds server put on top of the Moana hydrodynamic hindcast of New Zealand waters Attributes ---------- catalog_url : str The url to use to interogate the data catalog dods_base_url : str The base url to use to access the data via OpenDAP tmin : datetime The lower end of the time range for which data are required (inclusive). None means an open ended interval tmax : datetime The upper end of the time range for which data are required (exclusive). None means an open ended interval Methods ------- get_xarray_dataset() return a xarray dataset object that allows efficient access/subsampling of the data """ def __init__(self, catalog_url='', dods_base_url='', tmin=None, tmax=None): # Store the urls self.catalog_url = catalog_url self.dods_base_url = dods_base_url # Get catalog self.catalog = TDSCatalog(catalog_url) # Build list of all available files self.dods_url = [dods_base_url+filename\ for filename in sorted(self.catalog.datasets) if monthly_file_in_interval(filename, tmin=tmin, tmax=tmax)] # Load base dataset object if 'processed' in catalog_url: concat_dim = 'time' else: concat_dim = 'ocean_time' self.dset = xr.open_mfdataset(self.dods_url, parallel=True, coords="minimal", data_vars="minimal", compat='override', combine='nested', concat_dim=concat_dim) def get_xarray_dset(self): return self.dset
Example usage of connector

Initialise the connector

In [6]:

cat_url = "" base_dods_url = '' connector = MoanaThreddsServerConnector(cat_url, base_dods_url, tmin=datetime(2017,1,1), tmax=datetime(2017,4,1))

For now get the full xarray dataset object

In [7]:

dset = connector.get_xarray_dset()

Find the names of all the variables in the datasets that are time dependent

In [8]:

for (var_name, var) in dset.data_vars.items(): if 'ocean_time' in var.dims and len(var.dims)>1: print(var_name, '-->', var.attrs['long_name'],var.dims)
zeta --> time-averaged free-surface ('ocean_time', 'eta_rho', 'xi_rho') ubar_eastward --> time-averaged eastward vertically integrated momentum component at RHO-points ('ocean_time', 'eta_rho', 'xi_rho') vbar_northward --> time-averaged northward vertically integrated momentum component at RHO-points ('ocean_time', 'eta_rho', 'xi_rho') u_eastward --> time-averaged eastward momentum component at RHO-points ('ocean_time', 's_rho', 'eta_rho', 'xi_rho') v_northward --> time-averaged northward momentum component at RHO-points ('ocean_time', 's_rho', 'eta_rho', 'xi_rho') temp --> time-averaged potential temperature ('ocean_time', 's_rho', 'eta_rho', 'xi_rho') salt --> time-averaged salinity ('ocean_time', 's_rho', 'eta_rho', 'xi_rho')

In [9]:

print("The dataset contains data from", dset.variables['ocean_time'].data[0], "to", dset.variables['ocean_time'].data[-1])
The dataset contains data from 2017-01-01T12:00:00.000000000 to 2017-03-31T12:00:00.000000000

Now lets load some data. Lets say we want temperature data. The corresponding variable for time-averaged potential temperature is named 'temp' The dimensions and shape of the array are the following

In [10]:

print("Shape of 'temp' variable is ", dset.variables['temp'].shape) print("Dimensions are ", dset.variables['temp'].dims)
Shape of 'temp' variable is (90, 40, 467, 397) Dimensions are ('ocean_time', 's_rho', 'eta_rho', 'xi_rho')

The variable has 4 dimension in order time/level/2 dimensions of space. We load the data every 10 time-stamp for the level the closest to the surface (last one) and over the whole simulation domain.

In [11]:

temp_data = dset.variables['temp'][::10,-1,:,:].load()

Now lets plot the data. We plot the 6 consecutive frames in a 2x3 array.

In [12]:

%matplotlib inline from matplotlib import pyplot as plt fig, axs = plt.subplots(2, 3) for ax, data in zip(axs.flat, temp_data): ax.pcolor(data)

Now lets get a time-series of time-averaged elevation at a specific location. The corresponding variable is named zeta and has 3 dimensions (one of time and 2 of space)

In [13]:

zeta_data = dset.variables['zeta'][:,300,100].load()

In [14]:

%matplotlib inline import matplotlib.pyplot as plt from matplotlib.dates import date2num plt.xlabel('time') plt.ylabel('Time-averaged surface elevation (m)') plt.plot_date(date2num(dset.ocean_time), zeta_data, linestyle='-')


[<matplotlib.lines.Line2D at 0x145e92433358>]

More to come on how to get data for a specific bounding box or point defined using WGS84 longitudes and latitudes. Also, more data will be uploaded to the server which will be described.

None lets plot a cross section of the data. First we select a cross section of the 3D grid for the first timestamp of the dataset.

In [15]:

ds = dset.isel(ocean_time=0,xi_rho=200)

It can be plot using the s coordinate as the vertical dimension which does not show the actual depth of the cross-section.

In [16]:



<matplotlib.collections.QuadMesh at 0x145e91f39128>

Following ( we can write the equations to calculate the vertical coordinate

In [17]:

if ds.Vtransform == 1: Zo_rho = ds.hc * (ds.s_rho - ds.Cs_r) + ds.Cs_r * ds.h z_rho = Zo_rho + ds.zeta * (1 + Zo_rho/ds.h) elif ds.Vtransform == 2: Zo_rho = (ds.hc * ds.s_rho + ds.Cs_r * ds.h) / (ds.hc + ds.h) z_rho = ds.zeta + (ds.zeta + ds.h) * Zo_rho ds.coords['z_rho'] = z_rho.fillna(0).transpose()# needing transpose seems to be an xarray bug

Now the plot can be redone using the depth and the latitudes as coordinates

In [18]:

ds.temp.plot(x='lat_rho', y='z_rho')


<matplotlib.collections.QuadMesh at 0x145e91e2b400>

In [ ]: