Lazy evaluation on Dask arrays¶
If you are unfamiliar with Dask, read Parallel computing with Dask in Xarray documentation first. The current version only supports dask arrays on a single machine. Support of Dask.distributed is in roadmap.
xESMF’s Dask support is mostly for lazy evaluation and out-of-core computing, to allow processing large volumes of data with limited memory. You might also get moderate speed-up on a multi-core machine by choosing proper chunk sizes, but that generally won’t help your entire pipeline too much, because the read-regrid-write pipeline is severely I/O limited (see this issue for more discussions). On a single machine, the disk bandwidth is typically limited to ~500 MB/s, and you cannot process data faster than such rate. If you need much faster data processing rate, you should resort to parallel file systems on HPC clusters or distributed storage on public cloud platforms. Please refer to the Pangeo project for more information.
[1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import dask.array as da # need to have dask.array installed, although not directly using it here.
import xarray as xr
import xesmf as xe
A simple example¶
Prepare input data¶
[2]:
ds = xr.tutorial.open_dataset("air_temperature", chunks={"time": 500})
ds
[2]:
- lat: 25
- lon: 53
- time: 2920
- lat(lat)float3275.0 72.5 70.0 ... 20.0 17.5 15.0
- standard_name :
- latitude
- long_name :
- Latitude
- units :
- degrees_north
- axis :
- Y
array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5, 45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5, 15. ], dtype=float32)
- lon(lon)float32200.0 202.5 205.0 ... 327.5 330.0
- standard_name :
- longitude
- long_name :
- Longitude
- units :
- degrees_east
- axis :
- X
array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5, 225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5, 250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5, 275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5, 300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5, 325. , 327.5, 330. ], dtype=float32)
- time(time)datetime64[ns]2013-01-01 ... 2014-12-31T18:00:00
- standard_name :
- time
- long_name :
- Time
array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000', '2013-01-01T12:00:00.000000000', ..., '2014-12-31T06:00:00.000000000', '2014-12-31T12:00:00.000000000', '2014-12-31T18:00:00.000000000'], dtype='datetime64[ns]')
- air(time, lat, lon)float32dask.array<chunksize=(500, 25, 53), meta=np.ndarray>
- long_name :
- 4xDaily Air temperature at sigma level 995
- units :
- degK
- precision :
- 2
- GRIB_id :
- 11
- GRIB_name :
- TMP
- var_desc :
- Air temperature
- dataset :
- NMC Reanalysis
- level_desc :
- Surface
- statistic :
- Individual Obs
- parent_stat :
- Other
- actual_range :
- [185.16 322.1 ]
Array Chunk Bytes 15.48 MB 2.65 MB Shape (2920, 25, 53) (500, 25, 53) Count 7 Tasks 6 Chunks Type float32 numpy.ndarray
- Conventions :
- COARDS
- title :
- 4x daily NMC reanalysis (1948)
- description :
- Data is from NMC initialized reanalysis (4x/day). These are the 0.9950 sigma level values.
- platform :
- Model
- references :
- http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html
[3]:
ds.chunks
[3]:
Frozen(SortedKeysDict({'time': (500, 500, 500, 500, 500, 420), 'lat': (25,), 'lon': (53,)}))
[4]:
ds["air"].data
[4]:
|
Build regridder¶
[5]:
ds_out = xr.Dataset(
{
"lat": (["lat"], np.arange(16, 75, 1.0)),
"lon": (["lon"], np.arange(200, 330, 1.5)),
}
)
regridder = xe.Regridder(ds, ds_out, "bilinear")
regridder
[5]:
xESMF Regridder
Regridding algorithm: bilinear
Input grid shape: (25, 53)
Output grid shape: (59, 87)
Output grid dimension name: ('lat', 'lon')
Periodic in longitude? False
Apply to xarray Dataset/DataArray¶
[6]:
# only build the dask graph; actual computation happens later when calling compute()
%time ds_out = regridder(ds)
ds_out
using dimensions ('lat', 'lon') from data variable air as the horizontal dimensions for this dataset.
CPU times: user 2.09 ms, sys: 0 ns, total: 2.09 ms
Wall time: 2.06 ms
[6]:
- lat: 59
- lon: 87
- time: 2920
- time(time)datetime64[ns]2013-01-01 ... 2014-12-31T18:00:00
- standard_name :
- time
- long_name :
- Time
array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000', '2013-01-01T12:00:00.000000000', ..., '2014-12-31T06:00:00.000000000', '2014-12-31T12:00:00.000000000', '2014-12-31T18:00:00.000000000'], dtype='datetime64[ns]')
- lon(lon)float64200.0 201.5 203.0 ... 327.5 329.0
array([200. , 201.5, 203. , 204.5, 206. , 207.5, 209. , 210.5, 212. , 213.5, 215. , 216.5, 218. , 219.5, 221. , 222.5, 224. , 225.5, 227. , 228.5, 230. , 231.5, 233. , 234.5, 236. , 237.5, 239. , 240.5, 242. , 243.5, 245. , 246.5, 248. , 249.5, 251. , 252.5, 254. , 255.5, 257. , 258.5, 260. , 261.5, 263. , 264.5, 266. , 267.5, 269. , 270.5, 272. , 273.5, 275. , 276.5, 278. , 279.5, 281. , 282.5, 284. , 285.5, 287. , 288.5, 290. , 291.5, 293. , 294.5, 296. , 297.5, 299. , 300.5, 302. , 303.5, 305. , 306.5, 308. , 309.5, 311. , 312.5, 314. , 315.5, 317. , 318.5, 320. , 321.5, 323. , 324.5, 326. , 327.5, 329. ])
- lat(lat)float6416.0 17.0 18.0 ... 72.0 73.0 74.0
array([16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55., 56., 57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68., 69., 70., 71., 72., 73., 74.])
- air(time, lat, lon)float64dask.array<chunksize=(500, 59, 87), meta=np.ndarray>
Array Chunk Bytes 119.91 MB 20.53 MB Shape (2920, 59, 87) (500, 59, 87) Count 13 Tasks 6 Chunks Type float64 numpy.ndarray
- regrid_method :
- bilinear
[7]:
ds_out["air"].data # chunks are preserved
[7]:
|
[8]:
%time result = ds_out['air'].compute() # actually applies regridding
CPU times: user 298 ms, sys: 212 ms, total: 510 ms
Wall time: 154 ms
[9]:
type(result.data), result.data.shape
[9]:
(numpy.ndarray, (2920, 59, 87))
Invalid chunk sizes to avoid¶
Dask support relies on xarray.apply_ufunc
, which requires only chunking over extra/broadcasting dimensions (like time
and lev
), not horizontal/core dimensions (lat
, lon
, or x
, y
).
[10]:
# xESMF doesn't like chunking over horizontal dimensions
ds_bad = ds.chunk({"lat": 10, "lon": 10, "time": None})
ds_bad
[10]:
- lat: 25
- lon: 53
- time: 2920
- lat(lat)float3275.0 72.5 70.0 ... 20.0 17.5 15.0
- standard_name :
- latitude
- long_name :
- Latitude
- units :
- degrees_north
- axis :
- Y
array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5, 45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5, 15. ], dtype=float32)
- lon(lon)float32200.0 202.5 205.0 ... 327.5 330.0
- standard_name :
- longitude
- long_name :
- Longitude
- units :
- degrees_east
- axis :
- X
array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5, 225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5, 250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5, 275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5, 300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5, 325. , 327.5, 330. ], dtype=float32)
- time(time)datetime64[ns]2013-01-01 ... 2014-12-31T18:00:00
- standard_name :
- time
- long_name :
- Time
array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000', '2013-01-01T12:00:00.000000000', ..., '2014-12-31T06:00:00.000000000', '2014-12-31T12:00:00.000000000', '2014-12-31T18:00:00.000000000'], dtype='datetime64[ns]')
- air(time, lat, lon)float32dask.array<chunksize=(2920, 10, 10), meta=np.ndarray>
- long_name :
- 4xDaily Air temperature at sigma level 995
- units :
- degK
- precision :
- 2
- GRIB_id :
- 11
- GRIB_name :
- TMP
- var_desc :
- Air temperature
- dataset :
- NMC Reanalysis
- level_desc :
- Surface
- statistic :
- Individual Obs
- parent_stat :
- Other
- actual_range :
- [185.16 322.1 ]
Array Chunk Bytes 15.48 MB 1.17 MB Shape (2920, 25, 53) (2920, 10, 10) Count 44 Tasks 18 Chunks Type float32 numpy.ndarray
- Conventions :
- COARDS
- title :
- 4x daily NMC reanalysis (1948)
- description :
- Data is from NMC initialized reanalysis (4x/day). These are the 0.9950 sigma level values.
- platform :
- Model
- references :
- http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html
[11]:
# regridder(ds_bad) # uncomment this line to see the error message
[12]:
# besides rechunking data properly, another simple fix is to convert to numpy array without chunking
regridder(ds_bad.load())
using dimensions ('lat', 'lon') from data variable air as the horizontal dimensions for this dataset.
[12]:
- lat: 59
- lon: 87
- time: 2920
- time(time)datetime64[ns]2013-01-01 ... 2014-12-31T18:00:00
- standard_name :
- time
- long_name :
- Time
array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000', '2013-01-01T12:00:00.000000000', ..., '2014-12-31T06:00:00.000000000', '2014-12-31T12:00:00.000000000', '2014-12-31T18:00:00.000000000'], dtype='datetime64[ns]')
- lon(lon)float64200.0 201.5 203.0 ... 327.5 329.0
array([200. , 201.5, 203. , 204.5, 206. , 207.5, 209. , 210.5, 212. , 213.5, 215. , 216.5, 218. , 219.5, 221. , 222.5, 224. , 225.5, 227. , 228.5, 230. , 231.5, 233. , 234.5, 236. , 237.5, 239. , 240.5, 242. , 243.5, 245. , 246.5, 248. , 249.5, 251. , 252.5, 254. , 255.5, 257. , 258.5, 260. , 261.5, 263. , 264.5, 266. , 267.5, 269. , 270.5, 272. , 273.5, 275. , 276.5, 278. , 279.5, 281. , 282.5, 284. , 285.5, 287. , 288.5, 290. , 291.5, 293. , 294.5, 296. , 297.5, 299. , 300.5, 302. , 303.5, 305. , 306.5, 308. , 309.5, 311. , 312.5, 314. , 315.5, 317. , 318.5, 320. , 321.5, 323. , 324.5, 326. , 327.5, 329. ])
- lat(lat)float6416.0 17.0 18.0 ... 72.0 73.0 74.0
array([16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55., 56., 57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68., 69., 70., 71., 72., 73., 74.])
- air(time, lat, lon)float64296.1 296.4 296.6 ... 241.0 241.5
array([[[296.13399675, 296.38669304, 296.63889823, ..., 296.47490793, 296.43398913, 296.19924566], [295.97800871, 296.18274797, 296.42534501, ..., 296.09262341, 296.07802394, 295.72098714], [296.04001766, 296.13556275, 296.30247974, ..., 295.77692914, 295.73997197, 295.35693248], ..., [245.04017912, 245.36087049, 245.56096188, ..., 233.93629106, 235.51802332, 238.0780694 ], [243.27991042, 243.77519503, 244.17375053, ..., 233.81591274, 235.33999633, 237.63241841], [242.24003289, 242.87912303, 243.43775032, ..., 233.84791841, 235.41999207, 237.49641598]], [[296.25399643, 296.70203773, 297.03166485, ..., 296.06514956, 296.03998263, 296.01773136], [296.2179898 , 296.56767711, 296.82291528, ..., 295.7292558 , 295.6800262 , 295.5138904 ], [296.23999022, 296.42058286, 296.56714652, ..., 295.50442291, 295.41998903, 295.19133215], ..., [245.52028453, 245.73709231, 245.85148963, ..., 231.64759509, 232.67802699, 234.83033953], [243.29994515, 243.61404829, 243.85326489, ..., 231.80653129, 232.72003168, 234.51923375], [242.70001369, 243.03800427, 243.31726258, ..., 232.22256285, 233.15997775, 234.71925176]], [[296.31998597, 296.35233477, 296.37027072, ..., 296.69703874, 296.59998477, 296.42993717], [296.23999022, 296.37072264, 296.42865111, ..., 296.39312798, 296.20003046, 295.98447426], [296.07996829, 296.20134835, 296.24744824, ..., 296.17051878, 295.85798008, 295.63218076], ..., [246.92034241, 246.75294557, 246.50912779, ..., 231.18562131, 232.24003595, 234.61904532], [244.13992017, 244.03040136, 243.89556811, ..., 231.78222568, 232.82012307, 234.90301222], [243.22002405, 243.13672999, 243.05876072, ..., 233.39835074, 234.45993206, 236.27912467]], ..., [[297.62998356, 298.25582152, 298.65503226, ..., 295.7786526 , 295.66999665, 295.50299763], [297.07004997, 297.7199817 , 298.19914665, ..., 295.58670885, 295.55000304, 295.2150872 ], [296.38994762, 296.98154159, 297.52424514, ..., 295.48204978, 295.46998597, 295.05016151], ..., [251.81041188, 251.72296686, 251.55990593, ..., 240.75713161, 241.37000425, 242.46456685], [247.96982451, 247.87036819, 247.69574495, ..., 241.55307104, 241.93009016, 242.64624471], [245.73007797, 245.53418764, 245.25570503, ..., 242.92917313, 243.20994271, 243.68633428]], [[297.1899857 , 297.6237982 , 297.95503528, ..., 295.4505457 , 295.32998658, 295.05487915], [296.59005424, 297.09596253, 297.49915579, ..., 295.29059716, 295.1700073 , 294.7989566 ], [295.76992812, 296.23054587, 296.65708274, ..., 295.15442138, 295.00998537, 294.626334 ], ..., [252.39037715, 252.04962511, 251.65524335, ..., 241.72163807, 242.61000973, 243.93278497], [249.06987326, 248.66822819, 248.21431545, ..., 242.12742803, 242.79003593, 243.74867727], [247.43005818, 246.93209439, 246.37428385, ..., 242.84748108, 243.38996739, 244.10871142]], [[297.04997563, 297.38785212, 297.63503348, ..., 296.09892817, 295.98999483, 295.690696 ], [296.4100463 , 296.84401245, 297.17914422, ..., 295.82701626, 295.79001767, 295.49075727], [295.62992871, 296.00626644, 296.35697279, ..., 295.59472609, 295.56998294, 295.28282378], ..., [252.51039665, 252.14637887, 251.70761893, ..., 240.09173906, 240.75006397, 241.87173824], [248.92985255, 248.55305942, 248.11071361, ..., 240.21550695, 240.53002923, 241.23719787], [247.01007067, 246.58490681, 246.10268406, ..., 240.91155301, 241.00997318, 241.45322238]]])
- regrid_method :
- bilinear