html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/1751#issuecomment-352330986,https://api.github.com/repos/pydata/xarray/issues/1751,352330986,MDEyOklzc3VlQ29tbWVudDM1MjMzMDk4Ng==,1217238,2017-12-18T05:39:18Z,2017-12-18T05:39:18Z,MEMBER,I decided to merge in the current state rather than let this get stale. We can add the public API later....,"{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,278325492
https://github.com/pydata/xarray/pull/1751#issuecomment-351452403,https://api.github.com/repos/pydata/xarray/issues/1751,351452403,MDEyOklzc3VlQ29tbWVudDM1MTQ1MjQwMw==,1217238,2017-12-13T16:53:43Z,2017-12-13T16:53:43Z,MEMBER,@fujiisoup could you kindly take another look?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,278325492
https://github.com/pydata/xarray/pull/1751#issuecomment-350148425,https://api.github.com/repos/pydata/xarray/issues/1751,350148425,MDEyOklzc3VlQ29tbWVudDM1MDE0ODQyNQ==,1217238,2017-12-08T01:47:51Z,2017-12-08T01:47:51Z,MEMBER,"I pushed some additional tests, which turned up the fact that dask's vectorized indexing does not support negative indices (fixed by https://github.com/dask/dask/pull/2967).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,278325492
https://github.com/pydata/xarray/pull/1751#issuecomment-348870677,https://api.github.com/repos/pydata/xarray/issues/1751,348870677,MDEyOklzc3VlQ29tbWVudDM0ODg3MDY3Nw==,1217238,2017-12-04T06:22:57Z,2017-12-04T06:22:57Z,MEMBER,"Okay, I'll come up with a few more tests to make sure this maintains 100% coverage... Let me know if you have any ideas for other edge cases.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,278325492
https://github.com/pydata/xarray/pull/1751#issuecomment-348843824,https://api.github.com/repos/pydata/xarray/issues/1751,348843824,MDEyOklzc3VlQ29tbWVudDM0ODg0MzgyNA==,1217238,2017-12-04T02:19:53Z,2017-12-04T02:19:53Z,MEMBER,"OK, I'm going to merge this (just the first commit), and leave the second part (actually changing reindex) to another PR.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,278325492
https://github.com/pydata/xarray/pull/1751#issuecomment-348380756,https://api.github.com/repos/pydata/xarray/issues/1751,348380756,MDEyOklzc3VlQ29tbWVudDM0ODM4MDc1Ng==,1217238,2017-12-01T02:07:21Z,2017-12-01T02:07:32Z,MEMBER,"I pushed another commit (mostly but not entirely working) to port `reindex()` to use `getitem_with_mask` rather than its current implementation. This allows for some nice simplification for the `reindex_variables()` code: my git change is `1 file changed, 33 insertions(+), 64 deletions(-)`.
To get a sense of how this effects performance, I made a small benchmarking script with our tutorial dataset:
```python
import xarray
import numpy as np
ds_numpy = xarray.tutorial.load_dataset('air_temperature').load()
ds_chunked = ds_numpy.chunk({'time': 100})
lat = np.linspace(ds_numpy.lat.min(), ds_numpy.lat.max(), num=100)
lon = np.linspace(ds_numpy.lon.min(), ds_numpy.lon.max(), num=100)
def do_reindex(ds):
return ds.reindex(lat=lat, lon=lon, method='nearest', tolerance=0.5)
%timeit do_reindex(ds_numpy)
%timeit do_reindex(ds_chunked)
result = do_reindex(ds_chunked)
%timeit result.compute()
```
Our tutorial dataset is pretty small, but it can still give a flavor of how this scales. I chose new chunks intentionally with a small tolerance to create lots of empty chunks to mask:
```
In [2]: ds_numpy
Out[2]:
Dimensions: (lat: 25, lon: 53, time: 2920)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
* lon (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
* time (time) datetime64[ns] 2013-01-01T00:02:06.757437440 ...
Data variables:
air (time, lat, lon) float64 241.2 242.5 243.5 244.0 244.1 243.9 ...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
In [3]: do_reindex(ds_numpy)
Out[3]:
Dimensions: (lat: 100, lon: 100, time: 2920)
Coordinates:
* lat (lat) float64 15.0 15.61 16.21 16.82 17.42 18.03 18.64 19.24 ...
* lon (lon) float64 200.0 201.3 202.6 203.9 205.3 206.6 207.9 209.2 ...
* time (time) datetime64[ns] 2013-01-01T00:02:06.757437440 ...
Data variables:
air (time, lat, lon) float64 296.3 nan 296.8 nan 297.1 nan 297.0 ...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
```
Here are the benchmarking results:
Before:
```
NumPy: 201 ms ± 3.48 ms per loop
Dask build graph: 303 ms ± 5.48 ms per loop
Dask compute: 30.7 s ± 35.9 ms per loop
```
After:
```
NumPy: 546 ms ± 26.3 ms per loop
Dask build graph: 6.9 ms ± 464 µs per loop
Dask compute: 411 ms ± 17.9 ms per loop
```
So NumPy is somewhat slower (about 2.5x), but reindexing with dask is 75x faster! It even shows some ability to parallelize better than pure NumPy.
This is encouraging. We should try to close the performance gap with NumPy (it was cleverly optimized before to use minimal copies of the data), but the existing reindex code with dask when doing masking is so slow that it is almost unusable.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,278325492