issue_comments: 348380756
This data as json
html_url | issue_url | id | node_id | user | created_at | updated_at | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
https://github.com/pydata/xarray/pull/1751#issuecomment-348380756 | https://api.github.com/repos/pydata/xarray/issues/1751 | 348380756 | MDEyOklzc3VlQ29tbWVudDM0ODM4MDc1Ng== | 1217238 | 2017-12-01T02:07:21Z | 2017-12-01T02:07:32Z | MEMBER | I pushed another commit (mostly but not entirely working) to port To get a sense of how this effects performance, I made a small benchmarking script with our tutorial dataset:
Our tutorial dataset is pretty small, but it can still give a flavor of how this scales. I chose new chunks intentionally with a small tolerance to create lots of empty chunks to mask: ``` In [2]: ds_numpy Out[2]: <xarray.Dataset> Dimensions: (lat: 25, lon: 53, time: 2920) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ... * lon (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ... * time (time) datetime64[ns] 2013-01-01T00:02:06.757437440 ... Data variables: air (time, lat, lon) float64 241.2 242.5 243.5 244.0 244.1 243.9 ... Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly... In [3]: do_reindex(ds_numpy) Out[3]: <xarray.Dataset> Dimensions: (lat: 100, lon: 100, time: 2920) Coordinates: * lat (lat) float64 15.0 15.61 16.21 16.82 17.42 18.03 18.64 19.24 ... * lon (lon) float64 200.0 201.3 202.6 203.9 205.3 206.6 207.9 209.2 ... * time (time) datetime64[ns] 2013-01-01T00:02:06.757437440 ... Data variables: air (time, lat, lon) float64 296.3 nan 296.8 nan 297.1 nan 297.0 ... Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly... ``` Here are the benchmarking results: Before:
So NumPy is somewhat slower (about 2.5x), but reindexing with dask is 75x faster! It even shows some ability to parallelize better than pure NumPy. This is encouraging. We should try to close the performance gap with NumPy (it was cleverly optimized before to use minimal copies of the data), but the existing reindex code with dask when doing masking is so slow that it is almost unusable. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
278325492 |