home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 304589831

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
304589831 MDExOlB1bGxSZXF1ZXN0MTc0NTMxNTcy 1983 Parallel open_mfdataset 2443309 closed 0     18 2018-03-13T00:44:35Z 2018-04-20T12:04:31Z 2018-04-20T12:04:23Z MEMBER   0 pydata/xarray/pulls/1983
  • [x] Closes #1981
  • [x] Tests added
  • [x] Tests passed
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API

I'm sharing this in the hopes of getting comments from @mrocklin and @pydata/xarray.

What this does:

  • implements a dask.bag map/apply on the xarray open_dataset and preprocess steps in open_mfdataset
  • adds a new parallel option to open_mfdataset
  • provides about a 40% speedup in opening a multifile dataset when using the distributed scheduler (I tested on 1000 netcdf files that took about 9 seconds to open/concatenate in the default configuration)

What it does not do (yet):

  • check that autoclose=True when multiple processes are being use (multiprocessing/distributed scheduler)
  • provide any speedup with the multiprocessing backend (I do not understand why this is)

Benchmark Example

```Python In [1]: import xarray as xr ...: import dask ...: import dask.threaded ...: import dask.multiprocessing ...: from dask.distributed import Client ...:

In [2]: c = Client() ...: c ...: Out[2]: <Client: scheduler='tcp://127.0.0.1:59576' processes=4 cores=4>

In [4]: %%time ...: with dask.set_options(get=dask.multiprocessing.get): ...: ds = xr.open_mfdataset('../test_files/test_netcdf_*nc', autoclose=True, parallel=True) ...: CPU times: user 4.76 s, sys: 201 ms, total: 4.96 s Wall time: 7.74 s

In [5]: %%time ...: with dask.set_options(get=c.get): ...: ds = xr.open_mfdataset('../test_files/test_netcdf_*nc', autoclose=True, parallel=True) ...: ...: CPU times: user 1.88 s, sys: 60.6 ms, total: 1.94 s Wall time: 4.41 s

In [6]: %%time ...: with dask.set_options(get=dask.threaded.get): ...: ds = xr.open_mfdataset('../test_files/test_netcdf_*nc') ...: CPU times: user 7.77 s, sys: 247 ms, total: 8.02 s Wall time: 8.17 s

In [7]: %%time ...: with dask.set_options(get=dask.threaded.get): ...: ds = xr.open_mfdataset('../test_files/test_netcdf_*nc', autoclose=True) ...: ...: CPU times: user 7.89 s, sys: 202 ms, total: 8.09 s Wall time: 8.21 s

In [8]: ds Out[8]: <xarray.Dataset> Dimensions: (lat: 45, lon: 90, time: 1000) Coordinates: * lon (lon) float64 0.0 4.045 8.09 12.13 16.18 20.22 24.27 28.31 ... * lat (lat) float64 -90.0 -85.91 -81.82 -77.73 -73.64 -69.55 -65.45 ... * time (time) datetime64[ns] 1970-01-01 1970-01-02 1970-01-11 ... Data variables: foo (time, lon, lat) float64 dask.array<shape=(1000, 90, 45), chunksize=(1, 90, 45)> bar (time, lon, lat) float64 dask.array<shape=(1000, 90, 45), chunksize=(1, 90, 45)> baz (time, lon, lat) float32 dask.array<shape=(1000, 90, 45), chunksize=(1, 90, 45)> Attributes: history: created for xarray benchmarking

```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1983/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 pull

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 18 rows from issue in issue_comments
Powered by Datasette · Queries took 0.837ms · About: xarray-datasette