issues: 1581046647
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1581046647 | I_kwDOAMm_X85ePNt3 | 7522 | Differences in `to_netcdf` for dask and numpy backed arrays | 39069044 | open | 0 | 7 | 2023-02-11T23:06:37Z | 2023-03-01T23:12:11Z | CONTRIBUTOR | What is your issue?I make use of This works great, in that a many GB file can be lazy-loaded as a dataset in a few hundred milliseconds, by only parsing the netcdf headers with under-the-hood byte range requests. But, only if the netcdf is written from dask-backed arrays. Somehow, writing from numpy-backed arrays produces a different netcdf that requires reading deeper into the file to parse as a dataset. I spent some time digging into the backends and see xarray is ultimately passing off the store write to This should work as an MCVE: ```python import os import string import fsspec import numpy as np import xarray as xr fs = fsspec.filesystem("gs") bucket = "gs://<your-bucket>" create a ~160MB dataset with 20 variablesvariables = {v: (["x", "y"], np.random.random(size=(1000, 1000))) for v in string.ascii_letters[:20]} ds = xr.Dataset(variables) Save one version from numpy backed arrays and one from dask backed arraysds.compute().to_netcdf("numpy.nc") ds.chunk().to_netcdf("dask.nc") Copy these to a bucket of your choicefs.put("numpy.nc", bucket) fs.put("dask.nc", bucket) ``` Then time reading in these files as datasets with fsspec: ```python %timeit xr.open_dataset(fs.open(os.path.join(bucket, "numpy.nc"))) 2.15 s ± 40.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)``` ```python %timeit xr.open_dataset(fs.open(os.path.join(bucket, "dask.nc"))) 187 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)``` |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/7522/reactions", "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 1 } |
13221727 | issue |