home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 1581046647

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1581046647 I_kwDOAMm_X85ePNt3 7522 Differences in `to_netcdf` for dask and numpy backed arrays 39069044 open 0     7 2023-02-11T23:06:37Z 2023-03-01T23:12:11Z   CONTRIBUTOR      

What is your issue?

I make use of fsspec to quickly open netcdf files in the cloud and pull out slices of data without needing to read the entire file. Quick and dirty is just ds = xr.open_dataset(fs.open("gs://...")).

This works great, in that a many GB file can be lazy-loaded as a dataset in a few hundred milliseconds, by only parsing the netcdf headers with under-the-hood byte range requests. But, only if the netcdf is written from dask-backed arrays. Somehow, writing from numpy-backed arrays produces a different netcdf that requires reading deeper into the file to parse as a dataset.

I spent some time digging into the backends and see xarray is ultimately passing off the store write to dask.array here. A look at ncdump and Dataset.encoding didn't reveal any obvious differences between these files, but there is clearly something. Anyone know why the straight xarray store methods would produce a different netcdf structure, despite the underlying data and encoding being identical?

This should work as an MCVE: ```python import os import string import fsspec import numpy as np import xarray as xr

fs = fsspec.filesystem("gs") bucket = "gs://<your-bucket>"

create a ~160MB dataset with 20 variables

variables = {v: (["x", "y"], np.random.random(size=(1000, 1000))) for v in string.ascii_letters[:20]} ds = xr.Dataset(variables)

Save one version from numpy backed arrays and one from dask backed arrays

ds.compute().to_netcdf("numpy.nc") ds.chunk().to_netcdf("dask.nc")

Copy these to a bucket of your choice

fs.put("numpy.nc", bucket) fs.put("dask.nc", bucket) ```

Then time reading in these files as datasets with fsspec: ```python %timeit xr.open_dataset(fs.open(os.path.join(bucket, "numpy.nc")))

2.15 s ± 40.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```

```python %timeit xr.open_dataset(fs.open(os.path.join(bucket, "dask.nc")))

187 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7522/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
    13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 7 rows from issue in issue_comments
Powered by Datasette · Queries took 320.952ms · About: xarray-datasette