home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 955029073

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
955029073 MDU6SXNzdWU5NTUwMjkwNzM= 5643 open_mfdataset from zarr store, consolidated=None warns, consolidated=False is slow, consolidated=True fails 5509356 open 0     0 2021-07-28T16:23:53Z 2021-08-14T17:41:40Z   NONE      

What happened:

With xarray 0.19.0, using open_mfdataset to read from a zarr store written with a previous version of xarray (with consolidated=True), I get the following results depending on the consolidated parameter:

  1. None or not provided: The call takes 3 seconds and spits out a warning saying to set the consolidated flag.
  2. True: The call fails
  3. False: The call takes 15 seconds (5x longer than None) but doesn't warn

Hopefully it's okay if I include the actual code rather than trying to create a test zarr store that reproduces the situation:

```python import s3fs import xarray as xr

top_group_url = 's3://hrrrzarr/sfc/20200801/20200801_00z_anl.zarr' group_url = f'{top_group_url}/surface/GUST' subgroup_url = f"{group_url}/surface"

fs = s3fs.S3FileSystem(anon=True)
ds = xr.open_mfdataset([s3fs.S3Map(url, s3=fs) for url in [top_group_url, group_url, subgroup_url]], engine='zarr', consolidated=None) ``` Note that if I remove top_group_url, consolidated=False returns as quickly as consolidated=None and gives the same results (so it's just the top_group_url which is slowing things down, even though xarray doesn't load any data from it). However, if I only pass top_group_url, xarray loads the dataset in like 300ms, it's just empty.

What I expected to happen:

  1. I would expect that providing top_group_url (where the .zmetadata file is located) would speed up the call and prevent the warn from happening when consolidated=None, since obviously the metadata is actually consolidated.
  2. I wouldn't expect to be able to construct a call where with any possible value of consolidated, it either warns, fails, or is really slow––and the warning directs me to either change to one of the calls that fails/is slow or to update the data store, which I don't have write permissions to.
  3. I would expect the data store to be efficiently and easily read using xarray since it was written by xarray.

Anything else we need to know?:

This zarr store cannot be (usefully) opened by xarray without using open_mfdataset due to an issue I brought up in discussion #5584 which no one has replied to so far. Basically, the person creating it assumed that if they used xarray to write it, xarray would have no problem reading it, but since there's a slash in the variable names, xarray created it as a deeply-nested zarr store instead of a store with each variable as a single-level (sub)group that xarray would have been able to handle.

Each variable was written like this:

python ds.to_zarr(store=store, group=i, mode='w', encoding=encoding, consolidated=True) In the example, "surface/GUST" is the variable name.

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:36:15) [Clang 11.1.0 ] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.8.0 xarray: 0.19.0 pandas: 1.3.1 numpy: 1.21.1 scipy: 1.7.0 netCDF4: 1.5.7 pydap: None h5netcdf: 0.11.0 h5py: 3.3.0 Nio: None zarr: 2.8.3 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.07.1 distributed: 2021.07.1 matplotlib: 3.4.2 cartopy: 0.19.0.post1 seaborn: None numbagg: None pint: 0.17 setuptools: 49.6.0.post20210108 pip: 21.2.1 conda: None pytest: None IPython: 7.25.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5643/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 0.884ms · About: xarray-datasette