home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

3 rows where repo = 13221727, type = "issue" and user = 15016780 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, created_at (date), updated_at (date), closed_at (date)

type 1

  • issue · 3 ✖

state 1

  • closed 3

repo 1

  • xarray · 3 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2111051033 I_kwDOAMm_X8591BUZ 8691 xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks abarciauskas-bgse 15016780 closed 0     4 2024-01-31T22:04:02Z 2024-01-31T22:56:17Z 2024-01-31T22:56:17Z NONE      

What happened?

When opening MUR SST netcdfs from S3, xarray.open_dataset(file, engine="h5netcdf", chunks={}) returns a single chunk (whereas the h5netcdf library returns a chunk shape of (1, 1023, 2047).

A notebook version of the code below includes the output: https://gist.github.com/abarciauskas-bgse/9366e04d2af09b79c9de466f6c1d3b90

What did you expect to happen?

I thought the chunks={} option would return the same chunks (1, 1023, 2047) exposed by the h5netcdf engine.

Minimal Complete Verifiable Example

```Python

!/usr/bin/env python

coding: utf-8

This notebook looks at how xarray and h5netcdf return different chunks.

import pandas as pd import h5netcdf import s3fs import xarray as xr

dates = [ d.to_pydatetime().strftime('%Y%m%d') for d in pd.date_range('2023-02-01', '2023-03-01', freq='D') ]

SHORT_NAME = 'MUR-JPL-L4-GLOB-v4.1' s3_fs = s3fs.S3FileSystem(anon=False) var = 'analysed_sst'

def make_filename(time): base_url = f's3://podaac-ops-cumulus-protected/{SHORT_NAME}/' # example file: "/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc" return f'{base_url}{time}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'

s3_urls = [make_filename(d) for d in dates]

def print_chunk_shape(s3_url): try: # Open the dataset using xarray file = s3_fs.open(s3_url) dataset = xr.open_dataset(file, engine='h5netcdf', chunks={})

    # Print chunk shapes for each variable in the dataset
    print(f"\nChunk shapes for {s3_url}:")
    if dataset[var].chunks is not None:
        print(f"xarray open_dataset chunks for {var}: {dataset[var].chunks}")
    else:
        print(f"xarray open_dataset chunks for {var}: Not chunked")

    with h5netcdf.File(file, 'r') as file:
        dataset = file[var]

        # Check if the dataset is chunked
        if dataset.chunks:
            print(f"h5netcdf chunks for {var}:", dataset.chunks)
        else:
            print(f"h5netcdf dataset is not chunked.")

except Exception as e:
    print(f"Failed to process {s3_url}: {e}")

[print_chunk_shape(s3_url) for s3_url in s3_urls] ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 5.10.198-187.748.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.1 libnetcdf: 4.9.2 xarray: 2023.6.0 pandas: 2.0.3 numpy: 1.24.4 scipy: 1.11.1 netCDF4: 1.6.4 pydap: installed h5netcdf: 1.2.0 h5py: 3.9.0 Nio: None zarr: 2.15.0 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.6.1 distributed: 2023.6.1 matplotlib: 3.7.1 cartopy: 0.21.1 seaborn: 0.12.2 numbagg: None fsspec: 2023.6.0 cupy: None pint: 0.22 sparse: 0.14.0 flox: 0.7.2 numpy_groupies: 0.9.22 setuptools: 68.0.0 pip: 23.1.2 conda: None pytest: 7.4.0 mypy: None IPython: 8.14.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8691/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
548475127 MDU6SXNzdWU1NDg0NzUxMjc= 3686 Different data values from xarray open_mfdataset when using chunks abarciauskas-bgse 15016780 closed 0     7 2020-01-11T20:15:12Z 2020-01-20T20:35:48Z 2020-01-20T20:35:47Z NONE      

MCVE Code Sample

You will first need to download or (mount podaac's drive) from PO.DAAC, including credentials: ```bash curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/152/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/153/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/153/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/154/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/154/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/155/20020604090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/155/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/156/20020605090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/156/ ```

Then run the following code: ```python from datetime import datetime import xarray as xr import glob

def generate_file_list(start_doy, end_doy):
""" Given a start day and end end day, generate a list of file locations. Assumes a 'prefix' and 'year' variables have already been defined. 'Prefix' should be a local directory or http url and path. 'Year' should be a 4 digit year. """ days_of_year = list(range(start_doy, end_doy)) fileObjs = [] for doy in days_of_year: if doy < 10: doy = f"00{doy}" elif doy >= 10 and doy < 100: doy = f"0{doy}"
file = glob.glob(f"{prefix}/{doy}/*.nc")[0] fileObjs.append(file) return fileObjs

Invariants - but could be made configurable

year = 2002 prefix = f"data/mursst_netcdf" chunks = {'time': 1, 'lat': 1799, 'lon': 3600}

Create a list of files

start_doy = 152 num_days = 5 end_doy = start_doy + num_days fileObjs = generate_file_list(start_doy, end_doy)

will use this timeslice in query later on

time_slice = slice(datetime.strptime(f"{year}-06-02", '%Y-%m-%d'), datetime.strptime(f"{year}-06-04", '%Y-%m-%d'))

print("results from unchunked dataset") ds_unchunked = xr.open_mfdataset(fileObjs, combine='by_coords') print(ds_unchunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_unchunked.analysed_sst.sel(time=time_slice).mean().values)

print(f"results from chunked dataset using {chunks}") ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks=chunks) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values)

print("results from chunked dataset using 'auto'") ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks={'time': 'auto', 'lat': 'auto', 'lon': 'auto'}) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values) ``` Note, these are just a few examples but I tried a variety of other chunk options and got similar discrepancies between the unchunked and chunked datasets.

Output: results from unchunked dataset 290.13754 286.7869 results from chunked dataset using {'time': 1, 'lat': 1799, 'lon': 3600} 290.13757 286.81107 results from chunked dataset using 'auto' 290.1377 286.8118

Expected Output

Values output from queries of chunked and unchunked xarray dataset are equal.

Problem Description

I want to understand how to chunk or query data to verify data opened using chunks will have the same output as data opened without chunking. Would like to store data ultimately in Zarr but verifying data integrity is critical.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.8.1 | packaged by conda-forge | (default, Jan 5 2020, 20:58:18) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.154-99.181.amzn1.x86_64 machine: x86_64 processor: byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.14.1 pandas: 0.25.3 numpy: 1.17.3 scipy: None netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.3.2 cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.9.1 distributed: 2.9.1 matplotlib: None cartopy: None seaborn: None numbagg: None setuptools: 44.0.0.post20200102 pip: 19.3.1 conda: None pytest: None IPython: 7.11.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3686/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
493058488 MDU6SXNzdWU0OTMwNTg0ODg= 3306 `ds.load()` with local files stalls and fails, and `to_zarr` does not include `store` in the dask graph abarciauskas-bgse 15016780 closed 0     7 2019-09-12T22:29:04Z 2019-09-16T01:22:09Z 2019-09-16T01:22:09Z NONE      

MCVE Code Sample

Below details a scenario where reading local netcdf files (shared via EFS) to create a zarr store is not calling store as part of the dask graph. I discovered it looks like this may actually be related to concatenate

I include a commented option where I try using files over https and this works (does store data on S3), but of course the open dataset calls are slower.

ds.to_zarr and ds.load() will both stall and eventually returning many instances of:

distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://192.168.62.40:37233'], ('concatenate-2babafa03313bcf979ae6ca3a8e16aad', 1, 10, 13) NoneType: None distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://192.168.62.40:37233'], ('concatenate-2babafa03313bcf979ae6ca3a8e16aad', 0, 6, 30)

```python

!/usr/bin/env python

coding: utf-8

In[1]:

import xarray as xr from dask.distributed import Client, progress import s3fs import zarr import datetime

In[16]:

import datetime

chunks = {'lat': 1000, 'lon': 1000} base = 2018 year = base ending = '090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc' days_of_year = list(range(152, 154)) file_urls = []

for doy in days_of_year: date = datetime.datetime(year, 1, 1) + datetime.timedelta(doy - 1) date = date.strftime('%Y%m%d') file_urls.append('./{}/{}/{}{}'.format(year, doy, date, ending))

print(file_urls) ds = xr.open_mfdataset(file_urls, chunks=chunks, combine='by_coords', parallel=True) ds

In[21]:

This works fine

base_url = 'https://podaac-opendap.jpl.nasa.gov:443/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/'

url_ending = '090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc?time[0:1:0],lat[0:1:17998],lon[0:1:35999],analysed_sst[0:1:0][0:1:17998][0:1:35999]'

year = 2018

days_of_year = list(range(152, 154))

file_urls = []

for doy in days_of_year:

date = datetime.datetime(year, 1, 1) + datetime.timedelta(doy - 1)

date = date.strftime('%Y%m%d')

file_urls.append('{}/{}/{}/{}{}'.format(base_url, year, doy, date, url_ending))

#file_urls

ds = xr.open_mfdataset(file_urls, chunks=chunks, parallel=True, combine='by_coords')

ds

In[ ]:

Write zarr to s3

myS3fs = s3fs.S3FileSystem(anon=False) zarr_s3 = 'aimeeb-datasets-private/mur_sst_zarr14' d = s3fs.S3Map(zarr_s3, s3=myS3fs) compressor = zarr.Blosc(cname='zstd', clevel=5, shuffle=zarr.Blosc.AUTOSHUFFLE) encoding = {v: {'compressor': compressor} for v in ds.data_vars} ds.to_zarr(d, mode='w', encoding=encoding)

```

Expected Output

Expect the call to_zarr to produce a graph with store

Problem Description

The end result should be a zarr store on S3

Output of xr.show_versions()

``` INSTALLED VERSIONS


commit: None python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.128-112.105.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.3 pandas: 0.25.0 numpy: 1.16.4 scipy: 1.3.0 netCDF4: 1.5.1.2 pydap: None h5netcdf: None h5py: 2.9.0 Nio: None zarr: 2.3.2 cftime: 1.0.3.4 nc_time_axis: None PseudoNetCDF: None rasterio: 1.0.24 cfgrib: None iris: None bottleneck: None dask: 2.2.0 distributed: 2.2.0 matplotlib: 3.1.1 cartopy: 0.17.0 seaborn: 0.9.0 numbagg: None setuptools: 41.0.1 pip: 19.2.1 conda: None pytest: None IPython: 7.7.0 sphinx: None ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3306/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 72.558ms · About: xarray-datasette