id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 2111051033,I_kwDOAMm_X8591BUZ,8691,xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks,15016780,closed,0,,,4,2024-01-31T22:04:02Z,2024-01-31T22:56:17Z,2024-01-31T22:56:17Z,NONE,,,,"### What happened? When opening MUR SST netcdfs from S3, xarray.open_dataset(file, engine=""h5netcdf"", chunks={}) returns a single chunk (whereas the h5netcdf library returns a chunk shape of (1, 1023, 2047). A notebook version of the code below includes the output: https://gist.github.com/abarciauskas-bgse/9366e04d2af09b79c9de466f6c1d3b90 ### What did you expect to happen? I thought the chunks={} option would return the same chunks (1, 1023, 2047) exposed by the h5netcdf engine. ### Minimal Complete Verifiable Example ```Python #!/usr/bin/env python # coding: utf-8 # This notebook looks at how xarray and h5netcdf return different chunks. import pandas as pd import h5netcdf import s3fs import xarray as xr dates = [ d.to_pydatetime().strftime('%Y%m%d') for d in pd.date_range('2023-02-01', '2023-03-01', freq='D') ] SHORT_NAME = 'MUR-JPL-L4-GLOB-v4.1' s3_fs = s3fs.S3FileSystem(anon=False) var = 'analysed_sst' def make_filename(time): base_url = f's3://podaac-ops-cumulus-protected/{SHORT_NAME}/' # example file: ""/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc"" return f'{base_url}{time}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc' s3_urls = [make_filename(d) for d in dates] def print_chunk_shape(s3_url): try: # Open the dataset using xarray file = s3_fs.open(s3_url) dataset = xr.open_dataset(file, engine='h5netcdf', chunks={}) # Print chunk shapes for each variable in the dataset print(f""\nChunk shapes for {s3_url}:"") if dataset[var].chunks is not None: print(f""xarray open_dataset chunks for {var}: {dataset[var].chunks}"") else: print(f""xarray open_dataset chunks for {var}: Not chunked"") with h5netcdf.File(file, 'r') as file: dataset = file[var] # Check if the dataset is chunked if dataset.chunks: print(f""h5netcdf chunks for {var}:"", dataset.chunks) else: print(f""h5netcdf dataset is not chunked."") except Exception as e: print(f""Failed to process {s3_url}: {e}"") [print_chunk_shape(s3_url) for s3_url in s3_urls] ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [x] Complete example — the example is self-contained, including all data and the text of any traceback. - [x] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [x] New issue — a search of GitHub Issues suggests this is not a duplicate. - [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies. ### Relevant log output _No response_ ### Anything else we need to know? _No response_ ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 5.10.198-187.748.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.1 libnetcdf: 4.9.2 xarray: 2023.6.0 pandas: 2.0.3 numpy: 1.24.4 scipy: 1.11.1 netCDF4: 1.6.4 pydap: installed h5netcdf: 1.2.0 h5py: 3.9.0 Nio: None zarr: 2.15.0 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.6.1 distributed: 2023.6.1 matplotlib: 3.7.1 cartopy: 0.21.1 seaborn: 0.12.2 numbagg: None fsspec: 2023.6.0 cupy: None pint: 0.22 sparse: 0.14.0 flox: 0.7.2 numpy_groupies: 0.9.22 setuptools: 68.0.0 pip: 23.1.2 conda: None pytest: 7.4.0 mypy: None IPython: 8.14.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8691/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 548475127,MDU6SXNzdWU1NDg0NzUxMjc=,3686,Different data values from xarray open_mfdataset when using chunks ,15016780,closed,0,,,7,2020-01-11T20:15:12Z,2020-01-20T20:35:48Z,2020-01-20T20:35:47Z,NONE,,,,"#### MCVE Code Sample You will first need to download or (mount podaac's drive) from PO.DAAC, including credentials: ```bash curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/152/ curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/153/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/153/ curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/154/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/154/ curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/155/20020604090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/155/ curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/156/20020605090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/156/ ``` Then run the following code: ```python from datetime import datetime import xarray as xr import glob def generate_file_list(start_doy, end_doy): """""" Given a start day and end end day, generate a list of file locations. Assumes a 'prefix' and 'year' variables have already been defined. 'Prefix' should be a local directory or http url and path. 'Year' should be a 4 digit year. """""" days_of_year = list(range(start_doy, end_doy)) fileObjs = [] for doy in days_of_year: if doy < 10: doy = f""00{doy}"" elif doy >= 10 and doy < 100: doy = f""0{doy}"" file = glob.glob(f""{prefix}/{doy}/*.nc"")[0] fileObjs.append(file) return fileObjs # Invariants - but could be made configurable year = 2002 prefix = f""data/mursst_netcdf"" chunks = {'time': 1, 'lat': 1799, 'lon': 3600} # Create a list of files start_doy = 152 num_days = 5 end_doy = start_doy + num_days fileObjs = generate_file_list(start_doy, end_doy) # will use this timeslice in query later on time_slice = slice(datetime.strptime(f""{year}-06-02"", '%Y-%m-%d'), datetime.strptime(f""{year}-06-04"", '%Y-%m-%d')) print(""results from unchunked dataset"") ds_unchunked = xr.open_mfdataset(fileObjs, combine='by_coords') print(ds_unchunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_unchunked.analysed_sst.sel(time=time_slice).mean().values) print(f""results from chunked dataset using {chunks}"") ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks=chunks) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values) print(""results from chunked dataset using 'auto'"") ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks={'time': 'auto', 'lat': 'auto', 'lon': 'auto'}) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values) ``` Note, these are just a few examples but I tried a variety of other chunk options and got similar discrepancies between the unchunked and chunked datasets. Output: ``` results from unchunked dataset 290.13754 286.7869 results from chunked dataset using {'time': 1, 'lat': 1799, 'lon': 3600} 290.13757 286.81107 results from chunked dataset using 'auto' 290.1377 286.8118 ``` #### Expected Output Values output from queries of chunked and unchunked xarray dataset are equal. #### Problem Description I want to understand how to chunk or query data to verify data opened using chunks will have the same output as data opened without chunking. Would like to store data ultimately in Zarr but verifying data integrity is critical. #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 3.8.1 | packaged by conda-forge | (default, Jan 5 2020, 20:58:18) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.154-99.181.amzn1.x86_64 machine: x86_64 processor: byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.14.1 pandas: 0.25.3 numpy: 1.17.3 scipy: None netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.3.2 cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.9.1 distributed: 2.9.1 matplotlib: None cartopy: None seaborn: None numbagg: None setuptools: 44.0.0.post20200102 pip: 19.3.1 conda: None pytest: None IPython: 7.11.1 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3686/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 493058488,MDU6SXNzdWU0OTMwNTg0ODg=,3306,"`ds.load()` with local files stalls and fails, and `to_zarr` does not include `store` in the dask graph",15016780,closed,0,,,7,2019-09-12T22:29:04Z,2019-09-16T01:22:09Z,2019-09-16T01:22:09Z,NONE,,,,"#### MCVE Code Sample Below details a scenario where reading local netcdf files (shared via EFS) to create a zarr store is not calling `store` as part of the dask graph. I discovered it looks like this may actually be related to `concatenate` I include a commented option where I try using files over https and this works (does store data on S3), but of course the open dataset calls are slower. `ds.to_zarr` and `ds.load()` will both stall and eventually returning many instances of: ``` distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://192.168.62.40:37233'], ('concatenate-2babafa03313bcf979ae6ca3a8e16aad', 1, 10, 13) NoneType: None distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://192.168.62.40:37233'], ('concatenate-2babafa03313bcf979ae6ca3a8e16aad', 0, 6, 30) ``` ```python #!/usr/bin/env python # coding: utf-8 # In[1]: import xarray as xr from dask.distributed import Client, progress import s3fs import zarr import datetime # In[16]: import datetime chunks = {'lat': 1000, 'lon': 1000} base = 2018 year = base ending = '090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc' days_of_year = list(range(152, 154)) file_urls = [] for doy in days_of_year: date = datetime.datetime(year, 1, 1) + datetime.timedelta(doy - 1) date = date.strftime('%Y%m%d') file_urls.append('./{}/{}/{}{}'.format(year, doy, date, ending)) print(file_urls) ds = xr.open_mfdataset(file_urls, chunks=chunks, combine='by_coords', parallel=True) ds # In[21]: # This works fine # base_url = 'https://podaac-opendap.jpl.nasa.gov:443/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/' # url_ending = '090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc?time[0:1:0],lat[0:1:17998],lon[0:1:35999],analysed_sst[0:1:0][0:1:17998][0:1:35999]' # year = 2018 # days_of_year = list(range(152, 154)) # file_urls = [] # for doy in days_of_year: # date = datetime.datetime(year, 1, 1) + datetime.timedelta(doy - 1) # date = date.strftime('%Y%m%d') # file_urls.append('{}/{}/{}/{}{}'.format(base_url, year, doy, date, url_ending)) # #file_urls # ds = xr.open_mfdataset(file_urls, chunks=chunks, parallel=True, combine='by_coords') # ds # In[ ]: # Write zarr to s3 myS3fs = s3fs.S3FileSystem(anon=False) zarr_s3 = 'aimeeb-datasets-private/mur_sst_zarr14' d = s3fs.S3Map(zarr_s3, s3=myS3fs) compressor = zarr.Blosc(cname='zstd', clevel=5, shuffle=zarr.Blosc.AUTOSHUFFLE) encoding = {v: {'compressor': compressor} for v in ds.data_vars} ds.to_zarr(d, mode='w', encoding=encoding) ``` #### Expected Output Expect the call `to_zarr` to produce a graph with `store` #### Problem Description The end result should be a zarr store on S3 #### Output of ``xr.show_versions()`` ``` INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.128-112.105.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.12.3 pandas: 0.25.0 numpy: 1.16.4 scipy: 1.3.0 netCDF4: 1.5.1.2 pydap: None h5netcdf: None h5py: 2.9.0 Nio: None zarr: 2.3.2 cftime: 1.0.3.4 nc_time_axis: None PseudoNetCDF: None rasterio: 1.0.24 cfgrib: None iris: None bottleneck: None dask: 2.2.0 distributed: 2.2.0 matplotlib: 3.1.1 cartopy: 0.17.0 seaborn: 0.9.0 numbagg: None setuptools: 41.0.1 pip: 19.2.1 conda: None pytest: None IPython: 7.7.0 sphinx: None ``` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3306/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue