id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
2111051033,I_kwDOAMm_X8591BUZ,8691,xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks,15016780,closed,0,,,4,2024-01-31T22:04:02Z,2024-01-31T22:56:17Z,2024-01-31T22:56:17Z,NONE,,,,"### What happened?
When opening MUR SST netcdfs from S3, xarray.open_dataset(file, engine=""h5netcdf"", chunks={}) returns a single chunk (whereas the h5netcdf library returns a chunk shape of (1, 1023, 2047).
A notebook version of the code below includes the output: https://gist.github.com/abarciauskas-bgse/9366e04d2af09b79c9de466f6c1d3b90
### What did you expect to happen?
I thought the chunks={} option would return the same chunks (1, 1023, 2047) exposed by the h5netcdf engine.
### Minimal Complete Verifiable Example
```Python
#!/usr/bin/env python
# coding: utf-8
# This notebook looks at how xarray and h5netcdf return different chunks.
import pandas as pd
import h5netcdf
import s3fs
import xarray as xr
dates = [
d.to_pydatetime().strftime('%Y%m%d')
for d in pd.date_range('2023-02-01', '2023-03-01', freq='D')
]
SHORT_NAME = 'MUR-JPL-L4-GLOB-v4.1'
s3_fs = s3fs.S3FileSystem(anon=False)
var = 'analysed_sst'
def make_filename(time):
base_url = f's3://podaac-ops-cumulus-protected/{SHORT_NAME}/'
# example file: ""/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc""
return f'{base_url}{time}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'
s3_urls = [make_filename(d) for d in dates]
def print_chunk_shape(s3_url):
try:
# Open the dataset using xarray
file = s3_fs.open(s3_url)
dataset = xr.open_dataset(file, engine='h5netcdf', chunks={})
# Print chunk shapes for each variable in the dataset
print(f""\nChunk shapes for {s3_url}:"")
if dataset[var].chunks is not None:
print(f""xarray open_dataset chunks for {var}: {dataset[var].chunks}"")
else:
print(f""xarray open_dataset chunks for {var}: Not chunked"")
with h5netcdf.File(file, 'r') as file:
dataset = file[var]
# Check if the dataset is chunked
if dataset.chunks:
print(f""h5netcdf chunks for {var}:"", dataset.chunks)
else:
print(f""h5netcdf dataset is not chunked."")
except Exception as e:
print(f""Failed to process {s3_url}: {e}"")
[print_chunk_shape(s3_url) for s3_url in s3_urls]
```
### MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [x] Complete example — the example is self-contained, including all data and the text of any traceback.
- [x] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result.
- [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
### Relevant log output
_No response_
### Anything else we need to know?
_No response_
### Environment
INSTALLED VERSIONS
------------------
commit: None
python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.198-187.748.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.1
libnetcdf: 4.9.2
xarray: 2023.6.0
pandas: 2.0.3
numpy: 1.24.4
scipy: 1.11.1
netCDF4: 1.6.4
pydap: installed
h5netcdf: 1.2.0
h5py: 3.9.0
Nio: None
zarr: 2.15.0
cftime: 1.6.2
nc_time_axis: 1.4.1
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.6.1
distributed: 2023.6.1
matplotlib: 3.7.1
cartopy: 0.21.1
seaborn: 0.12.2
numbagg: None
fsspec: 2023.6.0
cupy: None
pint: 0.22
sparse: 0.14.0
flox: 0.7.2
numpy_groupies: 0.9.22
setuptools: 68.0.0
pip: 23.1.2
conda: None
pytest: 7.4.0
mypy: None
IPython: 8.14.0
sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8691/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
548475127,MDU6SXNzdWU1NDg0NzUxMjc=,3686,Different data values from xarray open_mfdataset when using chunks ,15016780,closed,0,,,7,2020-01-11T20:15:12Z,2020-01-20T20:35:48Z,2020-01-20T20:35:47Z,NONE,,,,"#### MCVE Code Sample
You will first need to download or (mount podaac's drive) from PO.DAAC, including credentials:
```bash
curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/152/
curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/153/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/153/
curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/154/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/154/
curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/155/20020604090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/155/
curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/156/20020605090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/156/
```
Then run the following code:
```python
from datetime import datetime
import xarray as xr
import glob
def generate_file_list(start_doy, end_doy):
""""""
Given a start day and end end day, generate a list of file locations.
Assumes a 'prefix' and 'year' variables have already been defined.
'Prefix' should be a local directory or http url and path.
'Year' should be a 4 digit year.
""""""
days_of_year = list(range(start_doy, end_doy))
fileObjs = []
for doy in days_of_year:
if doy < 10:
doy = f""00{doy}""
elif doy >= 10 and doy < 100:
doy = f""0{doy}""
file = glob.glob(f""{prefix}/{doy}/*.nc"")[0]
fileObjs.append(file)
return fileObjs
# Invariants - but could be made configurable
year = 2002
prefix = f""data/mursst_netcdf""
chunks = {'time': 1, 'lat': 1799, 'lon': 3600}
# Create a list of files
start_doy = 152
num_days = 5
end_doy = start_doy + num_days
fileObjs = generate_file_list(start_doy, end_doy)
# will use this timeslice in query later on
time_slice = slice(datetime.strptime(f""{year}-06-02"", '%Y-%m-%d'), datetime.strptime(f""{year}-06-04"", '%Y-%m-%d'))
print(""results from unchunked dataset"")
ds_unchunked = xr.open_mfdataset(fileObjs, combine='by_coords')
print(ds_unchunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)
print(ds_unchunked.analysed_sst.sel(time=time_slice).mean().values)
print(f""results from chunked dataset using {chunks}"")
ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks=chunks)
print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)
print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values)
print(""results from chunked dataset using 'auto'"")
ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks={'time': 'auto', 'lat': 'auto', 'lon': 'auto'})
print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)
print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values)
```
Note, these are just a few examples but I tried a variety of other chunk options and got similar discrepancies between the unchunked and chunked datasets.
Output:
```
results from unchunked dataset
290.13754
286.7869
results from chunked dataset using {'time': 1, 'lat': 1799, 'lon': 3600}
290.13757
286.81107
results from chunked dataset using 'auto'
290.1377
286.8118
```
#### Expected Output
Values output from queries of chunked and unchunked xarray dataset are equal.
#### Problem Description
I want to understand how to chunk or query data to verify data opened using chunks will have the same output as data opened without chunking. Would like to store data ultimately in Zarr but verifying data integrity is critical.
#### Output of ``xr.show_versions()``
INSTALLED VERSIONS
------------------
commit: None
python: 3.8.1 | packaged by conda-forge | (default, Jan 5 2020, 20:58:18)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 4.14.154-99.181.amzn1.x86_64
machine: x86_64
processor:
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.3
xarray: 0.14.1
pandas: 0.25.3
numpy: 1.17.3
scipy: None
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.3.2
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.9.1
distributed: 2.9.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 44.0.0.post20200102
pip: 19.3.1
conda: None
pytest: None
IPython: 7.11.1
sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3686/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
493058488,MDU6SXNzdWU0OTMwNTg0ODg=,3306,"`ds.load()` with local files stalls and fails, and `to_zarr` does not include `store` in the dask graph",15016780,closed,0,,,7,2019-09-12T22:29:04Z,2019-09-16T01:22:09Z,2019-09-16T01:22:09Z,NONE,,,,"#### MCVE Code Sample
Below details a scenario where reading local netcdf files (shared via EFS) to create a zarr store is not calling `store` as part of the dask graph. I discovered it looks like this may actually be related to `concatenate`
I include a commented option where I try using files over https and this works (does store data on S3), but of course the open dataset calls are slower.
`ds.to_zarr` and `ds.load()` will both stall and eventually returning many instances of:
```
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://192.168.62.40:37233'], ('concatenate-2babafa03313bcf979ae6ca3a8e16aad', 1, 10, 13)
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://192.168.62.40:37233'], ('concatenate-2babafa03313bcf979ae6ca3a8e16aad', 0, 6, 30)
```
```python
#!/usr/bin/env python
# coding: utf-8
# In[1]:
import xarray as xr
from dask.distributed import Client, progress
import s3fs
import zarr
import datetime
# In[16]:
import datetime
chunks = {'lat': 1000, 'lon': 1000}
base = 2018
year = base
ending = '090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'
days_of_year = list(range(152, 154))
file_urls = []
for doy in days_of_year:
date = datetime.datetime(year, 1, 1) + datetime.timedelta(doy - 1)
date = date.strftime('%Y%m%d')
file_urls.append('./{}/{}/{}{}'.format(year, doy, date, ending))
print(file_urls)
ds = xr.open_mfdataset(file_urls, chunks=chunks, combine='by_coords', parallel=True)
ds
# In[21]:
# This works fine
# base_url = 'https://podaac-opendap.jpl.nasa.gov:443/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/'
# url_ending = '090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc?time[0:1:0],lat[0:1:17998],lon[0:1:35999],analysed_sst[0:1:0][0:1:17998][0:1:35999]'
# year = 2018
# days_of_year = list(range(152, 154))
# file_urls = []
# for doy in days_of_year:
# date = datetime.datetime(year, 1, 1) + datetime.timedelta(doy - 1)
# date = date.strftime('%Y%m%d')
# file_urls.append('{}/{}/{}/{}{}'.format(base_url, year, doy, date, url_ending))
# #file_urls
# ds = xr.open_mfdataset(file_urls, chunks=chunks, parallel=True, combine='by_coords')
# ds
# In[ ]:
# Write zarr to s3
myS3fs = s3fs.S3FileSystem(anon=False)
zarr_s3 = 'aimeeb-datasets-private/mur_sst_zarr14'
d = s3fs.S3Map(zarr_s3, s3=myS3fs)
compressor = zarr.Blosc(cname='zstd', clevel=5, shuffle=zarr.Blosc.AUTOSHUFFLE)
encoding = {v: {'compressor': compressor} for v in ds.data_vars}
ds.to_zarr(d, mode='w', encoding=encoding)
```
#### Expected Output
Expect the call `to_zarr` to produce a graph with `store`
#### Problem Description
The end result should be a zarr store on S3
#### Output of ``xr.show_versions()``
```
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 4.14.128-112.105.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2
xarray: 0.12.3
pandas: 0.25.0
numpy: 1.16.4
scipy: 1.3.0
netCDF4: 1.5.1.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: 2.3.2
cftime: 1.0.3.4
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.0.24
cfgrib: None
iris: None
bottleneck: None
dask: 2.2.0
distributed: 2.2.0
matplotlib: 3.1.1
cartopy: 0.17.0
seaborn: 0.9.0
numbagg: None
setuptools: 41.0.1
pip: 19.2.1
conda: None
pytest: None
IPython: 7.7.0
sphinx: None
```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3306/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue