github: issues: 3 rows where repo = 13221727, type = "issue" and user = 15016780 sorted by updated

3 rows where repo = 13221727, type = "issue" and user = 15016780 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	closed_at	author_association	body	reactions	state_reason	repo	type
2111051033	I_kwDOAMm_X8591BUZ	8691	xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks	abarciauskas-bgse 15016780	closed	4	2024-01-31T22:04:02Z	2024-01-31T22:56:17Z	2024-01-31T22:56:17Z	NONE	What happened? When opening MUR SST netcdfs from S3, xarray.open_dataset(file, engine="h5netcdf", chunks={}) returns a single chunk (whereas the h5netcdf library returns a chunk shape of (1, 1023, 2047). A notebook version of the code below includes the output: https://gist.github.com/abarciauskas-bgse/9366e04d2af09b79c9de466f6c1d3b90 What did you expect to happen? I thought the chunks={} option would return the same chunks (1, 1023, 2047) exposed by the h5netcdf engine. Minimal Complete Verifiable Example ```Python !/usr/bin/env python coding: utf-8 This notebook looks at how xarray and h5netcdf return different chunks. import pandas as pd import h5netcdf import s3fs import xarray as xr dates = [ d.to_pydatetime().strftime('%Y%m%d') for d in pd.date_range('2023-02-01', '2023-03-01', freq='D') ] SHORT_NAME = 'MUR-JPL-L4-GLOB-v4.1' s3_fs = s3fs.S3FileSystem(anon=False) var = 'analysed_sst' def make_filename(time): base_url = f's3://podaac-ops-cumulus-protected/{SHORT_NAME}/' # example file: "/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc" return f'{base_url}{time}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc' s3_urls = [make_filename(d) for d in dates] def print_chunk_shape(s3_url): try: # Open the dataset using xarray file = s3_fs.open(s3_url) dataset = xr.open_dataset(file, engine='h5netcdf', chunks={}) # Print chunk shapes for each variable in the dataset print(f"\nChunk shapes for {s3_url}:") if dataset[var].chunks is not None: print(f"xarray open_dataset chunks for {var}: {dataset[var].chunks}") else: print(f"xarray open_dataset chunks for {var}: Not chunked") with h5netcdf.File(file, 'r') as file: dataset = file[var] # Check if the dataset is chunked if dataset.chunks: print(f"h5netcdf chunks for {var}:", dataset.chunks) else: print(f"h5netcdf dataset is not chunked.") except Exception as e: print(f"Failed to process {s3_url}: {e}") [print_chunk_shape(s3_url) for s3_url in s3_urls] ``` MVCE confirmation [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. [x] Complete example — the example is self-contained, including all data and the text of any traceback. [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result. [x] New issue — a search of GitHub Issues suggests this is not a duplicate. [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies. Relevant log output No response Anything else we need to know? No response Environment INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 \| packaged by conda-forge \| (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 5.10.198-187.748.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.1 libnetcdf: 4.9.2 xarray: 2023.6.0 pandas: 2.0.3 numpy: 1.24.4 scipy: 1.11.1 netCDF4: 1.6.4 pydap: installed h5netcdf: 1.2.0 h5py: 3.9.0 Nio: None zarr: 2.15.0 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.6.1 distributed: 2023.6.1 matplotlib: 3.7.1 cartopy: 0.21.1 seaborn: 0.12.2 numbagg: None fsspec: 2023.6.0 cupy: None pint: 0.22 sparse: 0.14.0 flox: 0.7.2 numpy_groupies: 0.9.22 setuptools: 68.0.0 pip: 23.1.2 conda: None pytest: 7.4.0 mypy: None IPython: 8.14.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8691/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
548475127	MDU6SXNzdWU1NDg0NzUxMjc=	3686	Different data values from xarray open_mfdataset when using chunks	abarciauskas-bgse 15016780	closed	7	2020-01-11T20:15:12Z	2020-01-20T20:35:48Z	2020-01-20T20:35:47Z	NONE	MCVE Code Sample You will first need to download or (mount podaac's drive) from PO.DAAC, including credentials: ```bash curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/152/ curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/153/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/153/ curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/154/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/154/ curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/155/20020604090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/155/ curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/156/20020605090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/156/ ``` Then run the following code: ```python from datetime import datetime import xarray as xr import glob def generate_file_list(start_doy, end_doy): """ Given a start day and end end day, generate a list of file locations. Assumes a 'prefix' and 'year' variables have already been defined. 'Prefix' should be a local directory or http url and path. 'Year' should be a 4 digit year. """ days_of_year = list(range(start_doy, end_doy)) fileObjs = [] for doy in days_of_year: if doy < 10: doy = f"00{doy}" elif doy >= 10 and doy < 100: doy = f"0{doy}" file = glob.glob(f"{prefix}/{doy}/*.nc")[0] fileObjs.append(file) return fileObjs Invariants - but could be made configurable year = 2002 prefix = f"data/mursst_netcdf" chunks = {'time': 1, 'lat': 1799, 'lon': 3600} Create a list of files start_doy = 152 num_days = 5 end_doy = start_doy + num_days fileObjs = generate_file_list(start_doy, end_doy) will use this timeslice in query later on time_slice = slice(datetime.strptime(f"{year}-06-02", '%Y-%m-%d'), datetime.strptime(f"{year}-06-04", '%Y-%m-%d')) print("results from unchunked dataset") ds_unchunked = xr.open_mfdataset(fileObjs, combine='by_coords') print(ds_unchunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_unchunked.analysed_sst.sel(time=time_slice).mean().values) print(f"results from chunked dataset using {chunks}") ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks=chunks) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values) print("results from chunked dataset using 'auto'") ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks={'time': 'auto', 'lat': 'auto', 'lon': 'auto'}) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values) ``` Note, these are just a few examples but I tried a variety of other chunk options and got similar discrepancies between the unchunked and chunked datasets. Output: `results from unchunked dataset 290.13754 286.7869 results from chunked dataset using {'time': 1, 'lat': 1799, 'lon': 3600} 290.13757 286.81107 results from chunked dataset using 'auto' 290.1377 286.8118` Expected Output Values output from queries of chunked and unchunked xarray dataset are equal. Problem Description I want to understand how to chunk or query data to verify data opened using chunks will have the same output as data opened without chunking. Would like to store data ultimately in Zarr but verifying data integrity is critical. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.8.1 \| packaged by conda-forge \| (default, Jan 5 2020, 20:58:18) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.154-99.181.amzn1.x86_64 machine: x86_64 processor: byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.14.1 pandas: 0.25.3 numpy: 1.17.3 scipy: None netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.3.2 cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.9.1 distributed: 2.9.1 matplotlib: None cartopy: None seaborn: None numbagg: None setuptools: 44.0.0.post20200102 pip: 19.3.1 conda: None pytest: None IPython: 7.11.1 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3686/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
493058488	MDU6SXNzdWU0OTMwNTg0ODg=	3306	`ds.load()` with local files stalls and fails, and `to_zarr` does not include `store` in the dask graph	abarciauskas-bgse 15016780	closed	7	2019-09-12T22:29:04Z	2019-09-16T01:22:09Z	2019-09-16T01:22:09Z	NONE	MCVE Code Sample Below details a scenario where reading local netcdf files (shared via EFS) to create a zarr store is not calling `store` as part of the dask graph. I discovered it looks like this may actually be related to `concatenate` I include a commented option where I try using files over https and this works (does store data on S3), but of course the open dataset calls are slower. `ds.to_zarr` and `ds.load()` will both stall and eventually returning many instances of: `distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://192.168.62.40:37233'], ('concatenate-2babafa03313bcf979ae6ca3a8e16aad', 1, 10, 13) NoneType: None distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://192.168.62.40:37233'], ('concatenate-2babafa03313bcf979ae6ca3a8e16aad', 0, 6, 30)` ```python !/usr/bin/env python coding: utf-8 In[1]: import xarray as xr from dask.distributed import Client, progress import s3fs import zarr import datetime In[16]: import datetime chunks = {'lat': 1000, 'lon': 1000} base = 2018 year = base ending = '090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc' days_of_year = list(range(152, 154)) file_urls = [] for doy in days_of_year: date = datetime.datetime(year, 1, 1) + datetime.timedelta(doy - 1) date = date.strftime('%Y%m%d') file_urls.append('./{}/{}/{}{}'.format(year, doy, date, ending)) print(file_urls) ds = xr.open_mfdataset(file_urls, chunks=chunks, combine='by_coords', parallel=True) ds In[21]: This works fine base_url = 'https://podaac-opendap.jpl.nasa.gov:443/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/' url_ending = '090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc?time[0:1:0],lat[0:1:17998],lon[0:1:35999],analysed_sst[0:1:0][0:1:17998][0:1:35999]' year = 2018 days_of_year = list(range(152, 154)) file_urls = [] for doy in days_of_year: date = datetime.datetime(year, 1, 1) + datetime.timedelta(doy - 1) date = date.strftime('%Y%m%d') file_urls.append('{}/{}/{}/{}{}'.format(base_url, year, doy, date, url_ending)) #file_urls ds = xr.open_mfdataset(file_urls, chunks=chunks, parallel=True, combine='by_coords') ds In[ ]: Write zarr to s3 myS3fs = s3fs.S3FileSystem(anon=False) zarr_s3 = 'aimeeb-datasets-private/mur_sst_zarr14' d = s3fs.S3Map(zarr_s3, s3=myS3fs) compressor = zarr.Blosc(cname='zstd', clevel=5, shuffle=zarr.Blosc.AUTOSHUFFLE) encoding = {v: {'compressor': compressor} for v in ds.data_vars} ds.to_zarr(d, mode='w', encoding=encoding) ``` Expected Output Expect the call `to_zarr` to produce a graph with `store` Problem Description The end result should be a zarr store on S3 Output of `xr.show_versions()` ``` INSTALLED VERSIONS commit: None python: 3.7.3 \| packaged by conda-forge \| (default, Jul 1 2019, 21:52:21) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.128-112.105.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.12.3 pandas: 0.25.0 numpy: 1.16.4 scipy: 1.3.0 netCDF4: 1.5.1.2 pydap: None h5netcdf: None h5py: 2.9.0 Nio: None zarr: 2.3.2 cftime: 1.0.3.4 nc_time_axis: None PseudoNetCDF: None rasterio: 1.0.24 cfgrib: None iris: None bottleneck: None dask: 2.2.0 distributed: 2.2.0 matplotlib: 3.1.1 cartopy: 0.17.0 seaborn: 0.9.0 numbagg: None setuptools: 41.0.1 pip: 19.2.1 conda: None pytest: None IPython: 7.7.0 sphinx: None ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3306/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

3 rows where repo = 13221727, type = "issue" and user = 15016780 sorted by updated_at descending

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

!/usr/bin/env python

coding: utf-8

This notebook looks at how xarray and h5netcdf return different chunks.

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

MCVE Code Sample

Invariants - but could be made configurable

Create a list of files

will use this timeslice in query later on

Expected Output

Problem Description

Output of xr.show_versions()

MCVE Code Sample

!/usr/bin/env python

coding: utf-8

In[1]:

In[16]:

In[21]:

This works fine

base_url = 'https://podaac-opendap.jpl.nasa.gov:443/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/'

url_ending = '090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc?time[0:1:0],lat[0:1:17998],lon[0:1:35999],analysed_sst[0:1:0][0:1:17998][0:1:35999]'

year = 2018

days_of_year = list(range(152, 154))

file_urls = []

for doy in days_of_year:

date = datetime.datetime(year, 1, 1) + datetime.timedelta(doy - 1)

date = date.strftime('%Y%m%d')

file_urls.append('{}/{}/{}/{}{}'.format(base_url, year, doy, date, url_ending))

#file_urls

ds = xr.open_mfdataset(file_urls, chunks=chunks, parallel=True, combine='by_coords')

ds

In[ ]:

Write zarr to s3

Expected Output

Problem Description

Output of xr.show_versions()

Advanced export

Output of `xr.show_versions()`

Output of `xr.show_versions()`