issues
3 rows where repo = 13221727, type = "issue" and user = 15016780 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: comments, created_at (date), updated_at (date), closed_at (date)
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at ▲ | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2111051033 | I_kwDOAMm_X8591BUZ | 8691 | xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks | abarciauskas-bgse 15016780 | closed | 0 | 4 | 2024-01-31T22:04:02Z | 2024-01-31T22:56:17Z | 2024-01-31T22:56:17Z | NONE | What happened?When opening MUR SST netcdfs from S3, xarray.open_dataset(file, engine="h5netcdf", chunks={}) returns a single chunk (whereas the h5netcdf library returns a chunk shape of (1, 1023, 2047). A notebook version of the code below includes the output: https://gist.github.com/abarciauskas-bgse/9366e04d2af09b79c9de466f6c1d3b90 What did you expect to happen?I thought the chunks={} option would return the same chunks (1, 1023, 2047) exposed by the h5netcdf engine. Minimal Complete Verifiable Example```Python !/usr/bin/env pythoncoding: utf-8This notebook looks at how xarray and h5netcdf return different chunks.import pandas as pd import h5netcdf import s3fs import xarray as xr dates = [ d.to_pydatetime().strftime('%Y%m%d') for d in pd.date_range('2023-02-01', '2023-03-01', freq='D') ] SHORT_NAME = 'MUR-JPL-L4-GLOB-v4.1' s3_fs = s3fs.S3FileSystem(anon=False) var = 'analysed_sst' def make_filename(time): base_url = f's3://podaac-ops-cumulus-protected/{SHORT_NAME}/' # example file: "/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc" return f'{base_url}{time}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc' s3_urls = [make_filename(d) for d in dates] def print_chunk_shape(s3_url): try: # Open the dataset using xarray file = s3_fs.open(s3_url) dataset = xr.open_dataset(file, engine='h5netcdf', chunks={})
[print_chunk_shape(s3_url) for s3_url in s3_urls] ``` MVCE confirmation
Relevant log outputNo response Anything else we need to know?No response Environment
INSTALLED VERSIONS
------------------
commit: None
python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.198-187.748.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.1
libnetcdf: 4.9.2
xarray: 2023.6.0
pandas: 2.0.3
numpy: 1.24.4
scipy: 1.11.1
netCDF4: 1.6.4
pydap: installed
h5netcdf: 1.2.0
h5py: 3.9.0
Nio: None
zarr: 2.15.0
cftime: 1.6.2
nc_time_axis: 1.4.1
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.6.1
distributed: 2023.6.1
matplotlib: 3.7.1
cartopy: 0.21.1
seaborn: 0.12.2
numbagg: None
fsspec: 2023.6.0
cupy: None
pint: 0.22
sparse: 0.14.0
flox: 0.7.2
numpy_groupies: 0.9.22
setuptools: 68.0.0
pip: 23.1.2
conda: None
pytest: 7.4.0
mypy: None
IPython: 8.14.0
sphinx: None
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/8691/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | xarray 13221727 | issue | ||||||
548475127 | MDU6SXNzdWU1NDg0NzUxMjc= | 3686 | Different data values from xarray open_mfdataset when using chunks | abarciauskas-bgse 15016780 | closed | 0 | 7 | 2020-01-11T20:15:12Z | 2020-01-20T20:35:48Z | 2020-01-20T20:35:47Z | NONE | MCVE Code SampleYou will first need to download or (mount podaac's drive) from PO.DAAC, including credentials: ```bash curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/152/ curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/153/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/153/ curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/154/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/154/ curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/155/20020604090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/155/ curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/156/20020605090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/156/ ``` Then run the following code: ```python from datetime import datetime import xarray as xr import glob def generate_file_list(start_doy, end_doy): Invariants - but could be made configurableyear = 2002 prefix = f"data/mursst_netcdf" chunks = {'time': 1, 'lat': 1799, 'lon': 3600} Create a list of filesstart_doy = 152 num_days = 5 end_doy = start_doy + num_days fileObjs = generate_file_list(start_doy, end_doy) will use this timeslice in query later ontime_slice = slice(datetime.strptime(f"{year}-06-02", '%Y-%m-%d'), datetime.strptime(f"{year}-06-04", '%Y-%m-%d')) print("results from unchunked dataset") ds_unchunked = xr.open_mfdataset(fileObjs, combine='by_coords') print(ds_unchunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_unchunked.analysed_sst.sel(time=time_slice).mean().values) print(f"results from chunked dataset using {chunks}") ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks=chunks) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values) print("results from chunked dataset using 'auto'") ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks={'time': 'auto', 'lat': 'auto', 'lon': 'auto'}) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values) ``` Note, these are just a few examples but I tried a variety of other chunk options and got similar discrepancies between the unchunked and chunked datasets. Output:
Expected OutputValues output from queries of chunked and unchunked xarray dataset are equal. Problem DescriptionI want to understand how to chunk or query data to verify data opened using chunks will have the same output as data opened without chunking. Would like to store data ultimately in Zarr but verifying data integrity is critical. Output of
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/3686/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | xarray 13221727 | issue | ||||||
493058488 | MDU6SXNzdWU0OTMwNTg0ODg= | 3306 | `ds.load()` with local files stalls and fails, and `to_zarr` does not include `store` in the dask graph | abarciauskas-bgse 15016780 | closed | 0 | 7 | 2019-09-12T22:29:04Z | 2019-09-16T01:22:09Z | 2019-09-16T01:22:09Z | NONE | MCVE Code SampleBelow details a scenario where reading local netcdf files (shared via EFS) to create a zarr store is not calling I include a commented option where I try using files over https and this works (does store data on S3), but of course the open dataset calls are slower.
```python !/usr/bin/env pythoncoding: utf-8In[1]:import xarray as xr from dask.distributed import Client, progress import s3fs import zarr import datetime In[16]:import datetime chunks = {'lat': 1000, 'lon': 1000} base = 2018 year = base ending = '090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc' days_of_year = list(range(152, 154)) file_urls = [] for doy in days_of_year: date = datetime.datetime(year, 1, 1) + datetime.timedelta(doy - 1) date = date.strftime('%Y%m%d') file_urls.append('./{}/{}/{}{}'.format(year, doy, date, ending)) print(file_urls) ds = xr.open_mfdataset(file_urls, chunks=chunks, combine='by_coords', parallel=True) ds In[21]:This works finebase_url = 'https://podaac-opendap.jpl.nasa.gov:443/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/'url_ending = '090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc?time[0:1:0],lat[0:1:17998],lon[0:1:35999],analysed_sst[0:1:0][0:1:17998][0:1:35999]'year = 2018days_of_year = list(range(152, 154))file_urls = []for doy in days_of_year:date = datetime.datetime(year, 1, 1) + datetime.timedelta(doy - 1)date = date.strftime('%Y%m%d')file_urls.append('{}/{}/{}/{}{}'.format(base_url, year, doy, date, url_ending))#file_urlsds = xr.open_mfdataset(file_urls, chunks=chunks, parallel=True, combine='by_coords')dsIn[ ]:Write zarr to s3myS3fs = s3fs.S3FileSystem(anon=False) zarr_s3 = 'aimeeb-datasets-private/mur_sst_zarr14' d = s3fs.S3Map(zarr_s3, s3=myS3fs) compressor = zarr.Blosc(cname='zstd', clevel=5, shuffle=zarr.Blosc.AUTOSHUFFLE) encoding = {v: {'compressor': compressor} for v in ds.data_vars} ds.to_zarr(d, mode='w', encoding=encoding) ``` Expected OutputExpect the call Problem DescriptionThe end result should be a zarr store on S3 Output of
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/3306/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | xarray 13221727 | issue |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issues] ( [id] INTEGER PRIMARY KEY, [node_id] TEXT, [number] INTEGER, [title] TEXT, [user] INTEGER REFERENCES [users]([id]), [state] TEXT, [locked] INTEGER, [assignee] INTEGER REFERENCES [users]([id]), [milestone] INTEGER REFERENCES [milestones]([id]), [comments] INTEGER, [created_at] TEXT, [updated_at] TEXT, [closed_at] TEXT, [author_association] TEXT, [active_lock_reason] TEXT, [draft] INTEGER, [pull_request] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [state_reason] TEXT, [repo] INTEGER REFERENCES [repos]([id]), [type] TEXT ); CREATE INDEX [idx_issues_repo] ON [issues] ([repo]); CREATE INDEX [idx_issues_milestone] ON [issues] ([milestone]); CREATE INDEX [idx_issues_assignee] ON [issues] ([assignee]); CREATE INDEX [idx_issues_user] ON [issues] ([user]);