home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 548475127

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
548475127 MDU6SXNzdWU1NDg0NzUxMjc= 3686 Different data values from xarray open_mfdataset when using chunks 15016780 closed 0     7 2020-01-11T20:15:12Z 2020-01-20T20:35:48Z 2020-01-20T20:35:47Z NONE      

MCVE Code Sample

You will first need to download or (mount podaac's drive) from PO.DAAC, including credentials: ```bash curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/152/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/153/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/153/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/154/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/154/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/155/20020604090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/155/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/156/20020605090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/156/ ```

Then run the following code: ```python from datetime import datetime import xarray as xr import glob

def generate_file_list(start_doy, end_doy):
""" Given a start day and end end day, generate a list of file locations. Assumes a 'prefix' and 'year' variables have already been defined. 'Prefix' should be a local directory or http url and path. 'Year' should be a 4 digit year. """ days_of_year = list(range(start_doy, end_doy)) fileObjs = [] for doy in days_of_year: if doy < 10: doy = f"00{doy}" elif doy >= 10 and doy < 100: doy = f"0{doy}"
file = glob.glob(f"{prefix}/{doy}/*.nc")[0] fileObjs.append(file) return fileObjs

Invariants - but could be made configurable

year = 2002 prefix = f"data/mursst_netcdf" chunks = {'time': 1, 'lat': 1799, 'lon': 3600}

Create a list of files

start_doy = 152 num_days = 5 end_doy = start_doy + num_days fileObjs = generate_file_list(start_doy, end_doy)

will use this timeslice in query later on

time_slice = slice(datetime.strptime(f"{year}-06-02", '%Y-%m-%d'), datetime.strptime(f"{year}-06-04", '%Y-%m-%d'))

print("results from unchunked dataset") ds_unchunked = xr.open_mfdataset(fileObjs, combine='by_coords') print(ds_unchunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_unchunked.analysed_sst.sel(time=time_slice).mean().values)

print(f"results from chunked dataset using {chunks}") ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks=chunks) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values)

print("results from chunked dataset using 'auto'") ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks={'time': 'auto', 'lat': 'auto', 'lon': 'auto'}) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values) ``` Note, these are just a few examples but I tried a variety of other chunk options and got similar discrepancies between the unchunked and chunked datasets.

Output: results from unchunked dataset 290.13754 286.7869 results from chunked dataset using {'time': 1, 'lat': 1799, 'lon': 3600} 290.13757 286.81107 results from chunked dataset using 'auto' 290.1377 286.8118

Expected Output

Values output from queries of chunked and unchunked xarray dataset are equal.

Problem Description

I want to understand how to chunk or query data to verify data opened using chunks will have the same output as data opened without chunking. Would like to store data ultimately in Zarr but verifying data integrity is critical.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.8.1 | packaged by conda-forge | (default, Jan 5 2020, 20:58:18) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.154-99.181.amzn1.x86_64 machine: x86_64 processor: byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.14.1 pandas: 0.25.3 numpy: 1.17.3 scipy: None netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.3.2 cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.9.1 distributed: 2.9.1 matplotlib: None cartopy: None seaborn: None numbagg: None setuptools: 44.0.0.post20200102 pip: 19.3.1 conda: None pytest: None IPython: 7.11.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3686/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 7 rows from issue in issue_comments
Powered by Datasette · Queries took 0.794ms · About: xarray-datasette