issues: 2117245042
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2117245042 | I_kwDOAMm_X85-Mphy | 8703 | calling to_zarr inside map_blocks function results in missing values | 23472459 | closed | 0 | 8 | 2024-02-04T18:21:40Z | 2024-04-11T06:53:45Z | 2024-04-11T06:53:45Z | NONE | What happened?I want to work with a huge dataset stored in hdf5 loaded in chunks. Each chunk contains part of my data that should be saved to a specific region of zarr files. I need to follow the original order of chunks.
I found it a convenient way to use a I used a simplified scenario for code documenting this behavior. The initial zarr file of zeros is filled with ones. There are always some parts where there are still zeros. What did you expect to happen?No response Minimal Complete Verifiable Example```Python import os import shutil import xarray as xr import numpy as np import dask.array as da xr.show_versions() zarr_file = "file.zarr" if os.path.exists(zarr_file): shutil.rmtree(zarr_file) chunk_size = 5 shape = (50, 32, 1000) ones_dataset = xr.Dataset({"data": xr.ones_like(xr.DataArray(np.empty(shape)))}) ones_dataset = ones_dataset.chunk({'dim_0': chunk_size}) chunk_indices = np.arange(len(ones_dataset.chunks['dim_0'])) chunk_ids = np.repeat(np.arange(ones_dataset.sizes["dim_0"] // chunk_size), chunk_size) chunk_ids_dask_array = da.from_array(chunk_ids, chunks=(chunk_size,)) Append the chunk IDs Dask array as a new variable to the existing datasetones_dataset['chunk_id'] = (('dim_0',), chunk_ids_dask_array) Create a new dataset filled with zeroszeros_dataset = xr.Dataset({"data": xr.zeros_like(xr.DataArray(np.empty(shape)))}) zeros_dataset.to_zarr(zarr_file, compute=False) def process_chunk(chunk_dataset): chunk_id = int(chunk_dataset["chunk_id"][0]) chunk_dataset_to_store = chunk_dataset.drop_vars("chunk_id")
ones_dataset.map_blocks(process_chunk, template=ones_dataset).compute() Load data stored in zarrzarr_data = xr.open_zarr(zarr_file, chunks={'dim_0': chunk_size}) Find differencesfor var_name in zarr_data.variables: try: xr.testing.assert_equal(zarr_data[var_name], ones_dataset[var_name]) except AssertionError: print(f"Differences in {var_name}:") print(zarr_data[var_name].values) print(ones_dataset[var_name].values) ``` MVCE confirmation
Relevant log outputNo response Anything else we need to know?No response Environment
INSTALLED VERSIONS
------------------
commit: None
python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
python-bits: 64
OS: Linux
OS-release: 6.5.0-15-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.3-development
xarray: 2024.1.1
pandas: 2.1.4
numpy: 1.26.3
scipy: 1.11.4
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.3.0
h5py: 3.10.0
Nio: None
zarr: 2.16.1
cftime: 1.6.3
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.3.7
dask: 2024.1.1
distributed: 2024.1.0
matplotlib: 3.8.2
cartopy: 0.22.0
seaborn: 0.13.1
numbagg: 0.6.8
fsspec: 2023.12.2
cupy: None
pint: None
sparse: None
flox: 0.8.9
numpy_groupies: 0.10.2
setuptools: 69.0.2
pip: 23.3.1
conda: None
pytest: 7.4.4
mypy: None
IPython: None
sphinx: None
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/8703/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |