home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1033142897

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1033142897 I_kwDOAMm_X849lIJx 5883 Failing parallel writes to_zarr with regions parameter? 4666753 closed 0     1 2021-10-22T03:33:02Z 2021-10-22T18:37:06Z 2021-10-22T18:37:06Z CONTRIBUTOR      

What happened: Following guidance on how to use regions keyword in xr.Dataset.to_zarr(), I wrote a multithreaded program that makes independent writes to each index along an axis. But, when I use more than one thread, some of these writes fail.

What you expected to happen: I expect all the writes to take place safely so long as the regions I write to do not overlap (they do not).

Minimal Complete Verifiable Example:

```python path = "tmp.zarr" NTHREADS = 4 # when 1, things work as expected import multiprocessing.dummy as mp # threads, instead of processes

import numpy as np import dask.array as da import xarray as xr

dummy values for metadata

xr.Dataset( {"x": (("a", "b"), -da.ones((10, 7), chunks=(None, 1)))}, {"apple": ("a", -da.ones(10, dtype=int, chunks=(1,)))}, ).to_zarr(path, mode="w", compute=False)

actual values to save

ds = xr.Dataset( {"x": (("a", "b"), np.random.uniform(size=(10, 7)))}, {"apple": ("a", np.arange(10))}, )

save them using NTHREADS

with mp.Pool(NTHREADS) as p: p.map( lambda idx: ds.isel(a=slice(idx, 1 + idx)).to_zarr(path, mode="r+", region=dict(a=slice(idx, 1 + idx))), range(10) ) ds_roundtrip = xr.open_zarr(path).load() # open what we just saved over multiple threads

perfect match for x on some slices of a, but when NTHREADS > 1, x has very different value or NaN on other slices of a

xr.testing.assert_allclose(ds, ds_roundtrip) # fails when NTHREADS > 1. ```

Anything else we need to know?:

  • this behavior is the same if coordinate "apple" (over a) is changed to be coordinate "a" (index over dimension)
  • if dummy dataset had "apple" defined using dask, I observed ds_roundtrip having all correct values of "apple" (but not "x"). But, if it was defined as a numpy array, I observed ds_roundtrip having incorrect values of "apple" (in addition to "x").

Environment:

Output of <tt>xr.show_versions()</tt> ``` INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.4.72-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.8.0 xarray: 0.19.0 pandas: 1.3.3 numpy: 1.21.2 scipy: 1.7.1 netCDF4: 1.5.7 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: 2.10.1 cftime: 1.5.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.08.1 distributed: 2021.08.1 matplotlib: 3.4.1 cartopy: None seaborn: 0.11.2 numbagg: None pint: None setuptools: 58.2.0 pip: 21.3 conda: None pytest: None IPython: 7.28.0 sphinx: None ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5883/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 1 row from issue in issue_comments
Powered by Datasette · Queries took 154.165ms · About: xarray-datasette