home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1656130602

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1656130602 I_kwDOAMm_X85itowq 7726 open_zarr: PermissionError with multiple processes despite use of ProcessSynchronizer 34257249 open 0     0 2023-04-05T18:55:12Z 2023-04-06T01:37:32Z   CONTRIBUTOR      

What happened?

Several processes read and write to a xarray stored in .zarr format, on a network. The write operations write to existing regions. These regions are not aligned to chunks, therefore I use a ProcessSynchronizer. The ProcessSynchronizer points to a local folder on SSD, separate from the actual stored array.

After several hundreds of read/write I get permission errors like below. So far I have failed to reproduce the error with a MCVE.

The file 0 that gave a permission error is the chunk of coordinates of a certain dimension in the dimension folder dim_yyy: dim_yyy |-- .zarray |-- .zattrs `-- 0

What did you expect to happen?

No permission error.

Minimal Complete Verifiable Example

```Python I have failed so far to reproduce the error with an MVCE. Here my attempt.

from pathlib import Path

import dask.array as da import pandas as pd import xarray as xr from dask.distributed import Client from zarr.sync import ProcessSynchronizer

if name == "main": path_store = Path(aaa) path_synchronizer = Path(bbb) # must exist, and not same location as store

# create and save a datset to zarr
s0, s1, s2 = 10, 10, 10
temperature = da.random.random((s0, s1, s2), chunks=[s0, s1, s2])
precipitation = da.random.random((s0, s1, s2), chunks=[s0, s1, s2])
lon = da.random.random((s0, s1))
lat = da.random.random((s0, s1))
time = pd.date_range("2014-09-06", periods=s2)
reference_time = pd.Timestamp("2014-09-05")
ds = xr.Dataset(
    data_vars=dict(
        temperature=(["x", "y", "time"], temperature),
        precipitation=(["x", "y", "time"], precipitation),
    ),
    coords=dict(
        lon=(["x", "y"], lon),
        lat=(["x", "y"], lat),
        time=time,
        reference_time=reference_time,
    ),
    attrs=dict(description="Weather related data."),
)
print(f"{ds=}")
ds.to_zarr(path_store, mode="w")

def read_write(path_store: Path):
    """lazily opens the dataset, then writes into a region. Comment/uncomment to use synchronizer"""
    synchronizer = ProcessSynchronizer(path_synchronizer)
    for b in range(100):
        # open the saved dataset
        # xr.open_zarr(path_store, synchronizer=synchronizer)
        ds = xr.open_zarr(path_store)

        # process a region
        dst = (
            ds.temperature.isel(x=slice(0, 5), y=slice(0, 5), time=slice(0, 5))
            .to_dataset()
            .load()
        )
        dst["temperature"] = -dst["temperature"]
        dst = dst.drop_vars(["time", "reference_time"])

        # save the region to the zarr store
        dst.to_zarr(
            path_store,
            region={
                "x": slice(0, 5),
                "y": slice(0, 5),
                "time": slice(0, 5),
            },
            # synchronizer=synchronizer,
        )

# independent processes that perform read and write operations
with Client(processes=True) as client:
    futures = [client.submit(read_write, path_store) for a in range(1000)]
    client.gather(futures)

```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Python return xr.open_zarr(path, synchronizer=synchronizer) File "C:\anaconda3\lib\site-packages\xarray\backends\zarr.py", line 787, in open_zarr ds = open_dataset( File "C:\anaconda3\lib\site-packages\xarray\backends\api.py", line 539, in open_dataset backend_ds = backend.open_dataset( File "C:\anaconda3\lib\site-packages\xarray\backends\zarr.py", line 862, in open_dataset ds = store_entrypoint.open_dataset( File "C:\anaconda3\lib\site-packages\xarray\backends\store.py", line 43, in open_dataset ds = Dataset(vars, attrs=attrs) File "C:\anaconda3\lib\site-packages\xarray\core\dataset.py", line 604, in __init__ variables, coord_names, dims, indexes, _ = merge_data_and_coords( File "C:\anaconda3\lib\site-packages\xarray\core\merge.py", line 575, in merge_data_and_coords return merge_core( File "C:\anaconda3\lib\site-packages\xarray\core\merge.py", line 755, in merge_core collected = collect_variables_and_indexes(aligned, indexes=indexes) File "C:\anaconda3\lib\site-packages\xarray\core\merge.py", line 365, in collect_variables_and_indexes variable = as_variable(variable, name=name) File "C:\anaconda3\lib\site-packages\xarray\core\variable.py", line 168, in as_variable obj = obj.to_index_variable() File "C:\anaconda3\lib\site-packages\xarray\core\variable.py", line 624, in to_index_variable return IndexVariable( File "C:\anaconda3\lib\site-packages\xarray\core\variable.py", line 2844, in __init__ self._data = PandasIndexingAdapter(self._data) File "C:\anaconda3\lib\site-packages\xarray\core\indexing.py", line 1420, in __init__ self.array = safe_cast_to_index(array) File "C:\anaconda3\lib\site-packages\xarray\core\indexes.py", line 177, in safe_cast_to_index index = pd.Index(np.asarray(array), **kwargs) File "C:\anaconda3\lib\site-packages\xarray\core\indexing.py", line 524, in __array__ return np.asarray(array[self.key], dtype=None) File "C:\anaconda3\lib\site-packages\xarray\backends\zarr.py", line 68, in __getitem__ return array[key.tuple] File "C:\anaconda3\lib\site-packages\zarr\core.py", line 821, in __getitem__ result = self.get_basic_selection(pure_selection, fields=fields) File "C:\anaconda3\lib\site-packages\zarr\core.py", line 947, in get_basic_selection return self._get_basic_selection_nd(selection=selection, out=out, File "C:\anaconda3\lib\site-packages\zarr\core.py", line 990, in _get_basic_selection_nd return self._get_selection(indexer=indexer, out=out, fields=fields) File "C:\anaconda3\lib\site-packages\zarr\core.py", line 1285, in _get_selection self._chunk_getitem(chunk_coords, chunk_selection, out, out_selection, File "C:\anaconda3\lib\site-packages\zarr\core.py", line 1994, in _chunk_getitem cdata = self.chunk_store[ckey] File "C:\anaconda3\lib\site-packages\zarr\storage.py", line 1085, in __getitem__ return self._fromfile(filepath) File "C:\anaconda3\lib\site-packages\zarr\storage.py", line 1059, in _fromfile with open(fn, 'rb') as f: PermissionError: [Errno 13] Permission denied: 'xxx.zarr\\dim_yyy/0'

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.9 | packaged by Anaconda, Inc. | (main, Mar 1 2023, 18:18:15) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 85 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('English_United States', '1252') libhdf5: 1.10.6 libnetcdf: None xarray: 2022.11.0 pandas: 1.5.3 numpy: 1.23.5 scipy: 1.10.0 netCDF4: None pydap: None h5netcdf: None h5py: 3.7.0 Nio: None zarr: 2.14.2 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.5 dask: 2022.7.0 distributed: None matplotlib: 3.7.0 cartopy: None seaborn: 0.12.2 numbagg: None fsspec: 2022.11.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.6.3 pip: 23.0.1 conda: 23.1.0 pytest: 7.1.2 IPython: 8.10.0 sphinx: 5.0.2
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7726/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 3 rows from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 107.505ms · About: xarray-datasette