home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

1 row where repo = 13221727, state = "closed" and user = 3171991 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 1

  • issue 1

state 1

  • closed · 1 ✖

repo 1

  • xarray · 1 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1781667275 I_kwDOAMm_X85qMhXL 7952 Tasks hang when operating on writing Zarr-backed Dataset ljstrnadiii 3171991 closed 0     5 2023-06-30T00:31:54Z 2023-11-03T04:51:41Z 2023-11-03T04:51:40Z NONE      

What happened?

When writing a dataset to zarr, we sometimes see that the last few tasks hang indefinitely with no cpu activity or data transfer activity (as seen from the daskui). Inspecting the daskui always shows we are waiting on a task with a name like ('store-map-34659153bd4dc964b4e5f380dacebdbe', 0, 1).

What did you expect to happen?

For all tasks to finish and all taking approximately the same amount of time to complete. I also would expect worker saturation to have some effect and kick in a queue with xarray and map_blocks to zarr, but I don't see this behavior. I do see this behavior with the dask counterpart (example in code the extra section below).

Minimal Complete Verifiable Example

```Python import os import xarray as xr import dask.array as da from distributed import Client

def main(): client = Client("...")

# task 1: make dataset
variables = [str(i) for i in range(36)]
dset = xr.Dataset(
    {
        v: (
            ("x", "y"),
            da.ones((50000, 50000), chunks=(4096, 4096), dtype="float32"),
        )
        for v in variables
    },
    coords={"x": range(50000), "y": range(50000)},
)

bucket = "..."
path = f"gs://{bucket}/tmp/test/0"
dset.to_zarr(os.path.join(path, "1"))

# task 2: "preprocess" the dataset with map_blocks
rt = xr.open_zarr(os.path.join(path, "1"))

def f(block):
    # I have found copying helps reproduce the issue
    # and is often happening anyways within functions we apply
    # with map blocks e.g. .rio.clip, .fillna, etc.
    block = block.copy().copy()
    return block

# first setup a target store
target_path = os.path.join(path, "2")
rt.to_zarr(target_path, compute=False)

# I could not reproduce without writing a region
region = {"x": slice(0, 20000), "y": slice(0, 10000)}
subset = rt.isel(**region)
subset.map_blocks(f, template=subset).to_zarr(target_path, region=region)

if name == "main": main()

```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

My dask cluster is deployed with helm with (32) workers each with 1cpu and 4gb.

We often mitigate this issue (poorly) using environment variables to trigger worker restarts so long hanging tasks get rescheduled so the program can complete, but at the cost of the expected time to restart, which can slow down performance. "DASK_DISTRIBUTED__WORKER__LIFETIME__DURATION": "8 minutes", "DASK_DISTRIBUTED__WORKER__LIFETIME__STAGGER": "4 minutes", "DASK_DISTRIBUTED__WORKER__LIFETIME__RESTART": "True",

In the mre above, we write to_zarr with region often because we often have to throttle the number of tasks submitted to dask.

Possibly relevant links: - https://github.com/dask/distributed/issues/391 - https://github.com/fsspec/gcsfs/issues/379 - https://github.com/pydata/xarray/issues/4406

note: I have tried to minimize the example even more with array = da.random.random((250000, 250000), chunks=(4096, 4096)).astype("float32") def f(block): block = block.copy().copy() return block array.map_blocks(f).to_zarr("gs://...") but I am not seeing the issue. I am seeing worker saturation taking some effect, which might be throttling file operations to gcs causing me to wonder why map_blocks().to_zarr() does not see the worker saturation effects and if that would help alleviate issues gcsfs is probably seeing.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.15.0-1036-gcp machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: None xarray: 2023.5.0 pandas: 1.5.3 numpy: 1.23.5 scipy: 1.10.1 netCDF4: None pydap: None h5netcdf: None h5py: 3.8.0 Nio: None zarr: 2.15.0 cftime: None nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.6.0 distributed: 2023.6.0 matplotlib: 3.7.1 cartopy: None seaborn: 0.12.2 numbagg: None fsspec: 2023.6.0 # and gcsfs.__version__ == '2023.6.0' cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 67.7.2 pip: 22.3.1 conda: None pytest: 7.3.2 mypy: 1.3.0 IPython: 8.14.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7952/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 56.723ms · About: xarray-datasette