home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

1 row where state = "open", type = "issue" and user = 10595679 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

type 1

  • issue · 1 ✖

state 1

  • open · 1 ✖

repo 1

  • xarray 1
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
425320466 MDU6SXNzdWU0MjUzMjA0NjY= 2852 Allow grouping by dask variables jmichel-otb 10595679 open 0     10 2019-03-26T09:55:19Z 2022-04-18T15:45:41Z   CONTRIBUTOR      

Code Sample, a copy-pastable example if possible

I am using xarray in combination to dask distributed on a cluster, so a mimimal code sample demonstrating my problem is not easy to come up with.

Problem description

Here is what I observe: 1. In my environment, dask distributed is correctly set-up with auto-scaling. I can verify this by loading data into xarray and using aggregation functions like mean(). This triggers auto-scaling and the dask dashboard shows that the processing is spread accross slave nodes.

  1. I have the following xarray dataset called geoms_ds: ``` <xarray.Dataset> Dimensions: (x: 10980, y: 10980) Coordinates:
  2. y (y) float64 4.9e+06 4.9e+06 4.9e+06 ... 4.79e+06 4.79e+06 4.79e+06
  3. x (x) float64 3e+05 3e+05 3e+05 ... 4.098e+05 4.098e+05 4.098e+05 Data variables: label (y, x) uint16 dask.array<shape=(10980, 10980), chunksize=(200, 10980)>

``` Which I load with the following code sample:

python import xarray as xr geoms = xr.open_rasterio('test_rasterization_T31TCJ_uint16.tif',chunks={'band': 1, 'x': 10980, 'y': 200}) geoms_squeez = geoms.isel(band=0).squeeze().drop(labels='band') geoms_ds = geoms_squeez.to_dataset(name='label') This array holds a finite number of integer values denoting groups (or classes if you like). I would like to perform statistics on groups (with additional variables) such as the mean value of a given variable for each group for instance.

  1. I can do this perfectly for a single group using .where(label=xxx).mean('variable'), this behaves as expected, triggering auto-scaling and dask graph of task.

  2. The problem is that I have a lot of groups (or classes) and looping through all of them and apply where() is not very efficient. From my reading of xarray documentation, groupby is what I need, to perform stats on all groups at once.

  3. When I try to use geoms_ds.groupby('label').size() for instance, here is what I observe:

  4. Grouping is not lazy, it is evaluated immediately,
  5. Grouping is not performed through dask distributed, only the master node is working, on a single thread,
  6. The grouping operation takes a large amount of time and eats a large amount of memory (nearly 30 Gb, which is a lot more than what is required to store the full dataset in memory)
  7. Most of the time, the grouping fail with the following errors and warnings:

distributed.utils_perf - WARNING - full garbage collections took 52% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 48% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 50% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 53% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 56% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 56% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 57% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 57% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 57% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 57% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 58% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 58% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 59% CPU time recently (threshold: 10%) WARNING:dask_jobqueue.core:Worker tcp://10.135.39.92:51747 restart in Job 2758934. This can be due to memory issue. distributed.utils - ERROR - 'tcp://10.135.39.92:51747' Traceback (most recent call last): File "/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/distributed/utils.py", line 648, in log_errors yield File "/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/distributed/scheduler.py", line 1360, in add_worker yield self.handle_worker(comm=comm, worker=address) File "/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run value = future.result() File "/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper yielded = next(result) File "/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/distributed/scheduler.py", line 2220, in handle_worker worker_comm = self.stream_comms[worker] KeyError: ... Which I assume comes from the fact that the process is killed by pbs for excessive memory usage.

Expected Output

I would except the following: * Single call to groupbylazily evaluated, * Evaluation of aggregation function performed through dask distributed * The dataset is not so large, even on a single master thread the computation should end well in reasonable time.

Output of xr.show_versions()

NSTALLED VERSIONS ------------------ commit: None python: 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 03:09:43) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-327.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.11.3 pandas: 0.24.1 numpy: 1.16.1 scipy: 1.2.0 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 PseudonetCDF: None rasterio: 1.0.15 cfgrib: None iris: None bottleneck: None cyordereddict: None dask: 1.1.1 distributed: 1.25.3 matplotlib: 3.0.2 cartopy: 0.17.0 seaborn: 0.9.0 setuptools: 40.7.1 pip: 19.0.1 conda: None pytest: None IPython: 7.1.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2852/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 29.229ms · About: xarray-datasette