home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

6 rows where user = 56583917 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, created_at (date), updated_at (date), closed_at (date)

type 2

  • issue 4
  • pull 2

state 2

  • closed 4
  • open 2

repo 1

  • xarray 6
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2108557477 PR_kwDOAMm_X85lfTRf 8684 Enable `numbagg` in calculation of quantiles maawoo 56583917 closed 0     5 2024-01-30T18:59:55Z 2024-02-11T22:31:26Z 2024-02-07T16:28:04Z CONTRIBUTOR   0 pydata/xarray/pulls/8684
  • [x] Closes #7377
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst

Just saw your message in the related issue @max-sixty. This is what I came up with earlier. I also did a quick test, comparing the calculation with and without using numbagg for a dummy 3D DataArray. I was only wondering if the default usage of numbagg (given that it's available and method='linear') should be noted somewhere in the docstrings and/or the docs in general.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8684/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1497031605 I_kwDOAMm_X85ZOuO1 7377 Aggregating a dimension using the Quantiles method with `skipna=True` is very slow maawoo 56583917 closed 0     17 2022-12-14T16:52:35Z 2024-02-07T16:28:05Z 2024-02-07T16:28:05Z CONTRIBUTOR      

What happened?

Hi all, as the title already summarizes, I'm running into performance issues when aggregating over the time-dimension of a 3D DataArray using the quantiles method with skipna=True. See the section below for some dummy data that represents what I'm working with (e.g., similar to this). Aggregating over the time-dimension of this dummy data I'm getting the following wall times:

| | | | | --------------- | --------------- | --------------- | | 1 | da.median(dim='time', skipna=True) | 1.35 s | | 2 | da.quantile(0.95, dim='time', skipna=False) | 5.95 s | | 3 | da.quantile(0.95, dim='time', skipna=True) | 6 min 6s |

I'm currently using a compute node with 40 CPUs and 180 GB RAM. Here is what the resource utilization looks like. First small bump are 1 and 2. Second longer peak is 3.

In this small example, the process at least finishes after a few seconds. With my actual dataset the quantile calculation takes hours...

I guess the following issue is relevant and should be revived: https://github.com/numpy/numpy/issues/16575

Are there any possible work-arounds?

What did you expect to happen?

No response

Minimal Complete Verifiable Example

```Python import pandas as pd import numpy as np import xarray as xr

Create dummy data with 20% random NaNs

size_spatial = 2000 size_temporal = 20 n_nan = int(size_spatial*20.2)

time = pd.date_range("2000-01-01", periods=size_temporal) lat = np.random.uniform(low=-90, high=90, size=size_spatial) lon = np.random.uniform(low=-180, high=180, size=size_spatial) data = np.random.rand(size_temporal, size_spatial, size_spatial) index_nan = np.random.choice(data.size, n_nan, replace=False) data.ravel()[index_nan] = np.nan

Create DataArray

da = xr.DataArray(data=data, dims=['time', 'x', 'y'], coords={'time': time, 'x': lon, 'y': lat}, attrs={'nodata': np.nan})

Calculate 95th quantile over time-dimension

da.quantile(0.95, dim='time', skipna=True) ```

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 5.4.0-125-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.0 xarray: 2022.12.0 pandas: 1.5.0 numpy: 1.23.3 scipy: 1.9.1 netCDF4: 1.6.1 pydap: None h5netcdf: None h5py: 3.7.0 Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.3 cfgrib: None iris: None bottleneck: 1.3.5 dask: 2022.10.0 distributed: 2022.10.0 matplotlib: 3.6.1 cartopy: 0.21.0 seaborn: 0.12.0 numbagg: None fsspec: 2022.8.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.5.0 pip: 22.3 conda: 4.12.0 pytest: None mypy: None IPython: 8.5.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7377/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2108212331 PR_kwDOAMm_X85leIWu 8683 Docs: Fix url in "Contribute to xarray" guide maawoo 56583917 closed 0     3 2024-01-30T15:59:37Z 2024-01-30T18:13:36Z 2024-01-30T18:13:26Z CONTRIBUTOR   0 pydata/xarray/pulls/8683

The URL in the section about creating a local development environment was pointing to itself. The new URL is pointing to the (I assume) correct section further down in the same guide.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8683/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
2098488235 I_kwDOAMm_X859FGOr 8654 Inconsistent preservation of chunk alignment for groupby-/resample-reduce operations w/o using flox maawoo 56583917 closed 0     2 2024-01-24T15:12:38Z 2024-01-24T16:23:20Z 2024-01-24T15:58:22Z CONTRIBUTOR      

What happened?

When performing groupby-/resample-reduce operations (e.g., ds.resample(time="6h").mean() as shown here) the alignment of chunks is not preserved when flox is disabled:

...whereas the alignment is preserved when flox is enabled:

What did you expect to happen?

The alignment of chunks is preserved whether using flox or not.

Minimal Complete Verifiable Example

```Python import pandas as pd import numpy as np import xarray as xr

size_spatial = 1000 size_temporal = 200 time = pd.date_range("2000-01-01", periods=size_temporal, freq='h') lat = np.random.uniform(low=-90, high=90, size=size_spatial) lon = np.random.uniform(low=-180, high=180, size=size_spatial) data = np.random.rand(size_temporal, size_spatial, size_spatial)

da = xr.DataArray(data=data, dims=['time', 'x', 'y'], coords={'time': time, 'x': lon, 'y': lat}).chunk({'time': -1, 'x': 'auto', 'y': 'auto'})

Chunk alignment not preserved

with xr.set_options(use_flox=False): da_1 = da.copy(deep=True) da_1 = da_1.resample(time="6h").mean()

Chunk alignment preserved

with xr.set_options(use_flox=True): da_2 = da.copy(deep=True) da_2 = da_2.resample(time="6h").mean() ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:38:07) [Clang 16.0.6 ] python-bits: 64 OS: Darwin OS-release: 22.4.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: None libnetcdf: None xarray: 2024.1.1 pandas: 2.2.0 numpy: 1.26.3 scipy: 1.12.0 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: 2024.1.0 distributed: 2024.1.0 matplotlib: None cartopy: None seaborn: None numbagg: 0.7.1 fsspec: 2023.12.2 cupy: None pint: None sparse: None flox: 0.9.0 numpy_groupies: 0.10.2 setuptools: 69.0.3 pip: 23.3.2 conda: None pytest: None mypy: None IPython: 8.20.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8654/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  not_planned xarray 13221727 issue
1963071630 I_kwDOAMm_X851AhiO 8378 Extend DatetimeAccessor with `snap`-method maawoo 56583917 open 0     2 2023-10-26T09:16:24Z 2023-10-27T08:08:58Z   CONTRIBUTOR      

Is your feature request related to a problem?

With satellite remote sensing data, you sometimes end up with a blown up DataArray/Dataset because individual acquisitions have been saved in slices:

One could then aggregate these slices with something like this:

python ds.coords['time'] = ds.time.dt.floor('1H') # or .ceil ds = ds_copy.groupby('time').mean()

However, this would miss cases where one slice has been acquired before and the other after a specific hour. The pandas.DatetimeIndex.snap method could be a good alternative for such cases.

Describe the solution you'd like

In addition to the floor, ceil and round methods, it would be great to also implement pandas.DatetimeIndex.snap.

Describe alternatives you've considered

No response

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8378/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1497131525 I_kwDOAMm_X85ZPGoF 7378 Improve docstrings for better discoverability maawoo 56583917 open 0     9 2022-12-14T17:59:20Z 2023-04-02T04:26:57Z   CONTRIBUTOR      

What is your issue?

I noticed that the docstrings of the aggregation methods are mostly written in the same style, e.g.: "Reduce this Dataset's data by applying xy along some dimension(s).". Let's say a user is interested in calculating the variance and searches for the appropriate method. Neither xarray.DataArray.var nor xarray.Dataset.var will be returned (see here), because "variance" is not mentioned at all in the docstrings. Same problem exists for other methods like .std, .prod, .cumsum, .cumprod, and probably others.

https://github.com/pydata/xarray/issues/6793 is related, but I guess it already has enough tasks.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7378/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 22.105ms · About: xarray-datasette