home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

12 rows where state = "closed" and user = 1312546 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, created_at (date), updated_at (date), closed_at (date)

type 2

  • pull 8
  • issue 4

state 1

  • closed · 12 ✖

repo 1

  • xarray 12
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1038531231 PR_kwDOAMm_X84tzEEk 5906 Avoid accessing slow .data in unstack TomAugspurger 1312546 closed 0     4 2021-10-28T13:39:36Z 2021-10-29T15:29:39Z 2021-10-29T15:14:43Z MEMBER   0 pydata/xarray/pulls/5906
  • [x] Closes https://github.com/pydata/xarray/issues/5902
  • [x] Passes pre-commit run --all-files
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [x] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5906/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1037894157 I_kwDOAMm_X8493QIN 5902 Slow performance of `DataArray.unstack()` from checking `variable.data` TomAugspurger 1312546 closed 0     4 2021-10-27T21:54:48Z 2021-10-29T15:21:24Z 2021-10-29T15:21:24Z MEMBER      

What happened:

Calling DataArray.unstack() spends time allocating an object-dtype NumPy array from values of the pandas MultiIndex.

What you expected to happen:

Faster unstack.

Minimal Complete Verifiable Example:

```python import pandas as pd import numpy as np import xarray as xr

t = pd.date_range("2000", periods=2) x = np.arange(1000) y = np.arange(1000) component = np.arange(4)

idx = pd.MultiIndex.from_product([t, y, x], names=["time", "y", "x"])

data = np.random.uniform(size=(len(idx), len(component))) arr = xr.DataArray( data, coords={"pixel": xr.DataArray(idx, name="pixel", dims="pixel"), "component": xr.DataArray(component, name="component", dims="component")}, dims=("pixel", "component") )

%time _ = arr.unstack() CPU times: user 6.33 s, sys: 295 ms, total: 6.62 s Wall time: 6.62 s ```

Anything else we need to know?:

For this example, >99% of the time is spent at on this line: https://github.com/pydata/xarray/blob/df7646182b17d829fe9b2199aebf649ddb2ed480/xarray/core/dataset.py#L4162, specifically on the call to v.data for the pixel array, which is a pandas MultiIndex.

Just going by the comments, it does seem like accessing v.data is necessary to perform the check. I'm wonder if we could make is_duck_dask_array a bit smarter, to avoid unnecessarily allocating data?

Alternatively, if that's too difficult, perhaps we could add a flag to unstack to disable those checks and just take the "slow" path. In my actual use-case, the slow _unstack_full_reindex is necessary since I have large Dask Arrays. But even then, the unstack completes in less than 3s, while I was getting OOM killed on the v.data checks.

Environment:

Output of <tt>xr.show_versions()</tt> ``` INSTALLED VERSIONS ------------------ commit: None python: 3.8.12 | packaged by conda-forge | (default, Sep 29 2021, 19:52:28) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.4.0-1040-azure machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 0.19.0 pandas: 1.3.3 numpy: 1.20.0 scipy: 1.7.1 netCDF4: 1.5.7 pydap: installed h5netcdf: 0.11.0 h5py: 3.4.0 Nio: None zarr: 2.10.1 cftime: 1.5.1 nc_time_axis: 1.3.1 PseudoNetCDF: None rasterio: 1.2.9 cfgrib: 0.9.9.0 iris: None bottleneck: 1.3.2 dask: 2021.08.1 distributed: 2021.08.1 matplotlib: 3.4.3 cartopy: 0.20.0 seaborn: 0.11.2 numbagg: None pint: 0.17 setuptools: 58.0.4 pip: 20.3.4 conda: None pytest: None IPython: 7.28.0 sphinx: None ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5902/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
857301324 MDU6SXNzdWU4NTczMDEzMjQ= 5151 DataArray.mean() emits warning with Dask, not NumPy TomAugspurger 1312546 closed 0     3 2021-04-13T20:34:56Z 2021-09-15T16:41:43Z 2021-09-15T16:41:43Z MEMBER      

What happened:

When calling DataArray.mean on an all-NaN dataset, a warning is emitted if and only if a Dask array is used.

What you expected to happen:

Identical behavior between the two, probably no warning .

Minimal Complete Verifiable Example:

```python In [7]: import xarray as xr

In [8]: import numpy as np

In [9]: import dask.array as da

In [10]: import xarray as xr

In [11]: a = xr.DataArray(da.from_array(np.full((10, 10), np.nan)))

In [12]: a.mean(dim="dim_0").compute() /home/taugspurger/miniconda3/envs/tmp-adlfs/lib/python3.8/site-packages/dask/array/numpy_compat.py:39: RuntimeWarning: invalid value encountered in true_divide x = np.divide(x1, x2, out) Out[12]: <xarray.DataArray 'array-395d894c4e4d4ca165a189736da1f52d' (dim_1: 10)> array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]) Dimensions without coordinates: dim_1

In [13]: a.compute().mean(dim="dim_0") Out[13]: <xarray.DataArray 'array-395d894c4e4d4ca165a189736da1f52d' (dim_1: 10)> array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]) Dimensions without coordinates: dim_1 ```

Anything else we need to know?:

I haven't looked closely at why this is happening (I couldn't immediately find where .mean is reduced). I know that Dask has had some issues in the past where NumPy warnings filters are set during graph construction time, but aren't set when the graph is actually computed.

Environment:

``` INSTALLED VERSIONS ------------------ commit: None python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.4.72-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: None libnetcdf: None xarray: 0.17.0 pandas: 1.2.4 numpy: 1.20.2 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.7.0 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.04.0 distributed: 2021.04.0 matplotlib: None cartopy: None seaborn: None numbagg: None pint: None setuptools: 52.0.0.post20210125 pip: 21.0.1 conda: None pytest: None IPython: 7.22.0 sphinx: None ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5151/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
770937642 MDU6SXNzdWU3NzA5Mzc2NDI= 4708 Potentially spurious warning in rechunk TomAugspurger 1312546 closed 0     0 2020-12-18T14:37:32Z 2020-12-24T11:32:43Z 2020-12-24T11:32:43Z MEMBER      

What happened:

When reading an zarr dataset where the last chunk is smaller than the chunk size, users see a UserWarning that this may be inefficient, since the chunking differs from the chunking on disk. In general that's a good warning, but it shouldn't appear when the only difference between the on-disk chunking and the Dataset chunking is the last chunk.

What you expected to happen:

No warning.

Minimal Complete Verifiable Example:

```python

Create and write the data

import numpy as np import pandas as pd import xarray as xr

np.random.seed(0) temperature = 15 + 8 * np.random.randn(2, 2, 3) precipitation = 10 * np.random.rand(2, 2, 3) lon = [[-99.83, -99.32], [-99.79, -99.23]] lat = [[42.25, 42.21], [42.63, 42.59]] time = pd.date_range("2014-09-06", periods=3) reference_time = pd.Timestamp("2014-09-05") ds = xr.Dataset( data_vars=dict( temperature=(["x", "y", "time"], temperature), precipitation=(["x", "y", "time"], precipitation), ), coords=dict( lon=(["x", "y"], lon), lat=(["x", "y"], lat), time=time, reference_time=reference_time, ), attrs=dict(description="Weather related data."), ) ds2 = ds.chunk(chunks=dict(time=(2, 1))) ds2['temperature'].chunks

ds2.to_zarr("/tmp/test.zarr", mode="w") ```

Reading it produces a warning

python xr.open_zarr("/tmp/test.zarr") /mnt/c/Users/taugspurger/src/xarray/xarray/core/dataset.py:408: UserWarning: Specified Dask chunks (2, 1) would separate on disks chunk shape 2 for dimension time. This could degrade performance. Consider rechunking after loading instead. _check_chunks_compatibility(var, output_chunks, preferred_chunks)

Anything else we need to know?:

The check around https://github.com/pydata/xarray/blob/91318d2ee63149669404489be9198f230d877642/xarray/core/dataset.py#L371-L378 should probably ignore the very last chunk, since Zarr allows it to be different?

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.19.128-microsoft-standard machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: None libnetcdf: None xarray: 0.16.3.dev21+g96e1aea0 pandas: 1.1.4 numpy: 1.19.4 scipy: 1.5.4 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.6.2.dev9+dirty cftime: 1.3.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.30.0 distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20201009 pip: 20.2.4 conda: None pytest: 5.4.3 IPython: 7.19.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4708/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
704668670 MDExOlB1bGxSZXF1ZXN0NDg5NTQ5MzIx 4438 Fixed dask.optimize on datasets TomAugspurger 1312546 closed 0     3 2020-09-18T21:30:17Z 2020-09-20T05:21:58Z 2020-09-20T05:21:58Z MEMBER   0 pydata/xarray/pulls/4438

Another attempt to fix #3698. The issue with my fix in is that we hit Variable._dask_finalize in both dask.optimize and dask.persist. We want to do the culling of unnecessary tasks (test_persist_Dataset) but only in the persist case, not optimize (test_optimize).

  • [x] Closes #3698
  • [x] Tests added
  • [x] Passes isort . && black . && mypy . && flake8
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4438/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
703881154 MDExOlB1bGxSZXF1ZXN0NDg4OTA4MTI5 4432 Fix optimize for chunked DataArray TomAugspurger 1312546 closed 0     8 2020-09-17T20:16:08Z 2020-09-18T13:20:45Z 2020-09-17T23:19:23Z MEMBER   0 pydata/xarray/pulls/4432

Previously we generated in invalidate Dask task graph, becuase the lines removed here dropped keys that were referenced elsewhere in the task graph. The original implementation had a comment indicating that this was to cull: https://github.com/pydata/xarray/blob/502a988ad5b87b9f3aeec3033bf55c71272e1053/xarray/core/variable.py#L384

Just spot-checking things, I think we're OK here though. Something like dask.visualize(arr[[0]], optimize_graph=True) indicates that we're OK.

  • [x] Closes #3698
  • [x] Tests added
  • [x] Passes isort . && black . && mypy . && flake8
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [x] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4432/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
672281867 MDExOlB1bGxSZXF1ZXN0NDYyMzQ2NzE4 4305 Fix map_blocks examples TomAugspurger 1312546 closed 0     5 2020-08-03T19:06:58Z 2020-08-04T07:27:08Z 2020-08-04T03:38:51Z MEMBER   0 pydata/xarray/pulls/4305

The examples on master raised with

pytb ValueError: Result from applying user function has unexpected coordinate variables {'month'}.

This PR updates the example to include the month coordinate. pytest --doctest-modules passes on these three now.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4305/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
672195744 MDExOlB1bGxSZXF1ZXN0NDYyMjc2NDEw 4303 Update map_blocks and map_overlap docstrings TomAugspurger 1312546 closed 0     1 2020-08-03T16:27:45Z 2020-08-03T18:35:43Z 2020-08-03T18:06:10Z MEMBER   0 pydata/xarray/pulls/4303

This reference an obj argument that only exists in parallel. The object being referenced is actually self.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4303/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
533555794 MDExOlB1bGxSZXF1ZXN0MzQ5NjA5NDM3 3598 Fix map_blocks HLG layering TomAugspurger 1312546 closed 0     2 2019-12-05T19:41:23Z 2019-12-07T04:30:19Z 2019-12-07T04:30:19Z MEMBER   0 pydata/xarray/pulls/3598

[x] closes #3599

This fixes an issue with the HighLevelGraph noted in https://github.com/pydata/xarray/pull/3584, and exposed by a recent change in Dask to do more HLG fusion.

cc @dcherian.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3598/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
400997415 MDExOlB1bGxSZXF1ZXN0MjQ2MDQ4MDcx 2693 Update asv.conf.json TomAugspurger 1312546 closed 0     1 2019-01-19T13:45:51Z 2019-01-19T19:42:48Z 2019-01-19T17:45:20Z MEMBER   0 pydata/xarray/pulls/2693

Is xarray 3.5+ now? Congrats, I didn't realize that.

This started failing the benchmark machine, which I was tending to last night.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2693/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
279161550 MDU6SXNzdWUyNzkxNjE1NTA= 1759 dask compute on reduction failes with ValueError TomAugspurger 1312546 closed 0     17 2017-12-04T21:45:41Z 2017-12-07T22:09:18Z 2017-12-07T22:09:18Z MEMBER      

I'm doing a reduction like mean on a dask-backed DataArray, and passing it to dask.compute

```python In [3]: from dask import compute ...: import numpy as np ...: import xarray as xr ...:

In [4]: data = xr.DataArray(np.random.random(size=(10, 2)), ...: dims=['samples', 'features']).chunk((5, 2)) ...:

In [5]: compute(data.mean(axis=0)) ```

```pytb

ValueError Traceback (most recent call last) <ipython-input-5-47605102585c> in <module>() ----> 1 compute(data.mean(axis=0))

~/Envs/dask-dev/lib/python3.6/site-packages/dask/dask/base.py in compute(args, kwargs) 334 results_iter = iter(results) 335 return tuple(a if f is None else f(next(results_iter), a) --> 336 for f, a in postcomputes) 337 338

~/Envs/dask-dev/lib/python3.6/site-packages/dask/dask/base.py in <genexpr>(.0) 334 results_iter = iter(results) 335 return tuple(a if f is None else f(next(results_iter), *a) --> 336 for f, a in postcomputes) 337 338

~/Envs/dask-dev/lib/python3.6/site-packages/xarray/xarray/core/dataarray.py in _dask_finalize(results, func, args, name) 607 @staticmethod 608 def _dask_finalize(results, func, args, name): --> 609 ds = func(results, *args) 610 variable = ds._variables.pop(_THIS_ARRAY) 611 coords = ds._variables

~/Envs/dask-dev/lib/python3.6/site-packages/xarray/xarray/core/dataset.py in _dask_postcompute(results, info, args) 551 func, args2 = v 552 r = results2.pop() --> 553 result = func(r, args2) 554 else: 555 result = v

~/Envs/dask-dev/lib/python3.6/site-packages/xarray/xarray/core/variable.py in _dask_finalize(results, array_func, array_args, dims, attrs, encoding) 389 results = {k: v for k, v in results.items() if k[0] == name} # cull 390 data = array_func(results, *array_args) --> 391 return Variable(dims, data, attrs=attrs, encoding=encoding) 392 393 @property

~/Envs/dask-dev/lib/python3.6/site-packages/xarray/xarray/core/variable.py in init(self, dims, data, attrs, encoding, fastpath) 267 """ 268 self._data = as_compatible_data(data, fastpath=fastpath) --> 269 self._dims = self._parse_dimensions(dims) 270 self._attrs = None 271 self._encoding = None

~/Envs/dask-dev/lib/python3.6/site-packages/xarray/xarray/core/variable.py in _parse_dimensions(self, dims) 431 raise ValueError('dimensions %s must have the same length as the ' 432 'number of data dimensions, ndim=%s' --> 433 % (dims, self.ndim)) 434 return dims 435

ValueError: dimensions ('features',) must have the same length as the number of data dimensions, ndim=0 ```

The expected output is the .compute version, which works correctly:

python In [7]: data.mean(axis=0).compute() Out[7]: <xarray.DataArray (features: 2)> array([0.535643, 0.459406]) Dimensions without coordinates: features

``` In [6]: xr.show_versions() INSTALLED VERSIONS ------------------ commit: c2b205f29467a4431baa80b5c07fe31bda67fbef python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.0-5-gc2b205f pandas: 0.22.0.dev0+118.g4c6387520 numpy: 1.14.0.dev0+2995e6a scipy: 1.1.0.dev0+b6fd544 netCDF4: 1.3.1 h5netcdf: None Nio: None bottleneck: None cyordereddict: None dask: 0.16.0+15.gcbc62fbef matplotlib: 2.1.0 cartopy: None seaborn: 0.8.1 setuptools: 36.7.2 pip: 10.0.0.dev0 conda: None pytest: 3.2.3 IPython: 6.2.1 sphinx: 1.6.5 ```

Apologies if I'm doing something silly here, I don't know xarray :)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1759/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
251773472 MDExOlB1bGxSZXF1ZXN0MTM2ODQ1MjE2 1515 Added show_commit_url to asv.conf TomAugspurger 1312546 closed 0     0 2017-08-21T21:17:10Z 2017-08-23T16:01:50Z 2017-08-23T16:01:50Z MEMBER   0 pydata/xarray/pulls/1515

This should setup the proper links from the published output to the commit on Github.

FYI the benchmarks should be running stably now, and posted to http://pandas.pydata.org/speed/xarray. http://pandas.pydata.org/speed/xarray/regressions.xml has an RSS feed to the regressions.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1515/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 20.737ms · About: xarray-datasette