id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1975574237,I_kwDOAMm_X851wN7d,8409,Task graphs on `.map_blocks` with many chunks can be huge,5635139,closed,0,,,6,2023-11-03T07:14:45Z,2024-01-03T04:10:16Z,2024-01-03T04:10:16Z,MEMBER,,,,"### What happened? I'm getting task graphs > 1GB, I think possibly because the full indexes are being included in every task? ### What did you expect to happen? Only the relevant sections of the index would be included ### Minimal Complete Verifiable Example ```Python da = xr.tutorial.load_dataset('air_temperature') # Dropping the index doesn't generally matter that much... len(cloudpickle.dumps(da.chunk(lat=1, lon=1))) # 15569320 len(cloudpickle.dumps(da.chunk().drop_vars(da.indexes))) # 15477313 # But with `.map_blocks`, it really matters — it's really big with the indexes, and the same size without: len(cloudpickle.dumps(da.chunk(lat=1, lon=1).map_blocks(lambda x: x))) # 79307120 len(cloudpickle.dumps(da.chunk(lat=1, lon=1).drop_vars(da.indexes).map_blocks(lambda x: x))) # 16016173 ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. - [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies. ### Relevant log output _No response_ ### Anything else we need to know? _No response_ ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.9.18 (main, Aug 24 2023, 21:19:58) [Clang 14.0.3 (clang-1403.0.22.14.1)] python-bits: 64 OS: Darwin OS-release: 22.6.0 machine: arm64 processor: arm byteorder: little LC_ALL: en_US.UTF-8 LANG: None LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: None xarray: 2023.10.1 pandas: 2.1.1 numpy: 1.26.1 scipy: 1.11.1 netCDF4: None pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 Nio: None zarr: 2.16.0 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.5.0 distributed: 2023.5.0 matplotlib: 3.6.0 cartopy: None seaborn: 0.12.2 numbagg: 0.6.0 fsspec: 2022.8.2 cupy: None pint: 0.22 sparse: 0.14.0 flox: 0.7.2 numpy_groupies: 0.9.22 setuptools: 68.1.2 pip: 23.2.1 conda: None pytest: 7.4.0 mypy: 1.6.1 IPython: 8.14.0 sphinx: 5.2.1
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8409/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 988158051,MDU6SXNzdWU5ODgxNTgwNTE=,5764,Implement __sizeof__ on objects?,5635139,open,0,,,6,2021-09-03T23:36:53Z,2023-12-19T18:23:08Z,,MEMBER,,,," **Is your feature request related to a problem? Please describe.** Currently `ds.nbytes` returns the size of the data. But `sys.getsizeof(ds)` returns a very small number. **Describe the solution you'd like** If we implement `__sizeof__` on DataArrays & Datasets, this would work. I think that would be something like `ds.nbytes` + the size of the `ds` container, + maybe attrs if those aren't handled by `.nbytes`?","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5764/reactions"", ""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,reopened,13221727,issue 866826033,MDU6SXNzdWU4NjY4MjYwMzM=,5215,"Add an Cumulative aggregation, similar to Rolling",5635139,closed,0,,,6,2021-04-24T19:59:49Z,2023-12-08T22:06:53Z,2023-12-08T22:06:53Z,MEMBER,,,," **Is your feature request related to a problem? Please describe.** Pandas has a [`.expanding` aggregation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.expanding.html), which is basically rolling with a full lookback. I often end up supplying rolling with the length of the dimension, and this is some nice sugar for that. **Describe the solution you'd like** Basically the same as pandas — a `.expanding` method that returns an `Expanding` class, which implements the same methods as a `Rolling` class. **Describe alternatives you've considered** Some options: – This – Don't add anything, the sugar isn't worth the additional API. – Go full out and write specialized expanding algos — which will be faster since they don't have to keep track of the window. But not that much faster, likely not worth the effort.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5215/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1878288525,PR_kwDOAMm_X85ZYos5,8139,Fix pandas' `interpolate(fill_value=)` error,5635139,closed,0,,,6,2023-09-02T02:41:45Z,2023-09-28T16:48:51Z,2023-09-04T18:05:14Z,MEMBER,,0,pydata/xarray/pulls/8139,"Pandas no longer has a `fill_value` parameter for `interpolate`. Weirdly I wasn't getting this locally, on pandas 2.1.0, only in CI on https://github.com/pydata/xarray/actions/runs/6054400455/job/16431747966?pr=8138. Removing it passes locally, let's see whether this works in CI Would close #8125 ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8139/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 967854972,MDExOlB1bGxSZXF1ZXN0NzEwMDA1NzY4,5694,Ask PRs to annotate tests,5635139,closed,0,,,6,2021-08-12T02:19:28Z,2023-09-28T16:46:19Z,2023-06-19T05:46:36Z,MEMBER,,0,pydata/xarray/pulls/5694," - [x] Passes `pre-commit run --all-files` - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` As discussed https://github.com/pydata/xarray/pull/5690#issuecomment-897280353","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5694/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1874148181,I_kwDOAMm_X85vtTtV,8123,`.rolling_exp` arguments could be clearer,5635139,open,0,,,6,2023-08-30T18:09:04Z,2023-09-01T00:25:08Z,,MEMBER,,,,"### Is your feature request related to a problem? Currently we call `.rolling_exp` like: ``` da.rolling_exp(date=20).mean() ``` `20` refers to a ""standard"" window type — broadly ""the same average distance as a simple rolling window. That works well, and matches the `.rolling(date=20).mean()` format. But we also have different window types, and this makes it a bit incongruent: ``` da.rolling_exp(date=0.5, window_type=""alpha"").mean() ``` ...since the `window_type` is completely changing the meaning of the value we pass to the dimension argument. A bit like someone asking ""how many apples would you like to buy"", and replying ""5"", and then separately saying ""when I said 5, I meant 5 _tonnes_"". ### Describe the solution you'd like One option would be: ``` .rolling_exp(dptr={""alpha"": 0.5}) ``` We pass a dict if we want a non-standard window type — so the value is attached to its type. We could still have the original form for `da.rolling_exp(date=20).mean()`. ### Describe alternatives you've considered _No response_ ### Additional context (I realize I wrote this originally, all criticism directed at me! This is based on feedback from a colleague, which on reflection I agree with.) Unless anyone disagrees, I'll try and do this soon-ish™","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8123/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 729208432,MDExOlB1bGxSZXF1ZXN0NTA5NzM0NTM2,4540,numpy_groupies,5635139,closed,0,,,6,2020-10-26T03:37:19Z,2022-02-05T22:24:12Z,2021-10-24T00:18:52Z,MEMBER,,0,pydata/xarray/pulls/4540," - [x] Closes https://github.com/pydata/xarray/issues/4473 - [ ] Tests added - [x] Passes `isort . && black . && mypy . && flake8` - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - [ ] New functions/methods are listed in `api.rst` Very early effort — I found this harder than I expected — I was trying to use the existing groupby infra, but think I maybe should start afresh. The result of the `numpy_groupies` operation is a fully formed array, whereas we're used to handling an iterable of results which need to be concat. I also added some type signature / notes and I was going through the existing code; mostly for my own understanding If anyone has any thoughts, feel free to comment — otherwise I'll resume this soon","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4540/reactions"", ""total_count"": 4, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 2, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 399164733,MDExOlB1bGxSZXF1ZXN0MjQ0NjU3NTk5,2674,Skipping variables in datasets that don't have the core dim,5635139,closed,0,,,6,2019-01-15T02:43:11Z,2021-05-13T22:02:19Z,2021-05-13T22:02:19Z,MEMBER,,0,pydata/xarray/pulls/2674,"ref https://github.com/pydata/xarray/pull/2650#issuecomment-454164295 This seems an ugly way of accomplishing the goal; any ideas for a better way of doing this? And stepping back, do others think a) it's helpful to skip variables in a dataset, and b) `apply_ufunc` should do this? ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2674/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 298421965,MDU6SXNzdWUyOTg0MjE5NjU=,1923,Local test failure in test_backends,5635139,closed,0,,,6,2018-02-19T22:53:37Z,2020-09-05T20:32:17Z,2020-09-05T20:32:17Z,MEMBER,,,,"I'm happy to debug this further but before I do, is this an issue people have seen before? I'm running tests on master and hit an issue very early on. FWIW I don't use netCDF, and don't think I've got that installed #### Code Sample, a copy-pastable example if possible ```python ========================================================================== FAILURES ========================================================================== _________________________________________________________ ScipyInMemoryDataTest.test_bytesio_pickle __________________________________________________________ self = @pytest.mark.skipif(PY2, reason='cannot pickle BytesIO on Python 2') def test_bytesio_pickle(self): data = Dataset({'foo': ('x', [1, 2, 3])}) fobj = BytesIO(data.to_netcdf()) with open_dataset(fobj, autoclose=self.autoclose) as ds: > unpickled = pickle.loads(pickle.dumps(ds)) E TypeError: can't pickle _thread.lock objects xarray/tests/test_backends.py:1384: TypeError ``` #### Problem description [this should explain **why** the current behavior is a problem and why the expected output is a better solution.] #### Expected Output Skip or pass backends tests #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: d00721a3560f57a1b9226c5dbf5bf3af0356619d python: 3.6.4.final.0 python-bits: 64 OS: Darwin OS-release: 17.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.7.0-38-g1005a9e # not sure why this is tagged so early. I'm running on latest master pandas: 0.22.0 numpy: 1.14.0 scipy: 1.0.0 netCDF4: None h5netcdf: None h5py: None Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: None distributed: None matplotlib: 2.1.2 cartopy: None seaborn: 0.8.1 setuptools: 38.5.1 pip: 9.0.1 conda: None pytest: 3.4.0 IPython: 6.2.1 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1923/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 575088962,MDExOlB1bGxSZXF1ZXN0MzgzMzAwMjgw,3826,Allow ellipsis to be used in stack,5635139,closed,0,,,6,2020-03-04T02:21:21Z,2020-03-20T01:20:54Z,2020-03-19T22:55:09Z,MEMBER,,0,pydata/xarray/pulls/3826," - [x] Closes https://github.com/pydata/xarray/issues/3814 - [x] Tests added - [x] Passes `isort -rc . && black . && mypy . && flake8` - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3826/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 577283480,MDExOlB1bGxSZXF1ZXN0Mzg1MTA3OTU4,3846,Doctests fixes,5635139,closed,0,,,6,2020-03-07T05:44:27Z,2020-03-10T14:03:05Z,2020-03-10T14:03:00Z,MEMBER,,0,pydata/xarray/pulls/3846," - [ ] Closes #xxxx - [ ] Tests added - [x] Passes `isort -rc . && black . && mypy . && flake8` - [ ] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API Starting to get some fixes in. It's going to be a long journey though. I think maybe we whitelist some files and move gradually through before whitelisting the whole library.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3846/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 485437811,MDU6SXNzdWU0ODU0Mzc4MTE=,3265,Sparse tests failing on master,5635139,closed,0,,,6,2019-08-26T20:34:21Z,2019-08-27T00:01:18Z,2019-08-27T00:01:07Z,MEMBER,,,,"https://dev.azure.com/xarray/xarray/_build/results?buildId=695 ```python =================================== FAILURES =================================== _______________________ TestSparseVariable.test_unary_op _______________________ self = def test_unary_op(self): > sparse.utils.assert_eq(-self.var.data, -self.data) E AttributeError: module 'sparse' has no attribute 'utils' xarray/tests/test_sparse.py:285: AttributeError ___________________ TestSparseVariable.test_univariate_ufunc ___________________ self = def test_univariate_ufunc(self): > sparse.utils.assert_eq(np.sin(self.data), xu.sin(self.var).data) E AttributeError: module 'sparse' has no attribute 'utils' xarray/tests/test_sparse.py:290: AttributeError ___________________ TestSparseVariable.test_bivariate_ufunc ____________________ self = def test_bivariate_ufunc(self): > sparse.utils.assert_eq(np.maximum(self.data, 0), xu.maximum(self.var, 0).data) E AttributeError: module 'sparse' has no attribute 'utils' xarray/tests/test_sparse.py:293: AttributeError ________________________ TestSparseVariable.test_pickle ________________________ self = def test_pickle(self): v1 = self.var v2 = pickle.loads(pickle.dumps(v1)) > sparse.utils.assert_eq(v1.data, v2.data) E AttributeError: module 'sparse' has no attribute 'utils' xarray/tests/test_sparse.py:307: AttributeError ``` Any ideas?","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3265/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 457080809,MDExOlB1bGxSZXF1ZXN0Mjg4OTY1MzQ4,3029,Fix pandas-dev tests ,5635139,closed,0,,,6,2019-06-17T18:15:16Z,2019-06-28T15:31:33Z,2019-06-28T15:31:28Z,MEMBER,,0,pydata/xarray/pulls/3029,"Currently pandas-dev tests get 'stuck' on the conda install. The last instruction to run is the standard install: ``` $ if [[ ""$CONDA_ENV"" == ""docs"" ]]; then conda env create -n test_env --file doc/environment.yml; elif [[ ""$CONDA_ENV"" == ""lint"" ]]; then conda env create -n test_env --file ci/requirements-py37.yml; else conda env create -n test_env --file ci/requirements-$CONDA_ENV.yml; fi ``` And after installing the libraries, [it prints this and then stops](https://travis-ci.org/max-sixty/xarray/jobs/546491330): ``` Preparing transaction: - - done Verifying transaction: | / \ | / - \ | / / done Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / / - \ | / - \ done No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself. ``` I'm not that familiar with conda. Anyone have any ideas as to why this would fail while the other builds would succeed? ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3029/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 168901028,MDU6SXNzdWUxNjg5MDEwMjg=,934,"Should indexing be possible on 1D coords, even if not dims?",5635139,closed,0,,,6,2016-08-02T14:33:43Z,2019-01-27T06:49:52Z,2019-01-27T06:49:52Z,MEMBER,,,,"``` python In [1]: arr = xr.DataArray(np.random.rand(4, 3), ...: ...: [('time', pd.date_range('2000-01-01', periods=4)), ...: ...: ('space', ['IA', 'IL', 'IN'])]) ...: ...: In [17]: arr.coords['space2'] = ('space', ['A','B','C']) In [18]: arr Out[18]: array([[ 0.05187049, 0.04743067, 0.90329666], [ 0.59482538, 0.71014366, 0.86588207], [ 0.51893157, 0.49442107, 0.10697737], [ 0.16068189, 0.60756757, 0.31935279]]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 * space (space) |S2 'IA' 'IL' 'IN' space2 (space) |S1 'A' 'B' 'C' ``` Now try to select on the space2 coord: ``` python In [19]: arr.sel(space2='A') --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () ----> 1 arr.sel(space2='A') /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xarray/core/dataarray.pyc in sel(self, method, tolerance, **indexers) 601 """""" 602 return self.isel(**indexing.remap_label_indexers( --> 603 self, indexers, method=method, tolerance=tolerance)) 604 605 def isel_points(self, dim='points', **indexers): /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xarray/core/dataarray.pyc in isel(self, **indexers) 588 DataArray.sel 589 """""" --> 590 ds = self._to_temp_dataset().isel(**indexers) 591 return self._from_temp_dataset(ds) 592 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xarray/core/dataset.pyc in isel(self, **indexers) 908 invalid = [k for k in indexers if k not in self.dims] 909 if invalid: --> 910 raise ValueError(""dimensions %r do not exist"" % invalid) 911 912 # all indexers should be int, slice or np.ndarrays ValueError: dimensions ['space2'] do not exist ``` Is there an easier way to do this? I couldn't think of anything... CC @justinkuosixty ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/934/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue