home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

896 rows where user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: milestone, comments, draft, state_reason, created_at (date), updated_at (date), closed_at (date)

type 2

  • pull 572
  • issue 324

state 2

  • closed 848
  • open 48

repo 1

  • xarray 896
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2266174558 I_kwDOAMm_X86HExRe 8975 Xarray sponsorship guidelines shoyer 1217238 open 0     3 2024-04-26T17:05:01Z 2024-04-30T20:52:33Z   MEMBER      

At what level of support should Xarray acknowledge sponsors on our website?

I would like to surface this for open discussion because there are potential sponsoring organizations with conflicts of interest with members of Xarray's leadership team (e.g., Earthmover, which employs @jhamman, @rabernat and @dcherian).

My suggestion is to use NumPy's guidelines, with an adjustment down to 1/3 of the thresholds to account for the smaller size of the project:

  • $10,000/yr for unrestricted financial contributions (e.g., donations)
  • $20,000/yr for financial contributions for a particular purpose (e.g., grants)
  • $30,000/yr for in-kind contributions (e.g., time for employees to contribute)
  • 2 person-months/yr of paid work time for one or more Xarray maintainers or regular contributors to any Xarray team or activity

The NumPy guidelines also include a grace period of a minimum of 6 months for acknowledging support. I would suggest increasing this to a minimum of 1 year for Xarray.

I would greatly appreciate any feedback from members of the community, either in this issue or on the next team meeting.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8975/reactions",
    "total_count": 6,
    "+1": 5,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
271043420 MDU6SXNzdWUyNzEwNDM0MjA= 1689 Roundtrip serialization of coordinate variables with spaces in their names shoyer 1217238 open 0     5 2017-11-03T16:43:20Z 2024-03-22T14:02:48Z   MEMBER      

If coordinates have spaces in their names, they get restored from netCDF files as data variables instead: ```

xarray.open_dataset(xarray.Dataset(coords={'name with spaces': 1}).to_netcdf()) <xarray.Dataset> Dimensions: () Data variables: name with spaces int32 1 ````

This happens because the CF convention is to indicate coordinates as a space separated string, e.g., coordinates='latitude longitude'.

Even though these aren't CF compliant variable names (which cannot have strings) It would be nice to have an ad-hoc convention for xarray that allows us to serialize/deserialize coordinates in all/most cases. Maybe we could use escape characters for spaces (e.g., coordinates='name\ with\ spaces') or quote names if they have spaces (e.g., coordinates='"name\ with\ spaces"'?

At the very least, we should issue a warning in these cases.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1689/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
267542085 MDU6SXNzdWUyNjc1NDIwODU= 1647 Representing missing values in string arrays on disk shoyer 1217238 closed 0     3 2017-10-23T05:01:10Z 2024-02-06T13:03:40Z 2024-02-06T13:03:40Z MEMBER      

This came up as part of my clean-up of serializing unicode strings in https://github.com/pydata/xarray/pull/1648.

There are two ways to represent strings in netCDF files.

  • As character arrays (NC_CHAR), supported by both netCDF3 and netCDF4
  • As variable length unicode strings (NC_STRING), only supported by netCDF4/HDF5.

Currently, by default (if no _FillValue is set) we replace missing values (NaN) with an empty string when writing data to disk.

For character arrays, we could use the normal _FillValue mechanism to set a fill value and decode when data is read back from disk. In fact, this already currently works for dtype=bytes (though it isn't documented): ``` In [10]: ds = xr.Dataset({'foo': ('x', np.array([b'bar', np.nan], dtype=object), {}, {'_FillValue': b''})})

In [11]: ds Out[11]: <xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: foo (x) object b'bar' nan

In [12]: ds.to_netcdf('foobar.nc')

In [13]: xr.open_dataset('foobar.nc').load() Out[13]: <xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: foo (x) object b'bar' nan ```

For variable length strings, it currently isn't possible to set a fill-value. So there's no good way to indicate missing values, though this may change if the future depending on the resolution of the netCDF-python issue.

It would obviously be nice to always automatically round-trip missing values, both for strings and bytes. I see two possible ways to do this: 1. Require setting an explicit _FillValue when a string contains missing values, by raising an error if this isn't done. We need an explicit choice because there aren't any extra unused characters left over, at least for character arrays. (NetCDF explicitly allows arbitrary bytes to be stored in NC_CHAR, even though this maps to an HDF5 fixed-width string with ASCII encoding.) For variable length strings, we could potentially set a non-character unicode symbol like U+FFFF, but again that isn't supported yet. 2. Treat empty strings as equivalent to a missing value (NaN). This has the advantage of not requiring an explicit choice of _FillValue, so we don't need to wait for any netCDF4 issues to be resolved. However, this does mean that empty strings would not round-trip. Still, given the relative prevalence of missing values vs empty strings in xarray/pandas, it's probably the lesser evil to not preserve empty string.

The default option is to adopt neither of these, and keep the current behavior where missing values are written as empty strings and not decoded at all.

Any opinions? I am leaning towards option (2).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1647/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
842436143 MDU6SXNzdWU4NDI0MzYxNDM= 5081 Lazy indexing arrays as a stand-alone package shoyer 1217238 open 0     6 2021-03-27T07:06:03Z 2023-12-15T13:20:03Z   MEMBER      

From @rabernat on Twitter:

"Xarray has some secret private classes for lazily indexing / wrapping arrays that are so useful I think they should be broken out into a standalone package. https://github.com/pydata/xarray/blob/master/xarray/core/indexing.py#L516"

The idea here is create a first-class "duck array" library for lazy indexing that could replace xarray's internal classes for lazy indexing. This would be in some ways similar to dask.array, but much simpler, because it doesn't have to worry about parallel computing.

Desired features:

  • Lazy indexing
  • Lazy transposes
  • Lazy concatenation (#4628) and stacking
  • Lazy vectorized operations (e.g., unary and binary arithmetic)
    • needed for decoding variables from disk (xarray.encoding) and
    • building lazy multi-dimensional coordinate arrays corresponding to map projections (#3620)
  • Maybe: lazy reshapes (#4113)

A common feature of these operations is they can (and almost always should) be fused with indexing: if N elements are selected via indexing, only O(N) compute and memory is required to produce them, regards of the size of the original arrays as long as the number of applied operations can be treated as a constant. Memory access is significantly slower than compute on modern hardware, so recomputing these operations on the fly is almost always a good idea.

Out of scope: lazy computation when indexing could require access to many more elements to compute the desired value than are returned. For example, mean() probably should not be lazy, because that could involve computation of a very large number of elements that one might want to cache.

This is valuable functionality for Xarray for two reasons:

  1. It allows for "previewing" small bits of data loaded from disk or remote storage, even if that data needs some form of cheap "decoding" from its form on disk.
  2. It allows for xarray to decode data in a lazy fashion that is compatible with full-featured systems for lazy computation (e.g., Dask), without requiring the user to choose dask when reading the data.

Related issues:

  • [Proposal] Expose Variable without Pandas dependency #3981
  • Lazy concatenation of arrays #4628
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5081/reactions",
    "total_count": 6,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 6,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
197939448 MDU6SXNzdWUxOTc5Mzk0NDg= 1189 Document using a spawning multiprocessing pool for multiprocessing with dask shoyer 1217238 closed 0     3 2016-12-29T01:21:50Z 2023-12-05T21:51:04Z 2023-12-05T21:51:04Z MEMBER      

This is a nice option for working with in-file HFD5/netCDF4 compression: https://github.com/pydata/xarray/pull/1128#issuecomment-261936849

Mixed multi-threading/multi-processing could also be interesting, if anyone wants to revive that: https://github.com/dask/dask/pull/457 (I think it would work now that xarray data stores are pickle-able)

CC @mrocklin

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1189/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
430188626 MDU6SXNzdWU0MzAxODg2MjY= 2873 Dask distributed tests fail locally shoyer 1217238 closed 0     3 2019-04-07T20:26:53Z 2023-12-05T21:43:02Z 2023-12-05T21:43:02Z MEMBER      

I'm not sure why, but when I run the integration tests with dask-distributed locally (on my MacBook pro), they fail: ``` $ pytest xarray/tests/test_distributed.py --maxfail 1 ================================================ test session starts ================================================= platform darwin -- Python 3.7.2, pytest-4.0.1, py-1.7.0, pluggy-0.8.0 rootdir: /Users/shoyer/dev/xarray, inifile: setup.cfg plugins: repeat-0.7.0 collected 19 items

xarray/tests/test_distributed.py F

====================================================== FAILURES ====================================================== ____ test_dask_distributed_netcdf_roundtrip[netcdf4-NETCDF3_CLASSIC] _______

loop = <tornado.platform.asyncio.AsyncIOLoop object at 0x1c182da1d0> tmp_netcdf_filename = '/private/var/folders/15/qdcz0wqj1t9dg40m_ld0fjkh00b4kd/T/pytest-of-shoyer/pytest-3/test_dask_distributed_netcdf_r0/testfile.nc' engine = 'netcdf4', nc_format = 'NETCDF3_CLASSIC'

@pytest.mark.parametrize('engine,nc_format', ENGINES_AND_FORMATS)  # noqa
def test_dask_distributed_netcdf_roundtrip(
        loop, tmp_netcdf_filename, engine, nc_format):

    if engine not in ENGINES:
        pytest.skip('engine not available')

    chunks = {'dim1': 4, 'dim2': 3, 'dim3': 6}

    with cluster() as (s, [a, b]):
        with Client(s['address'], loop=loop):

            original = create_test_data().chunk(chunks)

            if engine == 'scipy':
                with pytest.raises(NotImplementedError):
                    original.to_netcdf(tmp_netcdf_filename,
                                       engine=engine, format=nc_format)
                return

            original.to_netcdf(tmp_netcdf_filename,
                               engine=engine, format=nc_format)

            with xr.open_dataset(tmp_netcdf_filename,
                                 chunks=chunks, engine=engine) as restored:
                assert isinstance(restored.var1.data, da.Array)
                computed = restored.compute()
              assert_allclose(original, computed)

xarray/tests/test_distributed.py:87:


../../miniconda3/envs/xarray-py37/lib/python3.7/contextlib.py:119: in exit next(self.gen)


nworkers = 2, nanny = False, worker_kwargs = {}, active_rpc_timeout = 1, scheduler_kwargs = {}

@contextmanager
def cluster(nworkers=2, nanny=False, worker_kwargs={}, active_rpc_timeout=1,
            scheduler_kwargs={}):
    ...  # trimmed
    start = time()
    while list(ws):
        sleep(0.01)
      assert time() < start + 1, 'Workers still around after one second'

E AssertionError: Workers still around after one second

../../miniconda3/envs/xarray-py37/lib/python3.7/site-packages/distributed/utils_test.py:721: AssertionError ------------------------------------------------ Captured stderr call ------------------------------------------------ distributed.scheduler - INFO - Clear task state distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:51715 distributed.worker - INFO - Start worker at: tcp://127.0.0.1:51718 distributed.worker - INFO - Listening to: tcp://127.0.0.1:51718 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:51715 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 1 distributed.worker - INFO - Memory: 17.18 GB distributed.worker - INFO - Local Directory: /Users/shoyer/dev/xarray/_test_worker-5cabd1b7-4d9c-49eb-a79e-205c588f5dae/worker-n8uv72yx distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Start worker at: tcp://127.0.0.1:51720 distributed.worker - INFO - Listening to: tcp://127.0.0.1:51720 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:51715 distributed.scheduler - INFO - Register tcp://127.0.0.1:51718 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 1 distributed.worker - INFO - Memory: 17.18 GB distributed.worker - INFO - Local Directory: /Users/shoyer/dev/xarray/_test_worker-71a426d4-bd34-4808-9d33-79cac2bb4801/worker-a70rlf4r distributed.worker - INFO - ------------------------------------------------- distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:51718 distributed.core - INFO - Starting established connection distributed.worker - INFO - Registered to: tcp://127.0.0.1:51715 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Starting established connection distributed.scheduler - INFO - Register tcp://127.0.0.1:51720 distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:51720 distributed.core - INFO - Starting established connection distributed.worker - INFO - Registered to: tcp://127.0.0.1:51715 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Starting established connection distributed.scheduler - INFO - Receive client connection: Client-59a7918c-5972-11e9-912a-8c85907bce57 distributed.core - INFO - Starting established connection distributed.core - INFO - Event loop was unresponsive in Worker for 1.05s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. distributed.scheduler - INFO - Receive client connection: Client-worker-5a5c81de-5972-11e9-9136-8c85907bce57 distributed.core - INFO - Starting established connection distributed.core - INFO - Event loop was unresponsive in Worker for 1.33s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. distributed.scheduler - INFO - Receive client connection: Client-worker-5b2496d8-5972-11e9-9137-8c85907bce57 distributed.core - INFO - Starting established connection distributed.scheduler - INFO - Remove client Client-59a7918c-5972-11e9-912a-8c85907bce57 distributed.scheduler - INFO - Remove client Client-59a7918c-5972-11e9-912a-8c85907bce57 distributed.scheduler - INFO - Close client connection: Client-59a7918c-5972-11e9-912a-8c85907bce57 distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:51720 distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:51718 distributed.scheduler - INFO - Remove worker tcp://127.0.0.1:51720 distributed.core - INFO - Removing comms to tcp://127.0.0.1:51720 distributed.scheduler - INFO - Remove worker tcp://127.0.0.1:51718 distributed.core - INFO - Removing comms to tcp://127.0.0.1:51718 distributed.scheduler - INFO - Lost all workers distributed.scheduler - INFO - Remove client Client-worker-5b2496d8-5972-11e9-9137-8c85907bce57 distributed.scheduler - INFO - Remove client Client-worker-5a5c81de-5972-11e9-9136-8c85907bce57 distributed.scheduler - INFO - Close client connection: Client-worker-5b2496d8-5972-11e9-9137-8c85907bce57 distributed.scheduler - INFO - Close client connection: Client-worker-5a5c81de-5972-11e9-9136-8c85907bce57 distributed.scheduler - INFO - Scheduler closing... distributed.scheduler - INFO - Scheduler closing all comms ```

Version info: ``` In [2]: xarray.show_versions()

INSTALLED VERSIONS

commit: 2ce0639ee2ba9c7b1503356965f77d847d6cfcdf python: 3.7.2 (default, Dec 29 2018, 00:00:04) [Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.1+4.g2ce0639e pandas: 0.24.0 numpy: 1.15.4 scipy: 1.1.0 netCDF4: 1.4.3.2 pydap: None h5netcdf: 0.7.0 h5py: 2.9.0 Nio: None zarr: 2.2.0 cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.2.1 dask: 1.1.5 distributed: 1.26.1 matplotlib: 3.0.2 cartopy: 0.17.0 seaborn: 0.9.0 setuptools: 40.0.0 pip: 18.0 conda: None pytest: 4.0.1 IPython: 6.5.0 sphinx: 1.8.2 ```

@mrocklin does this sort of error look familiar to you?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2873/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  not_planned xarray 13221727 issue
707647715 MDExOlB1bGxSZXF1ZXN0NDkyMDEzODg4 4453 Simplify and restore old behavior for deep-copies shoyer 1217238 closed 0     3 2020-09-23T20:10:33Z 2023-09-14T03:06:34Z 2023-09-14T03:06:33Z MEMBER   1 pydata/xarray/pulls/4453

Intended to fix https://github.com/pydata/xarray/issues/4449

The goal is to restore behavior to match what we had prior to https://github.com/pydata/xarray/pull/4379 for all types of data other than np.ndarray objects

Needs tests!

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] Passes isort . && black . && mypy . && flake8
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4453/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
588105641 MDU6SXNzdWU1ODgxMDU2NDE= 3893 HTML repr in the online docs shoyer 1217238 open 0     3 2020-03-26T02:17:51Z 2023-09-11T17:41:59Z   MEMBER      

I noticed two minor issues in our online docs, now that we've switched to the hip new HTML repr by default.

  1. Most doc pages still show text, not HTML. I suspect this is a limitation of the IPython sphinx derictive we use for our snippets. We might be able to fix that by switching to jupyter-sphinx?

  2. The "attributes" part of the HTML repr in our notebook examples looks a little funny, with strange blue formatting around each attribute name. It looks like part of the outer style of our docs is leaking into the HTML repr:

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3893/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1376109308 I_kwDOAMm_X85SBcL8 7045 Should Xarray stop doing automatic index-based alignment? shoyer 1217238 open 0     13 2022-09-16T15:31:03Z 2023-08-23T07:42:34Z   MEMBER      

What is your issue?

I am increasingly thinking that automatic index-based alignment in Xarray (copied from pandas) may have been a design mistake. Almost every time I work with datasets with different indexes, I find myself writing code to explicitly align them:

  1. Automatic alignment is hard to predict. The implementation is complicated, and the exact mode of automatic alignment (outer vs inner vs left join) depends on the specific operation. It's also no longer possible to predict the shape (or even the dtype) resulting from most Xarray operations purely from input shape/dtype.
  2. Automatic alignment brings unexpected performance penalty. In some domains (analytics) this is OK, but in others (e.g,. numerical modeling or deep learning) this is a complete deal-breaker.
  3. Automatic alignment is not useful for float indexes, because exact matches are rare. In practice, this makes it less useful in Xarray's usual domains than it for pandas.

Would it be insane to consider changing Xarray's behavior to stop doing automatic alignment? I imagine we could roll this out slowly, first with warnings and then with an option for disabling it.

If you think this is a good or bad idea, consider responding to this issue with a 👍 or 👎 reaction.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7045/reactions",
    "total_count": 13,
    "+1": 9,
    "-1": 2,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 2
}
    xarray 13221727 issue
342928718 MDExOlB1bGxSZXF1ZXN0MjAyNzE0MjUx 2302 WIP: lazy=True in apply_ufunc() shoyer 1217238 open 0     1 2018-07-20T00:01:21Z 2023-07-18T04:19:17Z   MEMBER   0 pydata/xarray/pulls/2302
  • [x] Closes https://github.com/pydata/xarray/issues/2298
  • [ ] Tests added
  • [ ] Tests passed
  • [ ] Fully documented, including whats-new.rst for all changes and api.rst for new API

Still needs more tests and documentation.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2302/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1767947798 PR_kwDOAMm_X85TkPzV 7933 Update calendar for developers meeting shoyer 1217238 closed 0     0 2023-06-21T16:09:44Z 2023-06-21T17:56:22Z 2023-06-21T17:56:22Z MEMBER   0 pydata/xarray/pulls/7933

The old calendar was on @jhamman's UCAR account, which he no longer has access to!

xref https://github.com/pydata/xarray/issues/4001

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7933/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
479942077 MDU6SXNzdWU0Nzk5NDIwNzc= 3213 How should xarray use/support sparse arrays? shoyer 1217238 open 0     55 2019-08-13T03:29:42Z 2023-06-07T15:43:55Z   MEMBER      

I'm looking forward to being easily able to create sparse xarray objects from pandas: https://github.com/pydata/xarray/issues/3206

Are there other xarray APIs that could make good use of sparse arrays, or could make sparse arrays easier to use?

Some ideas: - to_sparse()/to_dense() methods for converting to/from sparse without requiring using .data - to_dataframe()/to_series() could grow options for skipping the fill-value in sparse arrays, so they can round-trip MultiIndex data back to pandas - Serialization to/from netCDF files, using some custom convention (see https://github.com/pydata/xarray/issues/1375#issuecomment-402699810)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3213/reactions",
    "total_count": 14,
    "+1": 14,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1465287257 I_kwDOAMm_X85XVoJZ 7325 Support reading Zarr data via TensorStore shoyer 1217238 open 0     1 2022-11-27T00:12:17Z 2023-05-11T01:24:27Z   MEMBER      

What is your issue?

TensorStore is another high performance API for reading distributed arrays in formats such as Zarr, written in C++.

It could be interesting to write an Xarray storage backend using TensorStore as an alternative way to read Zarr files.

As an exercise, I make a little demo of doing this: https://gist.github.com/shoyer/5b0c485979cc9c36a9685d8cf8e94565

I have not tested it for performance. The main annoyance is that TensorStore doesn't understand Zarr groups or Zarr array attributes, so I needed to write my own helpers for reading this metadata.

Also, there's a bit of an impedance mis-match between TensorStore (where everything returns futures) and Xarray (which assumes that indexing results in numpy arrays). This could likely be improved with some amount of effort -- in particular https://github.com/pydata/xarray/pull/6874/files should help.

CC @jbms who may have better ideas about how to use the TensorStore API.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7325/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
253395960 MDU6SXNzdWUyNTMzOTU5NjA= 1533 Index variables loaded from dask can be computed twice shoyer 1217238 closed 0     6 2017-08-28T17:18:27Z 2023-04-06T04:15:46Z 2023-04-06T04:15:46Z MEMBER      

as reported by @crusaderky in #1522

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1533/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
209653741 MDU6SXNzdWUyMDk2NTM3NDE= 1285 FAQ page could use some updating shoyer 1217238 open 0     1 2017-02-23T03:29:16Z 2023-03-26T16:32:44Z   MEMBER      

Along the same lines as https://github.com/pydata/xarray/issues/1282, we haven't done much updating for frequently asked questions -- it's mostly still the original handful of FAQ entries I wrote in the first version of the docs.

Topics worth addressing:

  • [ ] How xarray handles missing values
  • [x] File formats -- how can I read format X in xarray? (Maybe we should make a table with links to other packages?)

(please add suggestions for this list!)

StackOverflow may be a helpful reference here: http://stackoverflow.com/questions/tagged/python-xarray?sort=votes&pageSize=50

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1285/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
176805500 MDU6SXNzdWUxNzY4MDU1MDA= 1004 Remove IndexVariable.name shoyer 1217238 open 0     3 2016-09-14T03:27:43Z 2023-03-11T19:57:40Z   MEMBER      

As discussed in #947, we should remove the IndexVariable.name attribute. It should be fine to use an IndexVariable anywhere, regardless of whether or not it labels ticks along a dimension.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1004/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
98587746 MDU6SXNzdWU5ODU4Nzc0Ng== 508 Ignore missing variables when concatenating datasets? shoyer 1217238 closed 0     8 2015-08-02T06:03:57Z 2023-01-20T16:04:28Z 2023-01-20T16:04:28Z MEMBER      

Several users (@raj-kesavan, @richardotis, now myself) have wondered about how to concatenate xray Datasets with different variables.

With the current xray.concat, you need to awkwardly create dummy variables filled with NaN in datasets that don't have them (or drop mismatched variables entirely). Neither of these are great options -- concat should have an option (the default?) to take care of this for the user.

This would also be more consistent with pd.concat, which takes a more relaxed approach to matching dataframes with different variables (it does an outer join).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/508/reactions",
    "total_count": 6,
    "+1": 6,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
895983112 MDExOlB1bGxSZXF1ZXN0NjQ4MTM1NTcy 5351 Add xarray.backends.NoMatchingEngineError shoyer 1217238 open 0     4 2021-05-19T22:09:21Z 2022-11-16T15:19:54Z   MEMBER   0 pydata/xarray/pulls/5351
  • [x] Closes #5329
  • [x] Tests added
  • [x] Passes pre-commit run --all-files
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [x] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5351/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
803068773 MDExOlB1bGxSZXF1ZXN0NTY5MDU5MTEz 4879 Cache files for different CachingFileManager objects separately shoyer 1217238 closed 0     10 2021-02-07T21:48:06Z 2022-10-18T16:40:41Z 2022-10-18T16:40:40Z MEMBER   0 pydata/xarray/pulls/4879

This means that explicitly opening a file multiple times with open_dataset (e.g., after modifying it on disk) now reopens the file from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray object. We don't need this for handling many files in Dask (the original motivation for caching), because in those cases only a single CachingFileManager is created.

I think this should some long-standing usability issues: #4240, #4862

Conveniently, this also obviates the need for some messy reference counting logic.

  • [x] Closes #4240, #4862
  • [x] Tests added
  • [x] Passes pre-commit run --all-files
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4879/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
623804131 MDU6SXNzdWU2MjM4MDQxMzE= 4090 Error with indexing 2D lat/lon coordinates shoyer 1217238 closed 0     2 2020-05-24T06:19:45Z 2022-09-28T12:06:03Z 2022-09-28T12:06:03Z MEMBER      

``` filslp = "ChonghuaYinData/prmsl.mon.mean.nc" filtmp = "ChonghuaYinData/air.sig995.mon.mean.nc" filprc = "ChonghuaYinData/precip.mon.mean.nc"

ds_slp = xr.open_dataset(filslp).sel(time=slice(str(yrStrt)+'-01-01', str(yrLast)+'-12-31'))

ds_slp outputs: <xarray.Dataset> Dimensions: (nbnds: 2, time: 480, x: 349, y: 277) Coordinates: * time (time) datetime64[ns] 1979-01-01 ... 2018-12-01 lat (y, x) float32 ... lon (y, x) float32 ... * y (y) float32 0.0 32463.0 64926.0 ... 8927325.0 8959788.0 * x (x) float32 0.0 32463.0 64926.0 ... 11264660.0 11297120.0 Dimensions without coordinates: nbnds Data variables: Lambert_Conformal int32 ... prmsl (time, y, x) float32 ... time_bnds (time, nbnds) float64 ... Attributes: Conventions: CF-1.2 centerlat: 50.0 centerlon: -107.0 comments:
institution: National Centers for Environmental Prediction latcorners: [ 1.000001 0.897945 46.3544 46.63433 ] loncorners: [-145.5 -68.32005 -2.569891 148.6418 ] platform: Model standardpar1: 50.0 standardpar2: 50.000001 title: NARR Monthly Means dataset_title: NCEP North American Regional Reanalysis (NARR) history: created 2016/04/12 by NOAA/ESRL/PSD references: https://www.esrl.noaa.gov/psd/data/gridded/data.narr.html source: http://www.emc.ncep.noaa.gov/mmb/rreanl/index.html References:
```

``` yrStrt = 1950 # manually specify for convenience yrLast = 2018 # 20th century ends 2018

clStrt = 1950 # reference climatology for SOI clLast = 1979

yrStrtP = 1979 # 1st year GPCP yrLastP = yrLast # match 20th century

latT = -17.6 # Tahiti lonT = 210.75
latD = -12.5 # Darwin lonD = 130.83

select grids of T and D

T = ds_slp.sel(lat=latT, lon=lonT, method='nearest') D = ds_slp.sel(lat=latD, lon=lonD, method='nearest') outputs:


ValueError Traceback (most recent call last) <ipython-input-27-6702b30f473f> in <module> 1 # select grids of T and D ----> 2 T = ds_slp.sel(lat=latT, lon=lonT, method='nearest') 3 D = ds_slp.sel(lat=latD, lon=lonD, method='nearest')

~\Anaconda3\lib\site-packages\xarray\core\dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs) 2004 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel") 2005 pos_indexers, new_indexes = remap_label_indexers( -> 2006 self, indexers=indexers, method=method, tolerance=tolerance 2007 ) 2008 result = self.isel(indexers=pos_indexers, drop=drop)

~\Anaconda3\lib\site-packages\xarray\core\coordinates.py in remap_label_indexers(obj, indexers, method, tolerance, **indexers_kwargs) 378 379 pos_indexers, new_indexes = indexing.remap_label_indexers( --> 380 obj, v_indexers, method=method, tolerance=tolerance 381 ) 382 # attach indexer's coordinate to pos_indexers

~\Anaconda3\lib\site-packages\xarray\core\indexing.py in remap_label_indexers(data_obj, indexers, method, tolerance) 257 new_indexes = {} 258 --> 259 dim_indexers = get_dim_indexers(data_obj, indexers) 260 for dim, label in dim_indexers.items(): 261 try:

~\Anaconda3\lib\site-packages\xarray\core\indexing.py in get_dim_indexers(data_obj, indexers) 223 ] 224 if invalid: --> 225 raise ValueError("dimensions or multi-index levels %r do not exist" % invalid) 226 227 level_indexers = defaultdict(dict)

ValueError: dimensions or multi-index levels ['lat', 'lon'] do not exist ```

Does any know how fix to this problem?Thank you very much.

Originally posted by @JimmyGao0204 in https://github.com/pydata/xarray/issues/475#issuecomment-633172787

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4090/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1210147360 I_kwDOAMm_X85IIWIg 6504 test_weighted.test_weighted_operations_nonequal_coords should avoid depending on random number seed shoyer 1217238 closed 0 shoyer 1217238   0 2022-04-20T19:56:19Z 2022-08-29T20:42:30Z 2022-08-29T20:42:30Z MEMBER      

What happened?

In testing an upgrade to the latest version of xarray in our systems, I noticed this test failing: ``` def test_weighted_operations_nonequal_coords(): # There are no weights for a == 4, so that data point is ignored. weights = DataArray(np.random.randn(4), dims=("a",), coords=dict(a=[0, 1, 2, 3])) data = DataArray(np.random.randn(4), dims=("a",), coords=dict(a=[1, 2, 3, 4])) check_weighted_operations(data, weights, dim="a", skipna=None)

    q = 0.5
    result = data.weighted(weights).quantile(q, dim="a")
    # Expected value computed using code from [https://aakinshin.net/posts/weighted-quantiles/](https://www.google.com/url?q=https://aakinshin.net/posts/weighted-quantiles/&sa=D) with values at a=1,2,3
    expected = DataArray([0.9308707], coords={"quantile": [q]}).squeeze()
  assert_allclose(result, expected)

E AssertionError: Left and right DataArray objects are not close E
E Differing values: E L E array(0.919569) E R E array(0.930871) ```

It appears that this test is hard-coded to match a particular random number seed, which in turn would fix the resutls of np.random.randn().

What did you expect to happen?

Whenever possible, Xarray's own tests should avoid relying on particular random number generators, e.g., in this case we could specify random numbers instead.

A back-up option would be to explicitly set random seed locally inside the tests, e.g., by creating a np.random.RandomState() with a fixed seed and using that. The global random state used by np.random.randn() is sensitive to implementation details like the order in which tests are run.

Minimal Complete Verifiable Example

No response

Relevant log output

No response

Anything else we need to know?

No response

Environment

...

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6504/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1210267320 I_kwDOAMm_X85IIza4 6505 Dropping a MultiIndex variable raises an error after explicit indexes refactor shoyer 1217238 closed 0     3 2022-04-20T22:07:26Z 2022-07-21T14:46:58Z 2022-07-21T14:46:58Z MEMBER      

What happened?

With the latest released version of Xarray, it is possible to delete all variables corresponding to a MultiIndex by simply deleting the name of the MultiIndex.

After the explicit indexes refactor (i.e,. using the "main" development branch) this now raises error about how this would "corrupt" index state. This comes up when using drop() and assign_coords() and possibly some other methods.

This is not hard to work around, but we may want to consider this bug a blocker for the next Xarray release. I found the issue surfaced in several projects when attempting to use the new version of Xarray inside Google's codebase.

CC @benbovy in case you have any thoughts to share.

What did you expect to happen?

For now, we should preserve the behavior of deleting the variables corresponding to MultiIndex levels, but should issue a deprecation warning encouraging users to explicitly delete everything.

Minimal Complete Verifiable Example

```Python import xarray

array = xarray.DataArray( [[1, 2], [3, 4]], dims=['x', 'y'], coords={'x': ['a', 'b']}, ) stacked = array.stack(z=['x', 'y']) print(stacked.drop('z')) print() print(stacked.assign_coords(z=[1, 2, 3, 4])) ```

Relevant log output

```Python ValueError Traceback (most recent call last) Input In [1], in <cell line: 9>() 3 array = xarray.DataArray( 4 [[1, 2], [3, 4]], 5 dims=['x', 'y'], 6 coords={'x': ['a', 'b']}, 7 ) 8 stacked = array.stack(z=['x', 'y']) ----> 9 print(stacked.drop('z')) 10 print() 11 print(stacked.assign_coords(z=[1, 2, 3, 4]))

File ~/dev/xarray/xarray/core/dataarray.py:2425, in DataArray.drop(self, labels, dim, errors, labels_kwargs) 2408 def drop( 2409 self, 2410 labels: Mapping = None, (...) 2414 labels_kwargs, 2415 ) -> DataArray: 2416 """Backward compatible method based on drop_vars and drop_sel 2417 2418 Using either drop_vars or drop_sel is encouraged (...) 2423 DataArray.drop_sel 2424 """ -> 2425 ds = self._to_temp_dataset().drop(labels, dim, errors=errors) 2426 return self._from_temp_dataset(ds)

File ~/dev/xarray/xarray/core/dataset.py:4590, in Dataset.drop(self, labels, dim, errors, **labels_kwargs) 4584 if dim is None and (is_scalar(labels) or isinstance(labels, Iterable)): 4585 warnings.warn( 4586 "dropping variables using drop will be deprecated; using drop_vars is encouraged.", 4587 PendingDeprecationWarning, 4588 stacklevel=2, 4589 ) -> 4590 return self.drop_vars(labels, errors=errors) 4591 if dim is not None: 4592 warnings.warn( 4593 "dropping labels using list-like labels is deprecated; using " 4594 "dict-like arguments with drop_sel, e.g. `ds.drop_sel(dim=[labels]).", 4595 DeprecationWarning, 4596 stacklevel=2, 4597 )

File ~/dev/xarray/xarray/core/dataset.py:4549, in Dataset.drop_vars(self, names, errors) 4546 if errors == "raise": 4547 self._assert_all_in_dataset(names) -> 4549 assert_no_index_corrupted(self.xindexes, names) 4551 variables = {k: v for k, v in self._variables.items() if k not in names} 4552 coord_names = {k for k in self._coord_names if k in variables}

File ~/dev/xarray/xarray/core/indexes.py:1394, in assert_no_index_corrupted(indexes, coord_names) 1392 common_names_str = ", ".join(f"{k!r}" for k in common_names) 1393 index_names_str = ", ".join(f"{k!r}" for k in index_coords) -> 1394 raise ValueError( 1395 f"cannot remove coordinate(s) {common_names_str}, which would corrupt " 1396 f"the following index built from coordinates {index_names_str}:\n" 1397 f"{index}" 1398 )

ValueError: cannot remove coordinate(s) 'z', which would corrupt the following index built from coordinates 'z', 'x', 'y': <xarray.core.indexes.PandasMultiIndex object at 0x148c95150> ```

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: 33cdabd261b5725ac357c2823bd0f33684d3a954 python: 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:42:03) [Clang 12.0.1 ] python-bits: 64 OS: Darwin OS-release: 21.4.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 0.18.3.dev137+g96c56836 pandas: 1.4.2 numpy: 1.22.3 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.11.3 cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.04.1 distributed: 2022.4.1 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2022.3.0 cupy: None pint: None sparse: None setuptools: 62.1.0 pip: 22.0.4 conda: None pytest: 7.1.1 IPython: 8.2.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6505/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
168272291 MDExOlB1bGxSZXF1ZXN0NzkzMjE2NTc= 924 WIP: progress toward making groupby work with multiple arguments shoyer 1217238 open 0     16 2016-07-29T08:07:57Z 2022-06-09T14:50:17Z   MEMBER   0 pydata/xarray/pulls/924

Fixes #324

It definitely doesn't work properly yet, totally mixing up coordinates, data variables and multi-indexes (as shown by the failing tests).

A simple example:

``` In [4]: coords = {'a': ('x', [0, 0, 1, 1]), 'b': ('y', [0, 0, 1, 1])}

In [5]: square = xr.DataArray(np.arange(16).reshape(4, 4), coords=coords, dims=['x', 'y'])

In [6]: square Out[6]: <xarray.DataArray (x: 4, y: 4)> array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]]) Coordinates: b (y) int64 0 0 1 1 a (x) int64 0 0 1 1 * x (x) int64 0 1 2 3 * y (y) int64 0 1 2 3

In [7]: square.groupby(['a', 'b']).mean() Out[7]: <xarray.DataArray (a: 2, b: 2)> array([[ 2.5, 4.5], [ 10.5, 12.5]]) Coordinates: * a (a) int64 0 1 * b (b) int64 0 1

In [8]: square.groupby(['x', 'y']).mean() Out[8]: <xarray.DataArray (x: 4, y: 4)> array([[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.], [ 12., 13., 14., 15.]]) Coordinates: * x (x) int64 0 1 2 3 * y (y) int64 0 1 2 3 ```

More examples: https://gist.github.com/shoyer/5cfa4d5751e8a78a14af25f8442ad8d5

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/924/reactions",
    "total_count": 4,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 3,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
711626733 MDU6SXNzdWU3MTE2MjY3MzM= 4473 Wrap numpy-groupies to speed up Xarray's groupby aggregations shoyer 1217238 closed 0     8 2020-09-30T04:43:04Z 2022-05-15T02:38:29Z 2022-05-15T02:38:29Z MEMBER      

Is your feature request related to a problem? Please describe.

Xarray's groupby aggregations (e.g., groupby(..).sum()) are very slow compared to pandas, as described in https://github.com/pydata/xarray/issues/659.

Describe the solution you'd like

We could speed things up considerably (easily 100x) by wrapping the numpy-groupies package.

Additional context

One challenge is how to handle dask arrays (and other duck arrays). In some cases it might make sense to apply the numpy-groupies function (using apply_ufunc), but in other cases it might be better to stick with the current indexing + concatenate solution. We could either pick some simple heuristics for choosing the algorithm to use on dask arrays, or could just stick with the current algorithm for now.

In particular, it might make sense to stick with the current algorithm if there are a many chunks in the arrays to aggregated along the "grouped" dimension (depending on the size of the unique group values).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4473/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
326205036 MDU6SXNzdWUzMjYyMDUwMzY= 2180 How should Dataset.update() handle conflicting coordinates? shoyer 1217238 open 0     16 2018-05-24T16:46:23Z 2022-04-30T13:40:28Z   MEMBER      

Recently, we updated Dataset.__setitem__ to drop conflicting coordinates from DataArray values being assigned if they conflict with existing coordinates (https://github.com/pydata/xarray/pull/2087). Because update and __setitem__ share the same code path, this inadvertently updated update as well. Is this something we want?

In v0.10.3, both __setitem__ and update prioritize coordinates from the assigned objects (e.g., value in dataset[key] = value).

In v0.10.4, both __setitem__ and update prioritize coordinates from the original object (e.g., dataset).

I'm not sure this is the right behavior. In particular, in the case of dataset.update(other) where other is also an xarray.Dataset, it seems like coordinates from other should take priority.

Note that one advantage of the current logic (which is violated by my current fix in https://github.com/pydata/xarray/pull/2162), is that we maintain the invariant that dataset[key] = value is equivalent to dataset.update({key: value}).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2180/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
612918997 MDU6SXNzdWU2MTI5MTg5OTc= 4034 Fix tight_layout warning on cartopy facetgrid docs example shoyer 1217238 open 0     1 2020-05-05T21:54:46Z 2022-04-30T12:37:50Z   MEMBER      

Per the fix in https://github.com/pydata/xarray/pull/4032, I'm pretty sure we will soon start seeing a warning message printed on ReadTheDocs in Cartopy FacetGrid example: http://xarray.pydata.org/en/stable/plotting.html#maps

This would be nice to fix for users, especially because it's likely users will see this warning when running code outside of our documentation, too.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4034/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
621123222 MDU6SXNzdWU2MjExMjMyMjI= 4081 Wrap "Dimensions" onto multiple lines in xarray.Dataset repr? shoyer 1217238 closed 0     4 2020-05-19T16:31:59Z 2022-04-29T19:59:24Z 2022-04-29T19:59:24Z MEMBER      

Here's an example dataset of a large dataset from @alimanfoo: https://nbviewer.jupyter.org/gist/alimanfoo/b74b08465727894538d5b161b3ced764 <xarray.Dataset> Dimensions: (__variants/BaseCounts_dim1: 4, __variants/MLEAC_dim1: 3, __variants/MLEAF_dim1: 3, alt_alleles: 3, ploidy: 2, samples: 1142, variants: 21442865) Coordinates: samples/ID (samples) object dask.array<chunksize=(1142,), meta=np.ndarray> variants/CHROM (variants) object dask.array<chunksize=(21442865,), meta=np.ndarray> variants/POS (variants) int32 dask.array<chunksize=(4194304,), meta=np.ndarray> Dimensions without coordinates: __variants/BaseCounts_dim1, __variants/MLEAC_dim1, __variants/MLEAF_dim1, alt_alleles, ploidy, samples, variants Data variables: variants/ABHet (variants) float32 dask.array<chunksize=(4194304,), meta=np.ndarray> variants/ABHom (variants) float32 dask.array<chunksize=(4194304,), meta=np.ndarray> variants/AC (variants, alt_alleles) int32 dask.array<chunksize=(4194304, 3), meta=np.ndarray> variants/AF (variants, alt_alleles) float32 dask.array<chunksize=(4194304, 3), meta=np.ndarray> ...

I know similarly large datasets with lots of dimensions come up in other contexts as well, e.g., with geophysical model output.

That's a very long first line! This would be easier to read as: <xarray.Dataset> Dimensions: (__variants/BaseCounts_dim1: 4, __variants/MLEAC_dim1: 3, __variants/MLEAF_dim1: 3, alt_alleles: 3, ploidy: 2, samples: 1142, variants: 21442865) Coordinates: samples/ID (samples) object dask.array<chunksize=(1142,), meta=np.ndarray> variants/CHROM (variants) object dask.array<chunksize=(21442865,), meta=np.ndarray> variants/POS (variants) int32 dask.array<chunksize=(4194304,), meta=np.ndarray> Dimensions without coordinates: __variants/BaseCounts_dim1, __variants/MLEAC_dim1, __variants/MLEAF_dim1, alt_alleles, ploidy, samples, variants Data variables: variants/ABHet (variants) float32 dask.array<chunksize=(4194304,), meta=np.ndarray> variants/ABHom (variants) float32 dask.array<chunksize=(4194304,), meta=np.ndarray> variants/AC (variants, alt_alleles) int32 dask.array<chunksize=(4194304, 3), meta=np.ndarray> variants/AF (variants, alt_alleles) float32 dask.array<chunksize=(4194304, 3), meta=np.ndarray> ...

or maybe: <xarray.Dataset> Dimensions: __variants/BaseCounts_dim1: 4 __variants/MLEAC_dim1: 3 __variants/MLEAF_dim1: 3 alt_alleles: 3 ploidy: 2 samples: 1142 variants: 21442865 Coordinates: samples/ID (samples) object dask.array<chunksize=(1142,), meta=np.ndarray> variants/CHROM (variants) object dask.array<chunksize=(21442865,), meta=np.ndarray> variants/POS (variants) int32 dask.array<chunksize=(4194304,), meta=np.ndarray> Dimensions without coordinates: __variants/BaseCounts_dim1, __variants/MLEAC_dim1, __variants/MLEAF_dim1, alt_alleles, ploidy, samples, variants Data variables: variants/ABHet (variants) float32 dask.array<chunksize=(4194304,), meta=np.ndarray> variants/ABHom (variants) float32 dask.array<chunksize=(4194304,), meta=np.ndarray> variants/AC (variants, alt_alleles) int32 dask.array<chunksize=(4194304, 3), meta=np.ndarray> variants/AF (variants, alt_alleles) float32 dask.array<chunksize=(4194304, 3), meta=np.ndarray> ...

Dimensions without coordinates could probably use some wrapping, too.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4081/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
205455788 MDU6SXNzdWUyMDU0NTU3ODg= 1251 Consistent naming for xarray's methods that apply functions shoyer 1217238 closed 0     13 2017-02-05T21:27:24Z 2022-04-27T20:06:25Z 2022-04-27T20:06:25Z MEMBER      

We currently have two types of methods that take a function to apply to xarray objects: - pipe (on DataArray and Dataset): apply a function to this entire object (array.pipe(func) -> func(array)) - apply (on Dataset and GroupBy): apply a function to each labeled object in this object (e.g., ds.apply(func) -> ds({k: func(v) for k, v in ds.data_vars.items()})).

And one more method that we want to add but isn't finalized yet -- currently named apply_ufunc: - Apply a function that acts on unlabeled (i.e., numpy) arrays to each array in the object

I'd like to have three distinct names that makes it clear what these methods do and how they are different. This has come up a few times recently, e.g., https://github.com/pydata/xarray/issues/1130

One proposal: rename apply to map, and then use apply only for methods that act on unlabeled arrays. This would require a deprecation cycle, but eventually it would let us add .apply methods for handling raw arrays to both Dataset and DataArray. (We could use a separate apply method from apply_ufunc to convert dim arguments to axis and not do automatic broadcasting.)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1251/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
342180429 MDU6SXNzdWUzNDIxODA0Mjk= 2298 Making xarray math lazy shoyer 1217238 open 0     7 2018-07-18T05:18:53Z 2022-04-19T15:38:59Z   MEMBER      

At SciPy, I had the realization that it would be relatively straightforward to make element-wise math between xarray objects lazy. This would let us support lazy coordinate arrays, a feature that has quite a few use-cases, e.g., for both geoscience and astronomy.

The trick would be to write a lazy array class that holds an element-wise vectorized function and passes indexers on to its arguments. I haven't thought too hard about this yet for vectorized indexing, but it could be quite efficient for outer indexing. I have some prototype code but no tests yet.

The question is how to hook this into xarray operations. In particular, supposing that the inputs to a function do no hold dask arrays: - Should we try to make every element-wise operation with vectorized functions (ufuncs) lazy by default? This might have negative performance implications and would be a little tricky to implement with xarray's current code, since we still implement binary operations like + with separate logic from apply_ufunc. - Should we make every element-wise operation that explicitly uses apply_ufunc() lazy by default? - Or should we only make element-wise operations lazy with apply_ufunc() if you use some special flag, e.g., apply_ufunc(..., lazy=True)?

I am leaning towards the last option for now but would welcome other opinions.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2298/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
902622057 MDU6SXNzdWU5MDI2MjIwNTc= 5381 concat() with compat='no_conflicts' on dask arrays has accidentally quadratic runtime shoyer 1217238 open 0     0 2021-05-26T16:12:06Z 2022-04-19T03:48:27Z   MEMBER      

This ends up calling fillna() in a loop inside xarray.core.merge.unique_variable(), something like: python out = variables[0] for var in variables[1:]: out = out.fillna(var) https://github.com/pydata/xarray/blob/55e5b5aaa6d9c27adcf9a7cb1f6ac3bf71c10dea/xarray/core/merge.py#L147-L149

This has quadratic behavior if the variables are stored in dask arrays (the dask graph gets one element larger after each loop iteration). This is OK for merge() (which typically only has two arguments) but is problematic for dealing with variables that shouldn't be concatenated inside concat(), which should be able to handle very long lists of arguments.

I encountered this because compat='no_conflicts' is the default for xarray.combine_nested().

I guess there's also the related issue which is that even if we produced the output dask graph by hand without a loop, it still wouldn't be easy to evaluate for a large number of elements. Ideally we would use some sort of tree-reduction to ensure the operation can be parallelized.

xref https://github.com/google/xarray-beam/pull/13

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5381/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
325439138 MDU6SXNzdWUzMjU0MzkxMzg= 2171 Support alignment/broadcasting with unlabeled dimensions of size 1 shoyer 1217238 open 0     5 2018-05-22T19:52:21Z 2022-04-19T03:15:24Z   MEMBER      

Sometimes, it's convenient to include placeholder dimensions of size 1, which allows for removing any ambiguity related to the order of output dimensions.

Currently, this is not supported with xarray: ```

xr.DataArray([1], dims='x') + xr.DataArray([1, 2, 3], dims='x') ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension sizes: {1, 3}

xr.Variable(('x',), [1]) + xr.Variable(('x',), [1, 2, 3]) ValueError: operands cannot be broadcast together with mismatched lengths for dimension 'x': (1, 3) ```

However, these operations aren't really ambiguous. With size 1 dimensions, we could logically do broadcasting like NumPy arrays, e.g., ```

np.array([1]) + np.array([1, 2, 3]) array([2, 3, 4]) ```

This would be particularly convenient if we add keepdims=True to xarray operations (#2170).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2171/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
665488672 MDU6SXNzdWU2NjU0ODg2NzI= 4267 CachingFileManager should not use __del__ shoyer 1217238 open 0     2 2020-07-25T01:20:52Z 2022-04-17T21:42:39Z   MEMBER      

__del__ is sometimes called after modules have been deallocated, which results in errors printed to stderr when Python exits. This manifests itself in the following bug: https://github.com/shoyer/h5netcdf/issues/50

Per https://github.com/shoyer/h5netcdf/issues/50#issuecomment-572191867, the right solution is probably to use weakref.finalize.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4267/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
469440752 MDU6SXNzdWU0Njk0NDA3NTI= 3139 Change the signature of DataArray to DataArray(data, dims, coords, ...)? shoyer 1217238 open 0     1 2019-07-17T20:54:57Z 2022-04-09T15:28:51Z   MEMBER      

Currently, the signature of DataArray is DataArray(data, coords, dims, ...): http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html

In the long term, I think DataArray(data, dims, coords, ...) would be more intuitive: dimensions are a more fundamental part of xarray's data model than coordinates. Certainly I find it much more common to omit coords than to omit dims when I create a DataArray.

My original reasoning for this argument order was that dims could be copied from coords, e.g., DataArray(new_data, old_dataarray.coords), and it was nice to be able to pass this sole argument by position instead of by name. But a cleaner way to write this now is old_dataarray.copy(data=new_data).

The challenge in making any change here would be to have a smooth deprecation process, and that ideally avoids requiring users to rewrite all of their code and avoids loads of pointless/extraneous warnings. I'm not entirely sure this is possible. We could likely use heuristics to distinguish between dims and coords arguments regardless of their order, but this probably isn't something we would want to preserve in the long term.

An alternative that might achieve some of the convenience of this change would be to allow for passing lists of strings in the coords argument by position, which are interpreted as dimensions, e.g., DataArray(data, ['x', 'y']). The downside of this alternative is that it would add even more special cases to the DataArray constructor , which would make it harder to understand.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3139/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
327166000 MDExOlB1bGxSZXF1ZXN0MTkxMDMwMjA4 2195 WIP: explicit indexes shoyer 1217238 closed 0     3 2018-05-29T04:25:15Z 2022-03-21T14:59:52Z 2022-03-21T14:59:52Z MEMBER   0 pydata/xarray/pulls/2195

Some utility functions that should be useful for https://github.com/pydata/xarray/issues/1603

Still very much a work in progress -- it would be great if someone has time to finish writing any of these in another PR!

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2195/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
864249974 MDU6SXNzdWU4NjQyNDk5NzQ= 5202 Make creating a MultiIndex in stack optional shoyer 1217238 closed 0     7 2021-04-21T20:21:03Z 2022-03-17T17:11:42Z 2022-03-17T17:11:42Z MEMBER      

As @Hoeze notes in https://github.com/pydata/xarray/issues/5179, calling stack() can be "incredibly slow and memory-demanding, since it creates a MultiIndex of every possible coordinate in the array."

This is true with how stack() works currently, but I'm not sure this is necessary. I suspect it's a vestigial design choice from copying pandas, back from before Xarray had optional indexes. One benefit is that it's convenient for making unstack() the inverse of stack(), but isn't always required.

Regardless of how we define the semantics for boolean indexing (https://github.com/pydata/xarray/issues/1887), it seems like it could be a good idea to allow stack to skip creating a MultiIndex for the new dimension, via a new keyword argument such as ds.stack(index=False). This would be equivalent to calling reset_index() after stack() but would be cheaper because the MultiIndex is never created in the first place.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5202/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
237008177 MDU6SXNzdWUyMzcwMDgxNzc= 1460 groupby should still squeeze for non-monotonic inputs shoyer 1217238 open 0     5 2017-06-19T20:05:14Z 2022-03-04T21:31:41Z   MEMBER      

We can simply use argsort() to determine group_indices instead of np.arange(): https://github.com/pydata/xarray/blob/22ff955d53e253071f6e4fa849e5291d0005282a/xarray/core/groupby.py#L256

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1460/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
58117200 MDU6SXNzdWU1ODExNzIwMA== 324 Support multi-dimensional grouped operations and group_over shoyer 1217238 open 0   1.0 741199 12 2015-02-18T19:42:20Z 2022-02-28T19:03:17Z   MEMBER      

Multi-dimensional grouped operations should be relatively straightforward -- the main complexity will be writing an N-dimensional concat that doesn't involve repetitively copying data.

The idea with group_over would be to support groupby operations that act on a single element from each of the given groups, rather than the unique values. For example, ds.group_over(['lat', 'lon']) would let you iterate over or apply to 2D slices of ds, no matter how many dimensions it has.

Roughly speaking (it's a little more complex for the case of non-dimension variables), ds.group_over(dims) would get translated into ds.groupby([d for d in ds.dims if d not in dims]).

Related: #266

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/324/reactions",
    "total_count": 18,
    "+1": 18,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1090700695 I_kwDOAMm_X85BAsWX 6125 [Bug]: HTML repr does not display well in notebooks hosted on GitHub shoyer 1217238 open 0     0 2021-12-29T19:05:49Z 2021-12-29T19:36:25Z   MEMBER      

What happened?

We see both the raw text and a malformed version of the HTML (without CSS formatting).

Example (https://github.com/microsoft/PlanetaryComputerExamples/blob/main/quickstarts/reading-zarr-data.ipynb):

What did you expect to happen?

Either:

  1. Ideally, we only see the HTML repr, with CSS formatting applied.
  2. Or, if that isn't possible, we should figure out how to only show the raw text.

nbviewer gets this right:

Minimal Complete Verifiable Example

No response

Relevant log output

No response

Anything else we need to know?

No response

Environment

NA

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6125/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1062709354 PR_kwDOAMm_X84u-sO9 6025 Simplify missing value handling in xarray.corr shoyer 1217238 closed 0     1 2021-11-24T17:48:03Z 2021-11-28T04:39:22Z 2021-11-28T04:39:22Z MEMBER   0 pydata/xarray/pulls/6025

This PR simplifies the fix from https://github.com/pydata/xarray/pull/5731, specifically for the benefit of xarray.corr. There is no need to use map_blocks instead of using where directly.

It is a basically an alternative version of https://github.com/pydata/xarray/pull/5284. It is potentially slightly less efficient to do this masking step when unnecessary, but I doubt this makes a noticeable performance difference in practice (and I doubt this optimization is useful insdie map_blocks, anyways).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6025/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1044151556 PR_kwDOAMm_X84uELYB 5935 Docs: fix URL for PTSA shoyer 1217238 closed 0     1 2021-11-03T21:56:44Z 2021-11-05T09:36:04Z 2021-11-05T09:36:04Z MEMBER   0 pydata/xarray/pulls/5935

One of the PTSA authors told me about the new URL by email.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5935/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
874292512 MDU6SXNzdWU4NzQyOTI1MTI= 5251 Switch default for Zarr reading/writing to consolidated=True? shoyer 1217238 closed 0     4 2021-05-03T06:59:42Z 2021-08-30T15:21:11Z 2021-08-30T15:21:11Z MEMBER      

Consolidated metadata was a new feature in Zarr v2.3, which was released over two year ago (March 22, 2019).

Since then, I have used consolidated=True every time I've written or opened a Zarr store. As far as I can tell, this is almost always a good idea: - With local storage, it usually doesn't really matter. You spend a bit of time writing the consolidated metadata and have one extra file on disk, but the overhead is typically negligible. - With Cloud object stores or network filesystems, it can matter quite a large amount. Without consolidated metadata, these systems can be unusably slow for opening datasets. Cloud storage is of course the main use-case for Zarr. If you're using a local disk, you might as well stick with single files such as netCDF.

I wonder if consolidated metadata is mature enough now that we could consider switching the default behavior in Xarray. From my perspective, this is a big "gotcha" for getting good performance with Zarr. More than one of my colleagues has been unimpressed with the performance of Zarr until they learned to set consolidated=True.

I would suggest doing this in way is almost entirely backwards compatible, with only a minor performance costs for reading non-consolidated datasets: - to_zarr() switches the default to consolidated=True. The consolidate_metadata() will thus happen by default. - open_zarr() switches the default to consolidated=None, which means "Try reading consolidated metadata, and fall-back to non-consolidated if that fails." This will be slightly slower for non-consolidated metadata due to the extra file-lookup, but given that opening with non-consolidated metadata already requires a moderately large number of file look-ups, I doubt anyone will notice the difference.

CC @rabernat

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5251/reactions",
    "total_count": 11,
    "+1": 11,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
928402742 MDU6SXNzdWU5Mjg0MDI3NDI= 5516 Rename master branch -> main shoyer 1217238 closed 0     4 2021-06-23T15:45:57Z 2021-07-23T21:58:39Z 2021-07-23T21:58:39Z MEMBER      

This is a best practice for inclusive projects.

See https://github.com/github/renaming for guidance.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5516/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
948890466 MDExOlB1bGxSZXF1ZXN0NjkzNjY1NDEy 5624 Make typing-extensions optional shoyer 1217238 closed 0     6 2021-07-20T17:43:22Z 2021-07-22T23:30:49Z 2021-07-22T23:02:03Z MEMBER   0 pydata/xarray/pulls/5624

Type checking may be a little worse if typing-extensions are not installed, but I don't think it's worth the trouble of adding another hard dependency just for one use for TypeGuard.

Note: sadly this doesn't work yet. Mypy (and pylance) don't like the type alias defined with try/except. Any ideas? In the worst case, we could revert the TypeGuard entirely, but that would be a shame...

  • [x] Closes #5495
  • [x] Passes pre-commit run --all-files
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5624/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
890534794 MDU6SXNzdWU4OTA1MzQ3OTQ= 5295 Engine is no longer inferred for filenames not ending in ".nc" shoyer 1217238 closed 0     0 2021-05-12T22:28:46Z 2021-07-15T14:57:54Z 2021-05-14T22:40:14Z MEMBER      

This works with xarray=0.17.0: python import xarray xarray.Dataset({'x': [1, 2, 3]}).to_netcdf('tmp') xarray.open_dataset('tmp')

On xarray 0.18.0, it fails: ```


ValueError Traceback (most recent call last) <ipython-input-1-20e128a730aa> in <module>() 2 3 xarray.Dataset({'x': [1, 2, 3]}).to_netcdf('tmp') ----> 4 xarray.open_dataset('tmp')

/usr/local/lib/python3.7/dist-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, args, *kwargs) 483 484 if engine is None: --> 485 engine = plugins.guess_engine(filename_or_obj) 486 487 backend = plugins.get_backend(engine)

/usr/local/lib/python3.7/dist-packages/xarray/backends/plugins.py in guess_engine(store_spec) 110 warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning) 111 --> 112 raise ValueError("cannot guess the engine, try passing one explicitly") 113 114

ValueError: cannot guess the engine, try passing one explicitly ```

I'm not entirely sure what changed. My guess is that we used to fall-back to trying to use SciPy, but don't do that anymore. A potential fix would be reading strings as filenames in xarray.backends.utils.read_magic_number.

Related: https://github.com/pydata/xarray/issues/5291

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5295/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
252707680 MDU6SXNzdWUyNTI3MDc2ODA= 1525 Consider setting name=False in Variable.chunk() shoyer 1217238 open 0     4 2017-08-24T19:34:28Z 2021-07-13T01:50:16Z   MEMBER      

@mrocklin writes:

The following will be slower: b = (a.chunk(...) + 1) + (a.chunk(...) + 1) In current operation this will be optimized to tmp = a.chunk(...) + 1 b = tmp + tmp So you'll lose that, but I suspect that in your case chunking the same dataset many times is somewhat rare.

See here for discussion: https://github.com/pydata/xarray/pull/1517#issuecomment-324722153

Whether this is worth doing really depends on on what people would find most useful -- and what is the most intuitive behavior.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1525/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
254888879 MDU6SXNzdWUyNTQ4ODg4Nzk= 1552 Flow chart for choosing indexing operations shoyer 1217238 open 0     2 2017-09-03T17:33:30Z 2021-07-11T22:26:17Z   MEMBER      

We have a lot of indexing operations, even though sel_points and isel_points are about to be deprecated (#1473).

A flow chart / decision tree to help users pick the right indexing operation might be helpful (e.g., like this skimage FlowChart). It would ask various questions (e.g., do you have labels or integer positions? do you want to select or impose coordinates?) and then suggest appropriate the indexer methods.

cc @fujiisoup

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1552/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
891281614 MDU6SXNzdWU4OTEyODE2MTQ= 5302 Suggesting specific IO backends to install when open_dataset() fails shoyer 1217238 closed 0     3 2021-05-13T18:45:28Z 2021-06-23T08:18:07Z 2021-06-23T08:18:07Z MEMBER      

Currently, Xarray's internal backends don't get registered unless the necessary dependencies are installed: https://github.com/pydata/xarray/blob/1305d9b624723b86050ca5b2d854e5326bbaa8e6/xarray/backends/netCDF4_.py#L567-L568

In order to facilitating suggesting a specific backend to install (e.g., to improve error messages from opening tutorial datasets https://github.com/pydata/xarray/issues/5291), I would suggest that Xarray always registers its own backend entrypoints. Then we make the following changes to the plugin protocol:

  • guess_can_open() should work regardless of whether the underlying backend is installed
  • installed() returns a boolean reporting whether backend is installed. The default method in the base class would return True, for backwards compatibility.
  • open_dataset() of course should error if the backend is not installed.

This will let us leverage the existing guess_can_open() functionality to suggest specific optional dependencies to install. E.g., if you supply a netCDF3 file: Xarray cannot find a matching installed backend for this file in the installed backends ["h5netcdf"]. Consider installing one of the following backends which reports a match: ["scipy", "netcdf4"]

Does this reasonable and worthwhile?

CC @aurghs @alexamici

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5302/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
874331538 MDExOlB1bGxSZXF1ZXN0NjI4OTE0NDQz 5252 Add mode="r+" for to_zarr and use consolidated writes/reads by default shoyer 1217238 closed 0     14 2021-05-03T07:57:16Z 2021-06-22T06:51:35Z 2021-06-17T17:19:26Z MEMBER   0 pydata/xarray/pulls/5252

mode="r+" only allows for modifying pre-existing array values in a Zarr store. This makes it a safer default mode when doing a limited region write. It also offers a nice performance bonus when using consolidated metadata, because the store to modify can be opened in "consolidated" mode -- rather than painfully slow non-consolidated mode.

This PR includes several related changes to to_zarr():

  1. It adds support for the new mode="r+".
  2. consolidated=True in to_zarr() now means "open in consolidated mode" if using using mode="r+", instead of "write in consolidated mode" (which would not make sense for r+).
  3. It allows setting consolidated=True when using region, mostly for the sake of fast store opening with r+.
  4. Validation in to_zarr() has been reorganized to always use the existing Zarr group, rather than re-opening zar stores from scratch, which could require additional network requests.
  5. Incidentally, I've renamed the ZarrStore.ds attribute to ZarrStore.zarr_group, which is a much more descriptive name.

These changes gave me a ~5x boost in write performance in a large parallel job making use of to_zarr with region.

  • [x] Tests added
  • [x] Passes pre-commit run --all-files
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5252/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
340733448 MDU6SXNzdWUzNDA3MzM0NDg= 2283 Exact alignment should allow missing dimension coordinates shoyer 1217238 open 0     2 2018-07-12T17:40:24Z 2021-06-15T09:52:29Z   MEMBER      

Code Sample, a copy-pastable example if possible

python import xarray as xr xr.align(xr.DataArray([1, 2, 3], dims='x'), xr.DataArray([1, 2, 3], dims='x', coords=[[0, 1, 2]]), join='exact')

Problem description

This currently results in an error, but a missing index of size 3 does not actually conflict: ```python-traceback


ValueError Traceback (most recent call last) <ipython-input-15-1d63d3512fb6> in <module>() 1 xr.align(xr.DataArray([1, 2, 3], dims='x'), 2 xr.DataArray([1, 2, 3], dims='x', coords=[[0, 1, 2]]), ----> 3 join='exact')

/usr/local/lib/python3.6/dist-packages/xarray/core/alignment.py in align(objects, *kwargs) 129 raise ValueError( 130 'indexes along dimension {!r} are not equal' --> 131 .format(dim)) 132 index = joiner(matching_indexes) 133 joined_indexes[dim] = index

ValueError: indexes along dimension 'x' are not equal ```

This surfaced as an issue on StackOverflow: https://stackoverflow.com/questions/51308962/computing-matrix-vector-multiplication-for-each-time-point-in-two-dataarrays

Expected Output

Both output arrays should end up with the x coordinate from the input that has it, like the output of the above expression if join='inner': (<xarray.DataArray (x: 3)> array([1, 2, 3]) Coordinates: * x (x) int64 0 1 2, <xarray.DataArray (x: 3)> array([1, 2, 3]) Coordinates: * x (x) int64 0 1 2)

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.14.33+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.7 pandas: 0.22.0 numpy: 1.14.5 scipy: 0.19.1 netCDF4: None h5netcdf: None h5py: 2.8.0 Nio: None zarr: None bottleneck: None cyordereddict: None dask: None distributed: None matplotlib: 2.1.2 cartopy: None seaborn: 0.7.1 setuptools: 39.1.0 pip: 10.0.1 conda: None pytest: None IPython: 5.5.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2283/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
842438533 MDU6SXNzdWU4NDI0Mzg1MzM= 5082 Move encoding from xarray.Variable to duck arrays? shoyer 1217238 open 0     2 2021-03-27T07:21:55Z 2021-06-13T01:34:00Z   MEMBER      

The encoding property on Variable has always been an awkward part of Xarray's API, and an example of poor separation of concerns. It add conceptual overhead to all uses of xarray.Variable, but exists only for the (somewhat niche) benefit of Xarray's backend IO functionality. This is particularly problematic if we consider the possible separation of xarray.Variable into a separate package to remove the pandas dependency (https://github.com/pydata/xarray/issues/3981).

I think a cleaner way to handle encoding would be to move it from Variable onto array objects, specifically duck array objects that Xarray creates when loading data from disk. As long as these duck arrays don't "propagate" themselves under array operations but rather turn into raw numpy arrays (or whatever is wrapped), this would automatically resolve all issues around propagating encoding attributes (e.g., https://github.com/pydata/xarray/pull/5065, https://github.com/pydata/xarray/issues/1614). And users who don't care about encoding because they don't use Xarray's IO functionality would never need to think about it.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5082/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
416554477 MDU6SXNzdWU0MTY1NTQ0Nzc= 2797 Stalebot is being overly aggressive shoyer 1217238 closed 0     7 2019-03-03T19:37:37Z 2021-06-03T21:31:46Z 2021-06-03T21:22:48Z MEMBER      

E.g., see https://github.com/pydata/xarray/issues/1151 where stalebot closed an issue even after another comment.

Is this something we need to reconfigure or just a bug?

cc @pydata/xarray

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2797/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
276241764 MDU6SXNzdWUyNzYyNDE3NjQ= 1739 Utility to restore original dimension order after apply_ufunc shoyer 1217238 open 0     11 2017-11-23T00:47:57Z 2021-05-29T07:39:33Z   MEMBER      

This seems to be coming up quite a bit for wrapping functions that apply an operation along an axis, e.g., for interpolate in #1640 or rank in #1733.

We should either write a utility function to do this or consider adding an option to apply_ufunc.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1739/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
901047466 MDU6SXNzdWU5MDEwNDc0NjY= 5372 Consider revising the _repr_inline_ protocol shoyer 1217238 open 0     0 2021-05-25T16:18:31Z 2021-05-25T16:18:31Z   MEMBER      

_repr_inline_ looks like an IPython special method but is actually includes some xarray specific details: the result should not include shape or dtype.

As I wrote in https://github.com/pydata/xarray/pull/5352, I would suggest revising it in one of two ways:

  1. Giving it a name like _xarray_repr_inline_ to make it clearer that it's Xarray specific
  2. Include some more generic way of indicating that shape/dtype is redundant, e.g,. call it like obj._repr_ndarray_inline_(dtype=False, shape=False)
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5372/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
891253662 MDExOlB1bGxSZXF1ZXN0NjQ0MTQ5Mzc2 5300 Better error message when no backend engine is found. shoyer 1217238 closed 0     4 2021-05-13T18:10:04Z 2021-05-18T21:23:00Z 2021-05-18T21:23:00Z MEMBER   0 pydata/xarray/pulls/5300

Also includes a better error message when loading a tutorial dataset but an underlying IO dependency is not found.

  • [x] Fixes #5291
  • [x] Tests added
  • [x] Passes pre-commit run --all-files
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5300/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
890573049 MDExOlB1bGxSZXF1ZXN0NjQzNTc1Mjc5 5296 More robust guess_can_open for netCDF4/scipy/h5netcdf entrypoints shoyer 1217238 closed 0     1 2021-05-12T23:53:32Z 2021-05-14T22:40:14Z 2021-05-14T22:40:14Z MEMBER   0 pydata/xarray/pulls/5296

The new version checks magic numbers in files on disk, not just already open file objects.

I've also added a bunch of unit-tests.

Fixes GH5295

  • [x] Closes #5295
  • [x] Tests added
  • [x] Passes pre-commit run --all-files
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5296/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
46049691 MDU6SXNzdWU0NjA0OTY5MQ== 255 Add Dataset.to_pandas() method shoyer 1217238 closed 0   0.5 987654 2 2014-10-17T00:01:36Z 2021-05-04T13:56:00Z 2021-05-04T13:56:00Z MEMBER      

This would be the complement of the DataArray constructor, converting an xray.DataArray into a 1D series, 2D DataFrame or 3D panel, whichever is appropriate.

to_pandas would also makes sense for Dataset, if it could convert 0d datasets to series, e.g., pd.Series({k: v.item() for k, v in ds.items()}) (there is currently no direct way to do this), and revert to to_dataframe for higher dimensional input. - [x] DataArray method - [ ] Dataset method

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/255/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
294241734 MDU6SXNzdWUyOTQyNDE3MzQ= 1887 Boolean indexing with multi-dimensional key arrays shoyer 1217238 open 0     13 2018-02-04T23:28:45Z 2021-04-22T21:06:47Z   MEMBER      

Originally from https://github.com/pydata/xarray/issues/974

For boolean indexing: - da[key] where key is a boolean labelled array (with any number of dimensions) is made equivalent to da.where(key.reindex_like(ds), drop=True). This matches the existing behavior if key is a 1D boolean array. For multi-dimensional arrays, even though the result is now multi-dimensional, this coupled with automatic skipping of NaNs means that da[key].mean() gives the same result as in NumPy. - da[key] = value where key is a boolean labelled array can be made equivalent to da = da.where(*align(key.reindex_like(da), value.reindex_like(da))) (that is, the three argument form of where). - da[key_0, ..., key_n] where all of key_i are boolean arrays gets handled in the usual way. It is an IndexingError to supply multiple labelled keys if any of them are not already aligned with as the corresponding index coordinates (and share the same dimension name). If they want alignment, we suggest users simply write da[key_0 & ... & key_n].

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1887/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
346822633 MDU6SXNzdWUzNDY4MjI2MzM= 2336 test_88_character_filename_segmentation_fault should not try to write to the current working directory shoyer 1217238 closed 0     2 2018-08-02T01:06:41Z 2021-04-20T23:38:53Z 2021-04-20T23:38:53Z MEMBER      

This files in cases where the current working directory does not support writes, e.g., as seen here ``` def test_88_character_filename_segmentation_fault(self): # should be fixed in netcdf4 v1.3.1 with mock.patch('netCDF4.version', '1.2.4'): with warnings.catch_warnings(): message = ('A segmentation fault may occur when the ' 'file path has exactly 88 characters') warnings.filterwarnings('error', message) with pytest.raises(Warning): # Need to construct 88 character filepath

              xr.Dataset().to_netcdf('a' * (88 - len(os.getcwd()) - 1))

tests/test_backends.py:1234:


core/dataset.py:1150: in to_netcdf compute=compute) backends/api.py:715: in to_netcdf autoclose=autoclose, lock=lock) backends/netCDF4_.py:332: in open ds = opener() backends/netCDF4_.py:231: in _open_netcdf4_group ds = nc4.Dataset(filename, mode=mode, **kwargs) third_party/py/netCDF4/_netCDF4.pyx:2111: in netCDF4._netCDF4.Dataset.init ???


??? E IOError: [Errno 13] Permission denied ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2336/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
843996137 MDU6SXNzdWU4NDM5OTYxMzc= 5092 Concurrent loading of coordinate arrays from Zarr shoyer 1217238 open 0     0 2021-03-30T02:19:50Z 2021-04-19T02:43:31Z   MEMBER      

When you open a dataset with Zarr, xarray loads coordinate arrays corresponding to indexes in serial. This can be slow (multiple seconds) even with only a handful of such arrays if they are stored in a remote filesystem (e.g., cloud object stores). This is similar to the use-cases for consolidated metadata.

In principle, we could speed up loading datasets from Zarr into Xarray significantly by reading the data corresponding to these arrays in parallel (e.g., in multiple threads).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5092/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
621082480 MDU6SXNzdWU2MjEwODI0ODA= 4080 Most arguments to open_dataset should be keyword only shoyer 1217238 closed 0     1 2020-05-19T15:38:51Z 2021-03-16T10:56:09Z 2021-03-16T10:56:09Z MEMBER      

open_dataset has a long list of arguments: xarray.open_dataset(filename_or_obj, group=None, decode_cf=True, mask_and_scale=None, decode_times=True, autoclose=None, concat_characters=True, decode_coords=True, engine=None, chunks=None, lock=None, cache=None, drop_variables=None, backend_kwargs=None, use_cftime=None)

Similarly to the case for pandas (https://github.com/pandas-dev/pandas/issues/27544), it would be nice to make most of these arguments keyword-only, e.g., def open_dataset(filename_or_obj, group, *, ...). For consistency, this would also apply to open_dataarray, decode_cf, open_mfdataset, etc.

This would encourage writing readable code when calling open_dataset() and would allow us to use better organization when adding new arguments (e.g., decode_timedelta in https://github.com/pydata/xarray/pull/4071).

To make this change, we could make use of the deprecate_nonkeyword_arguments decorator from https://github.com/pandas-dev/pandas/pull/27573

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4080/reactions",
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
645062817 MDExOlB1bGxSZXF1ZXN0NDM5NTg4OTU1 4178 Fix min_deps_check; revert to support numpy=1.14 and pandas=0.24 shoyer 1217238 closed 0     5 2020-06-25T00:37:19Z 2021-02-27T21:46:43Z 2021-02-27T21:46:42Z MEMBER   1 pydata/xarray/pulls/4178

Fixes the issue noticed in: https://github.com/pydata/xarray/pull/4175#issuecomment-649135372

Let's see if this passes CI...

  • [x] Passes isort -rc . && black . && mypy . && flake8
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4178/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
645154872 MDU6SXNzdWU2NDUxNTQ4NzI= 4179 Consider revising our minimum dependency version policy shoyer 1217238 closed 0     7 2020-06-25T05:04:38Z 2021-02-22T05:02:25Z 2021-02-22T05:02:25Z MEMBER      

Our current policy is that xarray supports "the minor version (X.Y) initially published no more than N months ago" where N is:

  • Python: 42 months (NEP 29)
  • numpy: 24 months (NEP 29)
  • pandas: 12 months
  • scipy: 12 months
  • sparse, pint and other libraries that rely on NEP-18 for integration: very latest available versions only,
  • all other libraries: 6 months

I think this policy is too aggressive, particularly for pandas, SciPy and other libraries. Some of these projects can go 6+ months between minor releases. For example, version 2.3 of zarr is currently more than 6 months old. So if zarr released 2.4 today and xarray issued a new release tomorrow, and then our policy would dictate that we could ask users to upgrade to the new version.

In https://github.com/pydata/xarray/pull/4178, I misinterpreted our policy as supporting "the most recent minor version (X.Y) initially published more than N months ago". This version makes a bit more sense to me: users only need to upgrade dependencies at least every N months to use the latest xarray release.

I understand that NEP-29 chose its language intentionally, so that distributors know ahead of time when they can drop support for a Python or NumPy version. But this seems like a (very) poor fit for projects without regular releases. At the very least we should adjust the specific time windows.

I'll see if I can gain some understanding of the motivation for this particular language over on the NumPy tracker...

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4179/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
267927402 MDU6SXNzdWUyNjc5Mjc0MDI= 1652 Resolve warnings issued in the xarray test suite shoyer 1217238 closed 0     10 2017-10-24T07:36:55Z 2021-02-21T23:06:35Z 2021-02-21T23:06:34Z MEMBER      

82 warnings are currently issued in the process of running our test suite: https://gist.github.com/shoyer/db0b2c82efd76b254453216e957c4345

Some of can probably be safely ignored, but others are likely noticed by users, e.g., https://stackoverflow.com/questions/41130138/why-is-invalid-value-encountered-in-greater-warning-thrown-in-python-xarray-fo/41147570#41147570

It would be nice to clean up all of these, either by catching the appropriate upstream warning (if irrelevant) or changing our usage to avoid the warning. There may very well be a lurking FutureWarning in there somewhere that could cause issues when another library updates.

Probably the easiest way to get started here is to get the test suite running locally, and use py.test -W error to turn all warnings into errors.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1652/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
777327298 MDU6SXNzdWU3NzczMjcyOTg= 4749 Option for combine_attrs with conflicting values silently dropped shoyer 1217238 closed 0     0 2021-01-01T18:04:49Z 2021-02-10T19:50:17Z 2021-02-10T19:50:17Z MEMBER      

merge() currently supports four options for merging attrs: combine_attrs : {"drop", "identical", "no_conflicts", "override"}, \ default: "drop" String indicating how to combine attrs of the objects being merged: - "drop": empty attrs on returned Dataset. - "identical": all attrs must be the same on every object. - "no_conflicts": attrs from all objects are combined, any that have the same name must also have the same value. - "override": skip comparing and copy attrs from the first dataset to the result.

It would be nice to have an option to combine attrs from all objects like "no_conflicts", but that drops attributes with conflicting values rather than raising an error. We might call this combine_attrs="drop_conflicts" or combine_attrs="matching".

This is similar to how xarray currently handles conflicting values for DataArray.name and would be more suitable to consider for the default behavior of merge and other functions/methods that merge coordinates (e.g., apply_ufunc, concat, where, binary arithmetic).

cc @keewis

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4749/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
264098632 MDU6SXNzdWUyNjQwOTg2MzI= 1618 apply_raw() for a simpler version of apply_ufunc() shoyer 1217238 open 0     4 2017-10-10T04:51:38Z 2021-01-01T17:14:43Z   MEMBER      

apply_raw() would work like apply_ufunc(), but without the hard to understand broadcasting behavior and core dimensions.

The rule for apply_raw() would be that it directly unwraps its arguments and passes them on to the wrapped function, without any broadcasting. We would also include a dim argument that is automatically converted into the appropriate axis argument when calling the wrapped function.

Output dimensions would be determined from a simple rule of some sort: - Default output dimensions would either be copied from the first argument, or would take on the ordered union on all input dimensions. - Custom dimensions could either be set by adding a drop_dims argument (like dask.array.map_blocks), or require an explicit override output_dims.

This also could be suitable for defining as a method instead of a separate function. See https://github.com/pydata/xarray/issues/1251 and https://github.com/pydata/xarray/issues/1130 for related issues.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1618/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
269700511 MDU6SXNzdWUyNjk3MDA1MTE= 1672 Append along an unlimited dimension to an existing netCDF file shoyer 1217238 open 0     8 2017-10-30T18:09:54Z 2020-11-29T17:35:04Z   MEMBER      

This would be a nice feature to have for some use cases, e.g., for writing simulation time-steps: https://stackoverflow.com/questions/46951981/create-and-write-xarray-dataarray-to-netcdf-in-chunks

It should be relatively straightforward to add, too, building on support for writing files with unlimited dimensions. User facing API would probably be a new keyword argument to to_netcdf(), e.g., extend='time' to indicate the extended dimension.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1672/reactions",
    "total_count": 21,
    "+1": 21,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
314444743 MDU6SXNzdWUzMTQ0NDQ3NDM= 2059 How should xarray serialize bytes/unicode strings across Python/netCDF versions? shoyer 1217238 open 0     5 2018-04-15T19:36:55Z 2020-11-19T10:08:16Z   MEMBER      

netCDF string types

We have several options for storing strings in netCDF files: - NC_CHAR: netCDF's legacy character type. The closest match is NumPy 'S1' dtype. In principle, it's supposed to be able to store arbitrary bytes. On HDF5, it uses an UTF-8 encoded string with a fixed-size of 1 (but note that HDF5 does not complain about storing arbitrary bytes). - NC_STRING: netCDF's newer variable length string type. It's only available on netCDF4 (not netCDF3). It corresponds to an HDF5 variable-length string with UTF-8 encoding. - NC_CHAR with an _Encoding attribute: xarray and netCDF4-Python support an ad-hoc convention for storing unicode strings in NC_CHAR data-types, by adding an attribute {'_Encoding': 'UTF-8'}. The data is still stored as fixed width strings, but xarray (and netCDF4-Python) can decode them as unicode.

NC_STRING would seem like a clear win in cases where it's supported, but as @crusaderky points out in https://github.com/pydata/xarray/issues/2040, it actually results in much larger netCDF files in many cases than using character arrays, which are more easily compressed. Nonetheless, we currently default to storing unicode strings in NC_STRING, because it's the most portable option -- every tool that handles HDF5 and netCDF4 should be able to read it properly as unicode strings.

NumPy/Python string types

On the Python side, our options are perhaps even more confusing: - NumPy's dtype=np.string_ corresponds to fixed-length bytes. This is the default dtype for strings on Python 2, because on Python 2 strings are the same as bytes. - NumPy's dtype=np.unicode_ corresponds to fixed-length unicode. This is the default dtype for strings on Python 3, because on Python 3 strings are the same as unicode. - Strings are also commonly stored in numpy arrays with dtype=np.object_, as arrays of either bytes or unicode objects. This is a pragmatic choice, because otherwise NumPy has no support for variable length strings. We also use this (like pandas) to mark missing values with np.nan.

Like pandas, we are pretty liberal with converting back and forth between fixed-length (np.string/np.unicode_) and variable-length (object dtype) representations of strings as necessary. This works pretty well, though converting from object arrays in particular has downsides, since it cannot be done lazily with dask.

Current behavior of xarray

Currently, xarray uses the same behavior on Python 2/3. The priority was faithfully round-tripping data from a particular version of Python to netCDF and back, which the current serialization behavior achieves:

| Python version | NetCDF version | NumPy datatype | NetCDF datatype | | --------- | ---------- | -------------- | ------------ | | Python 2 | NETCDF3 | np.string_ / str | NC_CHAR | | Python 2 | NETCDF4 | np.string_ / str | NC_CHAR | | Python 3 | NETCDF3 | np.string_ / bytes | NC_CHAR | | Python 3 | NETCDF4 | np.string_ / bytes | NC_CHAR | | Python 2 | NETCDF3 | np.unicode_ / unicode | NC_CHAR with UTF-8 encoding | | Python 2 | NETCDF4 | np.unicode_ / unicode | NC_STRING | | Python 3 | NETCDF3 | np.unicode_ / str | NC_CHAR with UTF-8 encoding | | Python 3 | NETCDF4 | np.unicode_ / str | NC_STRING | | Python 2 | NETCDF3 | object bytes/str | NC_CHAR | | Python 2 | NETCDF4 | object bytes/str | NC_CHAR | | Python 3 | NETCDF3 | object bytes | NC_CHAR | | Python 3 | NETCDF4 | object bytes | NC_CHAR | | Python 2 | NETCDF3 | object unicode | NC_CHAR with UTF-8 encoding | | Python 2 | NETCDF4 | object unicode | NC_STRING | | Python 3 | NETCDF3 | object unicode/str | NC_CHAR with UTF-8 encoding | | Python 3 | NETCDF4 | object unicode/str | NC_STRING |

This can also be selected explicitly for most data-types by setting dtype in encoding: - 'S1' for NC_CHAR (with or without encoding) - str for NC_STRING (though I'm not 100% sure it works properly currently when given bytes)

Script for generating table:

```python from __future__ import print_function import xarray as xr import uuid import netCDF4 import numpy as np import sys for dtype_name, value in [ ('np.string_ / ' + type(b'').__name__, np.array([b'abc'])), ('np.unicode_ / ' + type(u'').__name__, np.array([u'abc'])), ('object bytes/' + type(b'').__name__, np.array([b'abc'], dtype=object)), ('object unicode/' + type(u'').__name__, np.array([u'abc'], dtype=object)), ]: for format in ['NETCDF3_64BIT', 'NETCDF4']: filename = str(uuid.uuid4()) + '.nc' xr.Dataset({'data': value}).to_netcdf(filename, format=format) with netCDF4.Dataset(filename) as f: var = f.variables['data'] disk_dtype = var.dtype has_encoding = hasattr(var, '_Encoding') disk_dtype_name = (('NC_CHAR' if disk_dtype == 'S1' else 'NC_STRING') + (' with UTF-8 encoding' if has_encoding else '')) print('|', 'Python %i' % sys.version_info[0], '|', format[:7], '|', dtype_name, '|', disk_dtype_name, '|') ```

Potential alternatives

The main option I'm considering is switching to default to NC_CHAR with UTF-8 encoding for np.string_ / str and object bytes/str on Python 2. The current behavior could be explicitly toggled by setting an encoding of {'_Encoding': None}.

This would imply two changes: 1. Attempting to serialize arbitrary bytes (on Python 2) would start raising an error -- anything that isn't ASCII would require explicitly disabling _Encoding. 2. Strings read back from disk on Python 2 would come back as unicode instead of bytes.

This implicit conversion would be consistent with Python 2's general handling of bytes/unicode, and facilitate reading netCDF files on Python 3 that were written with Python 2.

The counter-argument is that it may not be worth changing this at this late point, given that we will be sunsetting Python 2 support by year's end.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2059/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
613012939 MDExOlB1bGxSZXF1ZXN0NDEzODQ3NzU0 4035 Support parallel writes to regions of zarr stores shoyer 1217238 closed 0     17 2020-05-06T02:40:19Z 2020-11-04T06:19:01Z 2020-11-04T06:19:01Z MEMBER   0 pydata/xarray/pulls/4035

This PR adds support for a region keyword argument to to_zarr(), to support parallel writes to different parts of arrays in a zarr stores, e.g., ds.to_zarr(..., region={'x': slice(1000, 2000)}) to write a dataset over the range 1000:2000 along the x dimension.

This is useful for creating large Zarr datasets without requiring dask. For example, the separate workers in a simulation job might each write a single non-overlapping chunk of a Zarr file. The standard way to handle such datasets today is to first write netCDF files in each process, and then consolidate them afterwards with dask (see #3096).

Creating empty Zarr stores

In order to do so, the Zarr file must be pre-existing with desired variables in the right shapes/chunks. It is desirable to be able to create such stores without actually writing data, because datasets that we want to write in parallel may be very large.

In the example below, I achieve this filling a Dataset with dask arrays, and passing compute=False to to_zarr(). This works, but it relies on an undocumented implementation detail of the compute argument. We should either:

  1. Officially document that the compute argument only controls writing array values, not metadata (at least for zarr).
  2. Add a new keyword argument or entire new method for creating an unfilled Zarr store, e.g., write_values=False.

I think (1) is maybe the cleanest option (no extra API endpoints).

Unchunked variables

One potential gotcha concerns coordinate arrays that are not chunked, e.g., consider parallel writing of a dataset divided along time with 2D latitude and longitude arrays that are fixed over all chunks. With the current PR, such coordinate arrays would get rewritten by each separate writer.

If a Zarr store does not have atomic writes, then conceivably this could result in corrupted data. The default DirectoryStore has atomic writes and cloud based object stores should also be atomic, so perhaps this doesn't matter in practice, but at the very least it's inefficient and could cause issues for large-scale jobs due to resource contention.

Options include:

  1. Current behavior. Variables whose dimensions do not overlap with region are written by to_zarr(). This is likely the most intuitive behavior for writing from a single process at a time.
  2. Exclude variables whose dimensions do not overlap with region from being written. This is likely the most convenient behavior for writing from multiple processes at once.
  3. Like (2), but issue a warning if any such variables exist instead of silently dropping them.
  4. Like (2), but raise an error instead of a warning. Require the user to explicitly drop them with .drop(). This is probably the safest behavior.

I think (4) would be my preferred option. Some users would undoubtedly find this annoying, but the power-users for whom we are adding this feature would likely appreciate it.

Usage example

```python import xarray import dask.array as da

ds = xarray.Dataset({'u': (('x',), da.arange(1000, chunks=100))})

create the new zarr store, but don't write data

path = 'my-data.zarr' ds.to_zarr(path, compute=False)

look at the unwritten data

ds_opened = xarray.open_zarr(path) print('Data before writing:', ds_opened.u.data[::100].compute())

Data before writing: [ 1 100 1 100 100 1 1 1 1 1]

write out each slice (could be in separate processes)

for start in range(0, 1000, 100): selection = {'x': slice(start, start + 100)} ds.isel(selection).to_zarr(path, region=selection)

print('Data after writing:', ds_opened.u.data[::100].compute())

Data after writing: [ 0 100 200 300 400 500 600 700 800 900]

```

  • [x] Closes https://github.com/pydata/xarray/issues/3096
  • [x] Integration test
  • [x] Unit tests
  • [x] Passes isort -rc . && black . && mypy . && flake8
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4035/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
124809636 MDU6SXNzdWUxMjQ4MDk2MzY= 703 Document xray internals / advanced API shoyer 1217238 closed 0     2 2016-01-04T18:12:30Z 2020-11-03T17:33:32Z 2020-11-03T17:33:32Z MEMBER      

It would be useful to document the internal Variable class and the internal structure of Dataset and DataArray. This would be helpful for both new contributors and expert users, who might find Variable helpful as an advanced API.

I had some notes in an earlier version of the docs that could be adapted. Note, however, that the internal structure of DataArray changed in #648: http://xray.readthedocs.org/en/v0.2/tutorial.html#notes-on-xray-s-internals

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/703/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
715374721 MDU6SXNzdWU3MTUzNzQ3MjE= 4490 Group together decoding options into a single argument shoyer 1217238 open 0     6 2020-10-06T06:15:18Z 2020-10-29T04:07:46Z   MEMBER      

Is your feature request related to a problem? Please describe.

open_dataset() currently has a very long function signature. This makes it hard to keep track of everything it can do, and is particularly problematic for the authors of new backends (e.g., see https://github.com/pydata/xarray/pull/4477), which might need to know how to handle all these arguments.

Describe the solution you'd like

To simple the interface, I propose to group together all the decoding options into a new DecodingOptions class. I'm thinking something like: ```python from dataclasses import dataclass, field, asdict from typing import Optional, List

@dataclass(frozen=True) class DecodingOptions: mask: Optional[bool] = None scale: Optional[bool] = None datetime: Optional[bool] = None timedelta: Optional[bool] = None use_cftime: Optional[bool] = None concat_characters: Optional[bool] = None coords: Optional[bool] = None drop_variables: Optional[List[str]] = None

@classmethods
def disabled(cls):
    return cls(mask=False, scale=False, datetime=False, timedelta=False,
              concat_characters=False, coords=False)

def non_defaults(self):
    return {k: v for k, v in asdict(self).items() if v is not None}

# add another method for creating default Variable Coder() objects,
# e.g., those listed in encode_cf_variable()

```

The signature of open_dataset would then become: python def open_dataset( filename_or_obj, group=None, * engine=None, chunks=None, lock=None, cache=None, backend_kwargs=None, decode: Union[DecodingOptions, bool] = None, **deprecated_kwargs ): if decode is None: decode = DecodingOptions() if decode is False: decode = DecodingOptions.disabled() # handle deprecated_kwargs... ...

Question: are decode and DecodingOptions the right names? Maybe these should still include the name "CF", e.g., decode_cf and CFDecodingOptions, given that these are specific to CF conventions?

Note: the current signature is open_dataset(filename_or_obj, group=None, decode_cf=True, mask_and_scale=None, decode_times=True, autoclose=None, concat_characters=True, decode_coords=True, engine=None, chunks=None, lock=None, cache=None, drop_variables=None, backend_kwargs=None, use_cftime=None, decode_timedelta=None)

Usage with the new interface would look like xr.open_dataset(filename, decode=False) or xr.open_dataset(filename, decode=xr.DecodingOptions(mask=False, scale=False)).

This requires a little bit more typing than what we currently have, but it has a few advantages:

  1. It's easier to understand the role of different arguments. Now there is a function with ~8 arguments and a class with ~8 arguments rather than a function with ~15 arguments.
  2. It's easier to add new decoding arguments (e.g., for more advanced CF conventions), because they don't clutter the open_dataset interface. For example, I separated out mask and scale arguments, versus the current mask_and_scale argument.
  3. If a new backend plugin for open_dataset() needs to handle every option supported by open_dataset(), this makes that task significantly easier. The only decoding options they need to worry about are non-default options that were explicitly set, i.e., those exposed by the non_defaults() method. If another decoding option wasn't explicitly set and isn't recognized by the backend, they can just ignore it.

Describe alternatives you've considered

For the overall approach:

  1. We could keep the current design, with separate keyword arguments for decoding options, and just be very careful about passing around these arguments. This seems pretty painful for the backend refactor, though.
  2. We could keep the current design only for the user facing open_dataset() interface, and then internally convert into the DecodingOptions() struct for passing to backend constructors. This would provide much needed flexibility for backend authors, but most users wouldn't benefit from the new interface. Perhaps this would make sense as an intermediate step?
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4490/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
718492237 MDExOlB1bGxSZXF1ZXN0NTAwODc5MTY3 4500 Add variable/attribute names to netCDF validation errors shoyer 1217238 closed 0     1 2020-10-10T00:47:18Z 2020-10-10T05:28:08Z 2020-10-10T05:28:08Z MEMBER   0 pydata/xarray/pulls/4500

This should result in a better user experience, e.g., specifically pointing out the attribute with an invalid value.

  • [x] Tests added
  • [x] Passes isort . && black . && mypy . && flake8
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4500/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
169274464 MDU6SXNzdWUxNjkyNzQ0NjQ= 939 Consider how to deal with the proliferation of decoder options on open_dataset shoyer 1217238 closed 0     8 2016-08-04T01:57:26Z 2020-10-06T15:39:11Z 2020-10-06T15:39:11Z MEMBER      

There are already lots of keyword arguments, and users want even more! (#843)

Maybe we should use some sort of object to encapsulate desired options?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/939/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
253107677 MDU6SXNzdWUyNTMxMDc2Nzc= 1527 Binary operations with ds.groupby('time.dayofyear') errors out, but ds.groupby('time.month') works shoyer 1217238 open 0     10 2017-08-26T16:54:53Z 2020-09-29T10:05:42Z   MEMBER      

Reported on the mailing list:

Original datasets: ```

ds_xr <xarray.DataArray (time: 12775)> array([-0.01, -0.01, -0.01, ..., -0.27, -0.27, -0.27]) Coordinates: * time (time) datetime64[ns] 1979-01-01 1979-01-02 1979-01-03 ...

slope_itcp_ds <xarray.Dataset> Dimensions: (lat: 73, level: 2, lon: 144, time: 366) Coordinates: * lon (lon) float32 0.0 2.5 5.0 7.5 10.0 12.5 ... * lat (lat) float32 90.0 87.5 85.0 82.5 80.0 ... * level (level) float64 0.0 1.0 * time (time) datetime64[ns] 2010-01-01 ... Data variables: xarray_dataarray_variable (time, level, lat, lon) float64 -0.8795 ... Attributes: CDI: Climate Data Interface version 1.7.1 (http://mpimet.mpg.de/... Conventions: CF-1.4 history: Fri Aug 25 18:55:50 2017: cdo -inttime,2010-01-01,00:00:00,... CDO: Climate Data Operators version 1.7.1 (http://mpimet.mpg.de/... ```

Issue: Grouping by month works and outputs this: ```

ds_xr.groupby('time.month') - slope_itcp_ds.groupby('time.month').mean('time') <xarray.Dataset> Dimensions: (lat: 73, level: 2, lon: 144, time: 12775) Coordinates: * lon (lon) float32 0.0 2.5 5.0 7.5 10.0 12.5 ... * lat (lat) float32 90.0 87.5 85.0 82.5 80.0 ... * level (level) float64 0.0 1.0 month (time) int64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... * time (time) datetime64[ns] 1979-01-01 ... Data variables: xarray_dataarray_variable (time, level, lat, lon) float64 1.015 ... ```

Grouping by dayofyear doesn't work and gives this traceback: ```

ds_xr.groupby('time.dayofyear') - slope_itcp_ds.groupby('time.dayofyear').mean('time') KeyError Traceback (most recent call last) <ipython-input-10-01c0cf4c980a> in <module>() ----> 1 ds_xr.groupby('time.dayofyear') - slope_itcp_ds.groupby('time.dayofyear').mean('time')

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/groupby.py in func(self, other) 316 g = f if not reflexive else lambda x, y: f(y, x) 317 applied = self._yield_binary_applied(g, other) --> 318 combined = self._combine(applied) 319 return combined 320 return func

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/groupby.py in _combine(self, applied, shortcut) 532 combined = self._concat_shortcut(applied, dim, positions) 533 else: --> 534 combined = concat(applied, dim) 535 combined = _maybe_reorder(combined, dim, positions) 536

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in concat(objs, dim, data_vars, coords, compat, positions, indexers, mode, concat_over) 118 raise TypeError('can only concatenate xarray Dataset and DataArray ' 119 'objects, got %s' % type(first_obj)) --> 120 return f(objs, dim, data_vars, coords, compat, positions) 121 122

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in _dataset_concat(datasets, dim, data_vars, coords, compat, positions) 210 datasets = align(*datasets, join='outer', copy=False, exclude=[dim]) 211 --> 212 concat_over = _calc_concat_over(datasets, dim, data_vars, coords) 213 214 def insert_result_variable(k, v):

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in _calc_concat_over(datasets, dim, data_vars, coords) 190 if dim in v.dims) 191 concat_over.update(process_subset_opt(data_vars, 'data_vars')) --> 192 concat_over.update(process_subset_opt(coords, 'coords')) 193 if dim in datasets[0]: 194 concat_over.add(dim)

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in process_subset_opt(opt, subset) 165 for ds in datasets[1:]) 166 # all nonindexes that are not the same in each dataset --> 167 concat_new = set(k for k in getattr(datasets[0], subset) 168 if k not in concat_over and differs(k)) 169 elif opt == 'all':

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in <genexpr>(.0) 166 # all nonindexes that are not the same in each dataset 167 concat_new = set(k for k in getattr(datasets[0], subset) --> 168 if k not in concat_over and differs(k)) 169 elif opt == 'all': 170 concat_new = (set(getattr(datasets[0], subset)) -

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in differs(vname) 163 v = datasets[0].variables[vname] 164 return any(not ds.variables[vname].equals(v) --> 165 for ds in datasets[1:]) 166 # all nonindexes that are not the same in each dataset 167 concat_new = set(k for k in getattr(datasets[0], subset)

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in <genexpr>(.0) 163 v = datasets[0].variables[vname] 164 return any(not ds.variables[vname].equals(v) --> 165 for ds in datasets[1:]) 166 # all nonindexes that are not the same in each dataset 167 concat_new = set(k for k in getattr(datasets[0], subset)

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/utils.py in getitem(self, key) 288 289 def getitem(self, key): --> 290 return self.mapping[key] 291 292 def iter(self):

KeyError: 'lon' ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1527/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
644821435 MDU6SXNzdWU2NDQ4MjE0MzU= 4176 Pre-expand data and attributes in DataArray/Variable HTML repr? shoyer 1217238 closed 0     7 2020-06-24T18:22:35Z 2020-09-21T20:10:26Z 2020-06-28T17:03:40Z MEMBER      

Proposal

Given that a major purpose for plotting an array is to look at data or attributes, I wonder if we should expand these sections by default? - I worry that clicking on icons to expand sections may not be easy to discover - This would also be consistent with the text repr, which shows these sections by default (the Dataset repr is already consistent by default between text and HTML already)

Context

Currently the HTML repr for DataArray/Variable looks like this:

To see array data, you have to click on the icon:

(thanks to @max-sixty for making this a little bit more manageably sized in https://github.com/pydata/xarray/pull/3905!)

There's also a really nice repr for nested dask arrays:

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4176/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
702372014 MDExOlB1bGxSZXF1ZXN0NDg3NjYxMzIz 4426 Fix for h5py deepcopy issues shoyer 1217238 closed 0     6 2020-09-16T01:11:00Z 2020-09-18T22:31:13Z 2020-09-18T22:31:09Z MEMBER   0 pydata/xarray/pulls/4426
  • [x] Closes #4425
  • [x] Tests added
  • [x] Passes isort . && black . && mypy . && flake8
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4426/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
669307837 MDExOlB1bGxSZXF1ZXN0NDU5Njk1NDA5 4292 Fix indexing with datetime64[ns] with pandas=1.1 shoyer 1217238 closed 0     11 2020-07-31T00:48:50Z 2020-09-16T03:11:48Z 2020-09-16T01:33:30Z MEMBER   0 pydata/xarray/pulls/4292

Fixes #4283

The underlying issue is that calling .item() on a NumPy array with dtype=datetime64[ns] returns an integer, rather than an np.datetime64 scalar. This is somewhat baffling but works this way because .item() returns native Python types, but datetime.datetime doesn't support nanosecond precision.

pandas.Index.get_loc used to support these integers, but now is more strict. Hence we get errors.

We can fix this by using array[()] to convert 0d arrays into NumPy scalars instead of calling array.item().

I've added a crude regression test. There may well be a better way to test this but I haven't figured it out yet.

  • [x] Tests added
  • [x] Passes isort . && black . && mypy . && flake8
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4292/reactions",
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
417542619 MDU6SXNzdWU0MTc1NDI2MTk= 2803 Test failure with TestValidateAttrs.test_validating_attrs shoyer 1217238 closed 0     6 2019-03-05T23:03:02Z 2020-08-25T14:29:19Z 2019-03-14T15:59:13Z MEMBER      

This is due to setting multi-dimensional attributes being an error, as of the latest netCDF4-Python release: https://github.com/Unidata/netcdf4-python/blob/master/Changelog

E.g., as seen on Appveyor: https://ci.appveyor.com/project/shoyer/xray/builds/22834250/job/9q0ip6i3cchlbkw2 ``` ================================== FAILURES =================================== ___ TestValidateAttrs.test_validating_attrs _____ self = <xarray.tests.test_backends.TestValidateAttrs object at 0x00000096BE5FAFD0> def test_validating_attrs(self): def new_dataset(): return Dataset({'data': ('y', np.arange(10.0))}, {'y': np.arange(10)})

    def new_dataset_and_dataset_attrs():
        ds = new_dataset()
        return ds, ds.attrs

    def new_dataset_and_data_attrs():
        ds = new_dataset()
        return ds, ds.data.attrs

    def new_dataset_and_coord_attrs():
        ds = new_dataset()
        return ds, ds.coords['y'].attrs

    for new_dataset_and_attrs in [new_dataset_and_dataset_attrs,
                                  new_dataset_and_data_attrs,
                                  new_dataset_and_coord_attrs]:
        ds, attrs = new_dataset_and_attrs()

        attrs[123] = 'test'
        with raises_regex(TypeError, 'Invalid name for attr'):
            ds.to_netcdf('test.nc')

        ds, attrs = new_dataset_and_attrs()
        attrs[MiscObject()] = 'test'
        with raises_regex(TypeError, 'Invalid name for attr'):
            ds.to_netcdf('test.nc')

        ds, attrs = new_dataset_and_attrs()
        attrs[''] = 'test'
        with raises_regex(ValueError, 'Invalid name for attr'):
            ds.to_netcdf('test.nc')

        # This one should work
        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = 'test'
        with create_tmp_file() as tmp_file:
            ds.to_netcdf(tmp_file)

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = {'a': 5}
        with raises_regex(TypeError, 'Invalid value for attr'):
            ds.to_netcdf('test.nc')

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = MiscObject()
        with raises_regex(TypeError, 'Invalid value for attr'):
            ds.to_netcdf('test.nc')

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = 5
        with create_tmp_file() as tmp_file:
            ds.to_netcdf(tmp_file)

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = 3.14
        with create_tmp_file() as tmp_file:
            ds.to_netcdf(tmp_file)

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = [1, 2, 3, 4]
        with create_tmp_file() as tmp_file:
            ds.to_netcdf(tmp_file)

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = (1.9, 2.5)
        with create_tmp_file() as tmp_file:
            ds.to_netcdf(tmp_file)

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = np.arange(5)
        with create_tmp_file() as tmp_file:
            ds.to_netcdf(tmp_file)

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = np.arange(12).reshape(3, 4)
        with create_tmp_file() as tmp_file:
          ds.to_netcdf(tmp_file)

xarray\tests\test_backends.py:3450:


xarray\core\dataset.py:1323: in to_netcdf compute=compute) xarray\backends\api.py:767: in to_netcdf unlimited_dims=unlimited_dims) xarray\backends\api.py:810: in dump_to_store unlimited_dims=unlimited_dims) xarray\backends\common.py:262: in store self.set_attributes(attributes) xarray\backends\common.py:278: in set_attributes self.set_attribute(k, v) xarray\backends\netCDF4_.py:418: in set_attribute set_nc_attribute(self.ds, key, value) xarray\backends\netCDF4.py:294: in _set_nc_attribute obj.setncattr(key, value) netCDF4_netCDF4.pyx:2781: in netCDF4._netCDF4.Dataset.setncattr ???


??? E ValueError: multi-dimensional array attributes not supported netCDF4_netCDF4.pyx:1514: ValueError ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2803/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
676306518 MDU6SXNzdWU2NzYzMDY1MTg= 4331 Support explicitly setting a dimension order with to_dataframe() shoyer 1217238 closed 0     0 2020-08-10T17:45:17Z 2020-08-14T18:28:26Z 2020-08-14T18:28:26Z MEMBER      

As discussed in https://github.com/pydata/xarray/issues/2346, it would be nice to support explicitly setting the desired order of dimensions when calling Dataset.to_dataframe() or DataArray.to_dataframe().

There is nice precedent for this in the to_dask_dataframe method: http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_dask_dataframe.html

I imagine we could copy the exact same API for `to_dataframe.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4331/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
671019427 MDU6SXNzdWU2NzEwMTk0Mjc= 4295 We shouldn't require a recent version of setuptools to install xarray shoyer 1217238 closed 0     33 2020-08-01T16:49:57Z 2020-08-14T09:52:42Z 2020-08-14T09:52:42Z MEMBER      

@canol reports on our mailing that our setuptools 41.2 (released 21 August 2019) install requirement is making it hard to install recent versions of xarray at his company: https://groups.google.com/g/xarray/c/HS_xcZDEEtA/m/GGmW-3eMCAAJ

Hello, this is just a feedback about an issue we experienced which caused our internal tools stack to stay with xarray 0.15 version instead of a newer versions.

We are a company using xarray in our internal frameworks and at the beginning we didn't have any restrictions on xarray version in our requirements file, so that new installations of our framework were using the latest version of xarray. But a few months ago we started to hear complaints from users who were having problems with installing our framework and the installation was failing because of xarray's requirement to use at least setuptools 41.2 which is released on 21th of August last year. So it hasn't been a year since it got released which might be considered relatively new.

During the installation of our framework, pip was failing to update setuptools by saying that some other process is already using setuptools files so it cannot update setuptools. The people who are using our framework are not software developers so they didn't know how to solve this problem and it became so overwhelming for us maintainers that we set the xarray requirement to version >=0.15 <0.16. We also share our internal framework with customers of our company so we didn't want to bother the customers with any potential problems.

You can see some other people having having similar problem when trying to update setuptools here (although not related to xarray): https://stackoverflow.com/questions/49338652/pip-install-u-setuptools-fail-windows-10

It is not a big deal but I just wanted to give this as a feedback. I don't know how much xarray depends on setuptools' 41.2 version.

I was surprised to see this in our setup.cfg file, added by @crusaderky in #3628. The version requirement is not documented in our docs.

Given that setuptools may be challenging to upgrade, would it be possible to relax this version requirement?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4295/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
638597800 MDExOlB1bGxSZXF1ZXN0NDM0MzMxNzQ3 4154 Update issue templates inspired/based on dask shoyer 1217238 closed 0     1 2020-06-15T07:00:53Z 2020-08-05T13:05:33Z 2020-06-17T16:50:57Z MEMBER   0 pydata/xarray/pulls/4154

See https://github.com/dask/dask/issues/new/choose for an approximate example of what this looks like.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4154/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
290593053 MDU6SXNzdWUyOTA1OTMwNTM= 1850 xarray contrib module shoyer 1217238 closed 0     25 2018-01-22T19:50:08Z 2020-07-23T16:34:10Z 2020-07-23T16:34:10Z MEMBER      

Over in #1288 @nbren12 wrote:

Overall, I think the xarray community could really benefit from some kind of centralized contrib package which has a low barrier to entry for these kinds of functions.

Yes, I agree that we should explore this. There are a lot of interesting projects building on xarray now but not great ways to discover them.

Are there other open source projects with a good model we should copy here? - Scikit-Learn has a separate GitHub org/repositories for contrib projects: https://github.com/scikit-learn-contrib. - TensorFlow has a contrib module within the TensorFlow namespace: tensorflow.contrib

This gives us two different models to consider. The first "separate repository" model might be easier/flexible from a maintenance perspective. Any preferences/thoughts?

There's also some nice overlap with the Pangeo project.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1850/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
646073396 MDExOlB1bGxSZXF1ZXN0NDQwNDMxNjk5 4184 Improve the speed of from_dataframe with a MultiIndex (by 40x!) shoyer 1217238 closed 0     1 2020-06-26T07:39:14Z 2020-07-02T20:39:02Z 2020-07-02T20:39:02Z MEMBER   0 pydata/xarray/pulls/4184

Before:

pandas.MultiIndexSeries.time_to_xarray
======= ========= ==========
--             subset
------- --------------------
dtype     True     False
======= ========= ==========
  int    505±0ms   37.1±0ms
 float   485±0ms   38.3±0ms
======= ========= ==========

After:

pandas.MultiIndexSeries.time_to_xarray
======= ============ ==========
--               subset
------- -----------------------
dtype      True       False
======= ============ ==========
  int    10.7±0.4ms   22.6±1ms
 float   10.0±0.8ms   21.1±1ms
======= ============ ==========

~~There are still some cases where we have to fall back to the existing slow implementation, but hopefully they should now be relatively rare.~~ Edit: now we always use the new implementation

  • [x] Closes #2459, closes #4186
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [x] Passes isort -rc . && black . && mypy . && flake8
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4184/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 1,
    "eyes": 0
}
    xarray 13221727 pull
645961347 MDExOlB1bGxSZXF1ZXN0NDQwMzQ2NTQz 4182 Show data by default in HTML repr for DataArray shoyer 1217238 closed 0     0 2020-06-26T02:25:08Z 2020-06-28T17:03:41Z 2020-06-28T17:03:41Z MEMBER   0 pydata/xarray/pulls/4182
  • [x] Closes #4176
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4182/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
644170008 MDExOlB1bGxSZXF1ZXN0NDM4ODQxMjk2 4171 Remove <pre> from nested HTML repr shoyer 1217238 closed 0     0 2020-06-23T21:51:14Z 2020-06-24T15:45:20Z 2020-06-24T15:45:00Z MEMBER   0 pydata/xarray/pulls/4171

Using <pre> messes up the display of nested HTML reprs, e.g., from dask. Now we only use the <pre> tag when displaying raw text reprs.

Before (Jupyter notebook):

After:

  • [x] Tests added
  • [x] Passes isort -rc . && black . && mypy . && flake8
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4171/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
613546626 MDExOlB1bGxSZXF1ZXN0NDE0MjgwMDEz 4039 Revise pull request template shoyer 1217238 closed 0     5 2020-05-06T19:08:19Z 2020-06-18T05:45:11Z 2020-06-18T05:45:10Z MEMBER   0 pydata/xarray/pulls/4039

See below for the new language, to clarify that documentation is only necessary for "user visible changes."

I added "including notable bug fixes" to indicate that minor bug fixes may not be worth noting (I was thinking of test-suite only fixes in this category) but perhaps that is too confusing.

cc @pydata/xarray for opinions!

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] Passes isort -rc . && black . && mypy . && flake8
  • [ ] Fully documented, including whats-new.rst for user visible changes (including notable bug fixes) and api.rst for new API
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4039/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
639334065 MDExOlB1bGxSZXF1ZXN0NDM0OTQ0NTc4 4159 Test RTD's new pull request builder shoyer 1217238 closed 0     1 2020-06-16T03:06:32Z 2020-06-17T16:54:02Z 2020-06-17T16:54:02Z MEMBER   1 pydata/xarray/pulls/4159

https://docs.readthedocs.io/en/latest/guides/autobuild-docs-for-pull-requests.html

Don't merge this!

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4159/reactions",
    "total_count": 3,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 3,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
639397110 MDExOlB1bGxSZXF1ZXN0NDM0OTk1NzQz 4160 Fix failing upstream-dev build & remove docs build shoyer 1217238 closed 0     0 2020-06-16T06:08:55Z 2020-06-16T06:35:49Z 2020-06-16T06:35:44Z MEMBER   0 pydata/xarray/pulls/4160

Instead, we'll use RTD's new doc builder instead. For an example, click on "docs/readthedocs.org:xray" below or look at GH4159

  • [x] Closes https://github.com/pydata/xarray/issues/4146
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4160/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
35682274 MDU6SXNzdWUzNTY4MjI3NA== 158 groupby should work with name=None shoyer 1217238 closed 0     2 2014-06-13T15:38:00Z 2020-05-30T13:15:56Z 2020-05-30T13:15:56Z MEMBER      
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/158/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
612214951 MDExOlB1bGxSZXF1ZXN0NDEzMjIyOTEx 4028 Remove broken test for Panel with to_pandas() shoyer 1217238 closed 0     5 2020-05-04T22:41:42Z 2020-05-06T01:50:21Z 2020-05-06T01:50:21Z MEMBER   0 pydata/xarray/pulls/4028

We don't support creating a Panel with to_pandas() with any version of pandas at present, so this test was previous broken if pandas < 0.25 was installed.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4028/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
612772669 MDU6SXNzdWU2MTI3NzI2Njk= 4030 Doc build on Azure is timing out on master shoyer 1217238 closed 0     1 2020-05-05T17:30:16Z 2020-05-05T21:49:26Z 2020-05-05T21:49:26Z MEMBER      

I don't know what's going on, but it currently times out after 1 hour: https://dev.azure.com/xarray/xarray/_build/results?buildId=2767&view=logs&j=7e620c85-24a8-5ffa-8b1f-642bc9b1fc36&t=68484831-0a19-5145-bfe9-6309e5f7691d

Is it possible to login to Azure to debug this stuff?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4030/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
612838635 MDExOlB1bGxSZXF1ZXN0NDEzNzA3Mzgy 4032 Allow warning with cartopy in docs plotting build shoyer 1217238 closed 0     1 2020-05-05T19:25:11Z 2020-05-05T21:49:26Z 2020-05-05T21:49:26Z MEMBER   0 pydata/xarray/pulls/4032

Fixes https://github.com/pydata/xarray/issues/4030

It looks like this is triggered by the new cartopy version now being installed on RTD (version 0.17.0 -> 0.18.0).

Long term we should fix this, but for now it's better just to disable the warning.

Here's the message from RTD: `` Exception occurred: File "/home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.8/site-packages/IPython/sphinxext/ipython_directive.py", line 586, in process_input raise RuntimeError('Non Expected warning in{}line {}'.format(filename, lineno)) RuntimeError: Non Expected warning in/home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/latest/doc/plotting.rst` line 732 The full traceback has been saved in /tmp/sphinx-err-qav6jjmm.log, if you want to report the issue to the developers. Please also report this if it was a user error, so that a better error message can be provided next time. A bug report can be filed in the tracker at https://github.com/sphinx-doc/sphinx/issues. Thanks!


Warning in /home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/latest/doc/plotting.rst at block ending on line 732 Specify :okwarning: as an option in the ipython:: block to suppress this message


/home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/latest/xarray/plot/facetgrid.py:373: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. self.fig.tight_layout() <<<------------------------------------------------------------------------- ``` https://readthedocs.org/projects/xray/builds/10969146/

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4032/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
612262200 MDExOlB1bGxSZXF1ZXN0NDEzMjYwNTY2 4029 Support overriding existing variables in to_zarr() without appending shoyer 1217238 closed 0     2 2020-05-05T01:06:40Z 2020-05-05T19:28:02Z 2020-05-05T19:28:02Z MEMBER   0 pydata/xarray/pulls/4029

This is nice for consistency with to_netcdf. It should be useful for cases where users want to update values in existing Zarr datasets.

  • [x] Tests added
  • [x] Passes isort -rc . && black . && mypy . && flake8
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4029/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
187625917 MDExOlB1bGxSZXF1ZXN0OTI1MjQzMjg= 1087 WIP: New DataStore / Encoder / Decoder API for review shoyer 1217238 closed 0     8 2016-11-07T05:02:04Z 2020-04-17T18:37:45Z 2020-04-17T18:37:45Z MEMBER   0 pydata/xarray/pulls/1087

The goal here is to make something extensible that we can live with for quite some time, and to clean up the internals of xarray's backend interface.

Most of these are analogues of existing xarray classes with a cleaned up interface. I have not yet worried about backwards compatibility or tests -- I would appreciate feedback on the approach here.

Several parts of the logic exist for the sake of dask. I've included the word "dask" in comments to facilitate inspection by mrocklin.

CC @rabernat, @pwolfram, @jhamman, @mrocklin -- for review

CC @mcgibbon, @JoyMonteiro -- this is relevant to our discussion today about adding support for appending to netCDF files. Don't let this stop you from getting started on that with the existing interface, though.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1087/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
598567792 MDU6SXNzdWU1OTg1Njc3OTI= 3966 HTML repr is slightly broken in Google Colab shoyer 1217238 closed 0     1 2020-04-12T20:44:51Z 2020-04-16T20:14:37Z 2020-04-16T20:14:32Z MEMBER      

The "data" toggles are pre-expanded and don't work.

See https://github.com/googlecolab/colabtools/issues/1145 for a full description.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3966/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
479434052 MDU6SXNzdWU0Nzk0MzQwNTI= 3206 DataFrame with MultiIndex -> xarray with sparse array shoyer 1217238 closed 0     1 2019-08-12T00:46:16Z 2020-04-06T20:41:26Z 2019-08-27T08:54:26Z MEMBER      

Now that we have preliminary support for sparse arrays in xarray, one really cool feature we could explore is creating sparse arrays from MultiIndexed pandas DataFrames.

Right now, xarray's methods for creating objects from pandas always create dense arrays, but the size of these dense arrays can get big really quickly if the MultiIndex is sparsely populated, e.g., python import pandas as pd import numpy as np import xarray df = pd.DataFrame({ 'w': range(10), 'x': list('abcdefghij'), 'y': np.arange(0, 100, 10), 'z': np.ones(10), }).set_index(['w', 'x', 'y']) print(xarray.Dataset.from_dataframe(df)) This length 10 DataFrame turned into a dense array with 1000 elements (only 10 of which are not NaN): <xarray.Dataset> Dimensions: (w: 10, x: 10, y: 10) Coordinates: * w (w) int64 0 1 2 3 4 5 6 7 8 9 * x (x) object 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' * y (y) int64 0 10 20 30 40 50 60 70 80 90 Data variables: z (w, x, y) float64 1.0 nan nan nan nan nan ... nan nan nan nan 1.0

We can imagine xarray.Dataset.from_dataframe(df, sparse=True) would make the same Dataset, but with sparse array (with a NaN fill value) instead of dense arrays.

Once sparse arrays work pretty well, this could actually obviate most of the use cases for MultiIndex in arrays. Arguably the model is quite a bit cleaner.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3206/reactions",
    "total_count": 3,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 3,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
479940669 MDU6SXNzdWU0Nzk5NDA2Njk= 3212 Custom fill_value for from_dataframe/from_series shoyer 1217238 open 0     0 2019-08-13T03:22:46Z 2020-04-06T20:40:26Z   MEMBER      

It would be to have the option to customize the fill value when creating an xarray objects from pandas, instead of requiring to always be NaN.

This would probably be especially useful when creating sparse arrays (https://github.com/pydata/xarray/issues/3206), for which it often makes sense to use a fill value of zero. If your data has integer values (e.g., it represents counts), you probably don't want to let it be cast to float first.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3212/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
314482923 MDU6SXNzdWUzMTQ0ODI5MjM= 2061 Backend specific conventions decoding shoyer 1217238 open 0     1 2018-04-16T02:45:46Z 2020-04-05T23:42:34Z   MEMBER      

Currently, we have a single function xarray.decode_cf() that we apply to data loaded from all xarray backends.

This is appropriate for netCDF data, but it's not appropriate for backends with different implementations. For example, it doesn't work for zarr (which is why we have the separate open_zarr), and is also a poor fit for PseudoNetCDF (https://github.com/pydata/xarray/pull/1905). In the worst cases (e.g., for PseudoNetCDF) it can actually result in data being decoded twice, which can result in incorrectly scaled data.

Instead, we should declare default decoders as part of the backend API, and use those decoders as the defaults for open_dataset().

This should probably be tackled as part of the broader backends refactor: https://github.com/pydata/xarray/issues/1970

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2061/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
28376794 MDU6SXNzdWUyODM3Njc5NA== 25 Consistent rules for handling merges between variables with different attributes shoyer 1217238 closed 0     13 2014-02-26T22:37:01Z 2020-04-05T19:13:13Z 2014-09-04T06:50:49Z MEMBER      

Currently, variable attributes are checked for equality before allowing for a merge via a call to xarray_equal. It should be possible to merge datasets even if some of the variable metadata disagrees (conflicting attributes should be dropped). This is already the behavior for global attributes.

The right design of this feature should probably include some optional argument to Dataset.merge indicating how strict we want the merge to be. I can see at least three versions that could be useful: 1. Drop conflicting metadata silently. 2. Don't allow for conflicting values, but drop non-matching keys. 3. Require all keys and values to match.

We can argue about which of these should be the default option. My inclination is to be as flexible as possible by using 1 or 2 in most cases.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/25/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
173612265 MDU6SXNzdWUxNzM2MTIyNjU= 988 Hooks for custom attribute handling in xarray operations shoyer 1217238 open 0     24 2016-08-27T19:48:22Z 2020-04-05T18:19:11Z   MEMBER      

Over in #964, I am working on a rewrite/unification of the guts of xarray's logic for computation with labelled data. The goal is to get all of xarray's internal logic for working with labelled data going through a minimal set of flexible functions which we can also expose as part of the API.

Because we will finally have all (or at least nearly all) xarray operations using the same code path, I think it will also finally become feasible to open up hooks allowing extensions how xarray handles metadata.

Two obvious use cases here are units (#525) and automatic maintenance of metadata (e.g., cell_methods or history fields). Both of these are out of scope for xarray itself, mostly because the specific logic tends to be domain specific. This could also subsume options like the existing keep_attrs on many operations.

I like the idea of supporting something like NumPy's __array_wrap__ to allow third-party code to finalize xarray objects in some way before they are returned. However, it's not obvious to me what the right design is. - Should we lookup a custom attribute on subclasses like __array_wrap__ (or __numpy_ufunc__) in NumPy, or should we have a system (e.g., unilaterally or with a context manager and xarray.set_options) for registering hooks that are then checked on all xarray objects? I am inclined toward the later, even though it's a little slower, just because it will be simpler and easier to get right - Should these methods be able to control the full result objects, or only set attrs and/or name? - To be useful, do we need to allow extensions to take control of the full operation, to support things like automatic unit conversion? This would suggest something closing to __numpy_ufunc__, which is a little more ambitious than what I had in mind here.

Feedback would be greatly appreciated.

CC @darothen @rabernat @jhamman @pwolfram

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/988/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
29136905 MDU6SXNzdWUyOTEzNjkwNQ== 60 Implement DataArray.idxmax() shoyer 1217238 closed 0   1.0 741199 14 2014-03-10T22:03:06Z 2020-03-29T01:54:25Z 2020-03-29T01:54:25Z MEMBER      

Should match the pandas function: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.idxmax.html

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/60/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Next page

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 88.097ms · About: xarray-datasette