home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

48 rows where state = "open" and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

type 2

  • issue 45
  • pull 3

state 1

  • open · 48 ✖

repo 1

  • xarray 48
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2266174558 I_kwDOAMm_X86HExRe 8975 Xarray sponsorship guidelines shoyer 1217238 open 0     3 2024-04-26T17:05:01Z 2024-04-30T20:52:33Z   MEMBER      

At what level of support should Xarray acknowledge sponsors on our website?

I would like to surface this for open discussion because there are potential sponsoring organizations with conflicts of interest with members of Xarray's leadership team (e.g., Earthmover, which employs @jhamman, @rabernat and @dcherian).

My suggestion is to use NumPy's guidelines, with an adjustment down to 1/3 of the thresholds to account for the smaller size of the project:

  • $10,000/yr for unrestricted financial contributions (e.g., donations)
  • $20,000/yr for financial contributions for a particular purpose (e.g., grants)
  • $30,000/yr for in-kind contributions (e.g., time for employees to contribute)
  • 2 person-months/yr of paid work time for one or more Xarray maintainers or regular contributors to any Xarray team or activity

The NumPy guidelines also include a grace period of a minimum of 6 months for acknowledging support. I would suggest increasing this to a minimum of 1 year for Xarray.

I would greatly appreciate any feedback from members of the community, either in this issue or on the next team meeting.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8975/reactions",
    "total_count": 6,
    "+1": 5,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
271043420 MDU6SXNzdWUyNzEwNDM0MjA= 1689 Roundtrip serialization of coordinate variables with spaces in their names shoyer 1217238 open 0     5 2017-11-03T16:43:20Z 2024-03-22T14:02:48Z   MEMBER      

If coordinates have spaces in their names, they get restored from netCDF files as data variables instead: ```

xarray.open_dataset(xarray.Dataset(coords={'name with spaces': 1}).to_netcdf()) <xarray.Dataset> Dimensions: () Data variables: name with spaces int32 1 ````

This happens because the CF convention is to indicate coordinates as a space separated string, e.g., coordinates='latitude longitude'.

Even though these aren't CF compliant variable names (which cannot have strings) It would be nice to have an ad-hoc convention for xarray that allows us to serialize/deserialize coordinates in all/most cases. Maybe we could use escape characters for spaces (e.g., coordinates='name\ with\ spaces') or quote names if they have spaces (e.g., coordinates='"name\ with\ spaces"'?

At the very least, we should issue a warning in these cases.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1689/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
842436143 MDU6SXNzdWU4NDI0MzYxNDM= 5081 Lazy indexing arrays as a stand-alone package shoyer 1217238 open 0     6 2021-03-27T07:06:03Z 2023-12-15T13:20:03Z   MEMBER      

From @rabernat on Twitter:

"Xarray has some secret private classes for lazily indexing / wrapping arrays that are so useful I think they should be broken out into a standalone package. https://github.com/pydata/xarray/blob/master/xarray/core/indexing.py#L516"

The idea here is create a first-class "duck array" library for lazy indexing that could replace xarray's internal classes for lazy indexing. This would be in some ways similar to dask.array, but much simpler, because it doesn't have to worry about parallel computing.

Desired features:

  • Lazy indexing
  • Lazy transposes
  • Lazy concatenation (#4628) and stacking
  • Lazy vectorized operations (e.g., unary and binary arithmetic)
    • needed for decoding variables from disk (xarray.encoding) and
    • building lazy multi-dimensional coordinate arrays corresponding to map projections (#3620)
  • Maybe: lazy reshapes (#4113)

A common feature of these operations is they can (and almost always should) be fused with indexing: if N elements are selected via indexing, only O(N) compute and memory is required to produce them, regards of the size of the original arrays as long as the number of applied operations can be treated as a constant. Memory access is significantly slower than compute on modern hardware, so recomputing these operations on the fly is almost always a good idea.

Out of scope: lazy computation when indexing could require access to many more elements to compute the desired value than are returned. For example, mean() probably should not be lazy, because that could involve computation of a very large number of elements that one might want to cache.

This is valuable functionality for Xarray for two reasons:

  1. It allows for "previewing" small bits of data loaded from disk or remote storage, even if that data needs some form of cheap "decoding" from its form on disk.
  2. It allows for xarray to decode data in a lazy fashion that is compatible with full-featured systems for lazy computation (e.g., Dask), without requiring the user to choose dask when reading the data.

Related issues:

  • [Proposal] Expose Variable without Pandas dependency #3981
  • Lazy concatenation of arrays #4628
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5081/reactions",
    "total_count": 6,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 6,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
588105641 MDU6SXNzdWU1ODgxMDU2NDE= 3893 HTML repr in the online docs shoyer 1217238 open 0     3 2020-03-26T02:17:51Z 2023-09-11T17:41:59Z   MEMBER      

I noticed two minor issues in our online docs, now that we've switched to the hip new HTML repr by default.

  1. Most doc pages still show text, not HTML. I suspect this is a limitation of the IPython sphinx derictive we use for our snippets. We might be able to fix that by switching to jupyter-sphinx?

  2. The "attributes" part of the HTML repr in our notebook examples looks a little funny, with strange blue formatting around each attribute name. It looks like part of the outer style of our docs is leaking into the HTML repr:

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3893/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1376109308 I_kwDOAMm_X85SBcL8 7045 Should Xarray stop doing automatic index-based alignment? shoyer 1217238 open 0     13 2022-09-16T15:31:03Z 2023-08-23T07:42:34Z   MEMBER      

What is your issue?

I am increasingly thinking that automatic index-based alignment in Xarray (copied from pandas) may have been a design mistake. Almost every time I work with datasets with different indexes, I find myself writing code to explicitly align them:

  1. Automatic alignment is hard to predict. The implementation is complicated, and the exact mode of automatic alignment (outer vs inner vs left join) depends on the specific operation. It's also no longer possible to predict the shape (or even the dtype) resulting from most Xarray operations purely from input shape/dtype.
  2. Automatic alignment brings unexpected performance penalty. In some domains (analytics) this is OK, but in others (e.g,. numerical modeling or deep learning) this is a complete deal-breaker.
  3. Automatic alignment is not useful for float indexes, because exact matches are rare. In practice, this makes it less useful in Xarray's usual domains than it for pandas.

Would it be insane to consider changing Xarray's behavior to stop doing automatic alignment? I imagine we could roll this out slowly, first with warnings and then with an option for disabling it.

If you think this is a good or bad idea, consider responding to this issue with a 👍 or 👎 reaction.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7045/reactions",
    "total_count": 13,
    "+1": 9,
    "-1": 2,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 2
}
    xarray 13221727 issue
342928718 MDExOlB1bGxSZXF1ZXN0MjAyNzE0MjUx 2302 WIP: lazy=True in apply_ufunc() shoyer 1217238 open 0     1 2018-07-20T00:01:21Z 2023-07-18T04:19:17Z   MEMBER   0 pydata/xarray/pulls/2302
  • [x] Closes https://github.com/pydata/xarray/issues/2298
  • [ ] Tests added
  • [ ] Tests passed
  • [ ] Fully documented, including whats-new.rst for all changes and api.rst for new API

Still needs more tests and documentation.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2302/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
479942077 MDU6SXNzdWU0Nzk5NDIwNzc= 3213 How should xarray use/support sparse arrays? shoyer 1217238 open 0     55 2019-08-13T03:29:42Z 2023-06-07T15:43:55Z   MEMBER      

I'm looking forward to being easily able to create sparse xarray objects from pandas: https://github.com/pydata/xarray/issues/3206

Are there other xarray APIs that could make good use of sparse arrays, or could make sparse arrays easier to use?

Some ideas: - to_sparse()/to_dense() methods for converting to/from sparse without requiring using .data - to_dataframe()/to_series() could grow options for skipping the fill-value in sparse arrays, so they can round-trip MultiIndex data back to pandas - Serialization to/from netCDF files, using some custom convention (see https://github.com/pydata/xarray/issues/1375#issuecomment-402699810)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3213/reactions",
    "total_count": 14,
    "+1": 14,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1465287257 I_kwDOAMm_X85XVoJZ 7325 Support reading Zarr data via TensorStore shoyer 1217238 open 0     1 2022-11-27T00:12:17Z 2023-05-11T01:24:27Z   MEMBER      

What is your issue?

TensorStore is another high performance API for reading distributed arrays in formats such as Zarr, written in C++.

It could be interesting to write an Xarray storage backend using TensorStore as an alternative way to read Zarr files.

As an exercise, I make a little demo of doing this: https://gist.github.com/shoyer/5b0c485979cc9c36a9685d8cf8e94565

I have not tested it for performance. The main annoyance is that TensorStore doesn't understand Zarr groups or Zarr array attributes, so I needed to write my own helpers for reading this metadata.

Also, there's a bit of an impedance mis-match between TensorStore (where everything returns futures) and Xarray (which assumes that indexing results in numpy arrays). This could likely be improved with some amount of effort -- in particular https://github.com/pydata/xarray/pull/6874/files should help.

CC @jbms who may have better ideas about how to use the TensorStore API.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7325/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
209653741 MDU6SXNzdWUyMDk2NTM3NDE= 1285 FAQ page could use some updating shoyer 1217238 open 0     1 2017-02-23T03:29:16Z 2023-03-26T16:32:44Z   MEMBER      

Along the same lines as https://github.com/pydata/xarray/issues/1282, we haven't done much updating for frequently asked questions -- it's mostly still the original handful of FAQ entries I wrote in the first version of the docs.

Topics worth addressing:

  • [ ] How xarray handles missing values
  • [x] File formats -- how can I read format X in xarray? (Maybe we should make a table with links to other packages?)

(please add suggestions for this list!)

StackOverflow may be a helpful reference here: http://stackoverflow.com/questions/tagged/python-xarray?sort=votes&pageSize=50

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1285/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
176805500 MDU6SXNzdWUxNzY4MDU1MDA= 1004 Remove IndexVariable.name shoyer 1217238 open 0     3 2016-09-14T03:27:43Z 2023-03-11T19:57:40Z   MEMBER      

As discussed in #947, we should remove the IndexVariable.name attribute. It should be fine to use an IndexVariable anywhere, regardless of whether or not it labels ticks along a dimension.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1004/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
895983112 MDExOlB1bGxSZXF1ZXN0NjQ4MTM1NTcy 5351 Add xarray.backends.NoMatchingEngineError shoyer 1217238 open 0     4 2021-05-19T22:09:21Z 2022-11-16T15:19:54Z   MEMBER   0 pydata/xarray/pulls/5351
  • [x] Closes #5329
  • [x] Tests added
  • [x] Passes pre-commit run --all-files
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [x] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5351/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
168272291 MDExOlB1bGxSZXF1ZXN0NzkzMjE2NTc= 924 WIP: progress toward making groupby work with multiple arguments shoyer 1217238 open 0     16 2016-07-29T08:07:57Z 2022-06-09T14:50:17Z   MEMBER   0 pydata/xarray/pulls/924

Fixes #324

It definitely doesn't work properly yet, totally mixing up coordinates, data variables and multi-indexes (as shown by the failing tests).

A simple example:

``` In [4]: coords = {'a': ('x', [0, 0, 1, 1]), 'b': ('y', [0, 0, 1, 1])}

In [5]: square = xr.DataArray(np.arange(16).reshape(4, 4), coords=coords, dims=['x', 'y'])

In [6]: square Out[6]: <xarray.DataArray (x: 4, y: 4)> array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]]) Coordinates: b (y) int64 0 0 1 1 a (x) int64 0 0 1 1 * x (x) int64 0 1 2 3 * y (y) int64 0 1 2 3

In [7]: square.groupby(['a', 'b']).mean() Out[7]: <xarray.DataArray (a: 2, b: 2)> array([[ 2.5, 4.5], [ 10.5, 12.5]]) Coordinates: * a (a) int64 0 1 * b (b) int64 0 1

In [8]: square.groupby(['x', 'y']).mean() Out[8]: <xarray.DataArray (x: 4, y: 4)> array([[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.], [ 12., 13., 14., 15.]]) Coordinates: * x (x) int64 0 1 2 3 * y (y) int64 0 1 2 3 ```

More examples: https://gist.github.com/shoyer/5cfa4d5751e8a78a14af25f8442ad8d5

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/924/reactions",
    "total_count": 4,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 3,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
326205036 MDU6SXNzdWUzMjYyMDUwMzY= 2180 How should Dataset.update() handle conflicting coordinates? shoyer 1217238 open 0     16 2018-05-24T16:46:23Z 2022-04-30T13:40:28Z   MEMBER      

Recently, we updated Dataset.__setitem__ to drop conflicting coordinates from DataArray values being assigned if they conflict with existing coordinates (https://github.com/pydata/xarray/pull/2087). Because update and __setitem__ share the same code path, this inadvertently updated update as well. Is this something we want?

In v0.10.3, both __setitem__ and update prioritize coordinates from the assigned objects (e.g., value in dataset[key] = value).

In v0.10.4, both __setitem__ and update prioritize coordinates from the original object (e.g., dataset).

I'm not sure this is the right behavior. In particular, in the case of dataset.update(other) where other is also an xarray.Dataset, it seems like coordinates from other should take priority.

Note that one advantage of the current logic (which is violated by my current fix in https://github.com/pydata/xarray/pull/2162), is that we maintain the invariant that dataset[key] = value is equivalent to dataset.update({key: value}).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2180/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
612918997 MDU6SXNzdWU2MTI5MTg5OTc= 4034 Fix tight_layout warning on cartopy facetgrid docs example shoyer 1217238 open 0     1 2020-05-05T21:54:46Z 2022-04-30T12:37:50Z   MEMBER      

Per the fix in https://github.com/pydata/xarray/pull/4032, I'm pretty sure we will soon start seeing a warning message printed on ReadTheDocs in Cartopy FacetGrid example: http://xarray.pydata.org/en/stable/plotting.html#maps

This would be nice to fix for users, especially because it's likely users will see this warning when running code outside of our documentation, too.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4034/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
342180429 MDU6SXNzdWUzNDIxODA0Mjk= 2298 Making xarray math lazy shoyer 1217238 open 0     7 2018-07-18T05:18:53Z 2022-04-19T15:38:59Z   MEMBER      

At SciPy, I had the realization that it would be relatively straightforward to make element-wise math between xarray objects lazy. This would let us support lazy coordinate arrays, a feature that has quite a few use-cases, e.g., for both geoscience and astronomy.

The trick would be to write a lazy array class that holds an element-wise vectorized function and passes indexers on to its arguments. I haven't thought too hard about this yet for vectorized indexing, but it could be quite efficient for outer indexing. I have some prototype code but no tests yet.

The question is how to hook this into xarray operations. In particular, supposing that the inputs to a function do no hold dask arrays: - Should we try to make every element-wise operation with vectorized functions (ufuncs) lazy by default? This might have negative performance implications and would be a little tricky to implement with xarray's current code, since we still implement binary operations like + with separate logic from apply_ufunc. - Should we make every element-wise operation that explicitly uses apply_ufunc() lazy by default? - Or should we only make element-wise operations lazy with apply_ufunc() if you use some special flag, e.g., apply_ufunc(..., lazy=True)?

I am leaning towards the last option for now but would welcome other opinions.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2298/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
902622057 MDU6SXNzdWU5MDI2MjIwNTc= 5381 concat() with compat='no_conflicts' on dask arrays has accidentally quadratic runtime shoyer 1217238 open 0     0 2021-05-26T16:12:06Z 2022-04-19T03:48:27Z   MEMBER      

This ends up calling fillna() in a loop inside xarray.core.merge.unique_variable(), something like: python out = variables[0] for var in variables[1:]: out = out.fillna(var) https://github.com/pydata/xarray/blob/55e5b5aaa6d9c27adcf9a7cb1f6ac3bf71c10dea/xarray/core/merge.py#L147-L149

This has quadratic behavior if the variables are stored in dask arrays (the dask graph gets one element larger after each loop iteration). This is OK for merge() (which typically only has two arguments) but is problematic for dealing with variables that shouldn't be concatenated inside concat(), which should be able to handle very long lists of arguments.

I encountered this because compat='no_conflicts' is the default for xarray.combine_nested().

I guess there's also the related issue which is that even if we produced the output dask graph by hand without a loop, it still wouldn't be easy to evaluate for a large number of elements. Ideally we would use some sort of tree-reduction to ensure the operation can be parallelized.

xref https://github.com/google/xarray-beam/pull/13

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5381/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
325439138 MDU6SXNzdWUzMjU0MzkxMzg= 2171 Support alignment/broadcasting with unlabeled dimensions of size 1 shoyer 1217238 open 0     5 2018-05-22T19:52:21Z 2022-04-19T03:15:24Z   MEMBER      

Sometimes, it's convenient to include placeholder dimensions of size 1, which allows for removing any ambiguity related to the order of output dimensions.

Currently, this is not supported with xarray: ```

xr.DataArray([1], dims='x') + xr.DataArray([1, 2, 3], dims='x') ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension sizes: {1, 3}

xr.Variable(('x',), [1]) + xr.Variable(('x',), [1, 2, 3]) ValueError: operands cannot be broadcast together with mismatched lengths for dimension 'x': (1, 3) ```

However, these operations aren't really ambiguous. With size 1 dimensions, we could logically do broadcasting like NumPy arrays, e.g., ```

np.array([1]) + np.array([1, 2, 3]) array([2, 3, 4]) ```

This would be particularly convenient if we add keepdims=True to xarray operations (#2170).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2171/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
665488672 MDU6SXNzdWU2NjU0ODg2NzI= 4267 CachingFileManager should not use __del__ shoyer 1217238 open 0     2 2020-07-25T01:20:52Z 2022-04-17T21:42:39Z   MEMBER      

__del__ is sometimes called after modules have been deallocated, which results in errors printed to stderr when Python exits. This manifests itself in the following bug: https://github.com/shoyer/h5netcdf/issues/50

Per https://github.com/shoyer/h5netcdf/issues/50#issuecomment-572191867, the right solution is probably to use weakref.finalize.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4267/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
469440752 MDU6SXNzdWU0Njk0NDA3NTI= 3139 Change the signature of DataArray to DataArray(data, dims, coords, ...)? shoyer 1217238 open 0     1 2019-07-17T20:54:57Z 2022-04-09T15:28:51Z   MEMBER      

Currently, the signature of DataArray is DataArray(data, coords, dims, ...): http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html

In the long term, I think DataArray(data, dims, coords, ...) would be more intuitive: dimensions are a more fundamental part of xarray's data model than coordinates. Certainly I find it much more common to omit coords than to omit dims when I create a DataArray.

My original reasoning for this argument order was that dims could be copied from coords, e.g., DataArray(new_data, old_dataarray.coords), and it was nice to be able to pass this sole argument by position instead of by name. But a cleaner way to write this now is old_dataarray.copy(data=new_data).

The challenge in making any change here would be to have a smooth deprecation process, and that ideally avoids requiring users to rewrite all of their code and avoids loads of pointless/extraneous warnings. I'm not entirely sure this is possible. We could likely use heuristics to distinguish between dims and coords arguments regardless of their order, but this probably isn't something we would want to preserve in the long term.

An alternative that might achieve some of the convenience of this change would be to allow for passing lists of strings in the coords argument by position, which are interpreted as dimensions, e.g., DataArray(data, ['x', 'y']). The downside of this alternative is that it would add even more special cases to the DataArray constructor , which would make it harder to understand.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3139/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
237008177 MDU6SXNzdWUyMzcwMDgxNzc= 1460 groupby should still squeeze for non-monotonic inputs shoyer 1217238 open 0     5 2017-06-19T20:05:14Z 2022-03-04T21:31:41Z   MEMBER      

We can simply use argsort() to determine group_indices instead of np.arange(): https://github.com/pydata/xarray/blob/22ff955d53e253071f6e4fa849e5291d0005282a/xarray/core/groupby.py#L256

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1460/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
58117200 MDU6SXNzdWU1ODExNzIwMA== 324 Support multi-dimensional grouped operations and group_over shoyer 1217238 open 0   1.0 741199 12 2015-02-18T19:42:20Z 2022-02-28T19:03:17Z   MEMBER      

Multi-dimensional grouped operations should be relatively straightforward -- the main complexity will be writing an N-dimensional concat that doesn't involve repetitively copying data.

The idea with group_over would be to support groupby operations that act on a single element from each of the given groups, rather than the unique values. For example, ds.group_over(['lat', 'lon']) would let you iterate over or apply to 2D slices of ds, no matter how many dimensions it has.

Roughly speaking (it's a little more complex for the case of non-dimension variables), ds.group_over(dims) would get translated into ds.groupby([d for d in ds.dims if d not in dims]).

Related: #266

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/324/reactions",
    "total_count": 18,
    "+1": 18,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1090700695 I_kwDOAMm_X85BAsWX 6125 [Bug]: HTML repr does not display well in notebooks hosted on GitHub shoyer 1217238 open 0     0 2021-12-29T19:05:49Z 2021-12-29T19:36:25Z   MEMBER      

What happened?

We see both the raw text and a malformed version of the HTML (without CSS formatting).

Example (https://github.com/microsoft/PlanetaryComputerExamples/blob/main/quickstarts/reading-zarr-data.ipynb):

What did you expect to happen?

Either:

  1. Ideally, we only see the HTML repr, with CSS formatting applied.
  2. Or, if that isn't possible, we should figure out how to only show the raw text.

nbviewer gets this right:

Minimal Complete Verifiable Example

No response

Relevant log output

No response

Anything else we need to know?

No response

Environment

NA

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6125/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
252707680 MDU6SXNzdWUyNTI3MDc2ODA= 1525 Consider setting name=False in Variable.chunk() shoyer 1217238 open 0     4 2017-08-24T19:34:28Z 2021-07-13T01:50:16Z   MEMBER      

@mrocklin writes:

The following will be slower: b = (a.chunk(...) + 1) + (a.chunk(...) + 1) In current operation this will be optimized to tmp = a.chunk(...) + 1 b = tmp + tmp So you'll lose that, but I suspect that in your case chunking the same dataset many times is somewhat rare.

See here for discussion: https://github.com/pydata/xarray/pull/1517#issuecomment-324722153

Whether this is worth doing really depends on on what people would find most useful -- and what is the most intuitive behavior.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1525/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
254888879 MDU6SXNzdWUyNTQ4ODg4Nzk= 1552 Flow chart for choosing indexing operations shoyer 1217238 open 0     2 2017-09-03T17:33:30Z 2021-07-11T22:26:17Z   MEMBER      

We have a lot of indexing operations, even though sel_points and isel_points are about to be deprecated (#1473).

A flow chart / decision tree to help users pick the right indexing operation might be helpful (e.g., like this skimage FlowChart). It would ask various questions (e.g., do you have labels or integer positions? do you want to select or impose coordinates?) and then suggest appropriate the indexer methods.

cc @fujiisoup

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1552/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
340733448 MDU6SXNzdWUzNDA3MzM0NDg= 2283 Exact alignment should allow missing dimension coordinates shoyer 1217238 open 0     2 2018-07-12T17:40:24Z 2021-06-15T09:52:29Z   MEMBER      

Code Sample, a copy-pastable example if possible

python import xarray as xr xr.align(xr.DataArray([1, 2, 3], dims='x'), xr.DataArray([1, 2, 3], dims='x', coords=[[0, 1, 2]]), join='exact')

Problem description

This currently results in an error, but a missing index of size 3 does not actually conflict: ```python-traceback


ValueError Traceback (most recent call last) <ipython-input-15-1d63d3512fb6> in <module>() 1 xr.align(xr.DataArray([1, 2, 3], dims='x'), 2 xr.DataArray([1, 2, 3], dims='x', coords=[[0, 1, 2]]), ----> 3 join='exact')

/usr/local/lib/python3.6/dist-packages/xarray/core/alignment.py in align(objects, *kwargs) 129 raise ValueError( 130 'indexes along dimension {!r} are not equal' --> 131 .format(dim)) 132 index = joiner(matching_indexes) 133 joined_indexes[dim] = index

ValueError: indexes along dimension 'x' are not equal ```

This surfaced as an issue on StackOverflow: https://stackoverflow.com/questions/51308962/computing-matrix-vector-multiplication-for-each-time-point-in-two-dataarrays

Expected Output

Both output arrays should end up with the x coordinate from the input that has it, like the output of the above expression if join='inner': (<xarray.DataArray (x: 3)> array([1, 2, 3]) Coordinates: * x (x) int64 0 1 2, <xarray.DataArray (x: 3)> array([1, 2, 3]) Coordinates: * x (x) int64 0 1 2)

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.14.33+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.7 pandas: 0.22.0 numpy: 1.14.5 scipy: 0.19.1 netCDF4: None h5netcdf: None h5py: 2.8.0 Nio: None zarr: None bottleneck: None cyordereddict: None dask: None distributed: None matplotlib: 2.1.2 cartopy: None seaborn: 0.7.1 setuptools: 39.1.0 pip: 10.0.1 conda: None pytest: None IPython: 5.5.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2283/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
842438533 MDU6SXNzdWU4NDI0Mzg1MzM= 5082 Move encoding from xarray.Variable to duck arrays? shoyer 1217238 open 0     2 2021-03-27T07:21:55Z 2021-06-13T01:34:00Z   MEMBER      

The encoding property on Variable has always been an awkward part of Xarray's API, and an example of poor separation of concerns. It add conceptual overhead to all uses of xarray.Variable, but exists only for the (somewhat niche) benefit of Xarray's backend IO functionality. This is particularly problematic if we consider the possible separation of xarray.Variable into a separate package to remove the pandas dependency (https://github.com/pydata/xarray/issues/3981).

I think a cleaner way to handle encoding would be to move it from Variable onto array objects, specifically duck array objects that Xarray creates when loading data from disk. As long as these duck arrays don't "propagate" themselves under array operations but rather turn into raw numpy arrays (or whatever is wrapped), this would automatically resolve all issues around propagating encoding attributes (e.g., https://github.com/pydata/xarray/pull/5065, https://github.com/pydata/xarray/issues/1614). And users who don't care about encoding because they don't use Xarray's IO functionality would never need to think about it.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5082/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
276241764 MDU6SXNzdWUyNzYyNDE3NjQ= 1739 Utility to restore original dimension order after apply_ufunc shoyer 1217238 open 0     11 2017-11-23T00:47:57Z 2021-05-29T07:39:33Z   MEMBER      

This seems to be coming up quite a bit for wrapping functions that apply an operation along an axis, e.g., for interpolate in #1640 or rank in #1733.

We should either write a utility function to do this or consider adding an option to apply_ufunc.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1739/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
901047466 MDU6SXNzdWU5MDEwNDc0NjY= 5372 Consider revising the _repr_inline_ protocol shoyer 1217238 open 0     0 2021-05-25T16:18:31Z 2021-05-25T16:18:31Z   MEMBER      

_repr_inline_ looks like an IPython special method but is actually includes some xarray specific details: the result should not include shape or dtype.

As I wrote in https://github.com/pydata/xarray/pull/5352, I would suggest revising it in one of two ways:

  1. Giving it a name like _xarray_repr_inline_ to make it clearer that it's Xarray specific
  2. Include some more generic way of indicating that shape/dtype is redundant, e.g,. call it like obj._repr_ndarray_inline_(dtype=False, shape=False)
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5372/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
294241734 MDU6SXNzdWUyOTQyNDE3MzQ= 1887 Boolean indexing with multi-dimensional key arrays shoyer 1217238 open 0     13 2018-02-04T23:28:45Z 2021-04-22T21:06:47Z   MEMBER      

Originally from https://github.com/pydata/xarray/issues/974

For boolean indexing: - da[key] where key is a boolean labelled array (with any number of dimensions) is made equivalent to da.where(key.reindex_like(ds), drop=True). This matches the existing behavior if key is a 1D boolean array. For multi-dimensional arrays, even though the result is now multi-dimensional, this coupled with automatic skipping of NaNs means that da[key].mean() gives the same result as in NumPy. - da[key] = value where key is a boolean labelled array can be made equivalent to da = da.where(*align(key.reindex_like(da), value.reindex_like(da))) (that is, the three argument form of where). - da[key_0, ..., key_n] where all of key_i are boolean arrays gets handled in the usual way. It is an IndexingError to supply multiple labelled keys if any of them are not already aligned with as the corresponding index coordinates (and share the same dimension name). If they want alignment, we suggest users simply write da[key_0 & ... & key_n].

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1887/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
843996137 MDU6SXNzdWU4NDM5OTYxMzc= 5092 Concurrent loading of coordinate arrays from Zarr shoyer 1217238 open 0     0 2021-03-30T02:19:50Z 2021-04-19T02:43:31Z   MEMBER      

When you open a dataset with Zarr, xarray loads coordinate arrays corresponding to indexes in serial. This can be slow (multiple seconds) even with only a handful of such arrays if they are stored in a remote filesystem (e.g., cloud object stores). This is similar to the use-cases for consolidated metadata.

In principle, we could speed up loading datasets from Zarr into Xarray significantly by reading the data corresponding to these arrays in parallel (e.g., in multiple threads).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5092/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
264098632 MDU6SXNzdWUyNjQwOTg2MzI= 1618 apply_raw() for a simpler version of apply_ufunc() shoyer 1217238 open 0     4 2017-10-10T04:51:38Z 2021-01-01T17:14:43Z   MEMBER      

apply_raw() would work like apply_ufunc(), but without the hard to understand broadcasting behavior and core dimensions.

The rule for apply_raw() would be that it directly unwraps its arguments and passes them on to the wrapped function, without any broadcasting. We would also include a dim argument that is automatically converted into the appropriate axis argument when calling the wrapped function.

Output dimensions would be determined from a simple rule of some sort: - Default output dimensions would either be copied from the first argument, or would take on the ordered union on all input dimensions. - Custom dimensions could either be set by adding a drop_dims argument (like dask.array.map_blocks), or require an explicit override output_dims.

This also could be suitable for defining as a method instead of a separate function. See https://github.com/pydata/xarray/issues/1251 and https://github.com/pydata/xarray/issues/1130 for related issues.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1618/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
269700511 MDU6SXNzdWUyNjk3MDA1MTE= 1672 Append along an unlimited dimension to an existing netCDF file shoyer 1217238 open 0     8 2017-10-30T18:09:54Z 2020-11-29T17:35:04Z   MEMBER      

This would be a nice feature to have for some use cases, e.g., for writing simulation time-steps: https://stackoverflow.com/questions/46951981/create-and-write-xarray-dataarray-to-netcdf-in-chunks

It should be relatively straightforward to add, too, building on support for writing files with unlimited dimensions. User facing API would probably be a new keyword argument to to_netcdf(), e.g., extend='time' to indicate the extended dimension.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1672/reactions",
    "total_count": 21,
    "+1": 21,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
314444743 MDU6SXNzdWUzMTQ0NDQ3NDM= 2059 How should xarray serialize bytes/unicode strings across Python/netCDF versions? shoyer 1217238 open 0     5 2018-04-15T19:36:55Z 2020-11-19T10:08:16Z   MEMBER      

netCDF string types

We have several options for storing strings in netCDF files: - NC_CHAR: netCDF's legacy character type. The closest match is NumPy 'S1' dtype. In principle, it's supposed to be able to store arbitrary bytes. On HDF5, it uses an UTF-8 encoded string with a fixed-size of 1 (but note that HDF5 does not complain about storing arbitrary bytes). - NC_STRING: netCDF's newer variable length string type. It's only available on netCDF4 (not netCDF3). It corresponds to an HDF5 variable-length string with UTF-8 encoding. - NC_CHAR with an _Encoding attribute: xarray and netCDF4-Python support an ad-hoc convention for storing unicode strings in NC_CHAR data-types, by adding an attribute {'_Encoding': 'UTF-8'}. The data is still stored as fixed width strings, but xarray (and netCDF4-Python) can decode them as unicode.

NC_STRING would seem like a clear win in cases where it's supported, but as @crusaderky points out in https://github.com/pydata/xarray/issues/2040, it actually results in much larger netCDF files in many cases than using character arrays, which are more easily compressed. Nonetheless, we currently default to storing unicode strings in NC_STRING, because it's the most portable option -- every tool that handles HDF5 and netCDF4 should be able to read it properly as unicode strings.

NumPy/Python string types

On the Python side, our options are perhaps even more confusing: - NumPy's dtype=np.string_ corresponds to fixed-length bytes. This is the default dtype for strings on Python 2, because on Python 2 strings are the same as bytes. - NumPy's dtype=np.unicode_ corresponds to fixed-length unicode. This is the default dtype for strings on Python 3, because on Python 3 strings are the same as unicode. - Strings are also commonly stored in numpy arrays with dtype=np.object_, as arrays of either bytes or unicode objects. This is a pragmatic choice, because otherwise NumPy has no support for variable length strings. We also use this (like pandas) to mark missing values with np.nan.

Like pandas, we are pretty liberal with converting back and forth between fixed-length (np.string/np.unicode_) and variable-length (object dtype) representations of strings as necessary. This works pretty well, though converting from object arrays in particular has downsides, since it cannot be done lazily with dask.

Current behavior of xarray

Currently, xarray uses the same behavior on Python 2/3. The priority was faithfully round-tripping data from a particular version of Python to netCDF and back, which the current serialization behavior achieves:

| Python version | NetCDF version | NumPy datatype | NetCDF datatype | | --------- | ---------- | -------------- | ------------ | | Python 2 | NETCDF3 | np.string_ / str | NC_CHAR | | Python 2 | NETCDF4 | np.string_ / str | NC_CHAR | | Python 3 | NETCDF3 | np.string_ / bytes | NC_CHAR | | Python 3 | NETCDF4 | np.string_ / bytes | NC_CHAR | | Python 2 | NETCDF3 | np.unicode_ / unicode | NC_CHAR with UTF-8 encoding | | Python 2 | NETCDF4 | np.unicode_ / unicode | NC_STRING | | Python 3 | NETCDF3 | np.unicode_ / str | NC_CHAR with UTF-8 encoding | | Python 3 | NETCDF4 | np.unicode_ / str | NC_STRING | | Python 2 | NETCDF3 | object bytes/str | NC_CHAR | | Python 2 | NETCDF4 | object bytes/str | NC_CHAR | | Python 3 | NETCDF3 | object bytes | NC_CHAR | | Python 3 | NETCDF4 | object bytes | NC_CHAR | | Python 2 | NETCDF3 | object unicode | NC_CHAR with UTF-8 encoding | | Python 2 | NETCDF4 | object unicode | NC_STRING | | Python 3 | NETCDF3 | object unicode/str | NC_CHAR with UTF-8 encoding | | Python 3 | NETCDF4 | object unicode/str | NC_STRING |

This can also be selected explicitly for most data-types by setting dtype in encoding: - 'S1' for NC_CHAR (with or without encoding) - str for NC_STRING (though I'm not 100% sure it works properly currently when given bytes)

Script for generating table:

```python from __future__ import print_function import xarray as xr import uuid import netCDF4 import numpy as np import sys for dtype_name, value in [ ('np.string_ / ' + type(b'').__name__, np.array([b'abc'])), ('np.unicode_ / ' + type(u'').__name__, np.array([u'abc'])), ('object bytes/' + type(b'').__name__, np.array([b'abc'], dtype=object)), ('object unicode/' + type(u'').__name__, np.array([u'abc'], dtype=object)), ]: for format in ['NETCDF3_64BIT', 'NETCDF4']: filename = str(uuid.uuid4()) + '.nc' xr.Dataset({'data': value}).to_netcdf(filename, format=format) with netCDF4.Dataset(filename) as f: var = f.variables['data'] disk_dtype = var.dtype has_encoding = hasattr(var, '_Encoding') disk_dtype_name = (('NC_CHAR' if disk_dtype == 'S1' else 'NC_STRING') + (' with UTF-8 encoding' if has_encoding else '')) print('|', 'Python %i' % sys.version_info[0], '|', format[:7], '|', dtype_name, '|', disk_dtype_name, '|') ```

Potential alternatives

The main option I'm considering is switching to default to NC_CHAR with UTF-8 encoding for np.string_ / str and object bytes/str on Python 2. The current behavior could be explicitly toggled by setting an encoding of {'_Encoding': None}.

This would imply two changes: 1. Attempting to serialize arbitrary bytes (on Python 2) would start raising an error -- anything that isn't ASCII would require explicitly disabling _Encoding. 2. Strings read back from disk on Python 2 would come back as unicode instead of bytes.

This implicit conversion would be consistent with Python 2's general handling of bytes/unicode, and facilitate reading netCDF files on Python 3 that were written with Python 2.

The counter-argument is that it may not be worth changing this at this late point, given that we will be sunsetting Python 2 support by year's end.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2059/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
715374721 MDU6SXNzdWU3MTUzNzQ3MjE= 4490 Group together decoding options into a single argument shoyer 1217238 open 0     6 2020-10-06T06:15:18Z 2020-10-29T04:07:46Z   MEMBER      

Is your feature request related to a problem? Please describe.

open_dataset() currently has a very long function signature. This makes it hard to keep track of everything it can do, and is particularly problematic for the authors of new backends (e.g., see https://github.com/pydata/xarray/pull/4477), which might need to know how to handle all these arguments.

Describe the solution you'd like

To simple the interface, I propose to group together all the decoding options into a new DecodingOptions class. I'm thinking something like: ```python from dataclasses import dataclass, field, asdict from typing import Optional, List

@dataclass(frozen=True) class DecodingOptions: mask: Optional[bool] = None scale: Optional[bool] = None datetime: Optional[bool] = None timedelta: Optional[bool] = None use_cftime: Optional[bool] = None concat_characters: Optional[bool] = None coords: Optional[bool] = None drop_variables: Optional[List[str]] = None

@classmethods
def disabled(cls):
    return cls(mask=False, scale=False, datetime=False, timedelta=False,
              concat_characters=False, coords=False)

def non_defaults(self):
    return {k: v for k, v in asdict(self).items() if v is not None}

# add another method for creating default Variable Coder() objects,
# e.g., those listed in encode_cf_variable()

```

The signature of open_dataset would then become: python def open_dataset( filename_or_obj, group=None, * engine=None, chunks=None, lock=None, cache=None, backend_kwargs=None, decode: Union[DecodingOptions, bool] = None, **deprecated_kwargs ): if decode is None: decode = DecodingOptions() if decode is False: decode = DecodingOptions.disabled() # handle deprecated_kwargs... ...

Question: are decode and DecodingOptions the right names? Maybe these should still include the name "CF", e.g., decode_cf and CFDecodingOptions, given that these are specific to CF conventions?

Note: the current signature is open_dataset(filename_or_obj, group=None, decode_cf=True, mask_and_scale=None, decode_times=True, autoclose=None, concat_characters=True, decode_coords=True, engine=None, chunks=None, lock=None, cache=None, drop_variables=None, backend_kwargs=None, use_cftime=None, decode_timedelta=None)

Usage with the new interface would look like xr.open_dataset(filename, decode=False) or xr.open_dataset(filename, decode=xr.DecodingOptions(mask=False, scale=False)).

This requires a little bit more typing than what we currently have, but it has a few advantages:

  1. It's easier to understand the role of different arguments. Now there is a function with ~8 arguments and a class with ~8 arguments rather than a function with ~15 arguments.
  2. It's easier to add new decoding arguments (e.g., for more advanced CF conventions), because they don't clutter the open_dataset interface. For example, I separated out mask and scale arguments, versus the current mask_and_scale argument.
  3. If a new backend plugin for open_dataset() needs to handle every option supported by open_dataset(), this makes that task significantly easier. The only decoding options they need to worry about are non-default options that were explicitly set, i.e., those exposed by the non_defaults() method. If another decoding option wasn't explicitly set and isn't recognized by the backend, they can just ignore it.

Describe alternatives you've considered

For the overall approach:

  1. We could keep the current design, with separate keyword arguments for decoding options, and just be very careful about passing around these arguments. This seems pretty painful for the backend refactor, though.
  2. We could keep the current design only for the user facing open_dataset() interface, and then internally convert into the DecodingOptions() struct for passing to backend constructors. This would provide much needed flexibility for backend authors, but most users wouldn't benefit from the new interface. Perhaps this would make sense as an intermediate step?
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4490/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
253107677 MDU6SXNzdWUyNTMxMDc2Nzc= 1527 Binary operations with ds.groupby('time.dayofyear') errors out, but ds.groupby('time.month') works shoyer 1217238 open 0     10 2017-08-26T16:54:53Z 2020-09-29T10:05:42Z   MEMBER      

Reported on the mailing list:

Original datasets: ```

ds_xr <xarray.DataArray (time: 12775)> array([-0.01, -0.01, -0.01, ..., -0.27, -0.27, -0.27]) Coordinates: * time (time) datetime64[ns] 1979-01-01 1979-01-02 1979-01-03 ...

slope_itcp_ds <xarray.Dataset> Dimensions: (lat: 73, level: 2, lon: 144, time: 366) Coordinates: * lon (lon) float32 0.0 2.5 5.0 7.5 10.0 12.5 ... * lat (lat) float32 90.0 87.5 85.0 82.5 80.0 ... * level (level) float64 0.0 1.0 * time (time) datetime64[ns] 2010-01-01 ... Data variables: xarray_dataarray_variable (time, level, lat, lon) float64 -0.8795 ... Attributes: CDI: Climate Data Interface version 1.7.1 (http://mpimet.mpg.de/... Conventions: CF-1.4 history: Fri Aug 25 18:55:50 2017: cdo -inttime,2010-01-01,00:00:00,... CDO: Climate Data Operators version 1.7.1 (http://mpimet.mpg.de/... ```

Issue: Grouping by month works and outputs this: ```

ds_xr.groupby('time.month') - slope_itcp_ds.groupby('time.month').mean('time') <xarray.Dataset> Dimensions: (lat: 73, level: 2, lon: 144, time: 12775) Coordinates: * lon (lon) float32 0.0 2.5 5.0 7.5 10.0 12.5 ... * lat (lat) float32 90.0 87.5 85.0 82.5 80.0 ... * level (level) float64 0.0 1.0 month (time) int64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... * time (time) datetime64[ns] 1979-01-01 ... Data variables: xarray_dataarray_variable (time, level, lat, lon) float64 1.015 ... ```

Grouping by dayofyear doesn't work and gives this traceback: ```

ds_xr.groupby('time.dayofyear') - slope_itcp_ds.groupby('time.dayofyear').mean('time') KeyError Traceback (most recent call last) <ipython-input-10-01c0cf4c980a> in <module>() ----> 1 ds_xr.groupby('time.dayofyear') - slope_itcp_ds.groupby('time.dayofyear').mean('time')

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/groupby.py in func(self, other) 316 g = f if not reflexive else lambda x, y: f(y, x) 317 applied = self._yield_binary_applied(g, other) --> 318 combined = self._combine(applied) 319 return combined 320 return func

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/groupby.py in _combine(self, applied, shortcut) 532 combined = self._concat_shortcut(applied, dim, positions) 533 else: --> 534 combined = concat(applied, dim) 535 combined = _maybe_reorder(combined, dim, positions) 536

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in concat(objs, dim, data_vars, coords, compat, positions, indexers, mode, concat_over) 118 raise TypeError('can only concatenate xarray Dataset and DataArray ' 119 'objects, got %s' % type(first_obj)) --> 120 return f(objs, dim, data_vars, coords, compat, positions) 121 122

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in _dataset_concat(datasets, dim, data_vars, coords, compat, positions) 210 datasets = align(*datasets, join='outer', copy=False, exclude=[dim]) 211 --> 212 concat_over = _calc_concat_over(datasets, dim, data_vars, coords) 213 214 def insert_result_variable(k, v):

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in _calc_concat_over(datasets, dim, data_vars, coords) 190 if dim in v.dims) 191 concat_over.update(process_subset_opt(data_vars, 'data_vars')) --> 192 concat_over.update(process_subset_opt(coords, 'coords')) 193 if dim in datasets[0]: 194 concat_over.add(dim)

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in process_subset_opt(opt, subset) 165 for ds in datasets[1:]) 166 # all nonindexes that are not the same in each dataset --> 167 concat_new = set(k for k in getattr(datasets[0], subset) 168 if k not in concat_over and differs(k)) 169 elif opt == 'all':

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in <genexpr>(.0) 166 # all nonindexes that are not the same in each dataset 167 concat_new = set(k for k in getattr(datasets[0], subset) --> 168 if k not in concat_over and differs(k)) 169 elif opt == 'all': 170 concat_new = (set(getattr(datasets[0], subset)) -

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in differs(vname) 163 v = datasets[0].variables[vname] 164 return any(not ds.variables[vname].equals(v) --> 165 for ds in datasets[1:]) 166 # all nonindexes that are not the same in each dataset 167 concat_new = set(k for k in getattr(datasets[0], subset)

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/combine.py in <genexpr>(.0) 163 v = datasets[0].variables[vname] 164 return any(not ds.variables[vname].equals(v) --> 165 for ds in datasets[1:]) 166 # all nonindexes that are not the same in each dataset 167 concat_new = set(k for k in getattr(datasets[0], subset)

/data/keeling/a/ahuang11/anaconda3/lib/python3.6/site-packages/xarray/core/utils.py in getitem(self, key) 288 289 def getitem(self, key): --> 290 return self.mapping[key] 291 292 def iter(self):

KeyError: 'lon' ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1527/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
479940669 MDU6SXNzdWU0Nzk5NDA2Njk= 3212 Custom fill_value for from_dataframe/from_series shoyer 1217238 open 0     0 2019-08-13T03:22:46Z 2020-04-06T20:40:26Z   MEMBER      

It would be to have the option to customize the fill value when creating an xarray objects from pandas, instead of requiring to always be NaN.

This would probably be especially useful when creating sparse arrays (https://github.com/pydata/xarray/issues/3206), for which it often makes sense to use a fill value of zero. If your data has integer values (e.g., it represents counts), you probably don't want to let it be cast to float first.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3212/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
314482923 MDU6SXNzdWUzMTQ0ODI5MjM= 2061 Backend specific conventions decoding shoyer 1217238 open 0     1 2018-04-16T02:45:46Z 2020-04-05T23:42:34Z   MEMBER      

Currently, we have a single function xarray.decode_cf() that we apply to data loaded from all xarray backends.

This is appropriate for netCDF data, but it's not appropriate for backends with different implementations. For example, it doesn't work for zarr (which is why we have the separate open_zarr), and is also a poor fit for PseudoNetCDF (https://github.com/pydata/xarray/pull/1905). In the worst cases (e.g., for PseudoNetCDF) it can actually result in data being decoded twice, which can result in incorrectly scaled data.

Instead, we should declare default decoders as part of the backend API, and use those decoders as the defaults for open_dataset().

This should probably be tackled as part of the broader backends refactor: https://github.com/pydata/xarray/issues/1970

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2061/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
173612265 MDU6SXNzdWUxNzM2MTIyNjU= 988 Hooks for custom attribute handling in xarray operations shoyer 1217238 open 0     24 2016-08-27T19:48:22Z 2020-04-05T18:19:11Z   MEMBER      

Over in #964, I am working on a rewrite/unification of the guts of xarray's logic for computation with labelled data. The goal is to get all of xarray's internal logic for working with labelled data going through a minimal set of flexible functions which we can also expose as part of the API.

Because we will finally have all (or at least nearly all) xarray operations using the same code path, I think it will also finally become feasible to open up hooks allowing extensions how xarray handles metadata.

Two obvious use cases here are units (#525) and automatic maintenance of metadata (e.g., cell_methods or history fields). Both of these are out of scope for xarray itself, mostly because the specific logic tends to be domain specific. This could also subsume options like the existing keep_attrs on many operations.

I like the idea of supporting something like NumPy's __array_wrap__ to allow third-party code to finalize xarray objects in some way before they are returned. However, it's not obvious to me what the right design is. - Should we lookup a custom attribute on subclasses like __array_wrap__ (or __numpy_ufunc__) in NumPy, or should we have a system (e.g., unilaterally or with a context manager and xarray.set_options) for registering hooks that are then checked on all xarray objects? I am inclined toward the later, even though it's a little slower, just because it will be simpler and easier to get right - Should these methods be able to control the full result objects, or only set attrs and/or name? - To be useful, do we need to allow extensions to take control of the full operation, to support things like automatic unit conversion? This would suggest something closing to __numpy_ufunc__, which is a little more ambitious than what I had in mind here.

Feedback would be greatly appreciated.

CC @darothen @rabernat @jhamman @pwolfram

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/988/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
296120524 MDU6SXNzdWUyOTYxMjA1MjQ= 1901 Update assign to preserve order for **kwargs shoyer 1217238 open 0     1 2018-02-10T18:05:45Z 2020-02-10T19:44:20Z   MEMBER      

In Python 3.6+, keyword arguments preserve the order in which they are written. We should update assign and assign_coords to rely on this in the next major release, as has been done in pandas: https://github.com/pandas-dev/pandas/issues/14207

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1901/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
398107776 MDU6SXNzdWUzOTgxMDc3NzY= 2666 Dataset.from_dataframe will produce a FutureWarning for DatetimeTZ data shoyer 1217238 open 0     6 2019-01-11T02:45:49Z 2019-12-30T22:58:23Z   MEMBER      

This appears with the development version of pandas; see https://github.com/pandas-dev/pandas/issues/24716 for details.

Example: ``` In [16]: df = pd.DataFrame({"A": pd.date_range('2000', periods=12, tz='US/Central')})

In [17]: df.to_xarray() /Users/taugspurger/Envs/pandas-dev/lib/python3.7/site-packages/xarray/core/dataset.py:3111: FutureWarning: Converting timezone-aware DatetimeArray to timezone-naive ndarray with 'datetime64[ns]' dtype. In the future, this will return an ndarray with 'object' dtype where each element is a 'pandas.Timestamp' with the correct 'tz'. To accept the future behavior, pass 'dtype=object'. To keep the old behavior, pass 'dtype="datetime64[ns]"'. data = np.asarray(series).reshape(shape) Out[17]: <xarray.Dataset> Dimensions: (index: 12) Coordinates: * index (index) int64 0 1 2 3 4 5 6 7 8 9 10 11 Data variables: A (index) datetime64[ns] 2000-01-01T06:00:00 ... 2000-01-12T06:00:00 ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2666/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
96211612 MDU6SXNzdWU5NjIxMTYxMg== 486 API for multi-dimensional resampling/regridding shoyer 1217238 open 0     32 2015-07-21T02:38:29Z 2019-11-06T18:00:52Z   MEMBER      

This notebook by @kegl shows a nice example of how to use pyresample with xray: https://www.lri.fr/~kegl/Ramps/edaElNino.html#Downsampling

It would nice to build a wrapper for this machinery directly into xray in some way.

xref #475

cc @jhamman @rabernat

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/486/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
269348789 MDU6SXNzdWUyNjkzNDg3ODk= 1668 Remove use of allow_cleanup_failure in test_backends.py shoyer 1217238 open 0     6 2017-10-28T20:47:31Z 2019-09-29T20:07:03Z   MEMBER      

This exists for the benefit of Windows, on which trying to delete an open file results in an error. But really, it would be nice to have a test suite that doesn't leave any temporary files hanging around.

The main culprit is tests like this, where opening a file triggers an error: python with raises_regex(TypeError, 'pip install netcdf4'): open_dataset(tmp_file, engine='scipy')

The way to fix this is to use mocking of some sort, to intercept calls to backend file objects and close them afterwards.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1668/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
317362786 MDU6SXNzdWUzMTczNjI3ODY= 2078 apply_ufunc should include variable names in error messages shoyer 1217238 open 0     4 2018-04-24T19:26:13Z 2019-08-26T18:10:23Z   MEMBER      

This would make it easier to debug issues with dimensions.

For example, in this example from StackOverflow, the error message was ValueError: operand to apply_ufunc has required core dimensions ['time', 'lat', 'lon'], but some of these are missing on the input variable: ['lat', 'lon'].

A better error message would be: ValueError: operand to apply_ufunc has required core dimensions ['time', 'lat', 'lon'], but some of these are missing on input variable 'status': ['lat', 'lon']

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2078/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
188113943 MDU6SXNzdWUxODgxMTM5NDM= 1097 Better support for subclasses: tests, docs and API shoyer 1217238 open 0     14 2016-11-08T21:54:00Z 2019-08-22T13:07:44Z   MEMBER      

Given that people do currently subclass xarray objects, it's worth considering making a subclass API like pandas: http://pandas.pydata.org/pandas-docs/stable/internals.html#subclassing-pandas-data-structures

At the very least, it would be nice to have docs that describe how/when it's safe to subclass, and tests that verify our support for such subclasses.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1097/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
292000828 MDU6SXNzdWUyOTIwMDA4Mjg= 1861 Add an example page to the docs on geospatial filtering/indexing shoyer 1217238 open 0     0 2018-01-26T19:07:11Z 2019-07-12T02:53:53Z   MEMBER      

We cover standard time-series stuff pretty well in the "Toy weather data" example, but geospatial filtering/indexing questions come up all the time aren't well covered.

Topics could include: - How to filter out a region of interest (sel() with slice and where(..., drop=True)) - How to align two gridded datasets in space. - How to sample a gridded dataset at a list of station locations - How to resample a dataset to a new resolution (possibly referencing xESMF)

Not all of these are as smooth as they could be, but hopefully that will clearly point to where we have room for improvement in our APIs :).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1861/reactions",
    "total_count": 6,
    "+1": 6,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
35633124 MDU6SXNzdWUzNTYzMzEyNA== 155 Expose a public interface for CF encoding/decoding functions shoyer 1217238 open 0     3 2014-06-12T23:33:42Z 2019-02-04T04:17:40Z   MEMBER      

Relevant discussion: #153

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/155/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
403504120 MDU6SXNzdWU0MDM1MDQxMjA= 2719 Should xarray.align sort indexes in alignment? shoyer 1217238 open 0     1 2019-01-27T01:51:29Z 2019-01-28T18:03:53Z   MEMBER      

I noticed in https://github.com/pandas-dev/pandas/issues/24959 (which turned up as a failure in our test suite) that pandas sorts by default in Index.union and now Index.intersection, unless the indexes are the same or either index has duplicates. (These aspects are probably bugs.)

It occurs to me that we should make an intentional choice about sorting in xarray.align(), rather than merely following the whims of changed upstream behavior. Note that align() is called internally by all xarray operations that combine multiple objects (e.g., in arithmetic).

My proposal is to use "order of appearance" and not sort by default, but add a sort keyword argument to allow users to control this. Reasons for the default behavior of not sorting: 1. Sorting can't be undone if the original order is lost, so this preserve maximum flexibility for users. 2. This matches how we handle the ordering of dimensions in broadcasting. 3. Pandas is quite inconsistent with how it applies sorting and we don't want to copy that in xarray. We definitely don't want to sort in all cases by default (e.g., if objects have the same index), so we should avoid sorting in others.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2719/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
316448044 MDU6SXNzdWUzMTY0NDgwNDQ= 2069 to_netcdf() should not implicitly load dask arrays of strings into memory shoyer 1217238 open 0     0 2018-04-21T00:57:23Z 2019-01-13T01:41:20Z   MEMBER      

As discussed in https://github.com/pydata/xarray/pull/2058#discussion_r181606513, we should have an explicit interface of some sort, either via encoding or some new keyword argument to to_netcdf().

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2069/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 1655.625ms · About: xarray-datasette