home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

19 rows where comments = 6 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 2

  • issue 14
  • pull 5

state 2

  • closed 15
  • open 4

repo 1

  • xarray 19
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
842436143 MDU6SXNzdWU4NDI0MzYxNDM= 5081 Lazy indexing arrays as a stand-alone package shoyer 1217238 open 0     6 2021-03-27T07:06:03Z 2023-12-15T13:20:03Z   MEMBER      

From @rabernat on Twitter:

"Xarray has some secret private classes for lazily indexing / wrapping arrays that are so useful I think they should be broken out into a standalone package. https://github.com/pydata/xarray/blob/master/xarray/core/indexing.py#L516"

The idea here is create a first-class "duck array" library for lazy indexing that could replace xarray's internal classes for lazy indexing. This would be in some ways similar to dask.array, but much simpler, because it doesn't have to worry about parallel computing.

Desired features:

  • Lazy indexing
  • Lazy transposes
  • Lazy concatenation (#4628) and stacking
  • Lazy vectorized operations (e.g., unary and binary arithmetic)
    • needed for decoding variables from disk (xarray.encoding) and
    • building lazy multi-dimensional coordinate arrays corresponding to map projections (#3620)
  • Maybe: lazy reshapes (#4113)

A common feature of these operations is they can (and almost always should) be fused with indexing: if N elements are selected via indexing, only O(N) compute and memory is required to produce them, regards of the size of the original arrays as long as the number of applied operations can be treated as a constant. Memory access is significantly slower than compute on modern hardware, so recomputing these operations on the fly is almost always a good idea.

Out of scope: lazy computation when indexing could require access to many more elements to compute the desired value than are returned. For example, mean() probably should not be lazy, because that could involve computation of a very large number of elements that one might want to cache.

This is valuable functionality for Xarray for two reasons:

  1. It allows for "previewing" small bits of data loaded from disk or remote storage, even if that data needs some form of cheap "decoding" from its form on disk.
  2. It allows for xarray to decode data in a lazy fashion that is compatible with full-featured systems for lazy computation (e.g., Dask), without requiring the user to choose dask when reading the data.

Related issues:

  • [Proposal] Expose Variable without Pandas dependency #3981
  • Lazy concatenation of arrays #4628
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5081/reactions",
    "total_count": 6,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 6,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
253395960 MDU6SXNzdWUyNTMzOTU5NjA= 1533 Index variables loaded from dask can be computed twice shoyer 1217238 closed 0     6 2017-08-28T17:18:27Z 2023-04-06T04:15:46Z 2023-04-06T04:15:46Z MEMBER      

as reported by @crusaderky in #1522

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1533/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
948890466 MDExOlB1bGxSZXF1ZXN0NjkzNjY1NDEy 5624 Make typing-extensions optional shoyer 1217238 closed 0     6 2021-07-20T17:43:22Z 2021-07-22T23:30:49Z 2021-07-22T23:02:03Z MEMBER   0 pydata/xarray/pulls/5624

Type checking may be a little worse if typing-extensions are not installed, but I don't think it's worth the trouble of adding another hard dependency just for one use for TypeGuard.

Note: sadly this doesn't work yet. Mypy (and pylance) don't like the type alias defined with try/except. Any ideas? In the worst case, we could revert the TypeGuard entirely, but that would be a shame...

  • [x] Closes #5495
  • [x] Passes pre-commit run --all-files
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5624/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
715374721 MDU6SXNzdWU3MTUzNzQ3MjE= 4490 Group together decoding options into a single argument shoyer 1217238 open 0     6 2020-10-06T06:15:18Z 2020-10-29T04:07:46Z   MEMBER      

Is your feature request related to a problem? Please describe.

open_dataset() currently has a very long function signature. This makes it hard to keep track of everything it can do, and is particularly problematic for the authors of new backends (e.g., see https://github.com/pydata/xarray/pull/4477), which might need to know how to handle all these arguments.

Describe the solution you'd like

To simple the interface, I propose to group together all the decoding options into a new DecodingOptions class. I'm thinking something like: ```python from dataclasses import dataclass, field, asdict from typing import Optional, List

@dataclass(frozen=True) class DecodingOptions: mask: Optional[bool] = None scale: Optional[bool] = None datetime: Optional[bool] = None timedelta: Optional[bool] = None use_cftime: Optional[bool] = None concat_characters: Optional[bool] = None coords: Optional[bool] = None drop_variables: Optional[List[str]] = None

@classmethods
def disabled(cls):
    return cls(mask=False, scale=False, datetime=False, timedelta=False,
              concat_characters=False, coords=False)

def non_defaults(self):
    return {k: v for k, v in asdict(self).items() if v is not None}

# add another method for creating default Variable Coder() objects,
# e.g., those listed in encode_cf_variable()

```

The signature of open_dataset would then become: python def open_dataset( filename_or_obj, group=None, * engine=None, chunks=None, lock=None, cache=None, backend_kwargs=None, decode: Union[DecodingOptions, bool] = None, **deprecated_kwargs ): if decode is None: decode = DecodingOptions() if decode is False: decode = DecodingOptions.disabled() # handle deprecated_kwargs... ...

Question: are decode and DecodingOptions the right names? Maybe these should still include the name "CF", e.g., decode_cf and CFDecodingOptions, given that these are specific to CF conventions?

Note: the current signature is open_dataset(filename_or_obj, group=None, decode_cf=True, mask_and_scale=None, decode_times=True, autoclose=None, concat_characters=True, decode_coords=True, engine=None, chunks=None, lock=None, cache=None, drop_variables=None, backend_kwargs=None, use_cftime=None, decode_timedelta=None)

Usage with the new interface would look like xr.open_dataset(filename, decode=False) or xr.open_dataset(filename, decode=xr.DecodingOptions(mask=False, scale=False)).

This requires a little bit more typing than what we currently have, but it has a few advantages:

  1. It's easier to understand the role of different arguments. Now there is a function with ~8 arguments and a class with ~8 arguments rather than a function with ~15 arguments.
  2. It's easier to add new decoding arguments (e.g., for more advanced CF conventions), because they don't clutter the open_dataset interface. For example, I separated out mask and scale arguments, versus the current mask_and_scale argument.
  3. If a new backend plugin for open_dataset() needs to handle every option supported by open_dataset(), this makes that task significantly easier. The only decoding options they need to worry about are non-default options that were explicitly set, i.e., those exposed by the non_defaults() method. If another decoding option wasn't explicitly set and isn't recognized by the backend, they can just ignore it.

Describe alternatives you've considered

For the overall approach:

  1. We could keep the current design, with separate keyword arguments for decoding options, and just be very careful about passing around these arguments. This seems pretty painful for the backend refactor, though.
  2. We could keep the current design only for the user facing open_dataset() interface, and then internally convert into the DecodingOptions() struct for passing to backend constructors. This would provide much needed flexibility for backend authors, but most users wouldn't benefit from the new interface. Perhaps this would make sense as an intermediate step?
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4490/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
702372014 MDExOlB1bGxSZXF1ZXN0NDg3NjYxMzIz 4426 Fix for h5py deepcopy issues shoyer 1217238 closed 0     6 2020-09-16T01:11:00Z 2020-09-18T22:31:13Z 2020-09-18T22:31:09Z MEMBER   0 pydata/xarray/pulls/4426
  • [x] Closes #4425
  • [x] Tests added
  • [x] Passes isort . && black . && mypy . && flake8
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4426/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
417542619 MDU6SXNzdWU0MTc1NDI2MTk= 2803 Test failure with TestValidateAttrs.test_validating_attrs shoyer 1217238 closed 0     6 2019-03-05T23:03:02Z 2020-08-25T14:29:19Z 2019-03-14T15:59:13Z MEMBER      

This is due to setting multi-dimensional attributes being an error, as of the latest netCDF4-Python release: https://github.com/Unidata/netcdf4-python/blob/master/Changelog

E.g., as seen on Appveyor: https://ci.appveyor.com/project/shoyer/xray/builds/22834250/job/9q0ip6i3cchlbkw2 ``` ================================== FAILURES =================================== ___ TestValidateAttrs.test_validating_attrs _____ self = <xarray.tests.test_backends.TestValidateAttrs object at 0x00000096BE5FAFD0> def test_validating_attrs(self): def new_dataset(): return Dataset({'data': ('y', np.arange(10.0))}, {'y': np.arange(10)})

    def new_dataset_and_dataset_attrs():
        ds = new_dataset()
        return ds, ds.attrs

    def new_dataset_and_data_attrs():
        ds = new_dataset()
        return ds, ds.data.attrs

    def new_dataset_and_coord_attrs():
        ds = new_dataset()
        return ds, ds.coords['y'].attrs

    for new_dataset_and_attrs in [new_dataset_and_dataset_attrs,
                                  new_dataset_and_data_attrs,
                                  new_dataset_and_coord_attrs]:
        ds, attrs = new_dataset_and_attrs()

        attrs[123] = 'test'
        with raises_regex(TypeError, 'Invalid name for attr'):
            ds.to_netcdf('test.nc')

        ds, attrs = new_dataset_and_attrs()
        attrs[MiscObject()] = 'test'
        with raises_regex(TypeError, 'Invalid name for attr'):
            ds.to_netcdf('test.nc')

        ds, attrs = new_dataset_and_attrs()
        attrs[''] = 'test'
        with raises_regex(ValueError, 'Invalid name for attr'):
            ds.to_netcdf('test.nc')

        # This one should work
        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = 'test'
        with create_tmp_file() as tmp_file:
            ds.to_netcdf(tmp_file)

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = {'a': 5}
        with raises_regex(TypeError, 'Invalid value for attr'):
            ds.to_netcdf('test.nc')

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = MiscObject()
        with raises_regex(TypeError, 'Invalid value for attr'):
            ds.to_netcdf('test.nc')

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = 5
        with create_tmp_file() as tmp_file:
            ds.to_netcdf(tmp_file)

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = 3.14
        with create_tmp_file() as tmp_file:
            ds.to_netcdf(tmp_file)

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = [1, 2, 3, 4]
        with create_tmp_file() as tmp_file:
            ds.to_netcdf(tmp_file)

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = (1.9, 2.5)
        with create_tmp_file() as tmp_file:
            ds.to_netcdf(tmp_file)

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = np.arange(5)
        with create_tmp_file() as tmp_file:
            ds.to_netcdf(tmp_file)

        ds, attrs = new_dataset_and_attrs()
        attrs['test'] = np.arange(12).reshape(3, 4)
        with create_tmp_file() as tmp_file:
          ds.to_netcdf(tmp_file)

xarray\tests\test_backends.py:3450:


xarray\core\dataset.py:1323: in to_netcdf compute=compute) xarray\backends\api.py:767: in to_netcdf unlimited_dims=unlimited_dims) xarray\backends\api.py:810: in dump_to_store unlimited_dims=unlimited_dims) xarray\backends\common.py:262: in store self.set_attributes(attributes) xarray\backends\common.py:278: in set_attributes self.set_attribute(k, v) xarray\backends\netCDF4_.py:418: in set_attribute set_nc_attribute(self.ds, key, value) xarray\backends\netCDF4.py:294: in _set_nc_attribute obj.setncattr(key, value) netCDF4_netCDF4.pyx:2781: in netCDF4._netCDF4.Dataset.setncattr ???


??? E ValueError: multi-dimensional array attributes not supported netCDF4_netCDF4.pyx:1514: ValueError ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2803/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
398107776 MDU6SXNzdWUzOTgxMDc3NzY= 2666 Dataset.from_dataframe will produce a FutureWarning for DatetimeTZ data shoyer 1217238 open 0     6 2019-01-11T02:45:49Z 2019-12-30T22:58:23Z   MEMBER      

This appears with the development version of pandas; see https://github.com/pandas-dev/pandas/issues/24716 for details.

Example: ``` In [16]: df = pd.DataFrame({"A": pd.date_range('2000', periods=12, tz='US/Central')})

In [17]: df.to_xarray() /Users/taugspurger/Envs/pandas-dev/lib/python3.7/site-packages/xarray/core/dataset.py:3111: FutureWarning: Converting timezone-aware DatetimeArray to timezone-naive ndarray with 'datetime64[ns]' dtype. In the future, this will return an ndarray with 'object' dtype where each element is a 'pandas.Timestamp' with the correct 'tz'. To accept the future behavior, pass 'dtype=object'. To keep the old behavior, pass 'dtype="datetime64[ns]"'. data = np.asarray(series).reshape(shape) Out[17]: <xarray.Dataset> Dimensions: (index: 12) Coordinates: * index (index) int64 0 1 2 3 4 5 6 7 8 9 10 11 Data variables: A (index) datetime64[ns] 2000-01-01T06:00:00 ... 2000-01-12T06:00:00 ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2666/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
269348789 MDU6SXNzdWUyNjkzNDg3ODk= 1668 Remove use of allow_cleanup_failure in test_backends.py shoyer 1217238 open 0     6 2017-10-28T20:47:31Z 2019-09-29T20:07:03Z   MEMBER      

This exists for the benefit of Windows, on which trying to delete an open file results in an error. But really, it would be nice to have a test suite that doesn't leave any temporary files hanging around.

The main culprit is tests like this, where opening a file triggers an error: python with raises_regex(TypeError, 'pip install netcdf4'): open_dataset(tmp_file, engine='scipy')

The way to fix this is to use mocking of some sort, to intercept calls to backend file objects and close them afterwards.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1668/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
278713328 MDU6SXNzdWUyNzg3MTMzMjg= 1756 Deprecate inplace methods shoyer 1217238 closed 0   0.11 2856429 6 2017-12-02T20:09:00Z 2019-03-25T19:19:10Z 2018-11-03T21:24:13Z MEMBER      

The following methods have an inplace argument: DataArray.reset_coords DataArray.set_index DataArray.reset_index DataArray.reorder_levels Dataset.set_coords Dataset.reset_coords Dataset.rename Dataset.swap_dims Dataset.set_index Dataset.reset_index Dataset.reorder_levels Dataset.update Dataset.merge

As proposed in https://github.com/pydata/xarray/issues/1755#issuecomment-348682403, let's deprecate all of these at the next major release (v0.11). They add unnecessary complexity to methods and promote confusing about xarray's data model.

Practically, we would change all of the default values to inplace=None and issue either a DeprecationWarning or FutureWarning (see PEP 565 for more details on that choice).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1756/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
395332265 MDExOlB1bGxSZXF1ZXN0MjQxODExMjc4 2642 Use pycodestyle for lint checks. shoyer 1217238 closed 0     6 2019-01-02T18:11:38Z 2019-03-14T06:27:20Z 2019-01-03T18:10:13Z MEMBER   0 pydata/xarray/pulls/2642

flake8 includes a few more useful checks, but it's annoying to only see it's output in Travis-CI results.

This keeps Travis-CI and pep8speaks in sync.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2642/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
369310993 MDU6SXNzdWUzNjkzMTA5OTM= 2480 test_apply_dask_new_output_dimension is broken on master with dask-dev shoyer 1217238 closed 0     6 2018-10-11T21:24:33Z 2018-10-12T16:26:17Z 2018-10-12T16:26:17Z MEMBER      

Example build failure: https://travis-ci.org/pydata/xarray/jobs/439949937 ``` =================================== FAILURES =================================== ___ test_apply_dask_new_output_dimension ___ @requires_dask def test_apply_dask_new_output_dimension(): import dask.array as da

    array = da.ones((2, 2), chunks=(1, 1))
    data_array = xr.DataArray(array, dims=('x', 'y'))

    def stack_negative(obj):
        def func(x):
            return np.stack([x, -x], axis=-1)
        return apply_ufunc(func, obj, output_core_dims=[['sign']],
                           dask='parallelized', output_dtypes=[obj.dtype],
                           output_sizes={'sign': 2})

    expected = stack_negative(data_array.compute())

    actual = stack_negative(data_array)
    assert actual.dims == ('x', 'y', 'sign')
    assert actual.shape == (2, 2, 2)
    assert isinstance(actual.data, da.Array)
  assert_identical(expected, actual)

xarray/tests/test_computation.py:737:


xarray/tests/test_computation.py:24: in assert_identical assert a.identical(b), msg xarray/core/dataarray.py:1923: in identical self._all_compat(other, 'identical')) xarray/core/dataarray.py:1875: in _all_compat compat(self, other)) xarray/core/dataarray.py:1872: in compat return getattr(x.variable, compat_str)(y.variable) xarray/core/variable.py:1461: in identical self.equals(other)) xarray/core/variable.py:1439: in equals equiv(self.data, other.data))) xarray/core/duck_array_ops.py:144: in array_equiv arr1, arr2 = as_like_arrays(arr1, arr2) xarray/core/duck_array_ops.py:128: in as_like_arrays return tuple(np.asarray(d) for d in data) xarray/core/duck_array_ops.py:128: in <genexpr> return tuple(np.asarray(d) for d in data) ../../../miniconda/envs/test_env/lib/python3.6/site-packages/numpy/core/numeric.py:501: in asarray return array(a, dtype, copy=False, order=order) ../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/array/core.py:1118: in array x = self.compute() ../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/base.py:156: in compute (result,) = compute(self, traverse=False, kwargs) ../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/base.py:390: in compute dsk = collections_to_dsk(collections, optimize_graph, kwargs) ../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/base.py:194: in collections_to_dsk for opt, (dsk, keys) in groups.items()])) ../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/base.py:194: in <listcomp> for opt, (dsk, keys) in groups.items()])) ../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/array/optimization.py:41: in optimize dsk = ensure_dict(dsk) ../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/utils.py:830: in ensure_dict result.update(dd) ../../../miniconda/envs/test_env/lib/python3.6/_collections_abc.py:720: in iter yield from self._mapping ../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/array/top.py:168: in iter return iter(self._dict) ../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/array/top.py:160: in _dict concatenate=self.concatenate ../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/array/top.py:305: in top keytups = list(itertools.product(*[range(dims[i]) for i in out_indices]))


.0 = <tuple_iterator object at 0x7f606ba84fd0>

keytups = list(itertools.product(*[range(dims[i]) for i in out_indices])) E KeyError: '.0' ../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/array/top.py:305: KeyError ```

My guess is that this is somehow related to @mrocklin's recent refactor of dask.array.atop: https://github.com/dask/dask/pull/3998

If the cause isn't obvious, I'll try to come up with a simple dask only example that reproduces it.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2480/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
361915770 MDU6SXNzdWUzNjE5MTU3NzA= 2424 0.10.9 release shoyer 1217238 closed 0     6 2018-09-19T20:31:29Z 2018-09-26T01:05:09Z 2018-09-22T15:14:48Z MEMBER      

It's now been two months since the 0.10.8 release, so we really ought to issue a new minor release.

I was initially thinking of skipping straight to 0.11.0 if we include https://github.com/pydata/xarray/pull/2261 (xarray.backends refactor), but it seems that will take a bit longer to review/test so it's probably worth issuing a 0.10.9 release first.

@pydata/xarray -- are there any PRs / bug-fixes in particular we should wait for before issuing the release?

I suppose it would be good to sort out https://github.com/pydata/xarray/issues/2422 (Plot2D no longer sorts coordinates before plotting)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2424/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
302153432 MDExOlB1bGxSZXF1ZXN0MTcyNzYxNTAw 1962 Support __array_ufunc__ for xarray objects. shoyer 1217238 closed 0     6 2018-03-05T02:36:20Z 2018-03-12T20:31:07Z 2018-03-12T20:31:07Z MEMBER   0 pydata/xarray/pulls/1962

This means NumPy ufuncs are now supported directly on xarray.Dataset objects, and opens the door to supporting computation on new data types, such as sparse arrays or arrays with units.

  • [x] Closes #1617 (remove if there is no corresponding issue, which should only be the case for minor changes)
  • [x] Tests added (for all bug fixes or enhancements)
  • [x] Tests passed (for all non-documentation changes)
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1962/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
195579837 MDU6SXNzdWUxOTU1Nzk4Mzc= 1164 Don't warn when doing comparisons or arithmetic with NaN shoyer 1217238 closed 0     6 2016-12-14T16:33:05Z 2018-02-27T19:35:25Z 2018-02-27T16:03:43Z MEMBER      

Pandas used to unilaterally disable NumPy's warnings for doing comparisons with NaN, but now it doesn't: https://github.com/pandas-dev/pandas/issues/13109

See also http://stackoverflow.com/questions/41130138/why-is-invalid-value-encountered-in-greater-warning-thrown-in-python-xarray-fo/41147570#41147570

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1164/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
171828347 MDU6SXNzdWUxNzE4MjgzNDc= 974 Indexing with alignment and broadcasting shoyer 1217238 closed 0   1.0 741199 6 2016-08-18T06:39:27Z 2018-02-04T23:30:12Z 2018-02-04T23:30:11Z MEMBER      

I think we can bring all of NumPy's advanced indexing to xarray in a very consistent way, with only very minor breaks in backwards compatibility.

For boolean indexing: - da[key] where key is a boolean labelled array (with any number of dimensions) is made equivalent to da.where(key.reindex_like(ds), drop=True). This matches the existing behavior if key is a 1D boolean array. For multi-dimensional arrays, even though the result is now multi-dimensional, this coupled with automatic skipping of NaNs means that da[key].mean() gives the same result as in NumPy. - da[key] = value where key is a boolean labelled array can be made equivalent to da = da.where(*align(key.reindex_like(da), value.reindex_like(da))) (that is, the three argument form of where). - da[key_0, ..., key_n] where all of key_i are boolean arrays gets handled in the usual way. It is an IndexingError to supply multiple labelled keys if any of them are not already aligned with as the corresponding index coordinates (and share the same dimension name). If they want alignment, we suggest users simply write da[key_0 & ... & key_n].

For vectorized indexing (by integer or index value): - da[key_0, ..., key_n] where all of key_i are integer labelled arrays with any number of dimensions gets handled like NumPy, except instead of broadcasting numpy-style we do broadcasting xarray-style: - If any of key_i are unlabelled, 1D arrays (e.g., numpy arrays), we convert them into an xarray.Variable along the respective dimension. 0D arrays remain scalars. This ensures that the result of broadcasting them (in the next step) will be consistent with our current "outer indexing" behavior. Unlabelled higher dimensional arrays triggers an IndexingError. - We ensure all keys have the same dimensions/coordinates by mapping it to da[*broadcast(key_0, ..., key_n)] (note that broadcast now includes automatic alignment). - The result's dimensions and coordinates are copied from the broadcast keys. - The result's values are taken by mapping each set of integer locations specified by the broadcast version of key_i to the integer position on the corresponding ith axis on da. - Labeled indexing like ds.loc[key_0, ...., key_n] works exactly as above, except instead of doing integer lookup, we lookup label values in the corresponding index instead. - Indexing with .isel and .sel/.reindex works like the two previous cases, except we lookup axes by dimension name instead of axis position. - I haven't fully thought through the implications for assignment (da[key] = value or da.loc[key] = value), but I think it works in a straightforwardly similar fashion.

All of these methods should also work for indexing on Dataset by looping over Dataset variables in the usual way.

This framework neatly subsumes most of the major limitations with xarray's existing indexing: - Boolean indexing on multi-dimensional arrays works in an intuitive way, for both selection and assignment. - No more need for specialized methods (sel_points/isel_points) for pointwise indexing. If you want to select along the diagonal of an array, you simply need to supply indexers that use a new dimension. Instead of arr.sel_points(lat=stations.lat, lon=stations.lon, dim='station'), you would simply write arr.sel(lat=stations.lat, lon=stations.lon) -- the station dimension is taken automatically from the indexer. - Other use cases for NumPy's advanced indexing that currently are impossible in xarray also automatically work. For example, nearest neighbor interpolation to a completely different grid is now as simple as ds.reindex(lon=grid.lon, lat=grid.lat, method='nearest', tolerance=0.5) or ds.reindex_like(grid, method='nearest', tolerance=0.5).

Questions to consider: - How does this interact with @benbovy's enhancements for MultiIndex indexing? (#802 and #947) - How do we handle mixed slice and array indexing? In NumPy, this is a major source of confusion, because slicing is done before broadcasting and the order of slices in the result is handled separately from broadcast indices. I think we may be able to resolve this by mapping slices in this case to 1D arrays along their respective axes, and using our normal broadcasting rules. - Should we deprecate non-boolean indexing with [] and .loc[] and non-labelled arrays when some but not all dimensions are provided? Instead, we would require explicitly indexing like [key, ...] (yes, writing ...), which indicates "all trailing axes" like NumPy. This behavior has been suggested for new indexers in NumPy because it precludes a class of bugs where the array has an unexpected number of dimensions. On the other hand, it's not so necessary for us when we have explicit indexing by dimension name with .sel.

xref these comments from @MaximilianR and myself

Note: I would certainly welcome help making this happen from a contributor other than myself, though you should probably wait until I finish #964, first, which lays important groundwork.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/974/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
274308380 MDU6SXNzdWUyNzQzMDgzODA= 1720 Possible regression with PyNIO data not being lazily loaded shoyer 1217238 closed 0     6 2017-11-15T21:20:41Z 2017-11-17T17:33:13Z 2017-11-17T16:44:40Z MEMBER      

@weathergod reports on the mailing list:

I just tried [0.10.0 rc2] out in combination with the pynio engine (v1.5.0 from conda-forge), and doing a print on a dataset object causes all of the data to get loaded into memory.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1720/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
207587161 MDU6SXNzdWUyMDc1ODcxNjE= 1269 GroupBy like API for resample shoyer 1217238 closed 0     6 2017-02-14T17:46:02Z 2017-09-22T16:27:35Z 2017-09-22T16:27:35Z MEMBER      

Since we wrote resample in xarray, pandas updated resample to have a groupyby-like API (e.g., df.resample('24H').mean() vs. the old df.resample('24H') that uses the mean by default).

It would be nice to redo the xarray resample API to match, e.g., ds.resample(time='24H').mean() vs ds.resample('time', '24H'). This would solve a few use cases, including grouped-resample arithmetic, iterating over groups and (mostly) take care of the need for pd.TimeGrouper support (https://github.com/pydata/xarray/issues/364). If we use **kwargs for matching dimension names, this could be done with a minimally painful deprecation cycle.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1269/reactions",
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
88075523 MDU6SXNzdWU4ODA3NTUyMw== 432 Tools for converting between xray.Dataset and nested dictionaries/JSON shoyer 1217238 closed 0     6 2015-06-13T22:25:28Z 2016-08-11T21:54:51Z 2016-08-11T21:54:51Z MEMBER      

This came up in discussion with @freeman-lab -- xray does not have direct support for converting datasets to or from nested dictionaries (i.e., as could be serialized in JSON).

This is quite straightforward to implement oneself, of course, but there's something to be said for making this more obvious. I'm thinking of a serialization format that looks something like this:

{ 'variables': { 'temperature': { 'dimensions': ['x'], 'data': [1, 2, 3], 'attributes': {} } ... } 'attributes': { 'title': 'My example dataset', ... } }

The solution here would be to either: 1. add a few examples to the IO documentation of how to roll this one-self, or 2. create a few helper methods/functions to make this even easier: xray.Dataset.to_dict/xray.read_dict.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/432/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
104768781 MDExOlB1bGxSZXF1ZXN0NDQxNDkyOTQ= 559 Fix pcolormesh plots with cartopy shoyer 1217238 closed 0   0.6.1 1307323 6 2015-09-03T19:50:22Z 2015-11-15T21:49:11Z 2015-09-14T20:33:36Z MEMBER   0 pydata/xarray/pulls/559

python proj = ccrs.Orthographic(central_longitude=230, central_latitude=5) fig, ax = plt.subplots(figsize=(20, 8), subplot_kw=dict(projection=proj)) x.plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree()) ax.coastlines()

Before:

After:

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/559/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 168.188ms · About: xarray-datasette