home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

25 rows where comments = 5 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: draft, created_at (date), updated_at (date), closed_at (date)

type 2

  • pull 14
  • issue 11

state 2

  • closed 21
  • open 4

repo 1

  • xarray 25
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
271043420 MDU6SXNzdWUyNzEwNDM0MjA= 1689 Roundtrip serialization of coordinate variables with spaces in their names shoyer 1217238 open 0     5 2017-11-03T16:43:20Z 2024-03-22T14:02:48Z   MEMBER      

If coordinates have spaces in their names, they get restored from netCDF files as data variables instead: ```

xarray.open_dataset(xarray.Dataset(coords={'name with spaces': 1}).to_netcdf()) <xarray.Dataset> Dimensions: () Data variables: name with spaces int32 1 ````

This happens because the CF convention is to indicate coordinates as a space separated string, e.g., coordinates='latitude longitude'.

Even though these aren't CF compliant variable names (which cannot have strings) It would be nice to have an ad-hoc convention for xarray that allows us to serialize/deserialize coordinates in all/most cases. Maybe we could use escape characters for spaces (e.g., coordinates='name\ with\ spaces') or quote names if they have spaces (e.g., coordinates='"name\ with\ spaces"'?

At the very least, we should issue a warning in these cases.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1689/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
325439138 MDU6SXNzdWUzMjU0MzkxMzg= 2171 Support alignment/broadcasting with unlabeled dimensions of size 1 shoyer 1217238 open 0     5 2018-05-22T19:52:21Z 2022-04-19T03:15:24Z   MEMBER      

Sometimes, it's convenient to include placeholder dimensions of size 1, which allows for removing any ambiguity related to the order of output dimensions.

Currently, this is not supported with xarray: ```

xr.DataArray([1], dims='x') + xr.DataArray([1, 2, 3], dims='x') ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension sizes: {1, 3}

xr.Variable(('x',), [1]) + xr.Variable(('x',), [1, 2, 3]) ValueError: operands cannot be broadcast together with mismatched lengths for dimension 'x': (1, 3) ```

However, these operations aren't really ambiguous. With size 1 dimensions, we could logically do broadcasting like NumPy arrays, e.g., ```

np.array([1]) + np.array([1, 2, 3]) array([2, 3, 4]) ```

This would be particularly convenient if we add keepdims=True to xarray operations (#2170).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2171/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
237008177 MDU6SXNzdWUyMzcwMDgxNzc= 1460 groupby should still squeeze for non-monotonic inputs shoyer 1217238 open 0     5 2017-06-19T20:05:14Z 2022-03-04T21:31:41Z   MEMBER      

We can simply use argsort() to determine group_indices instead of np.arange(): https://github.com/pydata/xarray/blob/22ff955d53e253071f6e4fa849e5291d0005282a/xarray/core/groupby.py#L256

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1460/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
645062817 MDExOlB1bGxSZXF1ZXN0NDM5NTg4OTU1 4178 Fix min_deps_check; revert to support numpy=1.14 and pandas=0.24 shoyer 1217238 closed 0     5 2020-06-25T00:37:19Z 2021-02-27T21:46:43Z 2021-02-27T21:46:42Z MEMBER   1 pydata/xarray/pulls/4178

Fixes the issue noticed in: https://github.com/pydata/xarray/pull/4175#issuecomment-649135372

Let's see if this passes CI...

  • [x] Passes isort -rc . && black . && mypy . && flake8
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4178/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
314444743 MDU6SXNzdWUzMTQ0NDQ3NDM= 2059 How should xarray serialize bytes/unicode strings across Python/netCDF versions? shoyer 1217238 open 0     5 2018-04-15T19:36:55Z 2020-11-19T10:08:16Z   MEMBER      

netCDF string types

We have several options for storing strings in netCDF files: - NC_CHAR: netCDF's legacy character type. The closest match is NumPy 'S1' dtype. In principle, it's supposed to be able to store arbitrary bytes. On HDF5, it uses an UTF-8 encoded string with a fixed-size of 1 (but note that HDF5 does not complain about storing arbitrary bytes). - NC_STRING: netCDF's newer variable length string type. It's only available on netCDF4 (not netCDF3). It corresponds to an HDF5 variable-length string with UTF-8 encoding. - NC_CHAR with an _Encoding attribute: xarray and netCDF4-Python support an ad-hoc convention for storing unicode strings in NC_CHAR data-types, by adding an attribute {'_Encoding': 'UTF-8'}. The data is still stored as fixed width strings, but xarray (and netCDF4-Python) can decode them as unicode.

NC_STRING would seem like a clear win in cases where it's supported, but as @crusaderky points out in https://github.com/pydata/xarray/issues/2040, it actually results in much larger netCDF files in many cases than using character arrays, which are more easily compressed. Nonetheless, we currently default to storing unicode strings in NC_STRING, because it's the most portable option -- every tool that handles HDF5 and netCDF4 should be able to read it properly as unicode strings.

NumPy/Python string types

On the Python side, our options are perhaps even more confusing: - NumPy's dtype=np.string_ corresponds to fixed-length bytes. This is the default dtype for strings on Python 2, because on Python 2 strings are the same as bytes. - NumPy's dtype=np.unicode_ corresponds to fixed-length unicode. This is the default dtype for strings on Python 3, because on Python 3 strings are the same as unicode. - Strings are also commonly stored in numpy arrays with dtype=np.object_, as arrays of either bytes or unicode objects. This is a pragmatic choice, because otherwise NumPy has no support for variable length strings. We also use this (like pandas) to mark missing values with np.nan.

Like pandas, we are pretty liberal with converting back and forth between fixed-length (np.string/np.unicode_) and variable-length (object dtype) representations of strings as necessary. This works pretty well, though converting from object arrays in particular has downsides, since it cannot be done lazily with dask.

Current behavior of xarray

Currently, xarray uses the same behavior on Python 2/3. The priority was faithfully round-tripping data from a particular version of Python to netCDF and back, which the current serialization behavior achieves:

| Python version | NetCDF version | NumPy datatype | NetCDF datatype | | --------- | ---------- | -------------- | ------------ | | Python 2 | NETCDF3 | np.string_ / str | NC_CHAR | | Python 2 | NETCDF4 | np.string_ / str | NC_CHAR | | Python 3 | NETCDF3 | np.string_ / bytes | NC_CHAR | | Python 3 | NETCDF4 | np.string_ / bytes | NC_CHAR | | Python 2 | NETCDF3 | np.unicode_ / unicode | NC_CHAR with UTF-8 encoding | | Python 2 | NETCDF4 | np.unicode_ / unicode | NC_STRING | | Python 3 | NETCDF3 | np.unicode_ / str | NC_CHAR with UTF-8 encoding | | Python 3 | NETCDF4 | np.unicode_ / str | NC_STRING | | Python 2 | NETCDF3 | object bytes/str | NC_CHAR | | Python 2 | NETCDF4 | object bytes/str | NC_CHAR | | Python 3 | NETCDF3 | object bytes | NC_CHAR | | Python 3 | NETCDF4 | object bytes | NC_CHAR | | Python 2 | NETCDF3 | object unicode | NC_CHAR with UTF-8 encoding | | Python 2 | NETCDF4 | object unicode | NC_STRING | | Python 3 | NETCDF3 | object unicode/str | NC_CHAR with UTF-8 encoding | | Python 3 | NETCDF4 | object unicode/str | NC_STRING |

This can also be selected explicitly for most data-types by setting dtype in encoding: - 'S1' for NC_CHAR (with or without encoding) - str for NC_STRING (though I'm not 100% sure it works properly currently when given bytes)

Script for generating table:

```python from __future__ import print_function import xarray as xr import uuid import netCDF4 import numpy as np import sys for dtype_name, value in [ ('np.string_ / ' + type(b'').__name__, np.array([b'abc'])), ('np.unicode_ / ' + type(u'').__name__, np.array([u'abc'])), ('object bytes/' + type(b'').__name__, np.array([b'abc'], dtype=object)), ('object unicode/' + type(u'').__name__, np.array([u'abc'], dtype=object)), ]: for format in ['NETCDF3_64BIT', 'NETCDF4']: filename = str(uuid.uuid4()) + '.nc' xr.Dataset({'data': value}).to_netcdf(filename, format=format) with netCDF4.Dataset(filename) as f: var = f.variables['data'] disk_dtype = var.dtype has_encoding = hasattr(var, '_Encoding') disk_dtype_name = (('NC_CHAR' if disk_dtype == 'S1' else 'NC_STRING') + (' with UTF-8 encoding' if has_encoding else '')) print('|', 'Python %i' % sys.version_info[0], '|', format[:7], '|', dtype_name, '|', disk_dtype_name, '|') ```

Potential alternatives

The main option I'm considering is switching to default to NC_CHAR with UTF-8 encoding for np.string_ / str and object bytes/str on Python 2. The current behavior could be explicitly toggled by setting an encoding of {'_Encoding': None}.

This would imply two changes: 1. Attempting to serialize arbitrary bytes (on Python 2) would start raising an error -- anything that isn't ASCII would require explicitly disabling _Encoding. 2. Strings read back from disk on Python 2 would come back as unicode instead of bytes.

This implicit conversion would be consistent with Python 2's general handling of bytes/unicode, and facilitate reading netCDF files on Python 3 that were written with Python 2.

The counter-argument is that it may not be worth changing this at this late point, given that we will be sunsetting Python 2 support by year's end.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2059/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
613546626 MDExOlB1bGxSZXF1ZXN0NDE0MjgwMDEz 4039 Revise pull request template shoyer 1217238 closed 0     5 2020-05-06T19:08:19Z 2020-06-18T05:45:11Z 2020-06-18T05:45:10Z MEMBER   0 pydata/xarray/pulls/4039

See below for the new language, to clarify that documentation is only necessary for "user visible changes."

I added "including notable bug fixes" to indicate that minor bug fixes may not be worth noting (I was thinking of test-suite only fixes in this category) but perhaps that is too confusing.

cc @pydata/xarray for opinions!

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] Passes isort -rc . && black . && mypy . && flake8
  • [ ] Fully documented, including whats-new.rst for user visible changes (including notable bug fixes) and api.rst for new API
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4039/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
612214951 MDExOlB1bGxSZXF1ZXN0NDEzMjIyOTEx 4028 Remove broken test for Panel with to_pandas() shoyer 1217238 closed 0     5 2020-05-04T22:41:42Z 2020-05-06T01:50:21Z 2020-05-06T01:50:21Z MEMBER   0 pydata/xarray/pulls/4028

We don't support creating a Panel with to_pandas() with any version of pandas at present, so this test was previous broken if pandas < 0.25 was installed.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4028/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
309136602 MDU6SXNzdWUzMDkxMzY2MDI= 2019 Appending to an existing netCDF file fails with scipy==1.0.1 shoyer 1217238 closed 0     5 2018-03-27T21:15:05Z 2020-03-09T07:18:07Z 2020-03-09T07:18:07Z MEMBER      

https://travis-ci.org/pydata/xarray/builds/359093748

Example failure: ``` ___ ScipyFilePathTest.test_append_write ____ self = <xarray.tests.test_backends.ScipyFilePathTest testMethod=test_append_write> def test_append_write(self): # regression for GH1215 data = create_test_data()

  with self.roundtrip_append(data) as actual:

xarray/tests/test_backends.py:786:


../../../miniconda/envs/test_env/lib/python3.6/contextlib.py:81: in enter return next(self.gen) xarray/tests/test_backends.py:155: in roundtrip_append self.save(data[[key]], path, mode=mode, save_kwargs) xarray/tests/test_backends.py:162: in save kwargs) xarray/core/dataset.py:1131: in to_netcdf unlimited_dims=unlimited_dims) xarray/backends/api.py:657: in to_netcdf unlimited_dims=unlimited_dims) xarray/core/dataset.py:1068: in dump_to_store unlimited_dims=unlimited_dims) xarray/backends/common.py:363: in store unlimited_dims=unlimited_dims) xarray/backends/common.py:402: in set_variables self.writer.add(source, target) xarray/backends/common.py:265: in add target[...] = source xarray/backends/scipy_.py:61: in setitem data[key] = value


self = <scipy.io.netcdf.netcdf_variable object at 0x7fe3eb3ec6a0> index = Ellipsis, data = array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ]) def setitem(self, index, data): if self.maskandscale: missing_value = ( self._get_missing_value() or getattr(data, 'fill_value', 999999)) self._attributes.setdefault('missing_value', missing_value) self._attributes.setdefault('_FillValue', missing_value) data = ((data - self._attributes.get('add_offset', 0.0)) / self._attributes.get('scale_factor', 1.0)) data = np.ma.asarray(data).filled(missing_value) if self._typecode not in 'fd' and data.dtype.kind == 'f': data = np.round(data)

    # Expand data for record vars?
    if self.isrec:
        if isinstance(index, tuple):
            rec_index = index[0]
        else:
            rec_index = index
        if isinstance(rec_index, slice):
            recs = (rec_index.start or 0) + len(data)
        else:
            recs = rec_index + 1
        if recs > len(self.data):
            shape = (recs,) + self._shape[1:]
            # Resize in-place does not always work since
            # the array might not be single-segment
            try:
                self.data.resize(shape)
            except ValueError:
                self.__dict__['data'] = np.resize(self.data, shape).astype(self.data.dtype)
  self.data[index] = data

E ValueError: assignment destination is read-only ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2019/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
479914290 MDExOlB1bGxSZXF1ZXN0MzA2NzExNDYx 3210 sparse=True option for from_dataframe and from_series shoyer 1217238 closed 0     5 2019-08-13T01:09:19Z 2019-08-27T16:04:13Z 2019-08-27T08:54:26Z MEMBER   0 pydata/xarray/pulls/3210

Fixes https://github.com/pydata/xarray/issues/3206

Example usage:

In [3]: import pandas as pd
   ...: import numpy as np
   ...: import xarray
   ...: df = pd.DataFrame({
   ...:     'w': range(10),
   ...:     'x': list('abcdefghij'),
   ...:     'y': np.arange(0, 100, 10),
   ...:     'z': np.ones(10),
   ...: }).set_index(['w', 'x', 'y'])
   ...:

In [4]: ds = xarray.Dataset.from_dataframe(df, sparse=True)

In [5]: ds.z.data
Out[5]: <COO: shape=(10, 10, 10), dtype=float64, nnz=10, fill_value=nan>
  • [x] Closes #3206, Closes #2139
  • [x] Tests added
  • [x] Passes black . && mypy . && flake8
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3210/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
440233667 MDU6SXNzdWU0NDAyMzM2Njc= 2940 test_rolling_wrapped_dask is failing with dask-master shoyer 1217238 closed 0     5 2019-05-03T21:44:23Z 2019-06-28T16:49:04Z 2019-06-28T16:49:04Z MEMBER      

The test_rolling_wrapped_dask tests in test_dataarray.py are failing with dask master, e.g., as seen here: https://travis-ci.org/pydata/xarray/jobs/527936531

I reproduced this locally. git bisect identified the culprit as https://github.com/dask/dask/pull/4756.

The source of this issue on the xarray side appears to be these lines: https://github.com/pydata/xarray/blob/dd99b7d7d8576eefcef4507ae9eb36a144b60adf/xarray/core/rolling.py#L287-L291

In particular, we are currently padded as an xarray.DataArray object, not a dask array. Changing this to padded.data shows that passing an actual dask array to dask_array_ops.rolling_window results in failing tests.

@fujiisoup @jhamman any idea what's going on here?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2940/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
427451138 MDExOlB1bGxSZXF1ZXN0MjY2MDQ4MzEw 2858 Various fixes for explicit Dataset.indexes shoyer 1217238 closed 0     5 2019-03-31T21:48:47Z 2019-04-04T22:59:48Z 2019-04-04T21:58:24Z MEMBER   0 pydata/xarray/pulls/2858

I've added internal consistency checks to the uses of assert_equal in our test suite, so this shouldn't happen again.

  • [x] Closes #2856, closes #2854
  • [x] Tests added
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2858/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
365961291 MDExOlB1bGxSZXF1ZXN0MjE5NzUyOTE3 2458 WIP: sketch of resample support for CFTimeIndex shoyer 1217238 closed 0     5 2018-10-02T15:44:36Z 2019-02-03T03:21:52Z 2019-02-03T03:21:52Z MEMBER   0 pydata/xarray/pulls/2458

Example usage:

```

import xarray times = xarray.cftime_range('2000', periods=30, freq='MS') da = xarray.DataArray(range(30), [('time', times)]) da.resample(time='1AS').mean() <xarray.DataArray (time: 3)> array([ 5.5, 17.5, 26.5]) Coordinates: * time (time) object 2001-01-01 00:00:00 ... 2003-01-01 00:00:00 ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2458/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 1,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
388977754 MDExOlB1bGxSZXF1ZXN0MjM3MTAyNjYz 2595 Close files when CachingFileManager is garbage collected shoyer 1217238 closed 0     5 2018-12-09T01:53:50Z 2018-12-23T20:11:35Z 2018-12-23T20:11:32Z MEMBER   0 pydata/xarray/pulls/2595

This frees users from needing to worry about this.

Using __del__ turned up to be easier than using weak references.

  • [x] Closes #2560
  • [x] Closes #2614
  • [x] Tests added
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new AP
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2595/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
293345254 MDU6SXNzdWUyOTMzNDUyNTQ= 1875 roll doesn't handle periodic boundary conditions well shoyer 1217238 closed 0     5 2018-01-31T23:07:42Z 2018-08-15T08:11:29Z 2018-08-15T08:11:29Z MEMBER      

DataArray.roll() currently rolls both data variables and coordinates: ```

arr = xr.DataArray(range(4), [('x', range(0, 360, 90))]) arr.roll(x=2) <xarray.DataArray (x: 4)> array([2, 3, 0, 1]) Coordinates: * x (x) int64 180 270 0 90 ```

This is sort of makes sense, but the labels are now all non-monotonic, so you can't even plot the data with xarray. In my experience, you probably want coordinate labels that either look like:

  1. The unrolled original coordinates: [0, 90, 180, 270]
  2. Shifted coordinates: [-180, -90, 0, 90]

It should be easier to accomplish this is in xarray. I currently resort to using roll and manually fixing up coordinates after the fact.

I'm actually not sure if there are any use-cases for the current behavior. Choice (1) would have the virtue of being consistent with shift(): ```

arr.shift(x=2) <xarray.DataArray (x: 4)> array([nan, nan, 0., 1.]) Coordinates: * x (x) int64 0 90 180 270 ``` We could potentially add optional another argument for shifting labels, too, or requiring fixing that up by subtraction.

Note: you might argue that this is overly geoscience specific, and it would be, if this was only for handling a longitude coordinate. But periodic boundary conditions are common in many areas of the physical sciences.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1875/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
323674056 MDU6SXNzdWUzMjM2NzQwNTY= 2137 0.10.4 release shoyer 1217238 closed 0     5 2018-05-16T15:31:57Z 2018-05-17T02:29:52Z 2018-05-17T02:29:52Z MEMBER      

Our last release was April 13 (just over a month ago), and we've had a number of features land, so I'd like to issue this shortly. Ideally within the next few days, or maybe even later today.

CC @pydata/xarray

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2137/reactions",
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
305702311 MDU6SXNzdWUzMDU3MDIzMTE= 1993 DataArray.rolling().mean() is way slower than it should be shoyer 1217238 closed 0     5 2018-03-15T20:10:22Z 2018-03-18T08:56:27Z 2018-03-18T08:56:27Z MEMBER      

Code Sample, a copy-pastable example if possible

From @RayPalmerTech in https://github.com/kwgoodman/bottleneck/issues/186: ```python import numpy as np import pandas as pd import time import bottleneck as bn import xarray import matplotlib.pyplot as plt

N = 30000200 # Number of datapoints Fs = 30000 # sample rate T=1/Fs # sample period duration = N/Fs # duration in s t = np.arange(0,duration,T) # time vector DATA = np.random.randn(N,)+5np.sin(2np.pi0.01t) # Example noisy sine data and window size w = 330000

def using_bottleneck_mean(data,width): return bn.move_mean(a=data,window=width,min_count = 1)

def using_pandas_rolling_mean(data,width): return np.asarray(pd.DataFrame(data).rolling(window=width,center=True,min_periods=1).mean()).ravel()

def using_xarray_mean(data,width): return xarray.DataArray(data,dims='x').rolling(x=width,min_periods=1, center=True).mean()

start=time.time() A = using_bottleneck_mean(DATA,w) print('Bottleneck: ', time.time()-start, 's') start=time.time() B = using_pandas_rolling_mean(DATA,w) print('Pandas: ',time.time()-start,'s') start=time.time() C = using_xarray_mean(DATA,w) print('Xarray: ',time.time()-start,'s') ```

This results in: Bottleneck: 0.0867006778717041 s Pandas: 0.563546895980835 s Xarray: 25.133142709732056 s

Somehow xarray is way slower than pandas and bottleneck, even though it's using bottleneck under the hood!

Problem description

Profiling shows that the majority of time is spent in xarray.core.rolling.DataArrayRolling._setup_windows. Monkey-patching that method with a dummy rectifies the issue: xarray.core.rolling.DataArrayRolling._setup_windows = lambda *args: None

Now we obtain: Bottleneck: 0.06775331497192383 s Pandas: 0.48262882232666016 s Xarray: 0.1723031997680664 s

The solution is to make setting up windows done lazily (in __iter__), instead of doing it in the constructor.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.96+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.2 pandas: 0.22.0 numpy: 1.14.2 scipy: 0.19.1 netCDF4: None h5netcdf: None h5py: 2.7.1 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: None distributed: None matplotlib: 2.1.2 cartopy: None seaborn: 0.7.1 setuptools: 36.2.7 pip: 9.0.1 conda: None pytest: None IPython: 5.5.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1993/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
299601789 MDExOlB1bGxSZXF1ZXN0MTcwOTM0ODg1 1936 Tweak stickler config: ignore Python files in the docs & disable fixer shoyer 1217238 closed 0     5 2018-02-23T05:18:29Z 2018-02-25T20:51:42Z 2018-02-25T20:49:15Z MEMBER   0 pydata/xarray/pulls/1936

It doesn't always make sense to lint these files fully.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1936/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
272460887 MDExOlB1bGxSZXF1ZXN0MTUxNTc2MzA1 1705 Make Indexer classes not inherit from tuple. shoyer 1217238 closed 0     5 2017-11-09T07:08:27Z 2017-11-17T16:33:40Z 2017-11-14T03:32:34Z MEMBER   0 pydata/xarray/pulls/1705

I'm not entirely sure this is a good idea. The advantage is that it ensures that all our indexing code is entirely explicit: everything that reaches a backend must be an ExplicitIndexer. The downside is that it removes a bit of internal flexibility: we can't just use tuples in place of basic indexers anymore. On the whole, I think this is probably worth it but I would appreciate feedback.

@fujiisoup can you take a look?

  • [x] Tests added / passed
  • [x] Passes git diff upstream/master **/*py | flake8 --diff
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1705/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
112928260 MDExOlB1bGxSZXF1ZXN0NDg1MzUxMTA= 637 size and aspect arguments for plotting methods even without faceting shoyer 1217238 closed 0     5 2015-10-23T02:10:06Z 2016-12-20T10:08:35Z 2016-12-20T10:08:35Z MEMBER   0 pydata/xarray/pulls/637

I was finding myself writting plt.figure(figsize=(x, y)) way too often. This will be a convenient shortcut.

Still needs tests.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/637/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
180756013 MDU6SXNzdWUxODA3NTYwMTM= 1034 test_conventions.TestEncodeCFVariable failing on master for Appveyor Python 2.7 build shoyer 1217238 closed 0     5 2016-10-03T21:48:55Z 2016-10-22T00:49:53Z 2016-10-22T00:49:53Z MEMBER      

I have on idea what's going on here but maybe somebody who knows Windows better has a guess:

``` ================================== FAILURES =================================== __ TestEncodeCFVariable.testmissing_fillvalue ___ self = <xarray.test.test_conventions.TestEncodeCFVariable testMethod=test_missing_fillvalue> def test_missing_fillvalue(self): v = Variable(['x'], np.array([np.nan, 1, 2, 3])) v.encoding = {'dtype': 'int16'} with self.assertWarns('floating point data as an integer'):

      conventions.encode_cf_variable(v)

xarray\test\test_conventions.py:523:


C:\Python27-conda32\lib\contextlib.py:24: in exit self.gen.next()


self = <xarray.test.test_conventions.TestEncodeCFVariable testMethod=test_missing_fillvalue> message = 'floating point data as an integer' @contextmanager def assertWarns(self, message): with warnings.catch_warnings(record=True) as w: warnings.filterwarnings('always', message) yield assert len(w) > 0

      assert all(message in str(wi.message) for wi in w)

E AssertionError: NameError: all(<generator object \<genexpr> at 0x0617D170>) << global name 'message' is not defined xarray\test__init__.py:140: AssertionError ============== 1 failed, 970 passed, 67 skipped in 70.58 seconds ============== ```

I could understand a warning failing to be raised, but the NameError is especially strange.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1034/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
173908126 MDExOlB1bGxSZXF1ZXN0ODMxOTM2NTI= 993 Coordinate -> IndexVariable and other deprecations shoyer 1217238 closed 0     5 2016-08-30T01:12:19Z 2016-09-01T21:56:07Z 2016-09-01T21:56:02Z MEMBER   0 pydata/xarray/pulls/993
  • Renamed the Coordinate class from xarray's low level API to IndexVariable. xref https://github.com/pydata/xarray/pull/947#issuecomment-238549129
  • Deprecated supplying coords as a dictionary to the DataArray constructor without also supplying an explicit dims argument. The old behavior encouraged relying on the iteration order of dictionaries, which is a bad practice (fixes #727).
  • Removed a number of methods deprecated since v0.7.0 or earlier: load_data, vars, drop_vars, dump, dumps and the variables keyword argument alias to Dataset.
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/993/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
89866276 MDU6SXNzdWU4OTg2NjI3Ng== 439 Display datetime64 arrays without showing local timezones shoyer 1217238 closed 0     5 2015-06-21T05:13:58Z 2016-04-21T15:43:27Z 2016-04-21T15:43:27Z MEMBER      

NumPy has an unfortunate way of adding local timezone offsets when printing datetime64 arrays:

<xray.DataArray 'time' (time: 4000)> array(['1999-12-31T16:00:00.000000000-0800', '2000-01-01T16:00:00.000000000-0800', '2000-01-02T16:00:00.000000000-0800', ..., '2010-12-10T16:00:00.000000000-0800', '2010-12-11T16:00:00.000000000-0800', '2010-12-12T16:00:00.000000000-0800'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...

We should use custom formatting code to remove the local timezone (to encourage folks just to use naive timezones/UTC).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/439/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
98498103 MDExOlB1bGxSZXF1ZXN0NDEzOTQyNzY= 507 Add sel_points for point-wise indexing by label shoyer 1217238 closed 0     5 2015-08-01T01:52:52Z 2015-08-05T03:51:46Z 2015-08-05T03:51:44Z MEMBER   0 pydata/xarray/pulls/507

xref #475

Example usage:

``` In [1]: da = xray.DataArray(np.arange(56).reshape((7, 8)), ...: coords={'x': list('abcdefg'), ...: 'y': 10 * np.arange(8)}, ...: dims=['x', 'y']) ...:

In [2]: da Out[2]: <xray.DataArray (x: 7, y: 8)> array([[ 0, 1, 2, 3, 4, 5, 6, 7], [ 8, 9, 10, 11, 12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47], [48, 49, 50, 51, 52, 53, 54, 55]]) Coordinates: * y (y) int64 0 10 20 30 40 50 60 70 * x (x) |S1 'a' 'b' 'c' 'd' 'e' 'f' 'g'

we can index by position along each dimension

In [3]: da.isel_points(x=[0, 1, 6], y=[0, 1, 0], dim='points') Out[3]: <xray.DataArray (points: 3)> array([ 0, 9, 48]) Coordinates: y (points) int64 0 10 0 x (points) |S1 'a' 'b' 'g' * points (points) int64 0 1 2

or equivalently by label

In [4]: da.sel_points(x=['a', 'b', 'g'], y=[0, 10, 0], dim='points') Out[4]: <xray.DataArray (points: 3)> array([ 0, 9, 48]) Coordinates: y (points) int64 0 10 0 x (points) |S1 'a' 'b' 'g' * points (points) int64 0 1 2 Bug fixes ```

cc @jhamman

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/507/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
43442970 MDExOlB1bGxSZXF1ZXN0MjE1NjU3Mjg= 236 WIP: convert to/from cdms2 variables shoyer 1217238 closed 0   0.3.2 836999 5 2014-09-22T08:48:52Z 2014-12-19T09:11:42Z 2014-12-19T09:11:39Z MEMBER   0 pydata/xarray/pulls/236

Fixes #133

@DamienIrving am I missing anything obvious here?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/236/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
27625970 MDExOlB1bGxSZXF1ZXN0MTI1NzI5OTE= 12 Stephan's sprintbattical shoyer 1217238 closed 0     5 2014-02-14T21:23:09Z 2014-08-04T00:03:21Z 2014-02-21T00:36:53Z MEMBER   0 pydata/xarray/pulls/12
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/12/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 4641.421ms · About: xarray-datasette