id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 271043420,MDU6SXNzdWUyNzEwNDM0MjA=,1689,Roundtrip serialization of coordinate variables with spaces in their names,1217238,open,0,,,5,2017-11-03T16:43:20Z,2024-03-22T14:02:48Z,,MEMBER,,,,"If coordinates have spaces in their names, they get restored from netCDF files as data variables instead: ``` >>> xarray.open_dataset(xarray.Dataset(coords={'name with spaces': 1}).to_netcdf()) Dimensions: () Data variables: name with spaces int32 1 ```` This happens because the CF convention is to indicate coordinates as a space separated string, e.g., `coordinates='latitude longitude'`. Even though these aren't CF compliant variable names (which cannot have strings) It would be nice to have an ad-hoc convention for xarray that allows us to serialize/deserialize coordinates in all/most cases. Maybe we could use escape characters for spaces (e.g., `coordinates='name\ with\ spaces'`) or quote names if they have spaces (e.g., `coordinates='""name\ with\ spaces""'`? At the very least, we should issue a warning in these cases.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1689/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 325439138,MDU6SXNzdWUzMjU0MzkxMzg=,2171,Support alignment/broadcasting with unlabeled dimensions of size 1,1217238,open,0,,,5,2018-05-22T19:52:21Z,2022-04-19T03:15:24Z,,MEMBER,,,,"Sometimes, it's convenient to include placeholder dimensions of size 1, which allows for removing any ambiguity related to the order of output dimensions. Currently, this is not supported with xarray: ``` >>> xr.DataArray([1], dims='x') + xr.DataArray([1, 2, 3], dims='x') ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension sizes: {1, 3} >>> xr.Variable(('x',), [1]) + xr.Variable(('x',), [1, 2, 3]) ValueError: operands cannot be broadcast together with mismatched lengths for dimension 'x': (1, 3) ``` However, these operations aren't really ambiguous. With size 1 dimensions, we could logically do broadcasting like NumPy arrays, e.g., ``` >>> np.array([1]) + np.array([1, 2, 3]) array([2, 3, 4]) ``` This would be particularly convenient if we add `keepdims=True` to xarray operations (#2170).","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2171/reactions"", ""total_count"": 4, ""+1"": 4, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 237008177,MDU6SXNzdWUyMzcwMDgxNzc=,1460,groupby should still squeeze for non-monotonic inputs,1217238,open,0,,,5,2017-06-19T20:05:14Z,2022-03-04T21:31:41Z,,MEMBER,,,,"We can simply use `argsort()` to determine `group_indices` instead of `np.arange()`: https://github.com/pydata/xarray/blob/22ff955d53e253071f6e4fa849e5291d0005282a/xarray/core/groupby.py#L256","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1460/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 314444743,MDU6SXNzdWUzMTQ0NDQ3NDM=,2059,How should xarray serialize bytes/unicode strings across Python/netCDF versions?,1217238,open,0,,,5,2018-04-15T19:36:55Z,2020-11-19T10:08:16Z,,MEMBER,,,,"# netCDF string types We have several options for storing strings in netCDF files: - `NC_CHAR`: netCDF's legacy character type. The closest match is NumPy `'S1'` dtype. In principle, it's supposed to be able to store arbitrary bytes. On HDF5, it uses an UTF-8 encoded string with a fixed-size of 1 (but note that HDF5 does not complain about storing arbitrary bytes). - `NC_STRING`: netCDF's newer variable length string type. It's only available on netCDF4 (not netCDF3). It corresponds to an HDF5 variable-length string with UTF-8 encoding. - `NC_CHAR` with an `_Encoding` attribute: xarray and netCDF4-Python support an ad-hoc convention for storing unicode strings in `NC_CHAR` data-types, by adding an attribute `{'_Encoding': 'UTF-8'}`. The data is still stored as fixed width strings, but xarray (and netCDF4-Python) can decode them as unicode. `NC_STRING` would seem like a clear win in cases where it's supported, but as @crusaderky points out in https://github.com/pydata/xarray/issues/2040, it actually results in much larger netCDF files in many cases than using character arrays, which are more easily compressed. Nonetheless, we currently default to storing unicode strings in `NC_STRING`, because it's the most portable option -- every tool that handles HDF5 and netCDF4 should be able to read it properly as unicode strings. # NumPy/Python string types On the Python side, our options are perhaps even more confusing: - NumPy's `dtype=np.string_` corresponds to fixed-length bytes. This is the default dtype for strings on Python 2, because on Python 2 strings are the same as bytes. - NumPy's `dtype=np.unicode_` corresponds to fixed-length unicode. This is the default dtype for strings on Python 3, because on Python 3 strings are the same as unicode. - Strings are also commonly stored in numpy arrays with `dtype=np.object_`, as arrays of either `bytes` or `unicode` objects. This is a pragmatic choice, because otherwise NumPy has no support for variable length strings. We also use this (like pandas) to mark missing values with `np.nan`. Like pandas, we are pretty liberal with converting back and forth between fixed-length (`np.string`/`np.unicode_`) and variable-length (object dtype) representations of strings as necessary. This works pretty well, though converting from object arrays in particular has downsides, since it cannot be done lazily with dask. # Current behavior of xarray Currently, xarray uses the same behavior on Python 2/3. The priority was faithfully round-tripping data from a particular version of Python to netCDF and back, which the current serialization behavior achieves: | Python version | NetCDF version | NumPy datatype | NetCDF datatype | | --------- | ---------- | -------------- | ------------ | | Python 2 | NETCDF3 | np.string_ / str | NC_CHAR | | Python 2 | NETCDF4 | np.string_ / str | NC_CHAR | | Python 3 | NETCDF3 | np.string_ / bytes | NC_CHAR | | Python 3 | NETCDF4 | np.string_ / bytes | NC_CHAR | | Python 2 | NETCDF3 | np.unicode_ / unicode | NC_CHAR with UTF-8 encoding | | Python 2 | NETCDF4 | np.unicode_ / unicode | NC_STRING | | Python 3 | NETCDF3 | np.unicode_ / str | NC_CHAR with UTF-8 encoding | | Python 3 | NETCDF4 | np.unicode_ / str | NC_STRING | | Python 2 | NETCDF3 | object bytes/str | NC_CHAR | | Python 2 | NETCDF4 | object bytes/str | NC_CHAR | | Python 3 | NETCDF3 | object bytes | NC_CHAR | | Python 3 | NETCDF4 | object bytes | NC_CHAR | | Python 2 | NETCDF3 | object unicode | NC_CHAR with UTF-8 encoding | | Python 2 | NETCDF4 | object unicode | NC_STRING | | Python 3 | NETCDF3 | object unicode/str | NC_CHAR with UTF-8 encoding | | Python 3 | NETCDF4 | object unicode/str | NC_STRING | This can also be selected explicitly for most data-types by setting dtype in encoding: - `'S1'` for NC_CHAR (with or without encoding) - `str` for NC_STRING (though I'm not 100% sure it works properly currently when given bytes) Script for generating table:
```python from __future__ import print_function import xarray as xr import uuid import netCDF4 import numpy as np import sys for dtype_name, value in [ ('np.string_ / ' + type(b'').__name__, np.array([b'abc'])), ('np.unicode_ / ' + type(u'').__name__, np.array([u'abc'])), ('object bytes/' + type(b'').__name__, np.array([b'abc'], dtype=object)), ('object unicode/' + type(u'').__name__, np.array([u'abc'], dtype=object)), ]: for format in ['NETCDF3_64BIT', 'NETCDF4']: filename = str(uuid.uuid4()) + '.nc' xr.Dataset({'data': value}).to_netcdf(filename, format=format) with netCDF4.Dataset(filename) as f: var = f.variables['data'] disk_dtype = var.dtype has_encoding = hasattr(var, '_Encoding') disk_dtype_name = (('NC_CHAR' if disk_dtype == 'S1' else 'NC_STRING') + (' with UTF-8 encoding' if has_encoding else '')) print('|', 'Python %i' % sys.version_info[0], '|', format[:7], '|', dtype_name, '|', disk_dtype_name, '|') ```
# Potential alternatives The main option I'm considering is switching to default to `NC_CHAR` with UTF-8 encoding for np.string_ / str and object bytes/str on Python 2. The current behavior could be explicitly toggled by setting an encoding of `{'_Encoding': None}`. This would imply two changes: 1. Attempting to serialize arbitrary bytes (on Python 2) would start raising an error -- anything that isn't ASCII would require explicitly disabling `_Encoding`. 2. Strings read back from disk on Python 2 would come back as unicode instead of bytes. This implicit conversion would be consistent with Python 2's general handling of bytes/unicode, and facilitate reading netCDF files on Python 3 that were written with Python 2. The counter-argument is that it may not be worth changing this at this late point, given that we will be sunsetting Python 2 support by year's end.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2059/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 309136602,MDU6SXNzdWUzMDkxMzY2MDI=,2019,Appending to an existing netCDF file fails with scipy==1.0.1,1217238,closed,0,,,5,2018-03-27T21:15:05Z,2020-03-09T07:18:07Z,2020-03-09T07:18:07Z,MEMBER,,,,"https://travis-ci.org/pydata/xarray/builds/359093748 Example failure: ``` _____________________ ScipyFilePathTest.test_append_write ______________________ self = def test_append_write(self): # regression for GH1215 data = create_test_data() > with self.roundtrip_append(data) as actual: xarray/tests/test_backends.py:786: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../../miniconda/envs/test_env/lib/python3.6/contextlib.py:81: in __enter__ return next(self.gen) xarray/tests/test_backends.py:155: in roundtrip_append self.save(data[[key]], path, mode=mode, **save_kwargs) xarray/tests/test_backends.py:162: in save **kwargs) xarray/core/dataset.py:1131: in to_netcdf unlimited_dims=unlimited_dims) xarray/backends/api.py:657: in to_netcdf unlimited_dims=unlimited_dims) xarray/core/dataset.py:1068: in dump_to_store unlimited_dims=unlimited_dims) xarray/backends/common.py:363: in store unlimited_dims=unlimited_dims) xarray/backends/common.py:402: in set_variables self.writer.add(source, target) xarray/backends/common.py:265: in add target[...] = source xarray/backends/scipy_.py:61: in __setitem__ data[key] = value _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = index = Ellipsis, data = array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ]) def __setitem__(self, index, data): if self.maskandscale: missing_value = ( self._get_missing_value() or getattr(data, 'fill_value', 999999)) self._attributes.setdefault('missing_value', missing_value) self._attributes.setdefault('_FillValue', missing_value) data = ((data - self._attributes.get('add_offset', 0.0)) / self._attributes.get('scale_factor', 1.0)) data = np.ma.asarray(data).filled(missing_value) if self._typecode not in 'fd' and data.dtype.kind == 'f': data = np.round(data) # Expand data for record vars? if self.isrec: if isinstance(index, tuple): rec_index = index[0] else: rec_index = index if isinstance(rec_index, slice): recs = (rec_index.start or 0) + len(data) else: recs = rec_index + 1 if recs > len(self.data): shape = (recs,) + self._shape[1:] # Resize in-place does not always work since # the array might not be single-segment try: self.data.resize(shape) except ValueError: self.__dict__['data'] = np.resize(self.data, shape).astype(self.data.dtype) > self.data[index] = data E ValueError: assignment destination is read-only ```","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2019/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 440233667,MDU6SXNzdWU0NDAyMzM2Njc=,2940,test_rolling_wrapped_dask is failing with dask-master,1217238,closed,0,,,5,2019-05-03T21:44:23Z,2019-06-28T16:49:04Z,2019-06-28T16:49:04Z,MEMBER,,,,"The `test_rolling_wrapped_dask` tests in `test_dataarray.py` are failing with dask master, e.g., as seen here: https://travis-ci.org/pydata/xarray/jobs/527936531 I reproduced this locally. `git bisect` identified the culprit as https://github.com/dask/dask/pull/4756. The source of this issue on the xarray side appears to be these lines: https://github.com/pydata/xarray/blob/dd99b7d7d8576eefcef4507ae9eb36a144b60adf/xarray/core/rolling.py#L287-L291 In particular, we are currently `padded` as an xarray.DataArray object, not a dask array. Changing this to `padded.data` shows that passing an actual dask array to `dask_array_ops.rolling_window` results in failing tests. @fujiisoup @jhamman any idea what's going on here?","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2940/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 293345254,MDU6SXNzdWUyOTMzNDUyNTQ=,1875,roll doesn't handle periodic boundary conditions well,1217238,closed,0,,,5,2018-01-31T23:07:42Z,2018-08-15T08:11:29Z,2018-08-15T08:11:29Z,MEMBER,,,,"DataArray.roll() currently rolls both data variables and coordinates: ``` >>> arr = xr.DataArray(range(4), [('x', range(0, 360, 90))]) >>> arr.roll(x=2) array([2, 3, 0, 1]) Coordinates: * x (x) int64 180 270 0 90 ``` This is sort of makes sense, but the labels are now all non-monotonic, so you can't even plot the data with xarray. In my experience, you probably want coordinate labels that either look like: 1. The unrolled original coordinates: [0, 90, 180, 270] 2. Shifted coordinates: [-180, -90, 0, 90] It should be easier to accomplish this is in xarray. I currently resort to using roll and manually fixing up coordinates after the fact. I'm actually not sure if there are any use-cases for the current behavior. Choice (1) would have the virtue of being consistent with shift(): ``` >>> arr.shift(x=2) array([nan, nan, 0., 1.]) Coordinates: * x (x) int64 0 90 180 270 ``` We could potentially add optional another argument for shifting labels, too, or requiring fixing that up by subtraction. Note: you might argue that this is overly geoscience specific, and it would be, if this was only for handling a longitude coordinate. But periodic boundary conditions are common in many areas of the physical sciences.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1875/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 323674056,MDU6SXNzdWUzMjM2NzQwNTY=,2137,0.10.4 release,1217238,closed,0,,,5,2018-05-16T15:31:57Z,2018-05-17T02:29:52Z,2018-05-17T02:29:52Z,MEMBER,,,,"Our last release was April 13 (just over a month ago), and we've had a number of features land, so I'd like to issue this shortly. Ideally within the next few days, or maybe even later today. CC @pydata/xarray ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2137/reactions"", ""total_count"": 3, ""+1"": 3, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 305702311,MDU6SXNzdWUzMDU3MDIzMTE=,1993,DataArray.rolling().mean() is way slower than it should be,1217238,closed,0,,,5,2018-03-15T20:10:22Z,2018-03-18T08:56:27Z,2018-03-18T08:56:27Z,MEMBER,,,,"#### Code Sample, a copy-pastable example if possible From @RayPalmerTech in https://github.com/kwgoodman/bottleneck/issues/186: ```python import numpy as np import pandas as pd import time import bottleneck as bn import xarray import matplotlib.pyplot as plt N = 30000200 # Number of datapoints Fs = 30000 # sample rate T=1/Fs # sample period duration = N/Fs # duration in s t = np.arange(0,duration,T) # time vector DATA = np.random.randn(N,)+5*np.sin(2*np.pi*0.01*t) # Example noisy sine data and window size w = 330000 def using_bottleneck_mean(data,width): return bn.move_mean(a=data,window=width,min_count = 1) def using_pandas_rolling_mean(data,width): return np.asarray(pd.DataFrame(data).rolling(window=width,center=True,min_periods=1).mean()).ravel() def using_xarray_mean(data,width): return xarray.DataArray(data,dims='x').rolling(x=width,min_periods=1, center=True).mean() start=time.time() A = using_bottleneck_mean(DATA,w) print('Bottleneck: ', time.time()-start, 's') start=time.time() B = using_pandas_rolling_mean(DATA,w) print('Pandas: ',time.time()-start,'s') start=time.time() C = using_xarray_mean(DATA,w) print('Xarray: ',time.time()-start,'s') ``` This results in: ``` Bottleneck: 0.0867006778717041 s Pandas: 0.563546895980835 s Xarray: 25.133142709732056 s ``` Somehow xarray is way slower than pandas and bottleneck, even though it's using bottleneck under the hood! #### Problem description Profiling shows that the majority of time is spent in `xarray.core.rolling.DataArrayRolling._setup_windows`. Monkey-patching that method with a dummy rectifies the issue: ``` xarray.core.rolling.DataArrayRolling._setup_windows = lambda *args: None ``` Now we obtain: ``` Bottleneck: 0.06775331497192383 s Pandas: 0.48262882232666016 s Xarray: 0.1723031997680664 s ``` The solution is to make setting up windows done lazily (in `__iter__`), instead of doing it in the constructor. #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.96+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.2 pandas: 0.22.0 numpy: 1.14.2 scipy: 0.19.1 netCDF4: None h5netcdf: None h5py: 2.7.1 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: None distributed: None matplotlib: 2.1.2 cartopy: None seaborn: 0.7.1 setuptools: 36.2.7 pip: 9.0.1 conda: None pytest: None IPython: 5.5.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1993/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 180756013,MDU6SXNzdWUxODA3NTYwMTM=,1034,test_conventions.TestEncodeCFVariable failing on master for Appveyor Python 2.7 build,1217238,closed,0,,,5,2016-10-03T21:48:55Z,2016-10-22T00:49:53Z,2016-10-22T00:49:53Z,MEMBER,,,,"I have on idea what's going on here but maybe somebody who knows Windows better has a guess: ``` ================================== FAILURES =================================== _________________ TestEncodeCFVariable.test_missing_fillvalue _________________ self = def test_missing_fillvalue(self): v = Variable(['x'], np.array([np.nan, 1, 2, 3])) v.encoding = {'dtype': 'int16'} with self.assertWarns('floating point data as an integer'): > conventions.encode_cf_variable(v) xarray\test\test_conventions.py:523: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ C:\Python27-conda32\lib\contextlib.py:24: in __exit__ self.gen.next() _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = message = 'floating point data as an integer' @contextmanager def assertWarns(self, message): with warnings.catch_warnings(record=True) as w: warnings.filterwarnings('always', message) yield assert len(w) > 0 > assert all(message in str(wi.message) for wi in w) E AssertionError: NameError: all( at 0x0617D170>) << global name 'message' is not defined xarray\test\__init__.py:140: AssertionError ============== 1 failed, 970 passed, 67 skipped in 70.58 seconds ============== ``` I could understand a warning failing to be raised, but the `NameError` is especially strange. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1034/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 89866276,MDU6SXNzdWU4OTg2NjI3Ng==,439,Display datetime64 arrays without showing local timezones,1217238,closed,0,,,5,2015-06-21T05:13:58Z,2016-04-21T15:43:27Z,2016-04-21T15:43:27Z,MEMBER,,,,"NumPy has an unfortunate way of adding local timezone offsets when printing datetime64 arrays: ``` array(['1999-12-31T16:00:00.000000000-0800', '2000-01-01T16:00:00.000000000-0800', '2000-01-02T16:00:00.000000000-0800', ..., '2010-12-10T16:00:00.000000000-0800', '2010-12-11T16:00:00.000000000-0800', '2010-12-12T16:00:00.000000000-0800'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ... ``` We should use custom formatting code to remove the local timezone (to encourage folks just to use naive timezones/UTC). ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/439/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue