home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

4 rows where comments = 5, state = "open" and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date)

type 1

  • issue 4

state 1

  • open · 4 ✖

repo 1

  • xarray 4
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
271043420 MDU6SXNzdWUyNzEwNDM0MjA= 1689 Roundtrip serialization of coordinate variables with spaces in their names shoyer 1217238 open 0     5 2017-11-03T16:43:20Z 2024-03-22T14:02:48Z   MEMBER      

If coordinates have spaces in their names, they get restored from netCDF files as data variables instead: ```

xarray.open_dataset(xarray.Dataset(coords={'name with spaces': 1}).to_netcdf()) <xarray.Dataset> Dimensions: () Data variables: name with spaces int32 1 ````

This happens because the CF convention is to indicate coordinates as a space separated string, e.g., coordinates='latitude longitude'.

Even though these aren't CF compliant variable names (which cannot have strings) It would be nice to have an ad-hoc convention for xarray that allows us to serialize/deserialize coordinates in all/most cases. Maybe we could use escape characters for spaces (e.g., coordinates='name\ with\ spaces') or quote names if they have spaces (e.g., coordinates='"name\ with\ spaces"'?

At the very least, we should issue a warning in these cases.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1689/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
325439138 MDU6SXNzdWUzMjU0MzkxMzg= 2171 Support alignment/broadcasting with unlabeled dimensions of size 1 shoyer 1217238 open 0     5 2018-05-22T19:52:21Z 2022-04-19T03:15:24Z   MEMBER      

Sometimes, it's convenient to include placeholder dimensions of size 1, which allows for removing any ambiguity related to the order of output dimensions.

Currently, this is not supported with xarray: ```

xr.DataArray([1], dims='x') + xr.DataArray([1, 2, 3], dims='x') ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension sizes: {1, 3}

xr.Variable(('x',), [1]) + xr.Variable(('x',), [1, 2, 3]) ValueError: operands cannot be broadcast together with mismatched lengths for dimension 'x': (1, 3) ```

However, these operations aren't really ambiguous. With size 1 dimensions, we could logically do broadcasting like NumPy arrays, e.g., ```

np.array([1]) + np.array([1, 2, 3]) array([2, 3, 4]) ```

This would be particularly convenient if we add keepdims=True to xarray operations (#2170).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2171/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
237008177 MDU6SXNzdWUyMzcwMDgxNzc= 1460 groupby should still squeeze for non-monotonic inputs shoyer 1217238 open 0     5 2017-06-19T20:05:14Z 2022-03-04T21:31:41Z   MEMBER      

We can simply use argsort() to determine group_indices instead of np.arange(): https://github.com/pydata/xarray/blob/22ff955d53e253071f6e4fa849e5291d0005282a/xarray/core/groupby.py#L256

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1460/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
314444743 MDU6SXNzdWUzMTQ0NDQ3NDM= 2059 How should xarray serialize bytes/unicode strings across Python/netCDF versions? shoyer 1217238 open 0     5 2018-04-15T19:36:55Z 2020-11-19T10:08:16Z   MEMBER      

netCDF string types

We have several options for storing strings in netCDF files: - NC_CHAR: netCDF's legacy character type. The closest match is NumPy 'S1' dtype. In principle, it's supposed to be able to store arbitrary bytes. On HDF5, it uses an UTF-8 encoded string with a fixed-size of 1 (but note that HDF5 does not complain about storing arbitrary bytes). - NC_STRING: netCDF's newer variable length string type. It's only available on netCDF4 (not netCDF3). It corresponds to an HDF5 variable-length string with UTF-8 encoding. - NC_CHAR with an _Encoding attribute: xarray and netCDF4-Python support an ad-hoc convention for storing unicode strings in NC_CHAR data-types, by adding an attribute {'_Encoding': 'UTF-8'}. The data is still stored as fixed width strings, but xarray (and netCDF4-Python) can decode them as unicode.

NC_STRING would seem like a clear win in cases where it's supported, but as @crusaderky points out in https://github.com/pydata/xarray/issues/2040, it actually results in much larger netCDF files in many cases than using character arrays, which are more easily compressed. Nonetheless, we currently default to storing unicode strings in NC_STRING, because it's the most portable option -- every tool that handles HDF5 and netCDF4 should be able to read it properly as unicode strings.

NumPy/Python string types

On the Python side, our options are perhaps even more confusing: - NumPy's dtype=np.string_ corresponds to fixed-length bytes. This is the default dtype for strings on Python 2, because on Python 2 strings are the same as bytes. - NumPy's dtype=np.unicode_ corresponds to fixed-length unicode. This is the default dtype for strings on Python 3, because on Python 3 strings are the same as unicode. - Strings are also commonly stored in numpy arrays with dtype=np.object_, as arrays of either bytes or unicode objects. This is a pragmatic choice, because otherwise NumPy has no support for variable length strings. We also use this (like pandas) to mark missing values with np.nan.

Like pandas, we are pretty liberal with converting back and forth between fixed-length (np.string/np.unicode_) and variable-length (object dtype) representations of strings as necessary. This works pretty well, though converting from object arrays in particular has downsides, since it cannot be done lazily with dask.

Current behavior of xarray

Currently, xarray uses the same behavior on Python 2/3. The priority was faithfully round-tripping data from a particular version of Python to netCDF and back, which the current serialization behavior achieves:

| Python version | NetCDF version | NumPy datatype | NetCDF datatype | | --------- | ---------- | -------------- | ------------ | | Python 2 | NETCDF3 | np.string_ / str | NC_CHAR | | Python 2 | NETCDF4 | np.string_ / str | NC_CHAR | | Python 3 | NETCDF3 | np.string_ / bytes | NC_CHAR | | Python 3 | NETCDF4 | np.string_ / bytes | NC_CHAR | | Python 2 | NETCDF3 | np.unicode_ / unicode | NC_CHAR with UTF-8 encoding | | Python 2 | NETCDF4 | np.unicode_ / unicode | NC_STRING | | Python 3 | NETCDF3 | np.unicode_ / str | NC_CHAR with UTF-8 encoding | | Python 3 | NETCDF4 | np.unicode_ / str | NC_STRING | | Python 2 | NETCDF3 | object bytes/str | NC_CHAR | | Python 2 | NETCDF4 | object bytes/str | NC_CHAR | | Python 3 | NETCDF3 | object bytes | NC_CHAR | | Python 3 | NETCDF4 | object bytes | NC_CHAR | | Python 2 | NETCDF3 | object unicode | NC_CHAR with UTF-8 encoding | | Python 2 | NETCDF4 | object unicode | NC_STRING | | Python 3 | NETCDF3 | object unicode/str | NC_CHAR with UTF-8 encoding | | Python 3 | NETCDF4 | object unicode/str | NC_STRING |

This can also be selected explicitly for most data-types by setting dtype in encoding: - 'S1' for NC_CHAR (with or without encoding) - str for NC_STRING (though I'm not 100% sure it works properly currently when given bytes)

Script for generating table:

```python from __future__ import print_function import xarray as xr import uuid import netCDF4 import numpy as np import sys for dtype_name, value in [ ('np.string_ / ' + type(b'').__name__, np.array([b'abc'])), ('np.unicode_ / ' + type(u'').__name__, np.array([u'abc'])), ('object bytes/' + type(b'').__name__, np.array([b'abc'], dtype=object)), ('object unicode/' + type(u'').__name__, np.array([u'abc'], dtype=object)), ]: for format in ['NETCDF3_64BIT', 'NETCDF4']: filename = str(uuid.uuid4()) + '.nc' xr.Dataset({'data': value}).to_netcdf(filename, format=format) with netCDF4.Dataset(filename) as f: var = f.variables['data'] disk_dtype = var.dtype has_encoding = hasattr(var, '_Encoding') disk_dtype_name = (('NC_CHAR' if disk_dtype == 'S1' else 'NC_STRING') + (' with UTF-8 encoding' if has_encoding else '')) print('|', 'Python %i' % sys.version_info[0], '|', format[:7], '|', dtype_name, '|', disk_dtype_name, '|') ```

Potential alternatives

The main option I'm considering is switching to default to NC_CHAR with UTF-8 encoding for np.string_ / str and object bytes/str on Python 2. The current behavior could be explicitly toggled by setting an encoding of {'_Encoding': None}.

This would imply two changes: 1. Attempting to serialize arbitrary bytes (on Python 2) would start raising an error -- anything that isn't ASCII would require explicitly disabling _Encoding. 2. Strings read back from disk on Python 2 would come back as unicode instead of bytes.

This implicit conversion would be consistent with Python 2's general handling of bytes/unicode, and facilitate reading netCDF files on Python 3 that were written with Python 2.

The counter-argument is that it may not be worth changing this at this late point, given that we will be sunsetting Python 2 support by year's end.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2059/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 6403.713ms · About: xarray-datasette