home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

4 rows where user = 463809 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: closed_at, created_at (date), updated_at (date), closed_at (date)

type 2

  • issue 3
  • pull 1

state 2

  • closed 3
  • open 1

repo 1

  • xarray 4
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
789410367 MDU6SXNzdWU3ODk0MTAzNjc= 4826 Reading and writing a zarr dataset multiple times casts bools to int8 amatsukawa 463809 closed 0     10 2021-01-19T22:02:15Z 2023-04-10T09:26:27Z 2023-04-10T09:26:27Z CONTRIBUTOR      

What happened:

Reading and writing zarr dataset multiple times into different paths changes bool dtype arrays to int8. I think this issue is related to #2937.

What you expected to happen:

My array's dtype in numpy/dask should not change, even if certain storage backends store dtypes a certain way.

Minimal Complete Verifiable Example:

```python import xarray as xr import numpy as np

ds = xr.Dataset({ "bool_field": xr.DataArray( np.random.randn(5) < 0.5, dims=('g'), coords={'g': np.arange(5)} ) }) ds.to_zarr('test.zarr', mode="w")

d2 = xr.open_zarr('test.zarr') print(d2.bool_field.dtype) print(d2.bool_field.encoding) d2.to_zarr("test2.zarr", mode="w")

d3 = xr.open_zarr('test2.zarr') print(d3.bool_field.dtype) `` The above snippet prints the following. In d3, the dtype ofbool_fieldisint8, presumably because d3 inherited d2'sencodingand it saysint8, despite the array having abool` dtype.

bool {'chunks': (5,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('int8')} int8 Anything else we need to know?:

Currently workaround is to explicitly set encodings. This fixes the problem:

python encoding = {k: {"dtype": d2[k].dtype} for k in d2} d2.to_zarr('test2.zarr', mode="w", encoding=encoding)

Environment:

Output of <tt>xr.show_versions()</tt> ``` # I'll update with the the full output of xr.show_versions() soon. In [4]: xr.__version__ Out[4]: '0.16.2' In [2]: zarr.__version__ Out[2]: '2.6.1' ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4826/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
484098286 MDU6SXNzdWU0ODQwOTgyODY= 3242 An `asfreq` method without `resample`, and clarify or improve resample().asfreq() behavior for down-sampling amatsukawa 463809 open 0     2 2019-08-22T16:33:32Z 2022-04-18T16:01:07Z   CONTRIBUTOR      

MCVE Code Sample

```python

Your code here

import numpy as np import xarray as xr import pandas as pd

data = np.random.random(300)

Make a time grid that doesn't start exactly on the hour.

time = pd.date_range('2019-01-01', periods=300, freq='T') + pd.Timedelta('3T') time DatetimeIndex(['2019-01-01 00:03:00', '2019-01-01 00:04:00', '2019-01-01 00:05:00', '2019-01-01 00:06:00', '2019-01-01 00:07:00', '2019-01-01 00:08:00', '2019-01-01 00:09:00', '2019-01-01 00:10:00', '2019-01-01 00:11:00', '2019-01-01 00:12:00', ... '2019-01-01 04:53:00', '2019-01-01 04:54:00', '2019-01-01 04:55:00', '2019-01-01 04:56:00', '2019-01-01 04:57:00', '2019-01-01 04:58:00', '2019-01-01 04:59:00', '2019-01-01 05:00:00', '2019-01-01 05:01:00', '2019-01-01 05:02:00'], dtype='datetime64[ns]', length=300, freq='T')

da = xr.DataArray(data, dims=['time'], coords={'time': time}) resampled = da.resample(time='H').asfreq() resampled <xarray.DataArray (time: 6)> array([0.478601, 0.488425, 0.496322, 0.479256, 0.523395, 0.201718]) Coordinates: * time (time) datetime64[ns] 2019-01-01 ... 2019-01-01T05:00:00

The value is actually the mean over the time window, eg. the third value is:

da.loc['2019-01-01T02:00:00':'2019-01-01T02:59:00'].mean() <xarray.DataArray ()> array(0.496322) ```

Expected Output

Docs say this: Return values of original object at the new up-sampling frequency; essentially a re-index with new times set to NaN.

I suppose this doc is not technically wrong, since upon careful reading, I realize it does not define a behavior for down-sampling. But it's easy to: (1) assume the same behavior (reindexing) for down-sampling and up-sampling and/or (2) expect behavior similar to df.asfreq() in pandas.

Problem Description

I would argue for an asfreq method without resampling that matches the pandas behavior, which AFAIK, is to reindex starting at the first timestamp, at the specified interval.

```

df = pd.DataFrame(da, index=time) df.asfreq('H') 0 2019-01-01 00:03:00 0.065304 2019-01-01 01:03:00 0.325814 2019-01-01 02:03:00 0.841201 2019-01-01 03:03:00 0.610266 2019-01-01 04:03:00 0.613906 ```

This can currently easily be achieved, so it's not a blocker. ```

da.reindex(time=pd.date_range(da.time[0].values, da.time[-1].values, freq='H')) <xarray.DataArray (time: 5)> array([0.065304, 0.325814, 0.841201, 0.610266, 0.613906]) Coordinates: * time (time) datetime64[ns] 2019-01-01T00:03:00 ... 2019-01-01T04:03:00 ```

Why I argue for asfreq functionality outside of resampling is that asfreq(freq) in pandas is purely a reindex, compared to eg resample(freq).first() which would give you a different time index.

Output of xr.show_versions()

Still on python27, show_versions actually throws an exception, because some HDF5 library doesn't have a magic property. I don't think this detail is relevant here though.

``` >>> xr.__version__ u'0.11.3' ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3242/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
520507183 MDExOlB1bGxSZXF1ZXN0MzM5MDg0MjUz 3504 Allow appending datetime & boolean variables to zarr stores amatsukawa 463809 closed 0     5 2019-11-09T20:09:29Z 2019-11-13T18:47:42Z 2019-11-13T15:55:33Z CONTRIBUTOR   0 pydata/xarray/pulls/3504
  • [x] Closes #3480
  • [x] Tests added
  • [x] Passes black . && mypy . && flake8
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API

AFAICT, the type checking in the _validate_datatypes_for_zarr_append is simply too strict, and relaxing it seems to work fine. But this is my first time digging into the xarray source code, so please let me know if this issue is more complex.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3504/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
516725099 MDU6SXNzdWU1MTY3MjUwOTk= 3480 Allow appending non-numerical types to zarr arrays. amatsukawa 463809 closed 0     0 2019-11-02T21:20:53Z 2019-11-13T15:55:33Z 2019-11-13T15:55:33Z CONTRIBUTOR      

MCVE Code Sample

Zarr itself allows appending np.datetime and np.bool types.

```python

path = 'tmp/test.zarr' z1 = zarr.open(path, mode='w', shape=(10,), chunks=(10,), dtype='M8[D]') z1[:] = '1990-01-01' z2 = zarr.open(path, mode='a') a = np.array(['1992-01-01'] * 10, dtype='datetime64[D]') z2.append(a) (20,) z2 <zarr.core.Array (20,) datetime64[D]> ``` But it's equivalent in xarray throws an error:

```

ds = xr.Dataset( ... {'y': (('x',), np.array(['1991-01-01'] * 10, dtype='datetime64[D]'))} ... ) ds.to_zarr('tmp/test_xr.zarr', mode='w') <xarray.backends.zarr.ZarrStore object at 0x31f403170> ds2 = xr.Dataset( ... {'y': (('x',), np.array(['1992-01-01'] * 10, dtype='datetime64[D]'))} ... ) ds2.to_zarr('tmp/test_xr.zarr', mode='a', append_dim='x') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/personal/opt/anaconda3/lib/python3.7/site-packages/xarray/core/dataset.py", line 1616, in to_zarr append_dim=append_dim, File "/Users/personal/opt/anaconda3/lib/python3.7/site-packages/xarray/backends/api.py", line 1304, in to_zarr _validate_datatypes_for_zarr_append(dataset) File "/Users/personal/opt/anaconda3/lib/python3.7/site-packages/xarray/backends/api.py", line 1249, in _validate_datatypes_for_zarr_append check_dtype(k) File "/Users/personal/opt/anaconda3/lib/python3.7/site-packages/xarray/backends/api.py", line 1245, in check_dtype "unicode string or an object".format(var) ValueError: Invalid dtype for data variable: <xarray.DataArray 'y' (x: 10)> array(['1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000'], dtype='datetime64[ns]') Dimensions without coordinates: x dtype must be a subtype of number, a fixed sized string, a fixed size unicode string or an object ```

Expected Output

The append should succeed.

Problem Description

This function in xarray/api.py is too strict on types:

``` def _validate_datatypes_for_zarr_append(dataset): """DataArray.name and Dataset keys must be a string or None"""

def check_dtype(var):
    if (
        not np.issubdtype(var.dtype, np.number)
        and not coding.strings.is_unicode_dtype(var.dtype)
        and not var.dtype == object
    ):
        # and not re.match('^bytes[1-9]+$', var.dtype.name)):
        raise ValueError(
            "Invalid dtype for data variable: {} "
            "dtype must be a subtype of number, "
            "a fixed sized string, a fixed size "
            "unicode string or an object".format(var)
        )

for k in dataset.data_vars.values():
    check_dtype(k)

```

np.datetime64[.] and np.bool are not numbers: ```

np.issubdtype(np.dtype('datetime64[D]'), np.number) False np.issubdtype(np.dtype('bool'), np.number) False ```

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.4 (default, Aug 13 2019, 15:17:50) [Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 18.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: None xarray: 0.14.0 pandas: 0.25.1 numpy: 1.17.2 scipy: 1.3.1 netCDF4: None pydap: None h5netcdf: None h5py: 2.9.0 Nio: None zarr: 2.3.2 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.2.1 dask: 2.5.2 distributed: 2.5.2 matplotlib: 3.1.1 cartopy: None seaborn: 0.9.0 numbagg: None setuptools: 41.4.0 pip: 19.2.3 conda: 4.7.12 pytest: 5.2.1 IPython: 7.8.0 sphinx: 2.2.0
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3480/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 23.258ms · About: xarray-datasette