id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
789410367,MDU6SXNzdWU3ODk0MTAzNjc=,4826,Reading and writing a zarr dataset multiple times casts bools to int8,463809,closed,0,,,10,2021-01-19T22:02:15Z,2023-04-10T09:26:27Z,2023-04-10T09:26:27Z,CONTRIBUTOR,,,,"**What happened**:
Reading and writing zarr dataset multiple times into different paths changes `bool` dtype arrays to `int8`. I think this issue is related to #2937.
**What you expected to happen**:
My array's dtype in numpy/dask should not change, even if certain storage backends store dtypes a certain way.
**Minimal Complete Verifiable Example**:
```python
import xarray as xr
import numpy as np
ds = xr.Dataset({
""bool_field"": xr.DataArray(
np.random.randn(5) < 0.5,
dims=('g'),
coords={'g': np.arange(5)}
)
})
ds.to_zarr('test.zarr', mode=""w"")
d2 = xr.open_zarr('test.zarr')
print(d2.bool_field.dtype)
print(d2.bool_field.encoding)
d2.to_zarr(""test2.zarr"", mode=""w"")
d3 = xr.open_zarr('test2.zarr')
print(d3.bool_field.dtype)
```
The above snippet prints the following. In d3, the dtype of `bool_field` is `int8`, presumably because d3 inherited d2's `encoding` and it says `int8`, despite the array having a `bool` dtype.
```
bool
{'chunks': (5,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('int8')}
int8
```
**Anything else we need to know?**:
Currently workaround is to explicitly set encodings. This fixes the problem:
```python
encoding = {k: {""dtype"": d2[k].dtype} for k in d2}
d2.to_zarr('test2.zarr', mode=""w"", encoding=encoding)
```
**Environment**:
Output of xr.show_versions()
```
# I'll update with the the full output of xr.show_versions() soon.
In [4]: xr.__version__
Out[4]: '0.16.2'
In [2]: zarr.__version__
Out[2]: '2.6.1'
```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4826/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
484098286,MDU6SXNzdWU0ODQwOTgyODY=,3242,"An `asfreq` method without `resample`, and clarify or improve resample().asfreq() behavior for down-sampling",463809,open,0,,,2,2019-08-22T16:33:32Z,2022-04-18T16:01:07Z,,CONTRIBUTOR,,,,"#### MCVE Code Sample
```python
# Your code here
>>> import numpy as np
>>> import xarray as xr
>>> import pandas as pd
>>> data = np.random.random(300)
# Make a time grid that doesn't start exactly on the hour.
>>> time = pd.date_range('2019-01-01', periods=300, freq='T') + pd.Timedelta('3T')
>>> time
DatetimeIndex(['2019-01-01 00:03:00', '2019-01-01 00:04:00',
'2019-01-01 00:05:00', '2019-01-01 00:06:00',
'2019-01-01 00:07:00', '2019-01-01 00:08:00',
'2019-01-01 00:09:00', '2019-01-01 00:10:00',
'2019-01-01 00:11:00', '2019-01-01 00:12:00',
...
'2019-01-01 04:53:00', '2019-01-01 04:54:00',
'2019-01-01 04:55:00', '2019-01-01 04:56:00',
'2019-01-01 04:57:00', '2019-01-01 04:58:00',
'2019-01-01 04:59:00', '2019-01-01 05:00:00',
'2019-01-01 05:01:00', '2019-01-01 05:02:00'],
dtype='datetime64[ns]', length=300, freq='T')
>>> da = xr.DataArray(data, dims=['time'], coords={'time': time})
>>> resampled = da.resample(time='H').asfreq()
>>> resampled
array([0.478601, 0.488425, 0.496322, 0.479256, 0.523395, 0.201718])
Coordinates:
* time (time) datetime64[ns] 2019-01-01 ... 2019-01-01T05:00:00
# The value is actually the mean over the time window, eg. the third value is:
>>> da.loc['2019-01-01T02:00:00':'2019-01-01T02:59:00'].mean()
array(0.496322)
```
#### Expected Output
Docs say this:
```
Return values of original object at the new up-sampling frequency;
essentially a re-index with new times set to NaN.
```
I suppose this doc is not technically wrong, since upon careful reading, I realize it does not define a behavior for down-sampling. But it's easy to: (1) assume the same behavior (reindexing) for down-sampling and up-sampling and/or (2) expect behavior similar to `df.asfreq()` in pandas.
#### Problem Description
I would argue for an `asfreq` method without resampling that matches the pandas behavior, which AFAIK, is to reindex starting at the first timestamp, at the specified interval.
```
>>> df = pd.DataFrame(da, index=time)
>>> df.asfreq('H')
0
2019-01-01 00:03:00 0.065304
2019-01-01 01:03:00 0.325814
2019-01-01 02:03:00 0.841201
2019-01-01 03:03:00 0.610266
2019-01-01 04:03:00 0.613906
```
This can currently easily be achieved, so it's not a blocker.
```
>>> da.reindex(time=pd.date_range(da.time[0].values, da.time[-1].values, freq='H'))
array([0.065304, 0.325814, 0.841201, 0.610266, 0.613906])
Coordinates:
* time (time) datetime64[ns] 2019-01-01T00:03:00 ... 2019-01-01T04:03:00
```
Why I argue for `asfreq` functionality outside of resampling is that `asfreq(freq)` in pandas is purely a reindex, compared to eg `resample(freq).first()` which would give you a different time index.
#### Output of ``xr.show_versions()``
Still on python27, `show_versions` actually throws an exception, because some HDF5 library doesn't have a magic property. I don't think this detail is relevant here though.
```
>>> xr.__version__
u'0.11.3'
```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3242/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
516725099,MDU6SXNzdWU1MTY3MjUwOTk=,3480,Allow appending non-numerical types to zarr arrays.,463809,closed,0,,,0,2019-11-02T21:20:53Z,2019-11-13T15:55:33Z,2019-11-13T15:55:33Z,CONTRIBUTOR,,,,"#### MCVE Code Sample
Zarr itself allows appending `np.datetime` and `np.bool` types.
```python
>>> path = 'tmp/test.zarr'
>>> z1 = zarr.open(path, mode='w', shape=(10,), chunks=(10,), dtype='M8[D]')
>>> z1[:] = '1990-01-01'
>>> z2 = zarr.open(path, mode='a')
>>> a = np.array(['1992-01-01'] * 10, dtype='datetime64[D]')
>>> z2.append(a)
(20,)
>>> z2
```
But it's equivalent in xarray throws an error:
```
>>> ds = xr.Dataset(
... {'y': (('x',), np.array(['1991-01-01'] * 10, dtype='datetime64[D]'))}
... )
>>> ds.to_zarr('tmp/test_xr.zarr', mode='w')
>>> ds2 = xr.Dataset(
... {'y': (('x',), np.array(['1992-01-01'] * 10, dtype='datetime64[D]'))}
... )
>>> ds2.to_zarr('tmp/test_xr.zarr', mode='a', append_dim='x')
Traceback (most recent call last):
File """", line 1, in
File ""/Users/personal/opt/anaconda3/lib/python3.7/site-packages/xarray/core/dataset.py"", line 1616, in to_zarr
append_dim=append_dim,
File ""/Users/personal/opt/anaconda3/lib/python3.7/site-packages/xarray/backends/api.py"", line 1304, in to_zarr
_validate_datatypes_for_zarr_append(dataset)
File ""/Users/personal/opt/anaconda3/lib/python3.7/site-packages/xarray/backends/api.py"", line 1249, in _validate_datatypes_for_zarr_append
check_dtype(k)
File ""/Users/personal/opt/anaconda3/lib/python3.7/site-packages/xarray/backends/api.py"", line 1245, in check_dtype
""unicode string or an object"".format(var)
ValueError: Invalid dtype for data variable:
array(['1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000',
'1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000',
'1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000',
'1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000',
'1992-01-01T00:00:00.000000000', '1992-01-01T00:00:00.000000000'],
dtype='datetime64[ns]')
Dimensions without coordinates: x dtype must be a subtype of number, a fixed sized string, a fixed size unicode string or an object
```
#### Expected Output
The append should succeed.
#### Problem Description
This function in `xarray/api.py` is too strict on types:
```
def _validate_datatypes_for_zarr_append(dataset):
""""""DataArray.name and Dataset keys must be a string or None""""""
def check_dtype(var):
if (
not np.issubdtype(var.dtype, np.number)
and not coding.strings.is_unicode_dtype(var.dtype)
and not var.dtype == object
):
# and not re.match('^bytes[1-9]+$', var.dtype.name)):
raise ValueError(
""Invalid dtype for data variable: {} ""
""dtype must be a subtype of number, ""
""a fixed sized string, a fixed size ""
""unicode string or an object"".format(var)
)
for k in dataset.data_vars.values():
check_dtype(k)
```
`np.datetime64[.]` and `np.bool` are not numbers:
```
>>> np.issubdtype(np.dtype('datetime64[D]'), np.number)
False
>>> np.issubdtype(np.dtype('bool'), np.number)
False
```
#### Output of ``xr.show_versions()``
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.4 (default, Aug 13 2019, 15:17:50)
[Clang 4.0.1 (tags/RELEASE_401/final)]
python-bits: 64
OS: Darwin
OS-release: 18.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: None
xarray: 0.14.0
pandas: 0.25.1
numpy: 1.17.2
scipy: 1.3.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: 2.3.2
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.5.2
distributed: 2.5.2
matplotlib: 3.1.1
cartopy: None
seaborn: 0.9.0
numbagg: None
setuptools: 41.4.0
pip: 19.2.3
conda: 4.7.12
pytest: 5.2.1
IPython: 7.8.0
sphinx: 2.2.0
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3480/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue