html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/7790#issuecomment-1532441433,https://api.github.com/repos/pydata/xarray/issues/7790,1532441433,IC_kwDOAMm_X85bVzNZ,5821660,2023-05-03T04:25:50Z,2023-05-03T04:25:50Z,MEMBER,"@christine-e-smit Great this works on you side with the proposed patch in #7098.
Nevertheless, we've identified three more issues here in the debugging process which can now be handled one by one. So again, thanks for your contribution here.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1685803922
https://github.com/pydata/xarray/issues/7790#issuecomment-1531050846,https://api.github.com/repos/pydata/xarray/issues/7790,1531050846,IC_kwDOAMm_X85bQfte,5821660,2023-05-02T08:04:45Z,2023-05-03T04:20:11Z,MEMBER,"As in #7098, citing @dcherian:
> I think the real solution here is to explicitly handle NaNs during the decoding step. We do want these to be NaT in the output.
There are three more issues revealed here when using datetime64:
- if _FillValue is set in encoding, it has to be of same type/resolution as the times in the array
- If _FillValue is provided, we need to provide `dtype` and `units` to which fit our data,
eg. if the _FillValue is referenced to unix-epoch the unit's should be equivalent
- when encoding in the presence of NaT the data array is converted to floating point with NaN, which is problematic for the subsequent conversion to int64","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1685803922
https://github.com/pydata/xarray/issues/7790#issuecomment-1530991257,https://api.github.com/repos/pydata/xarray/issues/7790,1530991257,IC_kwDOAMm_X85bQRKZ,5821660,2023-05-02T07:09:38Z,2023-05-02T08:14:36Z,MEMBER,"@christine-e-smit I've created an fresh environment with only xarray and zarr and it still works on my machine. I've then followed the Darwin idea and digged up #6191 (I've got those casting warnings from exactly the line you were referring to). Comment https://github.com/pydata/xarray/issues/6191#issuecomment-1209567966 should explain what happens here.
tl;dr citing @DocOtak
> The short explanation is that the time conversion functions do an `astype(np.int64)` or equivalent cast on arrays that contain nans. This is [undefined behavior](https://github.com/numpy/numpy/issues/13101#issuecomment-740058842) and very soon, doing this will[ start to emit RuntimeWarnings](https://github.com/numpy/numpy/pull/21437).
There is also an open PR #7098.
Thanks @christine-e-smit for sticking with me to find the root-cause here by providing detailed information and code examples. :+1: ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1685803922
https://github.com/pydata/xarray/issues/7790#issuecomment-1530141083,https://api.github.com/repos/pydata/xarray/issues/7790,1530141083,IC_kwDOAMm_X85bNBmb,5821660,2023-05-01T20:01:50Z,2023-05-01T20:01:50Z,MEMBER,"@christine-e-smit One more idea, you might delete the zarr folder before re-creating (if you are not doing that already). I've removed the complete folder before any new write (by putting eg. `!rm -rf xarray_and_units.zarr` at the beginning of the notebook-cell).
It would also be great if you could run the code from https://github.com/pydata/xarray/issues/7790#issuecomment-1529894939 and post the output here, just for the sake of comparison (please delete the zarr-folder before if it exists). Thanks!
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1685803922
https://github.com/pydata/xarray/issues/7790#issuecomment-1530131533,https://api.github.com/repos/pydata/xarray/issues/7790,1530131533,IC_kwDOAMm_X85bM_RN,5821660,2023-05-01T19:53:53Z,2023-05-01T19:53:53Z,MEMBER,"@christine-e-smit I've plugged your code into a fresh notebook, here is my output:
```python
**********************
xarray created with NaT fill value
----------------------
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
**********************
xarray created read with NaT fill value
----------------------
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
{}
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9223372036854775808, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
```
The output seems OK on my side. I've no idea why the data isn't correctly decoded as NaT on your side. I've checked that my environment is comparable to yours. The only difference remaining is you are on Darwin arm64 whereas I'm on Linux.
```
INSTALLED VERSIONS
------------------
commit: None
python: 3.11.2 | packaged by conda-forge | (main, Mar 31 2023, 17:51:05) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-144-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.14.0
libnetcdf: None
xarray: 2023.4.2
pandas: 2.0.1
numpy: 1.24.3
scipy: 1.10.1
netCDF4: None
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.14.2
cftime: None
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.3.2
distributed: 2023.3.2
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.3.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.6.1
pip: 23.0.1
conda: None
pytest: 7.2.2
mypy: 0.982
IPython: 8.12.0
sphinx: None
``` ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1685803922
https://github.com/pydata/xarray/issues/7790#issuecomment-1530111912,https://api.github.com/repos/pydata/xarray/issues/7790,1530111912,IC_kwDOAMm_X85bM6eo,5821660,2023-05-01T19:30:22Z,2023-05-01T19:30:22Z,MEMBER,"> Unfortunately, I think you may have also gotten some wires crossed? You set the time fill value to 1900-01-01, but then use NaT in the actual array?
Yes, I use NaT because I want to check if the encoder does correctly translate NaT to the provided _FillValue on write.
So from your last example I'm assuming you would like to have the int64 representation of NaT as _FillValue, right?
I'll try to adapt this, and see what I get
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1685803922
https://github.com/pydata/xarray/issues/7790#issuecomment-1529894939,https://api.github.com/repos/pydata/xarray/issues/7790,1529894939,IC_kwDOAMm_X85bMFgb,5821660,2023-05-01T16:05:19Z,2023-05-01T16:05:19Z,MEMBER,"So, after some debugging I think I've found two issues here with the current code.
First, we need to give the fillvalue with a fitting resolution. Second, we have an issue with inferring the units from the data (if not given).
Here is some workaround code which (finally, :crossed_fingers:) should at least write and read correct data (added comments below):
```python
# Create a numpy array of type np.datetime64 with one fill value and one date
# FIRST ISSUE WITH _FillValue
# we need to provide ns resolution here too, otherwise we get wrong fillvalues (day-reference)
time_fill_value = np.datetime64(""1900-01-01 00:00:00.00000000"", ""ns"")
time = np.array([np.datetime64(""NaT"", ""ns""), '2023-01-02 00:00:00.00000000'], dtype='M8[ns]')
# Create a dataset with this one array
xr_time_array = xr.DataArray(data=time,dims=['time'],name='time')
xr_ds = xr.Dataset(dict(time=xr_time_array))
print(""******************"")
print(""Created with fill value 1900-01-01"")
print(xr_ds[""time""])
# Save the dataset to zarr
location_new_fill = ""from_xarray_new_fill.zarr""
# SECOND ISSUE with inferring units from data
# We need to specify ""dtype"" and ""units"" which fit our data
# Note: as we provide a _FillValue with a reference to unix-epoch
# we need to provide a fitting units too
encoding = {
""time"":{""_FillValue"":time_fill_value, ""dtype"":np.int64, ""units"":""nanoseconds since 1970-01-01""}
}
xr_ds.to_zarr(location_new_fill, mode=""w"", encoding=encoding)
xr_read = xr.open_zarr(location_new_fill)
print(""******************"")
print(""Read back out of the zarr store with xarray"")
print(xr_read[""time""])
print(xr_read[""time""].attrs)
print(xr_read[""time""].encoding)
z_new_fill = zarr.open('from_xarray_new_fill.zarr','r', )
print(""******************"")
print(""Read back out of the zarr store with zarr"")
print(z_new_fill[""time""])
print(z_new_fill[""time""].attrs)
print(z_new_fill[""time""][:])
```
```python
******************
Created with fill value 1900-01-01
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
******************
Read back out of the zarr store with xarray
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
{}
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -2208988800000000000, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
******************
Read back out of the zarr store with zarr
[-2208988800000000000 1672617600000000000]
```
@christine-e-smit Please let me know, if the above workaround gives you correct results in your workflow. If so, then we can think about how to automatically align fillvalue-resolution with data-resolution and what needs to be done to correctly deduce the units.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1685803922
https://github.com/pydata/xarray/issues/7790#issuecomment-1529076482,https://api.github.com/repos/pydata/xarray/issues/7790,1529076482,IC_kwDOAMm_X85bI9sC,5821660,2023-04-30T16:52:25Z,2023-04-30T16:52:25Z,MEMBER,"> ```python
> xr_ds.to_zarr(location_new_fill,encoding=encoding)
>
> xr_read = xr.open_zarr(location)
> print(""******************"")
> print(""Read back out of the zarr store with xarray"")
> print(xr_read[""time""])
> print(xr_read[""time""].encoding)
> ```
@christine-e-smit Is this just a remnant of copy&paste? The above code writes to `location_new_fill`, but reads from `location`.
Here is my code and output for comparison (using latest zarr/xarray):
```python
# Create a numpy array of type np.datetime64 with one fill value and one date
time_fill_value = np.datetime64(""1900-01-01"")
time = np.array([np.datetime64(""NaT""), '2023-01-02'], dtype='M8[ns]')
# Create a dataset with this one array
xr_time_array = xr.DataArray(data=time,dims=['time'],name='time')
xr_ds = xr.Dataset(dict(time=xr_time_array))
print(""******************"")
print(""Created with fill value 1900-01-01"")
print(xr_ds[""time""])
# Save the dataset to zarr
location_new_fill = ""from_xarray_new_fill.zarr""
encoding = {
""time"":{""_FillValue"":time_fill_value,""dtype"":np.int64}
}
xr_ds.to_zarr(location_new_fill, encoding=encoding)
xr_read = xr.open_zarr(location_new_fill)
print(""******************"")
print(""Read back out of the zarr store with xarray"")
print(xr_read[""time""])
print(xr_read[""time""].encoding)
```
```python
******************
Created with fill value 1900-01-01
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
******************
Read back out of the zarr store with xarray
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -25567, 'units': 'days since 2023-01-02 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
```
This doesn't look correct either. At least the decoded `_FillValue` or the `units` are wrong. So -25567 is 1900-01-01 when referenced to of unix-epoch (Question: Is zarr time based on unix epoch?). When read back via zarr only this would decode into:
```python
array(['1953-01-02T00:00:00.000000000', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
```
I totally agree with @christine-e-smit, this is all very confusing. As said at the beginning, I have little knowledge of zarr. I'm currently digging into cf encoding/decoding which made me jump on here.
AFAICT, it looks like already the encoding has a problem, at least the data on disk is already not what we expect. It seems that somehow the xarray cf_encoding/decoding is not well aligned with the zarr writing/reading of datetimes.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1685803922
https://github.com/pydata/xarray/issues/7790#issuecomment-1527050493,https://api.github.com/repos/pydata/xarray/issues/7790,1527050493,IC_kwDOAMm_X85bBPD9,5821660,2023-04-28T06:21:38Z,2023-04-28T06:21:38Z,MEMBER,"Thanks @dcherian for filling in the details.
I've digged up some more related issues: #2265, #3942, #4045
IIUC, #4684 did a great job to iron out much of these issues, but as it looks like only in the case when no `NaT` is within the time array (cc @spencerkclark). @christine-e-smit If you have no `NaT` in your time array then you can just omit `encoding` completely and Xarray will use int64 per default and your data should be fine on disk.
In the presence of `NaT` it looks like one workaround to circumvent that issue for the time being is to add the `dtype` in addition to `_FillValue` when writing out to zarr :
```python
encoding = {
""time"":{""_FillValue"": time_fill_value, ""dtype"": np.int64}
xr_ds.to_zarr(location, encoding=encoding)
}
```
One note to this: Xarray is deducing the `units` from the current time data. So for the above example it will result in `'days since 2023-01-02 00:00:00'` where `days` would now be the resolution in the file. If you want the resolution to be nanoseconds on disk `units` would need to be added to the encoding.
```python
encoding = {
""time"":{""_FillValue"": time_fill_value, ""dtype"": np.int64, 'units': 'nanoseconds since 2023-01-02'}
}
xr_ds.to_zarr(location, encoding=encoding)
```
@christine-e-smit It would be great if you could confirm that from your side (some sanity check needed on my side).
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1685803922
https://github.com/pydata/xarray/issues/7790#issuecomment-1525790614,https://api.github.com/repos/pydata/xarray/issues/7790,1525790614,IC_kwDOAMm_X85a8beW,5821660,2023-04-27T14:23:16Z,2023-04-27T14:23:16Z,MEMBER,"@christine-e-smit I see, thanks for the details. AFAICT from the code it looks like `zarr` is special-cased in some ways compared to other backends. I'd really rely on some zarr-expert shedding light here and over at #7776.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1685803922
https://github.com/pydata/xarray/issues/7790#issuecomment-1525524428,https://api.github.com/repos/pydata/xarray/issues/7790,1525524428,IC_kwDOAMm_X85a7afM,5821660,2023-04-27T11:26:15Z,2023-04-27T11:26:15Z,MEMBER,"Xref: discussion #7776, which got no attention up to now.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1685803922
https://github.com/pydata/xarray/issues/7790#issuecomment-1525513525,https://api.github.com/repos/pydata/xarray/issues/7790,1525513525,IC_kwDOAMm_X85a7X01,5821660,2023-04-27T11:19:24Z,2023-04-27T11:19:24Z,MEMBER,"@christine-e-smit
So, I'm no expert for `zarr`, but it turns out that your `NaT` was converted to `-9.223372036854776e+18` in the encoding step. I'm assuming that `zarr` is converting `NaT` as the format doesn't allow to use `NaT` directly, so it chooses a (default) value.
The `_FillValue` is not lost, but it will be preserved in the `.encoding`-dict of the underlying Variable:
```python
xr_read = xr.open_zarr(location)
print(""******************"")
print(""No fill value"")
print(xr_read[""time""])
print(xr_read[""time""].encoding)
```
```python
******************
No fill value
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9.223372036854776e+18, 'units': 'days since 2023-01-02 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('float64')}
```
You might also check this without decoding (`decode_cd=False`):
```python
with xr.open_zarr(location, decode_cf=False) as xr_read:
print(""******************"")
print(""No fill value"")
print(xr_read[""time""])
print(xr_read[""time""].encoding)
```
```python
******************
No fill value
array([-9.223372e+18, 0.000000e+00])
Coordinates:
* time (time) float64 -9.223e+18 0.0
Attributes:
calendar: proleptic_gregorian
units: days since 2023-01-02 00:00:00
_FillValue: -9.223372036854776e+18
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('float64')}
```
Maybe a zarr-expert can chime in here, what's the best practice for time-fill_values.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1685803922