home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

25 rows where issue = 1685803922 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 5

  • kmuehlbauer 12
  • christine-e-smit 9
  • dcherian 2
  • spencerkclark 1
  • welcome[bot] 1

author_association 2

  • MEMBER 15
  • NONE 10

issue 1

  • Fill values in time arrays (numpy.datetime64) are lost in zarr · 25 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1532441433 https://github.com/pydata/xarray/issues/7790#issuecomment-1532441433 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bVzNZ kmuehlbauer 5821660 2023-05-03T04:25:50Z 2023-05-03T04:25:50Z MEMBER

@christine-e-smit Great this works on you side with the proposed patch in #7098.

Nevertheless, we've identified three more issues here in the debugging process which can now be handled one by one. So again, thanks for your contribution here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1531050846 https://github.com/pydata/xarray/issues/7790#issuecomment-1531050846 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bQfte kmuehlbauer 5821660 2023-05-02T08:04:45Z 2023-05-03T04:20:11Z MEMBER

As in #7098, citing @dcherian:

I think the real solution here is to explicitly handle NaNs during the decoding step. We do want these to be NaT in the output.

There are three more issues revealed here when using datetime64:

  • if _FillValue is set in encoding, it has to be of same type/resolution as the times in the array
  • If _FillValue is provided, we need to provide dtype and units to which fit our data, eg. if the _FillValue is referenced to unix-epoch the unit's should be equivalent
  • when encoding in the presence of NaT the data array is converted to floating point with NaN, which is problematic for the subsequent conversion to int64
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1532236037 https://github.com/pydata/xarray/issues/7790#issuecomment-1532236037 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bVBEF spencerkclark 6628425 2023-05-02T22:28:52Z 2023-05-02T22:28:52Z MEMBER

Thanks for the ping @dcherian -- I just gave #7098 a review. I think it's close to ready to merge.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1532152709 https://github.com/pydata/xarray/issues/7790#issuecomment-1532152709 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bUsuF christine-e-smit 14983768 2023-05-02T21:07:27Z 2023-05-02T21:09:10Z NONE

@kmuehlbauer - genius! Yes. That pull request should fix this issue exactly! And it explains why I see this issue and you don't - with undefined behavior anything can happen. Since we are on different OSes, our systems behave differently.

I just double checked with pandas and this fix will do the right thing: python import pandas as pd print(pd.to_timedelta([np.nan, 0],"ns") + np.datetime64('1970-01-01')) DatetimeIndex(['NaT', '1970-01-01'], dtype='datetime64[ns]', freq=None) I see that the pull request with the fix has been sitting since December of last year. Is there some way to somehow get someone to look at that pull request who can merge it?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1530991257 https://github.com/pydata/xarray/issues/7790#issuecomment-1530991257 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bQRKZ kmuehlbauer 5821660 2023-05-02T07:09:38Z 2023-05-02T08:14:36Z MEMBER

@christine-e-smit I've created an fresh environment with only xarray and zarr and it still works on my machine. I've then followed the Darwin idea and digged up #6191 (I've got those casting warnings from exactly the line you were referring to). Comment https://github.com/pydata/xarray/issues/6191#issuecomment-1209567966 should explain what happens here.

tl;dr citing @DocOtak

The short explanation is that the time conversion functions do an astype(np.int64) or equivalent cast on arrays that contain nans. This is undefined behavior and very soon, doing this will start to emit RuntimeWarnings.

There is also an open PR #7098.

Thanks @christine-e-smit for sticking with me to find the root-cause here by providing detailed information and code examples. :+1:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1530347592 https://github.com/pydata/xarray/issues/7790#issuecomment-1530347592 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bN0BI christine-e-smit 14983768 2023-05-01T21:43:08Z 2023-05-01T21:43:56Z NONE

Ah hah! Well, I don't know why this is working for you @kmuehlbauer, but I can see why it is not working for me. I've been debugging through the code and it looks like the problem is the _decode_datetime_with_pandas function. For me, it's converting a float NaN into an integer, which results in a zero value.

It all starts in the open_zarr function, which by default sets the use_cftime parameter to None by default: https://github.com/pydata/xarray/blob/25d9a28e12141b9b5e4a79454eb76ddd2ee2bc4d/xarray/backends/zarr.py#L701-L817

There's a bunch of stuff that gets called, but eventually we get to the function decode_cf_datetime, which ironically (given the name) also takes this use_cftime parameter, which is still None. Because use_cftime is None, the function calls _decode_datetime_with_pandas:

https://github.com/pydata/xarray/blob/25d9a28e12141b9b5e4a79454eb76ddd2ee2bc4d/xarray/coding/times.py#L265-L289

and then, in _decode_datetime_with_pandas, the code casts a float NaN value to zero:

https://github.com/pydata/xarray/blob/979b99831f5d34d33120312a15dad3e6a0830f32/xarray/coding/times.py#L216-L262

In line 254, flat_num_dates is array([ nan, 1.6726176e+18]). After line 254, flat_nuM-dates_ns_int is array([ 0, 1672617600000000000]).

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1530186148 https://github.com/pydata/xarray/issues/7790#issuecomment-1530186148 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bNMmk christine-e-smit 14983768 2023-05-01T20:25:34Z 2023-05-01T20:25:34Z NONE

@kmuehlbauer - I ran https://github.com/pydata/xarray/issues/7790#issuecomment-1529894939 and I get an incorrect fill value:

```


Created with fill value 1900-01-01 <xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02


Read back out of the zarr store with xarray <xarray.DataArray 'time' (time: 2)> array(['1970-01-01T00:00:00.000000000', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 1970-01-01 2023-01-02 {} {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -2208988800000000000, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}


Read back out of the zarr store with zarr <zarr.core.Array '/time' (2,) int64 read-only> <zarr.attrs.Attributes object at 0x132802a50> [-2208988800000000000 1672617600000000000]

and here is my show_versions, since it may have changed because I've added some new libraries. It looks like my ipython version is slightly different, but I can't see how that would affect things. INSTALLED VERSIONS


commit: None python: 3.11.3 | packaged by conda-forge | (main, Apr 6 2023, 08:58:31) [Clang 14.0.6 ] python-bits: 64 OS: Darwin OS-release: 22.4.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None

xarray: 2023.4.2 pandas: 2.0.1 numpy: 1.24.3 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.14.2 cftime: None nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 67.7.2 pip: 23.1.2 conda: None pytest: None mypy: None IPython: 8.13.1 sphinx: None ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1530141083 https://github.com/pydata/xarray/issues/7790#issuecomment-1530141083 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bNBmb kmuehlbauer 5821660 2023-05-01T20:01:50Z 2023-05-01T20:01:50Z MEMBER

@christine-e-smit One more idea, you might delete the zarr folder before re-creating (if you are not doing that already). I've removed the complete folder before any new write (by putting eg. !rm -rf xarray_and_units.zarr at the beginning of the notebook-cell).

It would also be great if you could run the code from https://github.com/pydata/xarray/issues/7790#issuecomment-1529894939 and post the output here, just for the sake of comparison (please delete the zarr-folder before if it exists). Thanks!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1530131533 https://github.com/pydata/xarray/issues/7790#issuecomment-1530131533 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bM_RN kmuehlbauer 5821660 2023-05-01T19:53:53Z 2023-05-01T19:53:53Z MEMBER

@christine-e-smit I've plugged your code into a fresh notebook, here is my output:

```python


xarray created with NaT fill value

<xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02


xarray created read with NaT fill value

<xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02 {} {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9223372036854775808, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')} ```

The output seems OK on my side. I've no idea why the data isn't correctly decoded as NaT on your side. I've checked that my environment is comparable to yours. The only difference remaining is you are on Darwin arm64 whereas I'm on Linux.

``` INSTALLED VERSIONS


commit: None python: 3.11.2 | packaged by conda-forge | (main, Mar 31 2023, 17:51:05) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.4.0-144-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: ('de_DE', 'UTF-8') libhdf5: 1.14.0 libnetcdf: None

xarray: 2023.4.2 pandas: 2.0.1 numpy: 1.24.3 scipy: 1.10.1 netCDF4: None pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 Nio: None zarr: 2.14.2 cftime: None nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.3.2 distributed: 2023.3.2 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2023.3.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 67.6.1 pip: 23.0.1 conda: None pytest: 7.2.2 mypy: 0.982 IPython: 8.12.0 sphinx: None ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1530111912 https://github.com/pydata/xarray/issues/7790#issuecomment-1530111912 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bM6eo kmuehlbauer 5821660 2023-05-01T19:30:22Z 2023-05-01T19:30:22Z MEMBER

Unfortunately, I think you may have also gotten some wires crossed? You set the time fill value to 1900-01-01, but then use NaT in the actual array?

Yes, I use NaT because I want to check if the encoder does correctly translate NaT to the provided _FillValue on write.

So from your last example I'm assuming you would like to have the int64 representation of NaT as _FillValue, right? I'll try to adapt this, and see what I get

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1530056660 https://github.com/pydata/xarray/issues/7790#issuecomment-1530056660 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bMs_U christine-e-smit 14983768 2023-05-01T18:37:47Z 2023-05-01T18:39:21Z NONE

Oops! Yes. You are right. I had some cross-wording on the variable names. So I started a new notebook. Unfortunately, I think you may have also gotten some wires crossed? You set the time fill value to 1900-01-01, but then use NaT in the actual array?

Here is a fresh notebook with a stand-alone cell with everything that I think you were doing, but I'm not 100%. The fill value is still wrong when it gets read out, but it is at least different? The fill value is now set to the units for some reason. This seems like progress?

```python import numpy as np import xarray as xr import zarr

Create a time array with one fill value, NaT

time = np.array([np.datetime64("NaT", "ns"), '2023-01-02 00:00:00.00000000'], dtype='M8[ns]')

Create xarray with this fill value

xr_time_array = xr.DataArray(data=time,dims=['time'],name='time') xr_ds = xr.Dataset(dict(time=xr_time_array)) print("****") print("xarray created with NaT fill value") print("----------------------") print(xr_ds["time"])

Save as zarr

location_with_units = "xarray_and_units.zarr" encoding = { "time":{"_FillValue":np.datetime64("NaT","ns"),"dtype":np.int64,"units":"nanoseconds since 1970-01-01"} } xr_ds.to_zarr(location_with_units,mode="w",encoding=encoding)

Read it back out again

xr_read = xr.open_zarr(location_with_units) print("****") print("xarray created read with NaT fill value") print("----------------------") print(xr_read["time"]) print(xr_read["time"].attrs) print(xr_read["time"].encoding)

```

```


xarray created with NaT fill value

<xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02


xarray created read with NaT fill value

<xarray.DataArray 'time' (time: 2)> array(['1970-01-01T00:00:00.000000000', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 1970-01-01 2023-01-02 {} {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9223372036854775808, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1529894939 https://github.com/pydata/xarray/issues/7790#issuecomment-1529894939 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bMFgb kmuehlbauer 5821660 2023-05-01T16:05:19Z 2023-05-01T16:05:19Z MEMBER

So, after some debugging I think I've found two issues here with the current code.

First, we need to give the fillvalue with a fitting resolution. Second, we have an issue with inferring the units from the data (if not given).

Here is some workaround code which (finally, :crossed_fingers:) should at least write and read correct data (added comments below):

```python

Create a numpy array of type np.datetime64 with one fill value and one date

FIRST ISSUE WITH _FillValue

we need to provide ns resolution here too, otherwise we get wrong fillvalues (day-reference)

time_fill_value = np.datetime64("1900-01-01 00:00:00.00000000", "ns") time = np.array([np.datetime64("NaT", "ns"), '2023-01-02 00:00:00.00000000'], dtype='M8[ns]')

Create a dataset with this one array

xr_time_array = xr.DataArray(data=time,dims=['time'],name='time') xr_ds = xr.Dataset(dict(time=xr_time_array))

print("******") print("Created with fill value 1900-01-01") print(xr_ds["time"])

Save the dataset to zarr

location_new_fill = "from_xarray_new_fill.zarr"

SECOND ISSUE with inferring units from data

We need to specify "dtype" and "units" which fit our data

Note: as we provide a _FillValue with a reference to unix-epoch

we need to provide a fitting units too

encoding = { "time":{"_FillValue":time_fill_value, "dtype":np.int64, "units":"nanoseconds since 1970-01-01"} } xr_ds.to_zarr(location_new_fill, mode="w", encoding=encoding)

xr_read = xr.open_zarr(location_new_fill) print("******") print("Read back out of the zarr store with xarray") print(xr_read["time"]) print(xr_read["time"].attrs) print(xr_read["time"].encoding)

z_new_fill = zarr.open('from_xarray_new_fill.zarr','r', ) print("******") print("Read back out of the zarr store with zarr")

print(z_new_fill["time"]) print(z_new_fill["time"].attrs) print(z_new_fill["time"][:]) ```

```python


Created with fill value 1900-01-01 <xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02


Read back out of the zarr store with xarray <xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02 {} {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -2208988800000000000, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}


Read back out of the zarr store with zarr <zarr.core.Array '/time' (2,) int64 read-only> <zarr.attrs.Attributes object at 0x7f086ab8e710> [-2208988800000000000 1672617600000000000] ```

@christine-e-smit Please let me know, if the above workaround gives you correct results in your workflow. If so, then we can think about how to automatically align fillvalue-resolution with data-resolution and what needs to be done to correctly deduce the units.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1529076482 https://github.com/pydata/xarray/issues/7790#issuecomment-1529076482 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bI9sC kmuehlbauer 5821660 2023-04-30T16:52:25Z 2023-04-30T16:52:25Z MEMBER

```python xr_ds.to_zarr(location_new_fill,encoding=encoding)

xr_read = xr.open_zarr(location) print("******") print("Read back out of the zarr store with xarray") print(xr_read["time"]) print(xr_read["time"].encoding) ```

@christine-e-smit Is this just a remnant of copy&paste? The above code writes to location_new_fill, but reads from location.

Here is my code and output for comparison (using latest zarr/xarray):

```python

Create a numpy array of type np.datetime64 with one fill value and one date

time_fill_value = np.datetime64("1900-01-01") time = np.array([np.datetime64("NaT"), '2023-01-02'], dtype='M8[ns]')

Create a dataset with this one array

xr_time_array = xr.DataArray(data=time,dims=['time'],name='time') xr_ds = xr.Dataset(dict(time=xr_time_array))

print("******") print("Created with fill value 1900-01-01") print(xr_ds["time"])

Save the dataset to zarr

location_new_fill = "from_xarray_new_fill.zarr" encoding = { "time":{"_FillValue":time_fill_value,"dtype":np.int64} } xr_ds.to_zarr(location_new_fill, encoding=encoding)

xr_read = xr.open_zarr(location_new_fill) print("******") print("Read back out of the zarr store with xarray") print(xr_read["time"]) print(xr_read["time"].encoding) ```

```python


Created with fill value 1900-01-01 <xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02


Read back out of the zarr store with xarray <xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02 {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -25567, 'units': 'days since 2023-01-02 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')} ```

This doesn't look correct either. At least the decoded _FillValue or the units are wrong. So -25567 is 1900-01-01 when referenced to of unix-epoch (Question: Is zarr time based on unix epoch?). When read back via zarr only this would decode into:

python <xarray.DataArray 'time' (time: 2)> array(['1953-01-02T00:00:00.000000000', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]')

I totally agree with @christine-e-smit, this is all very confusing. As said at the beginning, I have little knowledge of zarr. I'm currently digging into cf encoding/decoding which made me jump on here.

AFAICT, it looks like already the encoding has a problem, at least the data on disk is already not what we expect. It seems that somehow the xarray cf_encoding/decoding is not well aligned with the zarr writing/reading of datetimes.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1528072972 https://github.com/pydata/xarray/issues/7790#issuecomment-1528072972 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bFIsM dcherian 2448579 2023-04-28T20:43:44Z 2023-04-28T20:43:44Z MEMBER

https://github.com/pydata/xarray/blob/25d9a28e12141b9b5e4a79454eb76ddd2ee2bc4d/xarray/coding/times.py#L717-L735

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1527948787 https://github.com/pydata/xarray/issues/7790#issuecomment-1527948787 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bEqXz christine-e-smit 14983768 2023-04-28T18:39:01Z 2023-04-28T18:39:01Z NONE

Where in the code is the time array being decoded? That seems to be where a lot of the issue is?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1527918654 https://github.com/pydata/xarray/issues/7790#issuecomment-1527918654 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bEjA- christine-e-smit 14983768 2023-04-28T18:08:16Z 2023-04-28T18:08:16Z NONE

The zarr store does indeed use an integer in this case according to the .zmetadata file: { "metadata": { ".zattrs": {}, ".zgroup": { "zarr_format": 2 }, "time/.zarray": { "chunks": [ 2 ], "compressor": { "blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1 }, "dtype": "<i8", "fill_value": -25567, "filters": null, "order": "C", "shape": [ 2 ], "zarr_format": 2 }, "time/.zattrs": { "_ARRAY_DIMENSIONS": [ "time" ], "calendar": "proleptic_gregorian", "units": "days since 1900-01-01 00:00:00" } }, "zarr_consolidated_format": 1 } Once again the values in the zarr store are correct given the units, but xarray misreads the fill value for some reason: python z_new_fill = zarr.open('from_xarray_new_fill.zarr','r') z_new_fill["time"][:] array([ 0, 44926])

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1527917772 https://github.com/pydata/xarray/issues/7790#issuecomment-1527917772 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bEizM christine-e-smit 14983768 2023-04-28T18:07:40Z 2023-04-28T18:07:40Z NONE

@kmuehlbauer - I think I'm not understanding what you are suggesting because the zarr store is still not being read correctly when I switch the fill value to a different date:

```python

Create a numpy array of type np.datetime64 with one fill value and one date

time_fill_value = np.datetime64("1900-01-01") time = np.array([time_fill_value,'2023-01-02'],dtype='M8[ns]')

Create a dataset with this one array

xr_time_array = xr.DataArray(data=time,dims=['time'],name='time') xr_ds = xr.Dataset(dict(time=xr_time_array))

print("******") print("Created with fill value 1900-01-01") print(xr_ds["time"])

Save the dataset to zarr

location_new_fill = "from_xarray_new_fill.zarr" encoding = { "time":{"_FillValue":time_fill_value,"dtype":np.int64} } xr_ds.to_zarr(location_new_fill,encoding=encoding)

xr_read = xr.open_zarr(location) print("******") print("Read back out of the zarr store with xarray") print(xr_read["time"]) print(xr_read["time"].encoding)


Created with fill value 1900-01-01 <xarray.DataArray 'time' (time: 2)> array(['1900-01-01T00:00:00.000000000', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 1900-01-01 2023-01-02


<xarray.DataArray 'time' (time: 2)> array(['2023-01-02T00:00:00.000000000', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 2023-01-02 2023-01-02 {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9.223372036854776e+18, 'units': 'days since 2023-01-02 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('float64')} ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1527050493 https://github.com/pydata/xarray/issues/7790#issuecomment-1527050493 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bBPD9 kmuehlbauer 5821660 2023-04-28T06:21:38Z 2023-04-28T06:21:38Z MEMBER

Thanks @dcherian for filling in the details.

I've digged up some more related issues: #2265, #3942, #4045

IIUC, #4684 did a great job to iron out much of these issues, but as it looks like only in the case when no NaT is within the time array (cc @spencerkclark). @christine-e-smit If you have no NaT in your time array then you can just omit encoding completely and Xarray will use int64 per default and your data should be fine on disk.

In the presence of NaT it looks like one workaround to circumvent that issue for the time being is to add the dtype in addition to _FillValue when writing out to zarr :

python encoding = { "time":{"_FillValue": time_fill_value, "dtype": np.int64} xr_ds.to_zarr(location, encoding=encoding) }

One note to this: Xarray is deducing the units from the current time data. So for the above example it will result in 'days since 2023-01-02 00:00:00' where days would now be the resolution in the file. If you want the resolution to be nanoseconds on disk units would need to be added to the encoding.

python encoding = { "time":{"_FillValue": time_fill_value, "dtype": np.int64, 'units': 'nanoseconds since 2023-01-02'} } xr_ds.to_zarr(location, encoding=encoding)

@christine-e-smit It would be great if you could confirm that from your side (some sanity check needed on my side).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1526224630 https://github.com/pydata/xarray/issues/7790#issuecomment-1526224630 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85a-Fb2 dcherian 2448579 2023-04-27T19:18:12Z 2023-04-27T19:18:12Z MEMBER

I think the issue is that we're always running "CF encoding" which is more appropriate for netCDF4 than Zarr, since Zarr supports datetime64 natively. And currently there's no way to control whether the datetime encoder is applied or not, we just look at the dtype: https://github.com/pydata/xarray/blob/0f4e99d036b0d6d76a3271e6191eacbc9922662f/xarray/coding/times.py#L697-L704

I think the right way to fix this is to allow the user to run the encode and write steps separately, with the encoding steps being controllable: https://github.com/pydata/xarray/issues/4412

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1525790614 https://github.com/pydata/xarray/issues/7790#issuecomment-1525790614 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85a8beW kmuehlbauer 5821660 2023-04-27T14:23:16Z 2023-04-27T14:23:16Z MEMBER

@christine-e-smit I see, thanks for the details. AFAICT from the code it looks like zarr is special-cased in some ways compared to other backends. I'd really rely on some zarr-expert shedding light here and over at #7776.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1525774670 https://github.com/pydata/xarray/issues/7790#issuecomment-1525774670 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85a8XlO christine-e-smit 14983768 2023-04-27T14:13:58Z 2023-04-27T14:13:58Z NONE

Interestingly, xarray is also perfectly happy to read a numpy.datetime64 array out of a zarr store as long as the xarray metadata is present. xarray even helpfully creates an '_FillValue" attribute for the array so there is no confusion:

```

Create a zarr store directly with numpy.datetime64 type

location_zarr_direct = "from_zarr.zarr" root = zarr.open(location_zarr_direct,mode='w') z_time_array = root.create_dataset( "time",data=time,shape=time.shape,chunks=time.shape,dtype=time.dtype, fill_value=time_fill_value )

Add xarray metadata

z_time_array.attrs["_ARRAY_DIMENSIONS"] = ["time"] zarr.convenience.consolidate_metadata(location_zarr_direct)

Use xarray to read this data out

xr_read_from_zarr = xr.open_zarr(location_zarr_direct) print(xr_read_from_zarr["time"]) <xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02 Attributes: _FillValue: NaT ```

So I am extremely confused as to why xarray encodes time arrays so strangely when it creates the zarr store itself! (Hence https://github.com/pydata/xarray/discussions/7776)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1525766244 https://github.com/pydata/xarray/issues/7790#issuecomment-1525766244 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85a8Vhk christine-e-smit 14983768 2023-04-27T14:08:37Z 2023-04-27T14:08:37Z NONE

Ah! Okay. I did not know about the .encoding option, which does indeed have the fill value. Thank you.

Interestingly, -9.223372036854776e+18 is just the float equivalent of numpy.datetime64('NaT'):

python float(np.datetime64('NaT').view('i8')) -9.223372036854776e+18

And I know this isn't an issue with zarr and NaT because I can create the zarr store directly with the zarr library and it's perfectly happy: ```python

Create a zarr store directly with numpy.datetime64 type

location_zarr_direct = "from_zarr.zarr" root = zarr.open(location_zarr_direct,mode='w') z_time_array = root.create_dataset( "time",data=time,shape=time.shape,chunks=time.shape,dtype=time.dtype, fill_value=time_fill_value ) zarr.convenience.consolidate_metadata(location_zarr_direct)

Read it back out again

read_zarr = zarr.open(location_zarr_direct,mode='r') print(read_zarr["time"][:]) [ 'NaT' '2023-01-02T00:00:00.000000000'] ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1525524428 https://github.com/pydata/xarray/issues/7790#issuecomment-1525524428 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85a7afM kmuehlbauer 5821660 2023-04-27T11:26:15Z 2023-04-27T11:26:15Z MEMBER

Xref: discussion #7776, which got no attention up to now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1525513525 https://github.com/pydata/xarray/issues/7790#issuecomment-1525513525 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85a7X01 kmuehlbauer 5821660 2023-04-27T11:19:24Z 2023-04-27T11:19:24Z MEMBER

@christine-e-smit

So, I'm no expert for zarr, but it turns out that your NaT was converted to -9.223372036854776e+18 in the encoding step. I'm assuming that zarr is converting NaT as the format doesn't allow to use NaT directly, so it chooses a (default) value.

The _FillValue is not lost, but it will be preserved in the .encoding-dict of the underlying Variable:

python xr_read = xr.open_zarr(location) print("******************") print("No fill value") print(xr_read["time"]) print(xr_read["time"].encoding) ```python


No fill value <xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02 {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9.223372036854776e+18, 'units': 'days since 2023-01-02 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('float64')} ```

You might also check this without decoding (decode_cd=False):

python with xr.open_zarr(location, decode_cf=False) as xr_read: print("******************") print("No fill value") print(xr_read["time"]) print(xr_read["time"].encoding) ```python


No fill value <xarray.DataArray 'time' (time: 2)> array([-9.223372e+18, 0.000000e+00]) Coordinates: * time (time) float64 -9.223e+18 0.0 Attributes: calendar: proleptic_gregorian units: days since 2023-01-02 00:00:00 _FillValue: -9.223372036854776e+18 {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('float64')} ```

Maybe a zarr-expert can chime in here, what's the best practice for time-fill_values.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1524099019 https://github.com/pydata/xarray/issues/7790#issuecomment-1524099019 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85a1-fL welcome[bot] 30606887 2023-04-26T22:03:08Z 2023-04-26T22:03:08Z NONE

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 18.324ms · About: xarray-datasette