home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

457 rows where author_association = "MEMBER" and user = 5821660 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

issue >30

  • ENH: use `dask.array.apply_gufunc` in `xr.apply_ufunc` 37
  • Fill missing data_vars during concat by reindexing 18
  • cf-coding 18
  • FIX: correct dask array handling in _calc_idxminmax 17
  • concat changes variable order 12
  • Fill values in time arrays (numpy.datetime64) are lost in zarr 12
  • Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 10
  • Fix module name retrieval in `backend.plugins.remove_duplicates()`, plugin tests 9
  • Add defaults during concat 508 8
  • Preserve nanosecond resolution when encoding/decoding times 8
  • ENH: enable `H5NetCDFStore` to work with already open h5netcdf.File a… 7
  • Times not decoding due to time_reference being in another variable 7
  • Saving and loading an array of strings changes datatype to object 7
  • CF encoding should preserve vlen dtype for empty arrays 7
  • Set `allow_rechunk=True` in `apply_ufunc` 6
  • fix compatibility with h5py version 3 and unpin tests 6
  • h5netcdf fails to decode attribute coordinates. 6
  • FIX: h5py>=3 string decoding 6
  • xr.open_dataset() reading ubyte variables as float32 from DAP server 6
  • Loading datasets of numpy string arrays leads to error and/or segfault 5
  • Backend / plugin system `remove_duplicates` raises AttributeError on discovering duplicates 5
  • nightly failure with h5netcdf indexing 5
  • expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 5
  • Y-axis flipped when reading data with Xarray 5
  • `nan` values appearing when saving and loading from `netCDF` due to encoding 5
  • Fix as_compatible_data for read-only np.ma.MaskedArray 5
  • `open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` 5
  • broken output of `find_root_and_group` for h5netcdf 4
  • Keep the original ordering of the coordinates 4
  • Alternative way to deal scale_factor and add_offset for opening datasets. 4
  • …

user 1

  • kmuehlbauer · 457 ✖

author_association 1

  • MEMBER · 457 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1578775636 https://github.com/pydata/xarray/pull/7862#issuecomment-1578775636 https://api.github.com/repos/pydata/xarray/issues/7862 IC_kwDOAMm_X85eGjRU kmuehlbauer 5821660 2023-06-06T13:30:15Z 2023-06-06T13:30:15Z MEMBER

Might be worth an issue over at numpy with the example from the test.

numpy/numpy#23886

The issue is already resolved over at numpy which is really great! It was also marked as backport. @headtr1ck How are these issues resolved currently or how do we track removing the ignore?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CF encoding should preserve vlen dtype for empty arrays 1720045908
1578248748 https://github.com/pydata/xarray/pull/7862#issuecomment-1578248748 https://api.github.com/repos/pydata/xarray/issues/7862 IC_kwDOAMm_X85eEios kmuehlbauer 5821660 2023-06-06T09:04:39Z 2023-06-06T09:04:39Z MEMBER

Might be worth an issue over at numpy with the example from the test.

https://github.com/numpy/numpy/issues/23886

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CF encoding should preserve vlen dtype for empty arrays 1720045908
1576080083 https://github.com/pydata/xarray/issues/7866#issuecomment-1576080083 https://api.github.com/repos/pydata/xarray/issues/7866 IC_kwDOAMm_X85d8RLT kmuehlbauer 5821660 2023-06-05T05:45:30Z 2023-06-05T05:45:30Z MEMBER

@vrishk Sorry for the delay here and thanks for bringing this to attention. We now have at least two requests which might move this forward (moving ensure_dtype_not_object into the backends). But this would need some discussion first, how to do this.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Enable object_codec in zarr backend 1720924071
1576074048 https://github.com/pydata/xarray/issues/7892#issuecomment-1576074048 https://api.github.com/repos/pydata/xarray/issues/7892 IC_kwDOAMm_X85d8PtA kmuehlbauer 5821660 2023-06-05T05:37:32Z 2023-06-05T05:37:32Z MEMBER

@mktippett Thanks for raising this. The issue should be cleared after #7888 is merged.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  GRIB Data Example is broken 1740685974
1572021301 https://github.com/pydata/xarray/pull/7862#issuecomment-1572021301 https://api.github.com/repos/pydata/xarray/issues/7862 IC_kwDOAMm_X85dsyQ1 kmuehlbauer 5821660 2023-06-01T13:06:32Z 2023-06-01T13:06:32Z MEMBER

@tomwhite I've added tests to check the backend code for vlen string dtype metadadata. Also had to add specific check for the h5py vlen string metadata. I think we've covered everything for the proposed change to allow empty vlen strings dtype metadata.

I'm looking at the mypy error and do not have the slightest clue what and where to change. Any help appreciated.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CF encoding should preserve vlen dtype for empty arrays 1720045908
1561584592 https://github.com/pydata/xarray/issues/7868#issuecomment-1561584592 https://api.github.com/repos/pydata/xarray/issues/7868 IC_kwDOAMm_X85dE-PQ kmuehlbauer 5821660 2023-05-24T16:50:34Z 2023-05-24T16:50:34Z MEMBER

Thanks @ghiggi for your comment.

The problem is we have at least two contradicting user requests here, see #7328 and #7862.

I'm sure there is a solution to accommodate both sides.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` 1722417436
1561285499 https://github.com/pydata/xarray/pull/7862#issuecomment-1561285499 https://api.github.com/repos/pydata/xarray/issues/7862 IC_kwDOAMm_X85dD1N7 kmuehlbauer 5821660 2023-05-24T14:37:58Z 2023-05-24T14:37:58Z MEMBER

Thanks for trying. I can't think of any downsides for the netcdf4-fix, as it just adds the needed metadata to the object-dtype. But you never know, so it would be good to get another set of eyes on it.

So it looks like the changes here with the fix in my branch will get your issue resolved @tomwhite, right?

I'm a bit worried, that this might break other users workflows, if they depend on the current conversion to floating point for some reason. Also other backends might rely on this feature. Especially because this has been there since the early days when xarray was known as xray.

@dcherian What would be the way to go here?

There is also a somehow contradicting issue in #7868.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CF encoding should preserve vlen dtype for empty arrays 1720045908
1561214028 https://github.com/pydata/xarray/issues/7868#issuecomment-1561214028 https://api.github.com/repos/pydata/xarray/issues/7868 IC_kwDOAMm_X85dDjxM kmuehlbauer 5821660 2023-05-24T13:58:16Z 2023-05-24T13:58:16Z MEMBER

My main question here is, why is dask not trying to retrieve the object types from dtype.metadata? Or does it and fail for some reason?.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` 1722417436
1561195832 https://github.com/pydata/xarray/pull/7862#issuecomment-1561195832 https://api.github.com/repos/pydata/xarray/issues/7862 IC_kwDOAMm_X85dDfU4 kmuehlbauer 5821660 2023-05-24T13:52:04Z 2023-05-24T13:52:04Z MEMBER

@tomwhite I've put a commit with changes to zarr/netcdf4-backends which should preserve the dtype metadata here: https://github.com/kmuehlbauer/xarray/tree/preserve-vlen-string-dtype.

I'm not really sure if that is the right location, but as it was already present that location at netcdf4-backend I think it will do.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CF encoding should preserve vlen dtype for empty arrays 1720045908
1561162311 https://github.com/pydata/xarray/pull/7862#issuecomment-1561162311 https://api.github.com/repos/pydata/xarray/issues/7862 IC_kwDOAMm_X85dDXJH kmuehlbauer 5821660 2023-05-24T13:32:26Z 2023-05-24T13:32:57Z MEMBER

@tomwhite Special casing on netcdf4 backend should be possible, too.

But it might need fixing at zarr backend, too:

python ds = xr.Dataset({"a": np.array([], dtype=xr.coding.strings.create_vlen_dtype(str))}) print(f"dtype: {ds['a'].dtype}") print(f"metadata: {ds['a'].dtype.metadata}") ds.to_zarr("a.zarr") print("\n### Loading ###") with xr.open_dataset("a.zarr", engine="zarr") as ds: print(f"dtype: {ds['a'].dtype}") print(f"metadata: {ds['a'].dtype.metadata}") ```python dtype: object metadata: {'element_type': <class 'str'>}

Loading

dtype: object metadata: None ```

Could you verify the above example, please? I'm relatively new to zarr :grimacing:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CF encoding should preserve vlen dtype for empty arrays 1720045908
1560674198 https://github.com/pydata/xarray/issues/7868#issuecomment-1560674198 https://api.github.com/repos/pydata/xarray/issues/7868 IC_kwDOAMm_X85dBf-W kmuehlbauer 5821660 2023-05-24T08:27:11Z 2023-05-24T08:27:11Z MEMBER

@ghiggi Glad it works, but we still have to check if that is the correct location for the fix, as it's not CF specific.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` 1722417436
1560559426 https://github.com/pydata/xarray/pull/7862#issuecomment-1560559426 https://api.github.com/repos/pydata/xarray/issues/7862 IC_kwDOAMm_X85dBD9C kmuehlbauer 5821660 2023-05-24T07:01:44Z 2023-05-24T07:01:44Z MEMBER

Thanks @tomwhite for the PR. I've only quickly checked the approach, which looks reasonable. But those changes have implications on several locations of the backend code, which we would have to sort out.

Considering this example:

```python import numpy as np import xarray as xr print(f"creating dataset with empty string array") print("-----------------------------------------") dtype = xr.coding.strings.create_vlen_dtype(str) ds = xr.Dataset({"a": np.array([], dtype=dtype)}) print(f"dtype: {ds['a'].dtype}") print(f"metadata: {ds['a'].dtype.metadata}") ds.to_netcdf("a.nc", engine="netcdf4")

print("\nncdump") print("-------") !ncdump a.nc

engines = ["netcdf4", "h5netcdf"] for engine in engines: with xr.open_dataset("a.nc", engine=engine) as ds: print(f"\nloading with {engine}") print("-------------------") print(f"dtype: {ds['a'].dtype}") print(f"metadata: {ds['a'].dtype.metadata}") ```

```python creating dataset with empty string array


dtype: object metadata: {'element_type': <class 'str'>}

ncdump

netcdf a { dimensions: a = UNLIMITED ; // (0 currently) variables: string a(a) ; data: }

loading with netcdf4

dtype: object metadata: None

loading with h5netcdf

dtype: object metadata: {'vlen': <class 'str'>} ```

Engine netcdf4 does not roundtrip here, losing the dtype metadata information. There is special casing for h5netcdf backend, though.

The source is actually located in open_store_variable of netcdf4 backend, when the underlying data is converted to Variable (which does some object dtype twiddling).

Unfortunately I do not have an immediate solution here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CF encoding should preserve vlen dtype for empty arrays 1720045908
1560534067 https://github.com/pydata/xarray/issues/7328#issuecomment-1560534067 https://api.github.com/repos/pydata/xarray/issues/7328 IC_kwDOAMm_X85dA9wz kmuehlbauer 5821660 2023-05-24T06:37:39Z 2023-05-24T06:37:39Z MEMBER

@tomwhite Sorry for the delay here. I'll respond shortly on your PR #7862, but we might have to reiterate here later

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr store array dtype changes for empty object string 1466586967
1559959581 https://github.com/pydata/xarray/issues/7868#issuecomment-1559959581 https://api.github.com/repos/pydata/xarray/issues/7868 IC_kwDOAMm_X85c-xgd kmuehlbauer 5821660 2023-05-23T18:42:55Z 2023-05-23T19:01:00Z MEMBER

@ghiggi Thanks for getting this back into action. I got dragged away from the one string object issue in #7654. I'll split this out and add a PR.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` 1722417436
1559973194 https://github.com/pydata/xarray/issues/7868#issuecomment-1559973194 https://api.github.com/repos/pydata/xarray/issues/7868 IC_kwDOAMm_X85c-01K kmuehlbauer 5821660 2023-05-23T18:55:46Z 2023-05-23T18:55:46Z MEMBER

@ghiggi I'd appreciate if you could test your workflows against #7869. Your example and the one over in #7652 are working AFAICT.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` 1722417436
1556891860 https://github.com/pydata/xarray/pull/7827#issuecomment-1556891860 https://api.github.com/repos/pydata/xarray/issues/7827 IC_kwDOAMm_X85czEjU kmuehlbauer 5821660 2023-05-22T09:40:04Z 2023-05-22T09:40:04Z MEMBER

The example below is only based on Variable and the cf encode/decode variable functions.

```python import xarray as xr import numpy as np

create DataArray

times = [np.datetime64("2000-01-01", "ns"), np.datetime64("NaT")] da = xr.DataArray(times, dims=["time"], name="foo") da.encoding["dtype"] = np.float64 da.encoding["_FillValue"] = 20.0

extract Variable

source_var = da.variable print("---------- source_var ------------------") print(source_var) print(source_var.encoding)

encode Variable

encoded_var = xr.conventions.encode_cf_variable(source_var) print("\n---------- encoded_var ------------------") print(encoded_var)

decode Variable

decoded_var = xr.conventions.decode_cf_variable("foo", encoded_var) print("\n---------- decoded_var ------------------") print(decoded_var.load()) ```

```python /home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/coding/times.py:618: RuntimeWarning: invalid value encountered in cast int_num = np.asarray(num, dtype=np.int64) /home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/coding/times.py:254: RuntimeWarning: invalid value encountered in cast flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype( /home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/coding/times.py:254: RuntimeWarning: invalid value encountered in cast flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(

---------- source_var ------------------ <xarray.Variable (time: 2)> array(['2000-01-01T00:00:00.000000000', 'NaT'], dtype='datetime64[ns]') {'dtype': <class 'numpy.float64'>, '_FillValue': 20.0} dtype num float64

---------- encoded_var ------------------ <xarray.Variable (time: 2)> array([ 0., 20.]) Attributes: units: days since 2000-01-01 00:00:00 calendar: proleptic_gregorian _FillValue: 20.0

---------- decoded_var ------------------ <xarray.Variable (time: 2)> array(['2000-01-01T00:00:00.000000000', 'NaT'], dtype='datetime64[ns]') {'_FillValue': 20.0, 'units': 'days since 2000-01-01 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('float64')} ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preserve nanosecond resolution when encoding/decoding times 1700227455
1556869361 https://github.com/pydata/xarray/pull/7827#issuecomment-1556869361 https://api.github.com/repos/pydata/xarray/issues/7827 IC_kwDOAMm_X85cy_Dx kmuehlbauer 5821660 2023-05-22T09:24:47Z 2023-05-22T09:24:47Z MEMBER

@spencerkclark With current master I get the following RuntimeWarning running your code example:

  • on encoding (calling to_netcdf()):

python /home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/coding/times.py:618: RuntimeWarning: invalid value encountered in cast int_num = np.asarray(num, dtype=np.int64)

  • on decoding (calling open_dataset()):

python /home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/coding/times.py:254: RuntimeWarning: invalid value encountered in cast flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype( /home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/coding/times.py:254: RuntimeWarning: invalid value encountered in cast flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(

The latter was discussed in #7098 (casting float64 to int64), the former was aimed to be resolved with this PR.

I'll try to create a test case using Variable and the respective encoding/decoding functions without involving IO (per your suggestion @spencerkclark).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preserve nanosecond resolution when encoding/decoding times 1700227455
1554532844 https://github.com/pydata/xarray/pull/7827#issuecomment-1554532844 https://api.github.com/repos/pydata/xarray/issues/7827 IC_kwDOAMm_X85cqEns kmuehlbauer 5821660 2023-05-19T12:57:31Z 2023-05-19T12:57:31Z MEMBER

Thanks @spencerkclark for taking the time. NaN has been written to disk (as you assumed). Let's have another try next week.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preserve nanosecond resolution when encoding/decoding times 1700227455
1545446155 https://github.com/pydata/xarray/pull/7788#issuecomment-1545446155 https://api.github.com/repos/pydata/xarray/issues/7788 IC_kwDOAMm_X85cHaML kmuehlbauer 5821660 2023-05-12T09:23:13Z 2023-05-12T09:23:13Z MEMBER

@maxhollmann I'm sorry, I'm still finding my way into Xarray. I've taken a closer look at #2377, especially https://github.com/pydata/xarray/issues/2377#issuecomment-415074188.

There @shoyer suggested to just use:

python data = duck_array_ops.where_method(data, ~mask, fill_value)

instead of python data[mask] = fill_value

I've checked and it works nicely with your test. That way we would get away without the flags test and the special handling will take place in duck_array_ops. Would be great if someone can double check.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fix as_compatible_data for read-only np.ma.MaskedArray 1685422501
1545408039 https://github.com/pydata/xarray/issues/4220#issuecomment-1545408039 https://api.github.com/repos/pydata/xarray/issues/4220 IC_kwDOAMm_X85cHQ4n kmuehlbauer 5821660 2023-05-12T08:55:09Z 2023-05-12T08:55:09Z MEMBER

combine_first uses fillna under the hood -> #3570

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  combine_first of Datasets changes dtype of variable present only in one Dataset 656089264
1545346823 https://github.com/pydata/xarray/issues/5706#issuecomment-1545346823 https://api.github.com/repos/pydata/xarray/issues/5706 IC_kwDOAMm_X85cHB8H kmuehlbauer 5821660 2023-05-12T08:06:06Z 2023-05-12T08:06:06Z MEMBER

This is resolved in recent netcdf-c/netcdf4-python and works with recent Xarray.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Loading datasets of numpy string arrays leads to error and/or segfault 970619131
1545337724 https://github.com/pydata/xarray/pull/7788#issuecomment-1545337724 https://api.github.com/repos/pydata/xarray/issues/7788 IC_kwDOAMm_X85cG_t8 kmuehlbauer 5821660 2023-05-12T07:59:19Z 2023-05-12T07:59:19Z MEMBER

@maxhollmann We might get at least some more views on this. There have been discussions on handling masked arrays and we should make sure this is exactly the solution we want to have.

@dcherian This changes as_compatible_data. Could you please have another look here? I'm a bit unclear about the implications.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fix as_compatible_data for read-only np.ma.MaskedArray 1685422501
1543526954 https://github.com/pydata/xarray/pull/7834#issuecomment-1543526954 https://api.github.com/repos/pydata/xarray/issues/7834 IC_kwDOAMm_X85cAFoq kmuehlbauer 5821660 2023-05-11T08:03:01Z 2023-05-11T08:03:01Z MEMBER

@mx-moth Yes, this casting should be fixed.

I'm adding a bit of context here, as this might need to be solved in combination with #7098 and #7827. #7098 removes undefined casting for decoding. In #7827 there are efforts to do this for encoding, too.

As cast_to_int_if_safe is called for encoding as well as decoding I'm not sure if all cases have been catched by these two PR.

One issue on decoding is that at least for datetime64 based times the calculated time_deltas are currently converted to float64 in the presence of NaT (although NaT can perfectly be expressed as int64). It would be great if you could try your PR on top of #7827 (which includes #7098) to see if that fixes the errors in this PR.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Use `numpy.can_cast` instead of casting and checking 1705163672
1543285629 https://github.com/pydata/xarray/issues/7833#issuecomment-1543285629 https://api.github.com/repos/pydata/xarray/issues/7833 IC_kwDOAMm_X85b_Kt9 kmuehlbauer 5821660 2023-05-11T03:39:29Z 2023-05-11T03:39:29Z MEMBER

@alimanfoo The slow code stems from my changes in #7400. Obviously the performance drop did not manifest in the tests/ benchmarks.

In #7824 @Illviljan is tackling concat performance.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Slow performance of concat() 1704950804
1542767369 https://github.com/pydata/xarray/pull/7827#issuecomment-1542767369 https://api.github.com/repos/pydata/xarray/issues/7827 IC_kwDOAMm_X85b9MMJ kmuehlbauer 5821660 2023-05-10T20:27:08Z 2023-05-10T20:27:08Z MEMBER

@dcherian You were right from the beginning, changing order for decoding and handling _FillValue in CFDatetimeCoder seems to be one working solution with minimal code changes.

If the CI is happy I'll add tests to cover for the nanosecond issues in #7817.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preserve nanosecond resolution when encoding/decoding times 1700227455
1541410601 https://github.com/pydata/xarray/issues/7831#issuecomment-1541410601 https://api.github.com/repos/pydata/xarray/issues/7831 IC_kwDOAMm_X85b4A8p kmuehlbauer 5821660 2023-05-10T06:13:20Z 2023-05-10T06:13:39Z MEMBER

Yet another idea would be to add and Engines heading on https://docs.xarray.dev/en/stable/ecosystem.html where engines/backends and there respective packages can be listed. The error could include a link to that page.

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Can't open datasets with the `rasterio` engine. 1702025553
1540845511 https://github.com/pydata/xarray/issues/7831#issuecomment-1540845511 https://api.github.com/repos/pydata/xarray/issues/7831 IC_kwDOAMm_X85b12_H kmuehlbauer 5821660 2023-05-09T20:26:32Z 2023-05-09T20:26:32Z MEMBER

Maybe it would also help to rephrase the error, something along the lines

"Engine rasterio is not available. Please install the needed package. Engines [xxx, yyy, zzz] are available."

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Can't open datasets with the `rasterio` engine. 1702025553
1539356386 https://github.com/pydata/xarray/pull/7827#issuecomment-1539356386 https://api.github.com/repos/pydata/xarray/issues/7827 IC_kwDOAMm_X85bwLbi kmuehlbauer 5821660 2023-05-09T03:51:39Z 2023-05-09T03:51:39Z MEMBER

Thanks for the heads-up, @spencerkclark. No worries, I need to apply some changes anyway as it turns out.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preserve nanosecond resolution when encoding/decoding times 1700227455
1538998850 https://github.com/pydata/xarray/pull/7827#issuecomment-1538998850 https://api.github.com/repos/pydata/xarray/issues/7827 IC_kwDOAMm_X85bu0JC kmuehlbauer 5821660 2023-05-08T20:22:28Z 2023-05-08T20:22:28Z MEMBER

All tests have passed. Rebased now on latest main. The issue described in #7817 is resolved. Ready for first reviews.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preserve nanosecond resolution when encoding/decoding times 1700227455
1538966366 https://github.com/pydata/xarray/pull/7827#issuecomment-1538966366 https://api.github.com/repos/pydata/xarray/issues/7827 IC_kwDOAMm_X85busNe kmuehlbauer 5821660 2023-05-08T20:01:17Z 2023-05-08T20:01:17Z MEMBER

I've reset the order of coders to the initial behaviour. Instead the times are special cased in the CFMaskCoder. Locally it works, but I'll only trust the CI.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preserve nanosecond resolution when encoding/decoding times 1700227455
1538819904 https://github.com/pydata/xarray/pull/7771#issuecomment-1538819904 https://api.github.com/repos/pydata/xarray/issues/7771 IC_kwDOAMm_X85buIdA kmuehlbauer 5821660 2023-05-08T18:11:00Z 2023-05-08T18:11:00Z MEMBER

Setting status back to draft for now, still evaluating solutions for the CF encoding/decoding.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  implement scale_factor/add_offset CF conformance test, add and align tests 1676309093
1538818465 https://github.com/pydata/xarray/pull/7654#issuecomment-1538818465 https://api.github.com/repos/pydata/xarray/issues/7654 IC_kwDOAMm_X85buIGh kmuehlbauer 5821660 2023-05-08T18:09:59Z 2023-05-08T18:09:59Z MEMBER

I've converted to draft for now, as I'm still evaluating solutions for the CF encoding/decoding.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  cf-coding 1633623916
1538364933 https://github.com/pydata/xarray/pull/7827#issuecomment-1538364933 https://api.github.com/repos/pydata/xarray/issues/7827 IC_kwDOAMm_X85bsZYF kmuehlbauer 5821660 2023-05-08T13:29:07Z 2023-05-08T13:29:07Z MEMBER

@spencerkclark I'd appreciate if you could have a look here. All but one test pass, but I can't immediately see what that test is doing. Looks like mismatched dtypes on the attributes. If you have any suggestions how to possibly improve, please let me know. I've not added tests here, yet.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preserve nanosecond resolution when encoding/decoding times 1700227455
1538354499 https://github.com/pydata/xarray/issues/7817#issuecomment-1538354499 https://api.github.com/repos/pydata/xarray/issues/7817 IC_kwDOAMm_X85bsW1D kmuehlbauer 5821660 2023-05-08T13:22:22Z 2023-05-08T13:22:52Z MEMBER

@dcherian Yes, I've setup a prototype in #7827. But the overall solution doesn't look that nice. The handling of fill_value has still to be done in CFMaskCoder.

Also #7098 is needed for this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  nanosecond precision lost when reading time data 1696097756
1535941525 https://github.com/pydata/xarray/issues/7816#issuecomment-1535941525 https://api.github.com/repos/pydata/xarray/issues/7816 IC_kwDOAMm_X85bjJuV kmuehlbauer 5821660 2023-05-05T08:55:42Z 2023-05-05T08:55:42Z MEMBER

@gauteh No worries, glad it works now!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Backend registration does not match docs, and is no longer specifiable in maturin pyproject toml 1695809136
1535776861 https://github.com/pydata/xarray/issues/7814#issuecomment-1535776861 https://api.github.com/repos/pydata/xarray/issues/7814 IC_kwDOAMm_X85bihhd kmuehlbauer 5821660 2023-05-05T06:31:20Z 2023-05-05T06:31:20Z MEMBER

@paul0207 Thanks for providing the datafiles. I can't reproduce on my machine. Please provide more information, the output of xr.show_versions() would help and a complete traceback of the error you are experiencing.

A complete list of installed Python Packages would be nice (eg. by pip list), too.

Another couple of questions to get some more insight:

  • Does this happen only with these special files, or do you experience this every time?
  • Does the problem persists when specifying engine="netcdf4" or engine="h5netcdf" in the call to open_mfdataset?
  • Does this also happen, if you open the files one-by-one (with xr.open_dataset) and combine the Datasets with xr.concat?
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  TypeError: 'NoneType' object is not callable when joining netCDF files. Works when ran interactively. 1695028906
1535724636 https://github.com/pydata/xarray/issues/7816#issuecomment-1535724636 https://api.github.com/repos/pydata/xarray/issues/7816 IC_kwDOAMm_X85biUxc kmuehlbauer 5821660 2023-05-05T05:46:46Z 2023-05-05T05:46:46Z MEMBER

@gauteh Yes, please provide as much information as possible. It is also of interest, how you installed the package and what Python environment you are using (eg. system python, conda, venv etc.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Backend registration does not match docs, and is no longer specifiable in maturin pyproject toml 1695809136
1535596259 https://github.com/pydata/xarray/issues/7816#issuecomment-1535596259 https://api.github.com/repos/pydata/xarray/issues/7816 IC_kwDOAMm_X85bh1bj kmuehlbauer 5821660 2023-05-05T01:46:12Z 2023-05-05T01:46:12Z MEMBER

@gauteh You would probably have to delete this line:

https://github.com/gauteh/hidefix/blob/main/python/hidefix/xarray.py#L192

As @headtr1ck already explained, it is all handled via plugin system to be able to handle duplicate engine names on discovery by the python metadata.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Backend registration does not match docs, and is no longer specifiable in maturin pyproject toml 1695809136
1534855008 https://github.com/pydata/xarray/issues/7817#issuecomment-1534855008 https://api.github.com/repos/pydata/xarray/issues/7817 IC_kwDOAMm_X85bfAdg kmuehlbauer 5821660 2023-05-04T14:11:26Z 2023-05-04T14:11:26Z MEMBER

cc @spencerkclark @DocOtak I've tried to at least find one example which incarnates as bug. Nevertheless the transformation from int to float in CFMaskCoder should be avoided.

We might think about special casing time data in CFMaskCoder, or handle masking of time data in CFDatetimeCoder/CFTimedeltaCoder.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  nanosecond precision lost when reading time data 1696097756
1532441433 https://github.com/pydata/xarray/issues/7790#issuecomment-1532441433 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bVzNZ kmuehlbauer 5821660 2023-05-03T04:25:50Z 2023-05-03T04:25:50Z MEMBER

@christine-e-smit Great this works on you side with the proposed patch in #7098.

Nevertheless, we've identified three more issues here in the debugging process which can now be handled one by one. So again, thanks for your contribution here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1531050846 https://github.com/pydata/xarray/issues/7790#issuecomment-1531050846 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bQfte kmuehlbauer 5821660 2023-05-02T08:04:45Z 2023-05-03T04:20:11Z MEMBER

As in #7098, citing @dcherian:

I think the real solution here is to explicitly handle NaNs during the decoding step. We do want these to be NaT in the output.

There are three more issues revealed here when using datetime64:

  • if _FillValue is set in encoding, it has to be of same type/resolution as the times in the array
  • If _FillValue is provided, we need to provide dtype and units to which fit our data, eg. if the _FillValue is referenced to unix-epoch the unit's should be equivalent
  • when encoding in the presence of NaT the data array is converted to floating point with NaN, which is problematic for the subsequent conversion to int64
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1531496369 https://github.com/pydata/xarray/issues/5490#issuecomment-1531496369 https://api.github.com/repos/pydata/xarray/issues/5490 IC_kwDOAMm_X85bSMex kmuehlbauer 5821660 2023-05-02T13:38:49Z 2023-05-02T13:38:49Z MEMBER

This is indeed an issue with scale_factor and add_offset as @d70-t has already mentioned.

That is not a problem per se, but those attributes are obviously different for different files. When concatenating only the first files's attributes survive. That might already be the source of the above problem, as it might slightly change values.

An even bigger problem is, when the dynamic range of the decoded data (min/max) doesn't overlap. Then the data might be folded from the lower border to the upper border or vica versa.

I've put an example into #5739. The suggestion for now is as @keewis comment to drop encoding in such cases and use floating point values for writing. You might use the available compression options for floating point data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Nan/ changed values in output when only reading data, saving and reading again 924676925
1531465011 https://github.com/pydata/xarray/issues/5490#issuecomment-1531465011 https://api.github.com/repos/pydata/xarray/issues/5490 IC_kwDOAMm_X85bSE0z kmuehlbauer 5821660 2023-05-02T13:20:46Z 2023-05-02T13:20:46Z MEMBER

Xref: #5739

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Nan/ changed values in output when only reading data, saving and reading again 924676925
1530991257 https://github.com/pydata/xarray/issues/7790#issuecomment-1530991257 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bQRKZ kmuehlbauer 5821660 2023-05-02T07:09:38Z 2023-05-02T08:14:36Z MEMBER

@christine-e-smit I've created an fresh environment with only xarray and zarr and it still works on my machine. I've then followed the Darwin idea and digged up #6191 (I've got those casting warnings from exactly the line you were referring to). Comment https://github.com/pydata/xarray/issues/6191#issuecomment-1209567966 should explain what happens here.

tl;dr citing @DocOtak

The short explanation is that the time conversion functions do an astype(np.int64) or equivalent cast on arrays that contain nans. This is undefined behavior and very soon, doing this will start to emit RuntimeWarnings.

There is also an open PR #7098.

Thanks @christine-e-smit for sticking with me to find the root-cause here by providing detailed information and code examples. :+1:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1530141083 https://github.com/pydata/xarray/issues/7790#issuecomment-1530141083 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bNBmb kmuehlbauer 5821660 2023-05-01T20:01:50Z 2023-05-01T20:01:50Z MEMBER

@christine-e-smit One more idea, you might delete the zarr folder before re-creating (if you are not doing that already). I've removed the complete folder before any new write (by putting eg. !rm -rf xarray_and_units.zarr at the beginning of the notebook-cell).

It would also be great if you could run the code from https://github.com/pydata/xarray/issues/7790#issuecomment-1529894939 and post the output here, just for the sake of comparison (please delete the zarr-folder before if it exists). Thanks!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1530131533 https://github.com/pydata/xarray/issues/7790#issuecomment-1530131533 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bM_RN kmuehlbauer 5821660 2023-05-01T19:53:53Z 2023-05-01T19:53:53Z MEMBER

@christine-e-smit I've plugged your code into a fresh notebook, here is my output:

```python


xarray created with NaT fill value

<xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02


xarray created read with NaT fill value

<xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02 {} {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9223372036854775808, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')} ```

The output seems OK on my side. I've no idea why the data isn't correctly decoded as NaT on your side. I've checked that my environment is comparable to yours. The only difference remaining is you are on Darwin arm64 whereas I'm on Linux.

``` INSTALLED VERSIONS


commit: None python: 3.11.2 | packaged by conda-forge | (main, Mar 31 2023, 17:51:05) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.4.0-144-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: ('de_DE', 'UTF-8') libhdf5: 1.14.0 libnetcdf: None

xarray: 2023.4.2 pandas: 2.0.1 numpy: 1.24.3 scipy: 1.10.1 netCDF4: None pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 Nio: None zarr: 2.14.2 cftime: None nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.3.2 distributed: 2023.3.2 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2023.3.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 67.6.1 pip: 23.0.1 conda: None pytest: 7.2.2 mypy: 0.982 IPython: 8.12.0 sphinx: None ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1530111912 https://github.com/pydata/xarray/issues/7790#issuecomment-1530111912 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bM6eo kmuehlbauer 5821660 2023-05-01T19:30:22Z 2023-05-01T19:30:22Z MEMBER

Unfortunately, I think you may have also gotten some wires crossed? You set the time fill value to 1900-01-01, but then use NaT in the actual array?

Yes, I use NaT because I want to check if the encoder does correctly translate NaT to the provided _FillValue on write.

So from your last example I'm assuming you would like to have the int64 representation of NaT as _FillValue, right? I'll try to adapt this, and see what I get

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1529894939 https://github.com/pydata/xarray/issues/7790#issuecomment-1529894939 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bMFgb kmuehlbauer 5821660 2023-05-01T16:05:19Z 2023-05-01T16:05:19Z MEMBER

So, after some debugging I think I've found two issues here with the current code.

First, we need to give the fillvalue with a fitting resolution. Second, we have an issue with inferring the units from the data (if not given).

Here is some workaround code which (finally, :crossed_fingers:) should at least write and read correct data (added comments below):

```python

Create a numpy array of type np.datetime64 with one fill value and one date

FIRST ISSUE WITH _FillValue

we need to provide ns resolution here too, otherwise we get wrong fillvalues (day-reference)

time_fill_value = np.datetime64("1900-01-01 00:00:00.00000000", "ns") time = np.array([np.datetime64("NaT", "ns"), '2023-01-02 00:00:00.00000000'], dtype='M8[ns]')

Create a dataset with this one array

xr_time_array = xr.DataArray(data=time,dims=['time'],name='time') xr_ds = xr.Dataset(dict(time=xr_time_array))

print("******") print("Created with fill value 1900-01-01") print(xr_ds["time"])

Save the dataset to zarr

location_new_fill = "from_xarray_new_fill.zarr"

SECOND ISSUE with inferring units from data

We need to specify "dtype" and "units" which fit our data

Note: as we provide a _FillValue with a reference to unix-epoch

we need to provide a fitting units too

encoding = { "time":{"_FillValue":time_fill_value, "dtype":np.int64, "units":"nanoseconds since 1970-01-01"} } xr_ds.to_zarr(location_new_fill, mode="w", encoding=encoding)

xr_read = xr.open_zarr(location_new_fill) print("******") print("Read back out of the zarr store with xarray") print(xr_read["time"]) print(xr_read["time"].attrs) print(xr_read["time"].encoding)

z_new_fill = zarr.open('from_xarray_new_fill.zarr','r', ) print("******") print("Read back out of the zarr store with zarr")

print(z_new_fill["time"]) print(z_new_fill["time"].attrs) print(z_new_fill["time"][:]) ```

```python


Created with fill value 1900-01-01 <xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02


Read back out of the zarr store with xarray <xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02 {} {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -2208988800000000000, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}


Read back out of the zarr store with zarr <zarr.core.Array '/time' (2,) int64 read-only> <zarr.attrs.Attributes object at 0x7f086ab8e710> [-2208988800000000000 1672617600000000000] ```

@christine-e-smit Please let me know, if the above workaround gives you correct results in your workflow. If so, then we can think about how to automatically align fillvalue-resolution with data-resolution and what needs to be done to correctly deduce the units.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1529076482 https://github.com/pydata/xarray/issues/7790#issuecomment-1529076482 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bI9sC kmuehlbauer 5821660 2023-04-30T16:52:25Z 2023-04-30T16:52:25Z MEMBER

```python xr_ds.to_zarr(location_new_fill,encoding=encoding)

xr_read = xr.open_zarr(location) print("******") print("Read back out of the zarr store with xarray") print(xr_read["time"]) print(xr_read["time"].encoding) ```

@christine-e-smit Is this just a remnant of copy&paste? The above code writes to location_new_fill, but reads from location.

Here is my code and output for comparison (using latest zarr/xarray):

```python

Create a numpy array of type np.datetime64 with one fill value and one date

time_fill_value = np.datetime64("1900-01-01") time = np.array([np.datetime64("NaT"), '2023-01-02'], dtype='M8[ns]')

Create a dataset with this one array

xr_time_array = xr.DataArray(data=time,dims=['time'],name='time') xr_ds = xr.Dataset(dict(time=xr_time_array))

print("******") print("Created with fill value 1900-01-01") print(xr_ds["time"])

Save the dataset to zarr

location_new_fill = "from_xarray_new_fill.zarr" encoding = { "time":{"_FillValue":time_fill_value,"dtype":np.int64} } xr_ds.to_zarr(location_new_fill, encoding=encoding)

xr_read = xr.open_zarr(location_new_fill) print("******") print("Read back out of the zarr store with xarray") print(xr_read["time"]) print(xr_read["time"].encoding) ```

```python


Created with fill value 1900-01-01 <xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02


Read back out of the zarr store with xarray <xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02 {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -25567, 'units': 'days since 2023-01-02 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')} ```

This doesn't look correct either. At least the decoded _FillValue or the units are wrong. So -25567 is 1900-01-01 when referenced to of unix-epoch (Question: Is zarr time based on unix epoch?). When read back via zarr only this would decode into:

python <xarray.DataArray 'time' (time: 2)> array(['1953-01-02T00:00:00.000000000', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]')

I totally agree with @christine-e-smit, this is all very confusing. As said at the beginning, I have little knowledge of zarr. I'm currently digging into cf encoding/decoding which made me jump on here.

AFAICT, it looks like already the encoding has a problem, at least the data on disk is already not what we expect. It seems that somehow the xarray cf_encoding/decoding is not well aligned with the zarr writing/reading of datetimes.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1527527029 https://github.com/pydata/xarray/issues/2478#issuecomment-1527527029 https://api.github.com/repos/pydata/xarray/issues/2478 IC_kwDOAMm_X85bDDZ1 kmuehlbauer 5821660 2023-04-28T12:59:04Z 2023-04-28T15:46:09Z MEMBER

@sbiner Sorry for the massive delay here. It doesn't have changed much since creation of your issue. Xarray doesn't take the netcdf default fill values into account (there are reasons, which @shoyer has explained in https://github.com/pydata/xarray/pull/5680#issuecomment-895455163 and https://github.com/pydata/xarray/pull/5680#issuecomment-895508489).

On write it just uses NaN as _FillValue (in case no specific encoding is given).

Xref: #2374, #7723, #5680

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  masked_array write/read differences between xarray and netCDF4 368833116
1527605739 https://github.com/pydata/xarray/issues/7713#issuecomment-1527605739 https://api.github.com/repos/pydata/xarray/issues/7713 IC_kwDOAMm_X85bDWnr kmuehlbauer 5821660 2023-04-28T13:55:17Z 2023-04-28T13:55:17Z MEMBER

The code is there since #867 by @shoyer which was committed almost 7 years ago.

I've no idea what's the purpose for packing tuples into 0d arrays but as there are also tests for it in the above PR I'm assuming there is one real reason. Maybe @shoyer can chime in here to shed some light?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `Variable/IndexVariable` do not accept a tuple for data. 1652227927
1527544656 https://github.com/pydata/xarray/issues/7647#issuecomment-1527544656 https://api.github.com/repos/pydata/xarray/issues/7647 IC_kwDOAMm_X85bDHtQ kmuehlbauer 5821660 2023-04-28T13:12:08Z 2023-04-28T13:12:08Z MEMBER

@wangshuaicumt Did you get along with this issue? If this is still unresolved it would be great if you could provide the data or a MCVE.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  merge 1631491844
1527541305 https://github.com/pydata/xarray/issues/7630#issuecomment-1527541305 https://api.github.com/repos/pydata/xarray/issues/7630 IC_kwDOAMm_X85bDG45 kmuehlbauer 5821660 2023-04-28T13:09:22Z 2023-04-28T13:09:22Z MEMBER

@AlxndrLhr I suppose your original issue is resolved. Please reopen or create a new issue if you still have problems with this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  .loc[] cannot find a value that .sel() can find without problem 1624560934
1527537064 https://github.com/pydata/xarray/issues/6429#issuecomment-1527537064 https://api.github.com/repos/pydata/xarray/issues/6429 IC_kwDOAMm_X85bDF2o kmuehlbauer 5821660 2023-04-28T13:06:14Z 2023-04-28T13:06:14Z MEMBER

It looks like this is no issue any more with recent versions of the stack. At least I can't reproduce this. @mjwillson Please reopen, if you still encounter problems while plotting.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  FacetGrid padding goes very bad when cartopy projection specified 1188262115
1527498384 https://github.com/pydata/xarray/issues/7092#issuecomment-1527498384 https://api.github.com/repos/pydata/xarray/issues/7092 IC_kwDOAMm_X85bC8aQ kmuehlbauer 5821660 2023-04-28T12:34:03Z 2023-04-28T12:34:03Z MEMBER

@leicunxing-rs Sorry for the delay here. Your issue might be connected with concatenation/merge of several files containing packed data with different scale_factor/add_offset. See issue #5739 for more details (there they also merge different ERA5 datasets, hence the idea).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Save an nc file and open it again, the content of the data inside has changed 1387341095
1527461082 https://github.com/pydata/xarray/issues/5739#issuecomment-1527461082 https://api.github.com/repos/pydata/xarray/issues/5739 IC_kwDOAMm_X85bCzTa kmuehlbauer 5821660 2023-04-28T12:00:15Z 2023-04-28T12:00:15Z MEMBER

@dougrichardson Sorry for the delay. If you are still interested in the source of this issue here is what I found:

The root cause is different scale_factor and add_offset in the source files.

When merging only the .encoding of the first dataset survives. This leads to wrongly encoded file for the may-dates. But why is this so?

The issue is with the packed dtype ("int16") and the particular values of scale_factor/add_offset.

For feb the dynamic range is (228.96394336525748, 309.9690856933594) K whereas for may it is (205.7644192729947, 311.7797088623047) K.

Now we can clearly see that all our values which are above 309.969 K will be folded to the lower end (>229 K).

To circumvent that you have at least two options:

  • change scale_factor and add_offset values in the variables .encoding before writing to appropriate values which cover your whole dynamic range
  • drop scale_factor/add_offset (and other CF related attributes) from .encoding to write floating point values

It might be nice to have checks for that in the encoding steps, to prevent writing erroneous values. So this is not really a bug, but might be less impactful when encoding is dropped on operations (see discussion in #6323).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing and reopening introduces bad values 979916914
1527376059 https://github.com/pydata/xarray/issues/5170#issuecomment-1527376059 https://api.github.com/repos/pydata/xarray/issues/5170 IC_kwDOAMm_X85bCei7 kmuehlbauer 5821660 2023-04-28T10:47:38Z 2023-04-28T10:47:38Z MEMBER

@floriankrb Sorry for the long delay. If you are still interested in the source of the issue, here is what I found:

By default Xarray will promote any data variable which shares it's name with a dimension to a coordinate. That accounts for ['number', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude']. valid_time is a two dimensional coordinate (by CF standard) and is a coordinate here because t2m data variable has a corresponding coordinates-attribute containing valid_time. In the decoding-step valid_time gets added to the .coords. The attribute is removed from t2m's attrs and kept in t2m.encoding. So far so good.

By renaming number to n that coordinates attribute (in encoding) does not change as well. So when the data is written, t2m will still hold number in it's coordinates-attribute (on disk).

The issue manifests on subsequent read as now the decoding-step tries to align the found coordinates with the available data variables. As number is not available, no coordinate from that string will be taken into account as coordinate (note the all on line 444):

https://github.com/pydata/xarray/blob/0f4e99d036b0d6d76a3271e6191eacbc9922662f/xarray/conventions.py#L439-L447

This can easily be observed by looking into t2m.attrs where the coordinates remains instead of being preserved in .encoding.

So the source of all problems here is that the renaming number -> n was missed for coordinates-attribute of t2m's .encoding.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_netcdf is not idempotent when stacking rename and set_coords 859772411
1527234694 https://github.com/pydata/xarray/issues/2192#issuecomment-1527234694 https://api.github.com/repos/pydata/xarray/issues/2192 IC_kwDOAMm_X85bB8CG kmuehlbauer 5821660 2023-04-28T09:06:22Z 2023-04-28T09:06:22Z MEMBER

Can't reproduce with recent xarray/matplotlib/cartopy. Looks like this has been resolved.

python import xarray as xr import cartopy.crs as ccrs ds = xr.tutorial.load_dataset('air_temperature') ds = ds.sel(lon = slice(250, 300)) air = ds['air'] transform = ccrs.PlateCarree() projection = ccrs.Mercator(air.lon.values.mean(), air.lat.values.min(), air.lat.values.max()) p = air.isel(time=[0,1]).plot(transform = transform, aspect = ds.dims['lon']/ds.dims['lat'], col = 'time', col_wrap = 1, subplot_kws = {'projection': projection}) for ax in p.axs.flat: ax.set_extent((air.lon.values.min(), air.lon.values.max(), air.lat.values.min(), air.lat.values.max()), crs = transform) ax.set_aspect('equal', 'box')

Please reopen, if this is still an issue.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Subplots overlap each other using plot() and cartopy 327101646
1527050493 https://github.com/pydata/xarray/issues/7790#issuecomment-1527050493 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85bBPD9 kmuehlbauer 5821660 2023-04-28T06:21:38Z 2023-04-28T06:21:38Z MEMBER

Thanks @dcherian for filling in the details.

I've digged up some more related issues: #2265, #3942, #4045

IIUC, #4684 did a great job to iron out much of these issues, but as it looks like only in the case when no NaT is within the time array (cc @spencerkclark). @christine-e-smit If you have no NaT in your time array then you can just omit encoding completely and Xarray will use int64 per default and your data should be fine on disk.

In the presence of NaT it looks like one workaround to circumvent that issue for the time being is to add the dtype in addition to _FillValue when writing out to zarr :

python encoding = { "time":{"_FillValue": time_fill_value, "dtype": np.int64} xr_ds.to_zarr(location, encoding=encoding) }

One note to this: Xarray is deducing the units from the current time data. So for the above example it will result in 'days since 2023-01-02 00:00:00' where days would now be the resolution in the file. If you want the resolution to be nanoseconds on disk units would need to be added to the encoding.

python encoding = { "time":{"_FillValue": time_fill_value, "dtype": np.int64, 'units': 'nanoseconds since 2023-01-02'} } xr_ds.to_zarr(location, encoding=encoding)

@christine-e-smit It would be great if you could confirm that from your side (some sanity check needed on my side).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1525790614 https://github.com/pydata/xarray/issues/7790#issuecomment-1525790614 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85a8beW kmuehlbauer 5821660 2023-04-27T14:23:16Z 2023-04-27T14:23:16Z MEMBER

@christine-e-smit I see, thanks for the details. AFAICT from the code it looks like zarr is special-cased in some ways compared to other backends. I'd really rely on some zarr-expert shedding light here and over at #7776.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1525780533 https://github.com/pydata/xarray/issues/7713#issuecomment-1525780533 https://api.github.com/repos/pydata/xarray/issues/7713 IC_kwDOAMm_X85a8ZA1 kmuehlbauer 5821660 2023-04-27T14:17:26Z 2023-04-27T14:17:26Z MEMBER

@zoj613 Thanks for raising this.

The root-cause is that the tuple is returned from as_compatible_data as single element array:

python import xarray as xr print(xr.core.variable.as_compatible_data((2, 3, 4))) python array((2, 3, 4), dtype=object) This then breaks with the error you are seeing. I'm not quite sure if this is a bug in the code, a bug in the doc or no bug at all. But as a tuple is easily wrapped by np.array there should be a reason why Xarray is currently not able to digest tuples.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `Variable/IndexVariable` do not accept a tuple for data. 1652227927
1525705799 https://github.com/pydata/xarray/issues/7782#issuecomment-1525705799 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85a8GxH kmuehlbauer 5821660 2023-04-27T13:33:50Z 2023-04-27T13:33:50Z MEMBER

As we can see from the above output, in netCDF4-python scaling is adapting the dtype to unsigned, not masking. This is also reflected in the docs unidata.github.io/netcdf4-python/#Variable.

Do we know why this is so?

TL;DR: NETCDF3 detail to allow (signal) unsigned integer, still used in recent formats

  • more discussion details on this over at https://github.com/Unidata/netcdf4-python/issues/656
  • at NetCDF Users Guide on packed data:

A conventional way to indicate whether a byte, short, or int variable is meant to be interpreted as unsigned, even for the netCDF-3 classic model that has no external unsigned integer type, is by providing the special variable attribute _Unsigned with value "true". However, most existing data for which packed values are intended to be interpreted as unsigned are stored without this attribute, so readers must be aware of packing assumptions in this case. In the enhanced netCDF-4 data model, packed integers may be declared to be of the appropriate unsigned type.

My suggestion would be to nudge the user by issuing warnings and link to new to be added documentation on the topic. This could be in line with the cf-coding conformance checks which have been discussed yesterday in the dev-meeting.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1525524428 https://github.com/pydata/xarray/issues/7790#issuecomment-1525524428 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85a7afM kmuehlbauer 5821660 2023-04-27T11:26:15Z 2023-04-27T11:26:15Z MEMBER

Xref: discussion #7776, which got no attention up to now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1525513525 https://github.com/pydata/xarray/issues/7790#issuecomment-1525513525 https://api.github.com/repos/pydata/xarray/issues/7790 IC_kwDOAMm_X85a7X01 kmuehlbauer 5821660 2023-04-27T11:19:24Z 2023-04-27T11:19:24Z MEMBER

@christine-e-smit

So, I'm no expert for zarr, but it turns out that your NaT was converted to -9.223372036854776e+18 in the encoding step. I'm assuming that zarr is converting NaT as the format doesn't allow to use NaT directly, so it chooses a (default) value.

The _FillValue is not lost, but it will be preserved in the .encoding-dict of the underlying Variable:

python xr_read = xr.open_zarr(location) print("******************") print("No fill value") print(xr_read["time"]) print(xr_read["time"].encoding) ```python


No fill value <xarray.DataArray 'time' (time: 2)> array([ 'NaT', '2023-01-02T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] NaT 2023-01-02 {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9.223372036854776e+18, 'units': 'days since 2023-01-02 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('float64')} ```

You might also check this without decoding (decode_cd=False):

python with xr.open_zarr(location, decode_cf=False) as xr_read: print("******************") print("No fill value") print(xr_read["time"]) print(xr_read["time"].encoding) ```python


No fill value <xarray.DataArray 'time' (time: 2)> array([-9.223372e+18, 0.000000e+00]) Coordinates: * time (time) float64 -9.223e+18 0.0 Attributes: calendar: proleptic_gregorian units: days since 2023-01-02 00:00:00 _FillValue: -9.223372036854776e+18 {'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('float64')} ```

Maybe a zarr-expert can chime in here, what's the best practice for time-fill_values.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fill values in time arrays (numpy.datetime64) are lost in zarr 1685803922
1524805132 https://github.com/pydata/xarray/pull/7788#issuecomment-1524805132 https://api.github.com/repos/pydata/xarray/issues/7788 IC_kwDOAMm_X85a4q4M kmuehlbauer 5821660 2023-04-27T06:13:23Z 2023-04-27T07:19:47Z MEMBER

@maxhollmann I've checked and memory served well, the following issue might be related: #2377. It looks like your use-case is at least connected to @gerritholl's. It would be great if you could add your original use case (as MCVE, if possible) to get more details.

A special case (masked integer arrays) is discussed in #3955. As this might give additional information, it might not exactly fit your problem.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fix as_compatible_data for read-only np.ma.MaskedArray 1685422501
1523829332 https://github.com/pydata/xarray/pull/7788#issuecomment-1523829332 https://api.github.com/repos/pydata/xarray/issues/7788 IC_kwDOAMm_X85a08pU kmuehlbauer 5821660 2023-04-26T17:55:13Z 2023-04-26T17:55:13Z MEMBER

@maxhollmann I'll have a look into this, I think I've seen something like this some time ago.

Maybe you can add the tests to the PR or as comment? This might get more attention and will really help to debug.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fix as_compatible_data for read-only np.ma.MaskedArray 1685422501
1523786065 https://github.com/pydata/xarray/pull/7788#issuecomment-1523786065 https://api.github.com/repos/pydata/xarray/issues/7788 IC_kwDOAMm_X85a0yFR kmuehlbauer 5821660 2023-04-26T17:18:44Z 2023-04-26T17:18:44Z MEMBER

I've marked this by accident, sorry @maxhollmann. Let us know when you feel this is ready

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fix as_compatible_data for read-only np.ma.MaskedArray 1685422501
1522997083 https://github.com/pydata/xarray/issues/7782#issuecomment-1522997083 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85axxdb kmuehlbauer 5821660 2023-04-26T08:28:39Z 2023-04-26T08:28:39Z MEMBER

This is how netCDF4-python handles this data with different parameters:

python import netCDF4 as nc with nc.Dataset("http://dap.ceda.ac.uk/thredds/dodsC/neodc/esacci/snow/data/scfv/MODIS/v2.0/2010/01/20100101-ESACCI-L3C_SNOW-SCFV-MODIS_TERRA-fv2.0.nc") as ds_dap: v = ds_dap["scfv"] print(v) print("\n- default") print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") print("\n- maskandscale False") ds_dap.set_auto_maskandscale(False) v = ds_dap["scfv"] print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") print("\n- mask/scale False") ds_dap.set_auto_mask(False) ds_dap.set_auto_scale(False) v = ds_dap["scfv"] print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") print("\n- mask True / scale False") ds_dap.set_auto_mask(True) ds_dap.set_auto_scale(False) v = ds_dap["scfv"] print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") print("\n- mask False / scale True") ds_dap.set_auto_mask(False) ds_dap.set_auto_scale(True) v = ds_dap["scfv"] print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") print("\n- mask True / scale True") ds_dap.set_auto_mask(True) ds_dap.set_auto_scale(True) v = ds_dap["scfv"] print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") print("\n- maskandscale True") ds_dap.set_auto_mask(False) ds_dap.set_auto_scale(False) ds_dap.set_auto_maskandscale(True) v = ds_dap["scfv"] print(f"variable dtype: {v.dtype}") print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}") print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}") ```python <class 'netCDF4._netCDF4.Variable'> int8 scfv(time, lat, lon) _Unsigned: true _FillValue: -1 standard_name: snow_area_fraction_viewable_from_above long_name: Snow Cover Fraction Viewable units: percent valid_range: [ 0 -2] actual_range: [ 0 100] flag_values: [-51 -50 -46 -41 -4 -3 -2] flag_meanings: Cloud Polar_Night_or_Night Water Permanent_Snow_and_Ice Classification_failed Input_Data_Error No_Satellite_Acquisition missing_value: -1 ancillary_variables: scfv_unc grid_mapping: spatial_ref _ChunkSizes: [ 1 1385 2770] unlimited dimensions: time current shape = (1, 18000, 36000) filling off

  • default variable dtype: int8 first 2 elements: uint8 [215 215] last 2 elements: uint8 [215 215]

  • maskandscale False variable dtype: int8 first 2 elements: int8 [-41 -41] last 2 elements: int8 [-41 -41]

  • mask/scale False variable dtype: int8 first 2 elements: int8 [-41 -41] last 2 elements: int8 [-41 -41]

  • mask True / scale False variable dtype: int8 first 2 elements: int8 [-- --] last 2 elements: int8 [-- --]

  • mask False / scale True variable dtype: int8 first 2 elements: uint8 [215 215] last 2 elements: uint8 [215 215]

  • mask True / scale True variable dtype: int8 first 2 elements: uint8 [215 215] last 2 elements: uint8 [215 215]

  • maskandscale True variable dtype: int8 first 2 elements: uint8 [215 215] last 2 elements: uint8 [215 215] ```

First, the dataset was created with filling off (read more about that in the netcdf file format specs https://docs.unidata.ucar.edu/netcdf-c/current/file_format_specifications.html). This should not be a problem for the analysis, but it tells us that all data points should have been written to somehow.

As we can see from the above output, in netCDF4-python scaling is adapting the dtype to unsigned, not masking. This is also reflected in the docs https://unidata.github.io/netcdf4-python/#Variable.

If Xarray is trying to align with netCDF4-python it should separate mask and scale as netCDF4-python is doing. It does this already by using different coders but it doesn't separate it API-wise.

We would need a similar approach here for Xarray with additional kwargs scale and mask in addition to mask_and_scale. We cannot just move the UnsignedCoder out of mask_and_scale and apply it unconditionally.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520804745 https://github.com/pydata/xarray/issues/7782#issuecomment-1520804745 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85apaOJ kmuehlbauer 5821660 2023-04-24T20:47:43Z 2023-04-24T20:47:43Z MEMBER

@dcherian The main issue here is that we have two different CF things which are applied, Unsigned and _FillValue/missing_value.

For netcdf4-python the values would just be masked and the dtype would be preserved. For xarray it will be cast to float32 because of the _FillValue/missing_value.

I agree, moving the Unsigned Coder out of mask_and_scale should help in that particular case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520514792 https://github.com/pydata/xarray/issues/7782#issuecomment-1520514792 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85aoTbo kmuehlbauer 5821660 2023-04-24T16:52:30Z 2023-04-24T16:52:30Z MEMBER

@dcherian Yes, that would work.

We would want to check the different attributes and apply the coders only as needed. That might need some refactoring. I'm already wrapping my head around this for several weeks now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520363622 https://github.com/pydata/xarray/issues/7782#issuecomment-1520363622 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85anuhm kmuehlbauer 5821660 2023-04-24T15:10:24Z 2023-04-24T15:11:00Z MEMBER

Then you are somewhat deadlocked. mask_and_scale=False will also deactivate the Unsigned decoding.

You might be able to achieve what want by using decode_cf=False (completely deactivate cf decoding). Then you would have to remove _FillValue attribute as well as missing_value attribute from the variables. Finally, you can run xr.decode_cf(ds) to correctly decode your data.

I'll add a code example tomorrow if no one beats me to it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1520277594 https://github.com/pydata/xarray/issues/7782#issuecomment-1520277594 https://api.github.com/repos/pydata/xarray/issues/7782 IC_kwDOAMm_X85anZha kmuehlbauer 5821660 2023-04-24T14:31:00Z 2023-04-24T14:31:00Z MEMBER

@Articoking

As both variables have a _FillValue attached xarray converts these values to NaN effectively casting to float32 in this case.

You might inspect the .encoding-property of the respective variables to get information of the source dtype.

You can deactivate the automatic conversion by adding kwarg mask_and_scale=False.

There is more information in the docs https://docs.xarray.dev/en/stable/user-guide/io.html

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.open_dataset() reading ubyte variables as float32 from DAP server 1681353195
1516573065 https://github.com/pydata/xarray/pull/7771#issuecomment-1516573065 https://api.github.com/repos/pydata/xarray/issues/7771 IC_kwDOAMm_X85aZRGJ kmuehlbauer 5821660 2023-04-20T15:53:58Z 2023-04-20T15:53:58Z MEMBER

OK it seems this is ready for a first round of reviews.

A bit of added context:

Currently there is no dedicated function for checking for CF standard conformance. The idea is to read as much as possible also non-standard conforming data files, but restrict writing non-standard conforming files.

The implemented function ensure_scale_offset_conformance takes a strict keyword argument, which is True when encoding and False when decoding. If strict=True it will raise errors if there is a mismatch with the standard and when strict=False it will issue warnings.

I've only had to adapt a few tests which where not conforming to standard on encoding to align with that. I've observed some warnings in the test suite which we might to have a look into.

One idea would be to fix erroneous scale_factor/add_offset with our best fitting estimate. This is already done for list-type scale_factor/add_offset.

I will follow-up with checks for CFMaskCoder.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  implement scale_factor/add_offset CF conformance test, add and align tests 1676309093
1515146820 https://github.com/pydata/xarray/issues/7770#issuecomment-1515146820 https://api.github.com/repos/pydata/xarray/issues/7770 IC_kwDOAMm_X85aT05E kmuehlbauer 5821660 2023-04-19T17:59:00Z 2023-04-19T17:59:00Z MEMBER

It's also possible to use the custom BackendEntrypoint-class directly in the call to xr.open_dataset with the engine keyword.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Provide a public API for adding new backends 1675299031
1514437541 https://github.com/pydata/xarray/issues/7767#issuecomment-1514437541 https://api.github.com/repos/pydata/xarray/issues/7767 IC_kwDOAMm_X85aRHul kmuehlbauer 5821660 2023-04-19T09:42:29Z 2023-04-19T09:42:29Z MEMBER

I think the equivalent incantation would be (note the different order of arguments in xr.where):

python da = xr.DataArray(np.arange(10)) print(xr.where(da < 5, da, 0).values) print(da.where(da < 5, 0).values) [0 1 2 3 4 0 0 0 0 0] [0 1 2 3 4 0 0 0 0 0]

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Inconsistency between xr.where() and da.where() 1674532233
1501070685 https://github.com/pydata/xarray/issues/7742#issuecomment-1501070685 https://api.github.com/repos/pydata/xarray/issues/7742 IC_kwDOAMm_X85ZeIVd kmuehlbauer 5821660 2023-04-09T08:03:18Z 2023-04-09T08:03:18Z MEMBER

@ChristmasZCY Please have a look at the documentation about string encoding

https://docs.xarray.dev/en/stable/user-guide/io.html#string-encoding

Good chance that this gives you the needed information.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  About save char into netcdf  1659786592
1500333865 https://github.com/pydata/xarray/pull/7720#issuecomment-1500333865 https://api.github.com/repos/pydata/xarray/issues/7720 IC_kwDOAMm_X85ZbUcp kmuehlbauer 5821660 2023-04-07T14:21:02Z 2023-04-07T14:21:21Z MEMBER

Rebased on top main after merge of #7719. This is ready for review. It's a one-liner actually :grin:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  preserve boolean dtype in encoding 1655000231
1498799474 https://github.com/pydata/xarray/issues/4826#issuecomment-1498799474 https://api.github.com/repos/pydata/xarray/issues/4826 IC_kwDOAMm_X85ZVd1y kmuehlbauer 5821660 2023-04-06T09:59:42Z 2023-04-06T09:59:42Z MEMBER

@JoerivanEngelen Thanks for taking the time. Much appreciated.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Reading and writing a zarr dataset multiple times casts bools to int8 789410367
1498794212 https://github.com/pydata/xarray/pull/7719#issuecomment-1498794212 https://api.github.com/repos/pydata/xarray/issues/7719 IC_kwDOAMm_X85ZVcjk kmuehlbauer 5821660 2023-04-06T09:55:25Z 2023-04-06T09:55:25Z MEMBER

This looks like it is ready to go. This will surely help further refactoring encode_cf_variable/decode_cf_variable. At least while working on it I spotted several locations where inconsistencies can be ironed out. A neat mostly flaw-free encoding/decoding is needed especially with regard to #6323.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Implement more Variable Coders 1654988876
1498647087 https://github.com/pydata/xarray/issues/7723#issuecomment-1498647087 https://api.github.com/repos/pydata/xarray/issues/7723 IC_kwDOAMm_X85ZU4ov kmuehlbauer 5821660 2023-04-06T08:00:09Z 2023-04-06T08:00:09Z MEMBER

I'm still convinced this could be fixed for floating point data.

Generally its worse if we obey some default fill values but not others, because it becomes quite confusing to a user.

I think this depends from which side you look at it :-) My point here is, we do not have to submissively obey to default fill values, but just use them when decoding. This only need to happen if no _FillValue is attached to the variable. By doing this we ensure that these missing values are mapped to np.nan (as it is expected by users).

In further course we can just apply the xarray standard np.nan when writing out. We need to document that in that case exact roundtrip isn't possible (it also isn't currently possible, in this example).

Consider this example:

```python dtype = "f4" with nc.Dataset("test-fillvalues-01.nc", mode="w") as ds: x = ds.createDimension("x", 10) test_fillval_fillon = ds.createVariable("test_fillval_fillon", dtype, ("x",), fill_value=nc.default_fillvals[dtype]) test_fillval_fillon[:5] = np.array([0.0, nc.default_fillvals[dtype], np.nan, 1.0, 8.0], dtype=dtype) test_nofillval_fillon = ds.createVariable("test_nofillval_fillon", dtype, ("x",), fill_value=None) test_nofillval_fillon[:5] = np.array([0.0, nc.default_fillvals[dtype], np.nan, 1.0, 8.0], dtype=dtype)

with nc.Dataset("test-fillvalues-01.nc") as ds: print("\n read with netCDF4-python") print("---------------------------") print(ds["test_fillval_fillon"]) print(ds["test_fillval_fillon"][:]) print(ds["test_nofillval_fillon"]) print(ds["test_nofillval_fillon"][:])

with xr.open_dataset("test-fillvalues-01.nc").load() as ds: print("\n read with xarray") print("---------------------------") print(ds["test_fillval_fillon"]) print(ds["test_fillval_fillon"][:]) print(ds["test_nofillval_fillon"]) print(ds["test_nofillval_fillon"][:]) python read with netCDF4-python


<class 'netCDF4._netCDF4.Variable'> float32 test_fillval_fillon(x) _FillValue: 9.96921e+36 unlimited dimensions: current shape = (10,) filling on [0.0 -- nan 1.0 8.0 -- -- -- -- --] <class 'netCDF4._netCDF4.Variable'> float32 test_nofillval_fillon(x) unlimited dimensions: current shape = (10,) filling on, default _FillValue of 9.969209968386869e+36 used [0.0 -- nan 1.0 8.0 -- -- -- -- --]

read with xarray-python

<xarray.DataArray 'test_fillval_fillon' (x: 10)> array([ 0., nan, nan, 1., 8., nan, nan, nan, nan, nan], dtype=float32) Dimensions without coordinates: x <xarray.DataArray 'test_nofillval_fillon' (x: 10)> array([0.00000e+00, 9.96921e+36, nan, 1.00000e+00, 8.00000e+00, 9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36], dtype=float32) Dimensions without coordinates: x ```

The only difference between these two variables is that on the first the _FillValue is declared, and on the other the default _FillValue is used. So if xarray obeys (by CF standard) the first it should also obey the second.

This might just work, if these cases the default fillvalue is used for decoding to np.nan, and declared that np.nan will be the new _FillValue. Does that make sense?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  default fill_value not masked when read from file 1655569401
1498540636 https://github.com/pydata/xarray/pull/7719#issuecomment-1498540636 https://api.github.com/repos/pydata/xarray/issues/7719 IC_kwDOAMm_X85ZUepc kmuehlbauer 5821660 2023-04-06T06:07:50Z 2023-04-06T06:23:40Z MEMBER

Now, this is interesting! It looks like those FillValue issues are following me. What did change that this now materializes here, all of a sudden.

Update: Small change - big issue. Checked for fv_exists instead of not fv_exists :grimacing:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Implement more Variable Coders 1654988876
1498490570 https://github.com/pydata/xarray/issues/7722#issuecomment-1498490570 https://api.github.com/repos/pydata/xarray/issues/7722 IC_kwDOAMm_X85ZUSbK kmuehlbauer 5821660 2023-04-06T04:55:02Z 2023-04-06T04:55:02Z MEMBER

The recommendation is to use _FillValue if there is only one value describing missing/fillvalue.

https://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html#missing-data

It's also written that missing_value is

This attribute is not treated in any special way by the library or conforming generic applications, but is often useful documentation and may be used by specific applications.

https://docs.unidata.ucar.edu/netcdf-c/current/attribute_conventions.html

Not sure, if xarray is a conforming generic application or a specific application.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Conflicting _FillValue and missing_value on write 1655483374
1498464352 https://github.com/pydata/xarray/issues/7723#issuecomment-1498464352 https://api.github.com/repos/pydata/xarray/issues/7723 IC_kwDOAMm_X85ZUMBg kmuehlbauer 5821660 2023-04-06T04:09:11Z 2023-04-06T04:09:11Z MEMBER

@dcherian Great, a duplicate. :-( Sorry I must have overlooked that one.

It's somewhat counter-intuitive to get differing results when using netcdf4-python and xarray. Would be a good idea to document this behaviour.

It looks like it might at least be resolved for floating point source data:

Let's take the above simple example. We have np.nan written to the file, but the netcdf representation on disk uses a default (undeclared by attribute) _FillValue for unwritten parts.

For the netcdf4-python user the np.nan will not be masked, but the unfilled parts will be masked.

For xarray the default fillvalue won't be masked, appearing as valid data, which it is not. On subsequent writes np.nan will be introduced as the new fillvalue (by attribute), effectively changing the meaning of the default fillvalues.

Wouldn't it make sense then, to transform these default fill values to np.nan on read too, instead of giving the a seemingly meaningful value? Maybe yet another keyword switch, use_default_fillvalues?

There should be at least a warning on read, in these situations, that there are undefined values in the dataset which were never written and which will not be masked.

If the dataset contains unwritten parts, and a default fillvalue is used, in turn meaning the data creator did this by purpose (by not setting a _FillValue) it can mean several things:

  • The creators data does actually not have missing values which need declaring, but it means, that his data will get masked for default fillvalue entries (maybe they doesn't know about this, but that might be unlikely).
  • The creator doesn't care at all, with same conclusion as above.
  • The creator purposefully uses default fillvalue as missing value, since they use this as a means of saving disk space. But this could also be done, by just defining that as _FillValue attribute at creation time, if I`m not mistaken.

I'm still convinced this could be fixed for floating point data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  default fill_value not masked when read from file 1655569401
1497971459 https://github.com/pydata/xarray/issues/4826#issuecomment-1497971459 https://api.github.com/repos/pydata/xarray/issues/4826 IC_kwDOAMm_X85ZSTsD kmuehlbauer 5821660 2023-04-05T18:56:23Z 2023-04-05T18:56:23Z MEMBER

Please check #7720 if that fixes the conversion problems. Thanks.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Reading and writing a zarr dataset multiple times casts bools to int8 789410367
1497866542 https://github.com/pydata/xarray/issues/7573#issuecomment-1497866542 https://api.github.com/repos/pydata/xarray/issues/7573 IC_kwDOAMm_X85ZR6Eu kmuehlbauer 5821660 2023-04-05T17:31:05Z 2023-04-05T17:31:05Z MEMBER

If it helps to minimize interoperability issues I'm all in for the change. One thing I would maybe do is wait for the next version. With the current PR we would end up with two different build numbers with differing behaviour, which might confuse folks.

But I'd rely on @ocefpaf's expertise.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add optional min versions to conda-forge recipe (`run_constrained`) 1603957501
1496973403 https://github.com/pydata/xarray/pull/7654#issuecomment-1496973403 https://api.github.com/repos/pydata/xarray/issues/7654 IC_kwDOAMm_X85ZOgBb kmuehlbauer 5821660 2023-04-05T06:15:58Z 2023-04-05T06:15:58Z MEMBER

As explained I've created two PR (#7719 and #7720) for the "easy" changes from this PR. Would be great, if those could go in fast. Thanks!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  cf-coding 1633623916
1496950962 https://github.com/pydata/xarray/pull/7654#issuecomment-1496950962 https://api.github.com/repos/pydata/xarray/issues/7654 IC_kwDOAMm_X85ZOaiy kmuehlbauer 5821660 2023-04-05T05:46:15Z 2023-04-05T05:46:15Z MEMBER

@dcherian Just a heads-up: I find this PR getting more and more involved at different parts of the machinery and hard to follow for reviewers. I'll split this up and start with the more or less undisputed changes.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  cf-coding 1633623916
1496044623 https://github.com/pydata/xarray/pull/7654#issuecomment-1496044623 https://api.github.com/repos/pydata/xarray/issues/7654 IC_kwDOAMm_X85ZK9RP kmuehlbauer 5821660 2023-04-04T14:10:33Z 2023-04-04T14:10:33Z MEMBER

Still hunting for corner cases and issues inside encode_cf_variable/decode_cf_variable.

It looks like I already see some light again. Not sure, if this is the last iteration, but the testsuite is still running green with added and enhanced tests, which is not that bad.

Unfortunately https://github.com/pydata/xarray/issues/2304 is still an issue for now. I'll clarify that later with an added test.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  cf-coding 1633623916
1493930592 https://github.com/pydata/xarray/pull/7654#issuecomment-1493930592 https://api.github.com/repos/pydata/xarray/issues/7654 IC_kwDOAMm_X85ZC5Jg kmuehlbauer 5821660 2023-04-03T08:53:17Z 2023-04-03T08:53:17Z MEMBER

While trying to create a test which specifically tests _choose_float_dtype I've found some issues with checking for availability of scale_factor/add_offset. Now testing for None.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  cf-coding 1633623916
1493296175 https://github.com/pydata/xarray/pull/7654#issuecomment-1493296175 https://api.github.com/repos/pydata/xarray/issues/7654 IC_kwDOAMm_X85ZAeQv kmuehlbauer 5821660 2023-04-02T10:47:21Z 2023-04-02T10:47:21Z MEMBER

This is now ready for another round of reviews, @dcherian, @Illviljan and @mankoff.

As @mankoff already pointed out, xarray is very generous to try to encode/decode non CF conforming data. This makes things a bit complicated as some issues only surface in rare corner cases.

I've tried to be as explicit in _choose_float_dtype, also added comments/tests where needed.

I'm finding the typing a bit hard. It seems that mypy can't derive the correct types from return types in certain cases.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  cf-coding 1633623916
1493127898 https://github.com/pydata/xarray/pull/7654#issuecomment-1493127898 https://api.github.com/repos/pydata/xarray/issues/7654 IC_kwDOAMm_X85Y_1La kmuehlbauer 5821660 2023-04-01T21:23:40Z 2023-04-01T21:23:40Z MEMBER

If at first you don't succeed... It looks like we have something working here.

Some more typing and maybe some more tests covering the cases with scale_factor/add_offset/_FillValue non-conforming CF and we should be good to go. Or do I miss something?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  cf-coding 1633623916
1493084805 https://github.com/pydata/xarray/pull/7654#issuecomment-1493084805 https://api.github.com/repos/pydata/xarray/issues/7654 IC_kwDOAMm_X85Y_qqF kmuehlbauer 5821660 2023-04-01T19:34:18Z 2023-04-01T19:34:18Z MEMBER

The latest changes brake #1840 again. We have two contradicting forces here, which need to be aligned.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  cf-coding 1633623916
1492937244 https://github.com/pydata/xarray/issues/5597#issuecomment-1492937244 https://api.github.com/repos/pydata/xarray/issues/5597 IC_kwDOAMm_X85Y_Goc kmuehlbauer 5821660 2023-04-01T11:03:02Z 2023-04-01T11:03:02Z MEMBER

To fix this, I think logic in _choose_float_dtype should be updated to look at encoding['dtype'] (if available) instead of dtype, in order to understand how the data was originally stored.

This is aimed at in #7654

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decoding netCDF is giving incorrect values for a large file 942738904
1492895855 https://github.com/pydata/xarray/pull/7654#issuecomment-1492895855 https://api.github.com/repos/pydata/xarray/issues/7654 IC_kwDOAMm_X85Y-8hv kmuehlbauer 5821660 2023-04-01T09:48:57Z 2023-04-01T09:48:57Z MEMBER

@Illviljan I'm not able to figure out the typing if I want to use Data-types as functions to convert python numbers to array scalars. If you have any suggestion how to fix this, please let me know.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  cf-coding 1633623916
1492880874 https://github.com/pydata/xarray/pull/7654#issuecomment-1492880874 https://api.github.com/repos/pydata/xarray/issues/7654 IC_kwDOAMm_X85Y-43q kmuehlbauer 5821660 2023-04-01T08:46:49Z 2023-04-01T09:28:16Z MEMBER

@dcherian @Illviljan Thanks for the first round of review. I've rebased everything on latest main. Now the code moving from conventions.py to coding.variable.py is correct. I've also removed the functions which have been converted to VariableCoders and adapted the tests.

To sum up this PR, it does:

  • convert functions to VariableCoders along @shoyer's TODO: https://github.com/pydata/xarray/blob/1c81162755457b3f4dc1f551f0321c75ec9daf6c/xarray/conventions.py#L298-L302 https://github.com/pydata/xarray/blob/1c81162755457b3f4dc1f551f0321c75ec9daf6c/xarray/conventions.py#L393-L405
  • preserve boolean dtype within encoding: https://github.com/pydata/xarray/issues/7652#issuecomment-1476956975
  • deterrmine cf packed dtype from scale_factor/add_offset

7691, #2304

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  cf-coding 1633623916
1492078304 https://github.com/pydata/xarray/issues/7691#issuecomment-1492078304 https://api.github.com/repos/pydata/xarray/issues/7691 IC_kwDOAMm_X85Y707g kmuehlbauer 5821660 2023-03-31T15:05:17Z 2023-03-31T15:05:17Z MEMBER

, the PR seems to solve my specific issue without changing the encoding

Great, thanks for testing.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `nan` values appearing when saving and loading from `netCDF` due to encoding 1643408278
1491915288 https://github.com/pydata/xarray/issues/7691#issuecomment-1491915288 https://api.github.com/repos/pydata/xarray/issues/7691 IC_kwDOAMm_X85Y7NIY kmuehlbauer 5821660 2023-03-31T13:19:01Z 2023-03-31T13:19:01Z MEMBER

@euronion There is a potential fix for your issue in #7654. It would be great, if you could have a closer look and test against that PR.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `nan` values appearing when saving and loading from `netCDF` due to encoding 1643408278
1491760266 https://github.com/pydata/xarray/pull/7654#issuecomment-1491760266 https://api.github.com/repos/pydata/xarray/issues/7654 IC_kwDOAMm_X85Y6nSK kmuehlbauer 5821660 2023-03-31T11:13:49Z 2023-03-31T11:13:49Z MEMBER

@dcherian @basnijholt

After the dev-meeting I've taken a step back and first implemented the coders as mentioned in @shoyer's ToDo.

I've fixed the one bool->int issue and it now derives the dtype for ScaleOffset coding from scale_factor add_offset.

I've improved some test with regard to the scale/offset issue.

I'll concentrate on the string fillvalue issues in a follow up PR.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  cf-coding 1633623916
1486870845 https://github.com/pydata/xarray/issues/7691#issuecomment-1486870845 https://api.github.com/repos/pydata/xarray/issues/7691 IC_kwDOAMm_X85Yn9k9 kmuehlbauer 5821660 2023-03-28T13:16:31Z 2023-03-28T13:31:46Z MEMBER

MCVE:

python fname = "test-7691.nc" import netCDF4 as nc with nc.Dataset(fname, "w") as ds0: ds0.createDimension("t", 5) ds0.createVariable("x", "int16", ("t",), fill_value=-32767) v = ds0.variables["x"] v.set_auto_maskandscale(False) v.add_offset = 278.297319296597 v.scale_factor = 1.16753614203674e-05 v[:] = np.array([-32768, -32767, -32766, 32767, 0]) with nc.Dataset(fname) as ds1: x1 = ds1["x"][:] print("netCDF4-python:", x1.dtype, x1) with xr.open_dataset(fname) as ds2: x2 = ds2["x"].values ds2.to_netcdf("test-7691-01.nc") print("xarray first read:", x2.dtype, x2) with xr.open_dataset("test-7691-01.nc") as ds3: x3 = ds3["x"].values print("xarray roundtrip:", x3.dtype, x3)

python netCDF4-python: float64 [277.9147410535744 -- 277.9147644042972 278.67988586425815 278.297319296597] xarray first read: float32 [277.91476 nan 277.91476 278.6799 278.29733] xarray roundtrip: float32 [ nan nan nan 278.6799 278.29733] I've confirmed that correctly promoting to float64 in CFMaskCoder solves this issue.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `nan` values appearing when saving and loading from `netCDF` due to encoding 1643408278
1486817329 https://github.com/pydata/xarray/issues/7691#issuecomment-1486817329 https://api.github.com/repos/pydata/xarray/issues/7691 IC_kwDOAMm_X85Ynwgx kmuehlbauer 5821660 2023-03-28T12:41:43Z 2023-03-28T12:41:43Z MEMBER

As this doesn't surface that often it might just happen here by accident. If the _FillValue/missing_value would be -32768 then the issue would not manifest.

So for NetCDF the default fillvalue for NC_SHORT (int16) is -32767. That means the promotion to float32 instead the needed float64 is the problem here (floating point precision).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `nan` values appearing when saving and loading from `netCDF` due to encoding 1643408278

Next page

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 49.325ms · About: xarray-datasette