home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 1722417436

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1722417436 I_kwDOAMm_X85mqgEc 7868 `open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING` 19285200 closed 0     8 2023-05-23T16:23:07Z 2023-11-17T15:26:01Z 2023-11-17T15:26:01Z NONE      

What is your issue?

I noticed that open_dataset with chunks="auto" fails when netCDF4 variables/coordinates are encoded as NC_STRING. The reason is that xarray reads netCDF4 NC_STRING as object type, and dask cannot estimate the size of a object dtype.

As a workaround, the user must currently rewrite the netCDF4 and specify the string DataArray(s) encoding(s) as a fixed-length string type (i.e "S2" if max string length is 2) so that the data are written as NC_CHAR and xarray read it back as byte-encoded fixed-length string type.

Here below I provide a reproducible example

``` import xarray as xr import numpy as np

Define string datarray

arr = np.array(["M6", "M3"], dtype=str) print(arr.dtype) # <U2 da = xr.DataArray(data=arr, dims=("time")) data_vars = {"str_arr": da}

Create dataset

ds_nc_string = xr.Dataset(data_vars=data_vars)

Set chunking to see behaviour at read-time

ds_nc_string["str_arr"] = ds_nc_string["str_arr"].chunk(1) # chunks ((1,1),)

Write dataset with NC_STRING

ds_nc_string["str_arr"].encoding["dtype"] = str ds_nc_string.to_netcdf("/tmp/nc_string.nc")

Write dataset with NC_CHAR

ds_nc_char = xr.Dataset(data_vars=data_vars) ds_nc_char["str_arr"].encoding["dtype"] = "S2" ds_nc_char.to_netcdf("/tmp/nc_char.nc")

When NC_STRING, chunks="auto" does not work when string are saved as

--> NC STRING is read as object, and dask can not estimate chunk size !

If chunks={} it reads the NC_STRING array in a single dask chunk !!!

ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks="auto") # NotImplementedError ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks={}) # Works ds_nc_string.chunks # chunks (2,)

With NC_CHAR, chunks={} and chunks="auto" works and returns the same result!

ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks={})
ds_nc_char.chunks # chunks (2,) ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks="auto") ds_nc_char.chunks # chunks (2,)

NC_STRING is read back as object

ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks=None) ds_nc_string["str_arr"].dtype # object

NC_CHAR is read back as fixed length byte-string representation (S2)

ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks=None) ds_nc_char["str_arr"].dtype # S2 ds_nc_char["str_arr"].data.astype(str) # U2 ```

Questions: - open_dataset should not take care of automatically deserializing the NC_CHAR fixed-length byte-string representation into a Unicode string? - open_dataset should not take care of automatically reading NC_STRING as Unicode string (converting object to str)?

Related issues are: - https://github.com/pydata/xarray/issues/7652 - https://github.com/pydata/xarray/issues/2059 - https://github.com/pydata/xarray/pull/7654 - https://github.com/pydata/xarray/issues/2040

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7868/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 7 rows from issue in issue_comments
Powered by Datasette · Queries took 0.668ms · About: xarray-datasette