issues: 1722417436

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1722417436	I_kwDOAMm_X85mqgEc	7868	`open_dataset` with `chunks="auto"` fails when a netCDF4 variables/coordinates is encoded as `NC_STRING`	19285200	closed	0			8	2023-05-23T16:23:07Z	2023-11-17T15:26:01Z	2023-11-17T15:26:01Z	NONE				What is your issue? I noticed that `open_dataset` with `chunks="auto"` fails when netCDF4 variables/coordinates are encoded as `NC_STRING`. The reason is that xarray reads netCDF4 `NC_STRING` as `object` type, and `dask` cannot estimate the size of a `object` dtype. As a workaround, the user must currently rewrite the netCDF4 and specify the string DataArray(s) `encoding`(s) as a fixed-length string type (i.e `"S2"` if max string length is 2) so that the data are written as `NC_CHAR` and xarray read it back as byte-encoded fixed-length string type. Here below I provide a reproducible example ``` import xarray as xr import numpy as np Define string datarray arr = np.array(["M6", "M3"], dtype=str) print(arr.dtype) # <U2 da = xr.DataArray(data=arr, dims=("time")) data_vars = {"str_arr": da} Create dataset ds_nc_string = xr.Dataset(data_vars=data_vars) Set chunking to see behaviour at read-time ds_nc_string["str_arr"] = ds_nc_string["str_arr"].chunk(1) # chunks ((1,1),) Write dataset with NC_STRING ds_nc_string["str_arr"].encoding["dtype"] = str ds_nc_string.to_netcdf("/tmp/nc_string.nc") Write dataset with NC_CHAR ds_nc_char = xr.Dataset(data_vars=data_vars) ds_nc_char["str_arr"].encoding["dtype"] = "S2" ds_nc_char.to_netcdf("/tmp/nc_char.nc") When NC_STRING, chunks="auto" does not work when string are saved as --> NC STRING is read as object, and dask can not estimate chunk size ! If chunks={} it reads the NC_STRING array in a single dask chunk !!! ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks="auto") # NotImplementedError ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks={}) # Works ds_nc_string.chunks # chunks (2,) With NC_CHAR, chunks={} and chunks="auto" works and returns the same result! ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks={}) ds_nc_char.chunks # chunks (2,) ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks="auto") ds_nc_char.chunks # chunks (2,) NC_STRING is read back as object ds_nc_string = xr.open_dataset("/tmp/nc_string.nc", chunks=None) ds_nc_string["str_arr"].dtype # object NC_CHAR is read back as fixed length byte-string representation (S2) ds_nc_char = xr.open_dataset("/tmp/nc_char.nc", chunks=None) ds_nc_char["str_arr"].dtype # S2 ds_nc_char["str_arr"].data.astype(str) # U2 ``` Questions: - `open_dataset` should not take care of automatically deserializing the `NC_CHAR` fixed-length byte-string representation into a `Unicode string`? - `open_dataset` should not take care of automatically reading `NC_STRING` as `Unicode string` (converting `object` to `str`)? Related issues are: - https://github.com/pydata/xarray/issues/7652 - https://github.com/pydata/xarray/issues/2059 - https://github.com/pydata/xarray/pull/7654 - https://github.com/pydata/xarray/issues/2040	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7868/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

1 row from issues_id in issues_labels
7 rows from issue in issue_comments