issue_comments: 1477447875

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/7652#issuecomment-1477447875	https://api.github.com/repos/pydata/xarray/issues/7652	1477447875	IC_kwDOAMm_X85YEBDD	5821660	2023-03-21T08:37:48Z	2023-03-21T08:37:48Z	MEMBER	@basnijholt For the string issue this is somehwat kind of netcdf/numpy based issue with VLEN types. XRef: https://unidata.github.io/netcdf4-python/#dealing-with-strings The most flexible way to store arrays of strings is with the Variable-length (vlen) string data type. However, this requires the use of the NETCDF4 data model, and the vlen type does not map very well numpy arrays (you have to use numpy arrays of dtype=object, which are arrays of arbitrary python objects). And numpy will create a VLEN string array if no dtype is given, like in your case. At least netCDF4 and h5netcdf backends are consistent in their writing (creating similar hdf5-files) and reading back (object-dtype): plain netCDF4 ```python import netCDF4 as nc import numpy as np data = np.array([["a", "b"], ["c", "d"]], dtype="<U1") print(f"source dtype: {data.dtype.str}\n", ) auto = False with nc.Dataset("test-plain-netcdf4.nc", mode="w") as ds: print("Write NC-File") ds.set_auto_maskandscale(auto) ds.set_auto_chartostring(auto) ds.createDimension("x", size=2) ds.createDimension("y", size=2) var = ds.createVariable("da", data.dtype.str, dimensions=("x", "y")) var[:] = data print("Variable\n") print(var) print(var.dtype) print("\nContents\n") print(var[:]) print(var[:].dtype) with nc.Dataset("test-plain-netcdf4.nc") as ds: print("\nRead NC-File") ds.set_auto_maskandscale(auto) ds.set_auto_chartostring(auto) da = ds["da"] print("Variable\n") print(da) print(da.dtype) da = ds["da"][:] print("\nContents\n") print(da) print(da.dtype) ``` ```python source dtype: <U1 Write NC-File Variable <class 'netCDF4._netCDF4.Variable'> vlen da(x, y) vlen data type: <class 'str'> unlimited dimensions: current shape = (2, 2) <class 'str'> Contents [['a' 'b'] ['c' 'd']] object Read NC-File Variable <class 'netCDF4._netCDF4.Variable'> vlen da(x, y) vlen data type: <class 'str'> unlimited dimensions: current shape = (2, 2) <class 'str'> Contents [['a' 'b'] ['c' 'd']] object ``` ```bash netcdf test-plain-netcdf4 { dimensions: x = 2 ; y = 2 ; variables: string da(x, y) ; data: da = "a", "b", "c", "d" ; } HDF5 "test-plain-netcdf4.nc" { DATASET "da" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_UTF8; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 2, 2 ) / ( 2, 2 ) } DATA { (0,0): "a", "b", (1,0): "c", "d" } ATTRIBUTE "DIMENSION_LIST" { DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }} DATASPACE SIMPLE { ( 2 ) / ( 2 ) } DATA { (0): (), () } } ATTRIBUTE "_Netcdf4Coordinates" { DATATYPE H5T_STD_I32LE DATASPACE SIMPLE { ( 2 ) / ( 2 ) } DATA { (0): 0, 1 } } } } ``` plain h5netcdf ```python import h5netcdf.legacyapi as h5nc import h5py data = np.array([["a", "b"], ["c", "d"]], dtype="<U1") print(f"source dtype: {data.dtype.str}\n", ) with h5nc.Dataset("test-plain-h5netcdf.nc", mode="w") as ds: print("Write NC-File") ds.createDimension("x", 2) ds.createDimension("y", 2) dtype = h5py.string_dtype() print("Source dtype:", dtype) var = ds.createVariable("da", dtype, dimensions=("x", "y")) var[:] = data print("Variable\n") print(var) print(var.dtype) print("\nContents\n") print(var[:]) print(var[:].dtype) with h5nc.Dataset("test-plain-h5netcdf.nc") as ds: print("\nRead NC-File") da = ds["da"] print("Variable\n") print(da) print(da.dtype) da = ds["da"][:] print("\nContents\n") print(da) print(da.dtype) ``` ```python source dtype: <U1 Write NC-File Source dtype: object Variable <h5netcdf.legacyapi.Variable '/da': dimensions ('x', 'y'), shape (2, 2), dtype <class 'str'>> Attributes: <class 'str'> Contents [['a' 'b'] ['c' 'd']] object Read NC-File Variable <h5netcdf.legacyapi.Variable '/da': dimensions ('x', 'y'), shape (2, 2), dtype <class 'str'>> Attributes: <class 'str'> Contents [['a' 'b'] ['c' 'd']] object ``` ```bash netcdf test-plain-h5netcdf { dimensions: x = 2 ; y = 2 ; variables: string da(x, y) ; data: da = "a", "b", "c", "d" ; } HDF5 "test-plain-h5netcdf.nc" { DATASET "da" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_UTF8; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 2, 2 ) / ( 2, 2 ) } DATA { (0,0): "a", "b", (1,0): "c", "d" } ATTRIBUTE "DIMENSION_LIST" { DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }} DATASPACE SIMPLE { ( 2 ) / ( 2 ) } DATA { (0): (), () } } ATTRIBUTE "_Netcdf4Coordinates" { DATATYPE H5T_STD_I32LE DATASPACE SIMPLE { ( 2 ) / ( 2 ) } DATA { (0): 0, 1 } } ATTRIBUTE "_Netcdf4Dimid" { DATATYPE H5T_STD_I32LE DATASPACE SCALAR DATA { (0): 0 } } } } ``` Both get written out as: `DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_UTF8; CTYPE H5T_C_S1; }` If you use fixed length strings (eg. `\|S1`) the dtype is preserved during roundtrip: ```python import xarray as xr Make an xarray with an array of fixed-length strings data = np.array([["a", "b"], ["c", "d"]], dtype="\|S1") da = xr.DataArray( data=data, dims=["x", "y"], coords={"x": [0, 1], "y": [0, 1]}, ) da.to_netcdf("test.nc", mode='w') Load the xarray back in da_loaded = xr.load_dataarray("test.nc") assert da.dtype == da_loaded.dtype, "Dtypes don't match" ``` Versions ``` INSTALLED VERSIONS ------------------ commit: None python: 3.11.0 \| packaged by conda-forge \| (main, Jan 14 2023, 12:27:40) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.14.21-150400.24.46-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: ('de_DE', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.1 xarray: 2023.2.0 pandas: 1.5.3 numpy: 1.24.2 scipy: 1.10.1 netCDF4: 1.6.3 pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.6 cfgrib: None iris: None bottleneck: None dask: 2023.3.1 distributed: 2023.3.1 matplotlib: 3.7.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.3.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 67.6.0 pip: 23.0.1 conda: None pytest: None mypy: None IPython: 8.11.0 sphinx: None ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		1632718954