home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 1477447875

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/7652#issuecomment-1477447875 https://api.github.com/repos/pydata/xarray/issues/7652 1477447875 IC_kwDOAMm_X85YEBDD 5821660 2023-03-21T08:37:48Z 2023-03-21T08:37:48Z MEMBER

@basnijholt For the string issue this is somehwat kind of netcdf/numpy based issue with VLEN types.

XRef: https://unidata.github.io/netcdf4-python/#dealing-with-strings

The most flexible way to store arrays of strings is with the Variable-length (vlen) string data type. However, this requires the use of the NETCDF4 data model, and the vlen type does not map very well numpy arrays (you have to use numpy arrays of dtype=object, which are arrays of arbitrary python objects).

And numpy will create a VLEN string array if no dtype is given, like in your case.

At least netCDF4 and h5netcdf backends are consistent in their writing (creating similar hdf5-files) and reading back (object-dtype):

plain netCDF4 ```python import netCDF4 as nc import numpy as np data = np.array([["a", "b"], ["c", "d"]], dtype="<U1") print(f"source dtype: {data.dtype.str}\n", ) auto = False with nc.Dataset("test-plain-netcdf4.nc", mode="w") as ds: print("Write NC-File") ds.set_auto_maskandscale(auto) ds.set_auto_chartostring(auto) ds.createDimension("x", size=2) ds.createDimension("y", size=2) var = ds.createVariable("da", data.dtype.str, dimensions=("x", "y")) var[:] = data print("Variable\n") print(var) print(var.dtype) print("\nContents\n") print(var[:]) print(var[:].dtype) with nc.Dataset("test-plain-netcdf4.nc") as ds: print("\nRead NC-File") ds.set_auto_maskandscale(auto) ds.set_auto_chartostring(auto) da = ds["da"] print("Variable\n") print(da) print(da.dtype) da = ds["da"][:] print("\nContents\n") print(da) print(da.dtype) ``` ```python source dtype: <U1 Write NC-File Variable <class 'netCDF4._netCDF4.Variable'> vlen da(x, y) vlen data type: <class 'str'> unlimited dimensions: current shape = (2, 2) <class 'str'> Contents [['a' 'b'] ['c' 'd']] object Read NC-File Variable <class 'netCDF4._netCDF4.Variable'> vlen da(x, y) vlen data type: <class 'str'> unlimited dimensions: current shape = (2, 2) <class 'str'> Contents [['a' 'b'] ['c' 'd']] object ``` ```bash netcdf test-plain-netcdf4 { dimensions: x = 2 ; y = 2 ; variables: string da(x, y) ; data: da = "a", "b", "c", "d" ; } HDF5 "test-plain-netcdf4.nc" { DATASET "da" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_UTF8; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 2, 2 ) / ( 2, 2 ) } DATA { (0,0): "a", "b", (1,0): "c", "d" } ATTRIBUTE "DIMENSION_LIST" { DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }} DATASPACE SIMPLE { ( 2 ) / ( 2 ) } DATA { (0): (), () } } ATTRIBUTE "_Netcdf4Coordinates" { DATATYPE H5T_STD_I32LE DATASPACE SIMPLE { ( 2 ) / ( 2 ) } DATA { (0): 0, 1 } } } } ```
plain h5netcdf ```python import h5netcdf.legacyapi as h5nc import h5py data = np.array([["a", "b"], ["c", "d"]], dtype="<U1") print(f"source dtype: {data.dtype.str}\n", ) with h5nc.Dataset("test-plain-h5netcdf.nc", mode="w") as ds: print("Write NC-File") ds.createDimension("x", 2) ds.createDimension("y", 2) dtype = h5py.string_dtype() print("Source dtype:", dtype) var = ds.createVariable("da", dtype, dimensions=("x", "y")) var[:] = data print("Variable\n") print(var) print(var.dtype) print("\nContents\n") print(var[:]) print(var[:].dtype) with h5nc.Dataset("test-plain-h5netcdf.nc") as ds: print("\nRead NC-File") da = ds["da"] print("Variable\n") print(da) print(da.dtype) da = ds["da"][:] print("\nContents\n") print(da) print(da.dtype) ``` ```python source dtype: <U1 Write NC-File Source dtype: object Variable <h5netcdf.legacyapi.Variable '/da': dimensions ('x', 'y'), shape (2, 2), dtype <class 'str'>> Attributes: <class 'str'> Contents [['a' 'b'] ['c' 'd']] object Read NC-File Variable <h5netcdf.legacyapi.Variable '/da': dimensions ('x', 'y'), shape (2, 2), dtype <class 'str'>> Attributes: <class 'str'> Contents [['a' 'b'] ['c' 'd']] object ``` ```bash netcdf test-plain-h5netcdf { dimensions: x = 2 ; y = 2 ; variables: string da(x, y) ; data: da = "a", "b", "c", "d" ; } HDF5 "test-plain-h5netcdf.nc" { DATASET "da" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_UTF8; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( 2, 2 ) / ( 2, 2 ) } DATA { (0,0): "a", "b", (1,0): "c", "d" } ATTRIBUTE "DIMENSION_LIST" { DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }} DATASPACE SIMPLE { ( 2 ) / ( 2 ) } DATA { (0): (), () } } ATTRIBUTE "_Netcdf4Coordinates" { DATATYPE H5T_STD_I32LE DATASPACE SIMPLE { ( 2 ) / ( 2 ) } DATA { (0): 0, 1 } } ATTRIBUTE "_Netcdf4Dimid" { DATATYPE H5T_STD_I32LE DATASPACE SCALAR DATA { (0): 0 } } } } ```

Both get written out as:

DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_UTF8; CTYPE H5T_C_S1; }

If you use fixed length strings (eg. |S1) the dtype is preserved during roundtrip:

```python import xarray as xr

Make an xarray with an array of fixed-length strings

data = np.array([["a", "b"], ["c", "d"]], dtype="|S1") da = xr.DataArray( data=data, dims=["x", "y"], coords={"x": [0, 1], "y": [0, 1]}, ) da.to_netcdf("test.nc", mode='w')

Load the xarray back in

da_loaded = xr.load_dataarray("test.nc") assert da.dtype == da_loaded.dtype, "Dtypes don't match" ```

Versions ``` INSTALLED VERSIONS ------------------ commit: None python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.14.21-150400.24.46-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: ('de_DE', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.1 xarray: 2023.2.0 pandas: 1.5.3 numpy: 1.24.2 scipy: 1.10.1 netCDF4: 1.6.3 pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.6 cfgrib: None iris: None bottleneck: None dask: 2023.3.1 distributed: 2023.3.1 matplotlib: 3.7.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.3.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 67.6.0 pip: 23.0.1 conda: None pytest: None mypy: None IPython: 8.11.0 sphinx: None ```
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  1632718954
Powered by Datasette · Queries took 0.593ms · About: xarray-datasette