home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

2 rows where user = 16100116 sorted by updated_at descending

✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 1

  • issue 2

state 1

  • closed 2

repo 1

  • xarray 2
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1225191984 I_kwDOAMm_X85JBvIw 6570 h5netcdf-engine now reads attributes with array length 1 as scalar erik-mansson 16100116 closed 0     1 2022-05-04T10:34:06Z 2023-09-19T01:02:24Z 2023-09-19T01:02:24Z NONE      

What is your issue?

The h5netcdf engine for reading NetCDF4-files was recently changed https://github.com/h5netcdf/h5netcdf/pull/151 so that when reading attributes, any 1D array/list of length 1 gets turned into a scalar element/item. The change happened with version 0.14.0.

The issue is that the xarray documentation still describes the old h5netcdf-behaviour on https://docs.xarray.dev/en/stable/user-guide/io.html?highlight=attributes%20h5netcdf#netcdf

Could we mention this also on https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html#xarray.open_dataset under the engine argument, or just make sure it links to the above page?

I initially looked under https://docs.xarray.dev/en/stable/user-guide/io.html?highlight=string#string-encoding because my issue was for a string array/list, but maybe too much to mention there if this is a general change that affects attributes of all types.

As explained on the h5netcdf-issue tracker, the reason for dropping/squeezing 1-length-array attributes to scalars, is for compatibility with the other NetCDF4-engine or NetCDF in general (and there might be some varying opinions about how good that is, vs. fully using features available in HDF5). (Interesting to note is that when writing, an attribute with a python list of length 1 does give an array of length 1 in the HDF5/NetCDF4-file, the dropping of array dimension only happens only when reading.)

Adding the invalid_netcdf=True argument when loading does not change the behaviour. Maybe it could be interesting to use it to generally allow 1-length attribute arrays? Now, I think every usage of array-attributes will need conversions like list_in_recent_version = attribute if isinstance(attribute, list) else [attribute] or always_list = list(attribute if isinstance(attribute, (list, np.ndarray)) else [attribute]) to support both old and new versions. Otherwise, iterating over an attribute string will cause surprises by iterating over its characters instead of doing a single iteration that yields the single string (as in older versions).

Minimal example

This serves to clarify what happens. The issue is not about reverting to the old behaviour (although I liked it), just updating the xarray documentation. ``` import xarray as xr import numpy as np ds = xr.Dataset() ds['stuff'] = xr.DataArray(np.random.randn(2), dims='x') ds['stuff'].attrs['strings_0D_one'] = 'abc' ds['stuff'].attrs['strings_1D_two'] = ['abc', 'def'] ds['stuff'].attrs['strings_1D_one'] = ['abc'] path = 'demo.nc' ds.to_netcdf(path, engine='h5netcdf', format='netCDF4') ds2 = xr.load_dataset(path, engine='h5netcdf')

print(type(ds2['stuff'].attrs['strings_0D_one']).name, repr(ds2['stuff'].attrs['strings_0D_one'])) print(type(ds2['stuff'].attrs['strings_1D_two']).name, repr(ds2['stuff'].attrs['strings_1D_two'])) print(type(ds2['stuff'].attrs['strings_1D_one']).name, repr(ds2['stuff'].attrs['strings_1D_one'])) ``` With h5netcdf: 0.12.0 (python: 3.7.9, OS: Windows, OS-release: 10, libhdf5: 1.10.4, xarray: 0.20.1, pandas: 1.3.4, numpy: 1.21.5, netCDF4: None, h5netcdf: 0.12.0, h5py: 2.10.0) the printouts are:

str 'abc' ndarray array(['abc', 'def'], dtype=object) ndarray array(['abc'], dtype=object)

With h5netcdf: 1.0.0 (python: 3.8.11, OS: Linux, OS-release: 3.10.0-1160.49.1.el7.x86_64, libhdf5: 1.10.4, xarray: 0.20.1, pandas: 1.4.2, numpy: 1.21.2, netCDF4: None, h5netcdf: 1.0.0, h5py: 2.10.0) the printouts are:

str 'abc' list ['abc', 'def'] str 'abc'

I have tested that direct reading by h5py.File gives str, ndarray, ndarray so the change is not in the writing or h5py.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6570/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1216517115 I_kwDOAMm_X85IgpP7 6517 Loading from NetCDF creates unnecessary numpy.ndarray-views that clears the OWNDATA-flag erik-mansson 16100116 closed 0     6 2022-04-26T21:53:00Z 2023-04-06T03:42:34Z 2023-04-06T03:42:34Z NONE      

What happened?

When loading a NetCDF4 dataset from a file (at least using the 'h5netcdf' engine) I get a xarray.Dataset where each DataArray wraps a numpy.ndarray with the OWNDATA flag set to False. This is counter-intuitive as the high-level user has not knowingly run anything that would create a view/slice to get a second array sharing memory with the "first" array.

This is of course a rather minor issue, but it annoyed me when I was making tools to keep track of which arrays in my dataset were using much RAM, because I had made an option to only show memory usage for the primary/base array, not for possible extra array views that reuse the same memory. With this option enabled, the reported issue however prevents me from getting useful information (just shows nearly zero memory usage) when inspecting a Dataset loaded from NetCDF4-file instead of a "freshly" created or deep-copied Dataset.

What did you expect to happen?

I would prefer the OWNDATA flag to be True as they are from the lower-level h5py-reading of the HDF5-file. After some debugging this means avoiding to do things like array = array[:, ...] at various places in the multiple layers of wrappers that are involved in the dataset-loading, as that would create and return a new ndarray-instance "sharing" (not owning) memory with the original (while no user-accessible reference to the original ndarray that technically "owns" the memory seems to be retained).

See the minimal code example below. It and my patch were made using xarray 0.20.1 with h5netcdf 0.12.0 but looking it looks like the relevant parts of xarray/core/indexing.py are still the same.

Minimal Complete Verifiable Example

Python import xarray as xr import numpy as np ds = xr.Dataset() ds['stuff'] = xr.DataArray(np.random.randn(2), dims='x') path = 'demo.nc' ds.to_netcdf(path, engine='h5netcdf', format='netCDF4', invalid_netcdf=True) ds2 = xr.load_dataset(path, engine='h5netcdf') print(ds2['stuff'].values.flags['OWNDATA']) # initially False, True after patching

Relevant log output

No response

Anything else we need to know?

I patched two parts of xarray/core/indexing.py to solves the issue: xarray core indexing.diff.txt

Testing in other situations will of course be needed to make sure this doesn't disturb anything else, but I hope the general idea would be useful even if some more condition might be needed for when to take the shortcut of returning the original numpy.ndarray rather than a view of it.

Environment

NSTALLED VERSIONS ------------------ commit: None python: 3.7.9 (default, Aug 31 2020, 17:10:11) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: (None, None) libhdf5: 1.10.4 libnetcdf: None xarray: 0.20.1 pandas: 1.3.4 numpy: 1.21.5 scipy: 1.7.3 netCDF4: None pydap: None h5netcdf: 0.12.0 h5py: 2.10.0 Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.10.0 distributed: 2021.10.0 matplotlib: 3.3.2 cartopy: None seaborn: None numbagg: None fsspec: 0.8.3 cupy: None pint: 0.16.1 sparse: None setuptools: 52.0.0.post20210125 pip: 20.3.3 conda: 4.12.0 pytest: 6.2.2 IPython: 7.20.0 sphinx: 3.4.3
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6517/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 80.055ms · About: xarray-datasette
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows