home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

2 rows where type = "issue" and user = 56827 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

state 2

  • closed 1
  • open 1

type 1

  • issue · 2 ✖

repo 1

  • xarray 2
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1536004355 I_kwDOAMm_X85bjZED 7446 Parallel + multi-threaded reading of NetCDF4 + HDF5: Hidefix! gauteh 56827 open 0     9 2023-01-17T08:56:03Z 2023-06-26T22:06:46Z   NONE      

What is your issue?

Greetings,

I have developed a parallel or multi-threaded (and even async) reader for HDF5 and NetCDF4 files. It is still at a somewhat experimental stage (and does not support all compressions etc), but has been tested a fair bit by now. The reader is written in Rust with Python bindings:

https://github.com/gauteh/hidefix (pending conda package: https://github.com/conda-forge/staged-recipes/pull/21742)

Regular NetCDF4 and HDF5 is not thread-safe, and there's a global process-wide lock for reading files. With hidefix this lock is removed. This would allow parallel reading of datasets to be done in the same process, as opposed to split across processes. Additionally, the reader can read directly into the target buffer and thus avoids a cache for decoded chunks (effectively reducing memory usage and chunk re-decoding).

The reader works by indexing the chunks of a dataset so that chunks can be accessed independently.

I have created a basic xarray backend, combined with the NetCDF4 backend for reading attributes etc: https://github.com/gauteh/hidefix/blob/main/python/hidefix/xarray.py and it works pretty well for reading:

on my laptop with 8 CPUs we get 6x speed-up over the xarray NetCDF4 backend (reading a 380mb variable)! On larger machines the speed-up is even greater (if you want to control the number of CPUs set the RAYON_NUM_THREADS env variable).

Running benchmarks along the lines of:

``` import xarray as xr

i = xr.open_dataset('tests/data/barents_zdepth_m00_FC.nc', engine='hidefix') d = i['v'] v = d[...].values print(v.shape, type(v)) ```

for the different backends (with or without xarray):

At this point it turns out that a significant point of time was spent setting the _FillValue for the returned array (less important for NetCDF4 since the reader took much longer time anyway), this could also be done in rust in parallel: https://github.com/gauteh/hidefix/blob/main/src/python/mod.rs#L128 . Reducing it to a negligible amount of time. This can also be used on the existing xarray NetCDF4 backend.

I hope this can be of general interest, and if it would be of interest to move the hidefix xarray backend into xarray that would be very cool.

Best regards, Gaute

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7446/reactions",
    "total_count": 10,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 5,
    "rocket": 4,
    "eyes": 1
}
    xarray 13221727 issue
1695809136 I_kwDOAMm_X85lE_5w 7816 Backend registration does not match docs, and is no longer specifiable in maturin pyproject toml gauteh 56827 closed 0     7 2023-05-04T11:17:08Z 2023-05-05T08:55:57Z 2023-05-05T08:46:42Z NONE      

What happened?

https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html#how-to-register-a-backend does not work with maturin any more. The format has changed to a tuple, from netcdf4 backend:

BACKEND_ENTRYPOINTS["netcdf4"] = ("netCDF4", NetCDF4BackendEntrypoint)

https://www.maturin.rs/metadata.html

is this specifiable by pyproject.toml anymore?

Affects: #7446

What did you expect to happen?

No response

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7816/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 26.446ms · About: xarray-datasette