home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 1152047670

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1152047670 I_kwDOAMm_X85Eqto2 6309 Read/Write performance optimizations for netcdf files 90008 open 0     5 2022-02-26T17:40:40Z 2023-09-13T08:27:47Z   CONTRIBUTOR      

What happened?

I'm not too sure this is a bug report, but I figured I would share some of the investigation I've done on the topic of writing large datasets to netcdf.

For clarity, the usecase I'm considering is writing large in-memory array to persistant storage on Linux.

  • Array size 4-100GB
  • File Format Netcdf
  • Dask: No.
  • Hardware: Some very modern SSD that can write more than 1GB/s (Sabrent Rocket 4 Plus for example).
  • Operating System: Linux
  • Xarray version: 0.21.1 (or around there)

The symptoms are two fold: 1. The write speed is slow. About 1GB/s, much less than the 2-3 GB/s you can get with other means. 2. The Linux disk cache just keeps filling up.

Its quite hard to get good performance from systems, so I"m going to put a few more constraints on the type of data we are are writing: 1. The underlying numpy array must be alight to the linux Page boundary of 4096 bytes. 2. The underlying numpy array must have been pre-faulted and not swapped. (Do not use np.zeros, it doesn't fault the memory)

I feel like these two options are rather easy to get to as I'll show in my example.

What did you expect to happen?

I want to be able to write at 3.2GB/s with my shiny new SSD.

I want to leave my RAM unused when I'm archiving to disk.

Minimal Complete Verifiable Example

```Python import numpy as np import xarray as xr

def empty_aligned(shape, dtype=np.float64, align=4096): if not isinstance(shape, tuple): shape = (shape,)

dtype = np.dtype(dtype)
size = dtype.itemsize
# Compute the final size of the array
for s in shape:
    size *= s

a = np.empty(size + (align - 1), dtype=np.uint8)
data_align = a.ctypes.data % align
offset = 0 if data_align == 0 else (align - data_align)
arr = a[offset:offset + size].view(dtype)
# Don't use reshape since reshape might copy the data.
# This is the suggested way to assign a new shape with guarantee
# That the data won't be copied.
arr.shape = shape
return arr

dataset = xr.DataArray( empty_aligned((4, 1024, 1024, 1024), dtype='uint8'), name='mydata').to_dataset()

Fault and write data to this dataset

dataset['mydata'].data[...] = 1

%time dataset.to_netcdf("test", engine='h5netcdf') %time dataset.to_netcdf("test", engine='netcdf4') ```

Relevant log output

Both output about 3.5s equivalent to just about 1GB/s.

To get to about 3 ish GB/s (taking about 1.27s to write a 4GB array). One needs to do a few things:

  1. You must align the underlying data to disk.
  2. h5netcdf (h5py) backend https://github.com/h5py/h5py/pull/2040
  3. netcdf4: https://github.com/Unidata/netcdf-c/pull/2206
  4. You must use a driver that bypasses the operating system cache
  5. https://github.com/h5py/h5py/pull/2041

For the h5netcdf backend you would have to add the following kwargs to h5netcdf constructor kwargs = { "invalid_netcdf": invalid_netcdf, "phony_dims": phony_dims, "decode_vlen_strings": decode_vlen_strings, 'alignment_threshold': alignment_threshold, 'alignment_interval': alignment_interval, }

Anything else we need to know?

The main challenge is that while writing aligned data this way is REALLY fast, writing small chunks and unaligned data becomes REALLY slow.

Personally, I think that someone might be able to write a new HDF5 driver that does better optimization, I feel like this can help people loading large datasets which seems to be a large part of the community of xarray users.

Environment

``` INSTALLED VERSIONS


commit: None python: 3.9.9 (main, Dec 29 2021, 07:47:36) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.13.0-30-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1

xarray: 0.21.1 pandas: 1.4.0 numpy: 1.22.2 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: 0.13.1 h5py: 3.6.0.post1 Nio: None zarr: None cftime: 1.5.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.01.1 distributed: None matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: None fsspec: 2022.01.0 cupy: None pint: None sparse: None setuptools: 60.8.1 pip: 22.0.3 conda: None pytest: None IPython: 8.0.1 sphinx: None ```

h5py includes some additions of mine that allow you to use the DIRECT driver and I am using a version of HDF5 that is built with the DIRECT driver.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6309/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
    13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 5 rows from issue in issue_comments
Powered by Datasette · Queries took 241.503ms · About: xarray-datasette