issues: 1152047670

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1152047670	I_kwDOAMm_X85Eqto2	6309	Read/Write performance optimizations for netcdf files	90008	open	0			5	2022-02-26T17:40:40Z	2023-09-13T08:27:47Z		CONTRIBUTOR				What happened? I'm not too sure this is a bug report, but I figured I would share some of the investigation I've done on the topic of writing large datasets to netcdf. For clarity, the usecase I'm considering is writing large in-memory array to persistant storage on Linux. Array size 4-100GB File Format Netcdf Dask: No. Hardware: Some very modern SSD that can write more than 1GB/s (Sabrent Rocket 4 Plus for example). Operating System: Linux Xarray version: 0.21.1 (or around there) The symptoms are two fold: 1. The write speed is slow. About 1GB/s, much less than the 2-3 GB/s you can get with other means. 2. The Linux disk cache just keeps filling up. Its quite hard to get good performance from systems, so I"m going to put a few more constraints on the type of data we are are writing: 1. The underlying numpy array must be alight to the linux Page boundary of 4096 bytes. 2. The underlying numpy array must have been pre-faulted and not swapped. (Do not use `np.zeros`, it doesn't fault the memory) I feel like these two options are rather easy to get to as I'll show in my example. What did you expect to happen? I want to be able to write at 3.2GB/s with my shiny new SSD. I want to leave my RAM unused when I'm archiving to disk. Minimal Complete Verifiable Example ```Python import numpy as np import xarray as xr def empty_aligned(shape, dtype=np.float64, align=4096): if not isinstance(shape, tuple): shape = (shape,) `dtype = np.dtype(dtype) size = dtype.itemsize # Compute the final size of the array for s in shape: size *= s a = np.empty(size + (align - 1), dtype=np.uint8) data_align = a.ctypes.data % align offset = 0 if data_align == 0 else (align - data_align) arr = a[offset:offset + size].view(dtype) # Don't use reshape since reshape might copy the data. # This is the suggested way to assign a new shape with guarantee # That the data won't be copied. arr.shape = shape return arr` dataset = xr.DataArray( empty_aligned((4, 1024, 1024, 1024), dtype='uint8'), name='mydata').to_dataset() Fault and write data to this dataset dataset['mydata'].data[...] = 1 %time dataset.to_netcdf("test", engine='h5netcdf') %time dataset.to_netcdf("test", engine='netcdf4') ``` Relevant log output Both output about 3.5s equivalent to just about 1GB/s. To get to about 3 ish GB/s (taking about 1.27s to write a 4GB array). One needs to do a few things: You must align the underlying data to disk. h5netcdf (h5py) backend https://github.com/h5py/h5py/pull/2040 netcdf4: https://github.com/Unidata/netcdf-c/pull/2206 You must use a driver that bypasses the operating system cache https://github.com/h5py/h5py/pull/2041 For the h5netcdf backend you would have to add the following kwargs to h5netcdf constructor `kwargs = { "invalid_netcdf": invalid_netcdf, "phony_dims": phony_dims, "decode_vlen_strings": decode_vlen_strings, 'alignment_threshold': alignment_threshold, 'alignment_interval': alignment_interval, }` Anything else we need to know? The main challenge is that while writing aligned data this way is REALLY fast, writing small chunks and unaligned data becomes REALLY slow. Personally, I think that someone might be able to write a new HDF5 driver that does better optimization, I feel like this can help people loading large datasets which seems to be a large part of the community of xarray users. Environment ``` INSTALLED VERSIONS commit: None python: 3.9.9 (main, Dec 29 2021, 07:47:36) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.13.0-30-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 0.21.1 pandas: 1.4.0 numpy: 1.22.2 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: 0.13.1 h5py: 3.6.0.post1 Nio: None zarr: None cftime: 1.5.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.01.1 distributed: None matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: None fsspec: 2022.01.0 cupy: None pint: None sparse: None setuptools: 60.8.1 pip: 22.0.3 conda: None pytest: None IPython: 8.0.1 sphinx: None ``` h5py includes some additions of mine that allow you to use the DIRECT driver and I am using a version of HDF5 that is built with the DIRECT driver.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6309/reactions", "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 1 }			13221727	issue

Links from other tables

2 rows from issues_id in issues_labels
5 rows from issue in issue_comments