issues: 1152047670
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1152047670 | I_kwDOAMm_X85Eqto2 | 6309 | Read/Write performance optimizations for netcdf files | 90008 | open | 0 | 5 | 2022-02-26T17:40:40Z | 2023-09-13T08:27:47Z | CONTRIBUTOR | What happened?I'm not too sure this is a bug report, but I figured I would share some of the investigation I've done on the topic of writing large datasets to netcdf. For clarity, the usecase I'm considering is writing large in-memory array to persistant storage on Linux.
The symptoms are two fold: 1. The write speed is slow. About 1GB/s, much less than the 2-3 GB/s you can get with other means. 2. The Linux disk cache just keeps filling up. Its quite hard to get good performance from systems, so I"m going to put a few more constraints on the type of data we are are writing:
1. The underlying numpy array must be alight to the linux Page boundary of 4096 bytes.
2. The underlying numpy array must have been pre-faulted and not swapped. (Do not use I feel like these two options are rather easy to get to as I'll show in my example. What did you expect to happen?I want to be able to write at 3.2GB/s with my shiny new SSD. I want to leave my RAM unused when I'm archiving to disk. Minimal Complete Verifiable Example```Python import numpy as np import xarray as xr def empty_aligned(shape, dtype=np.float64, align=4096): if not isinstance(shape, tuple): shape = (shape,)
dataset = xr.DataArray( empty_aligned((4, 1024, 1024, 1024), dtype='uint8'), name='mydata').to_dataset() Fault and write data to this datasetdataset['mydata'].data[...] = 1 %time dataset.to_netcdf("test", engine='h5netcdf') %time dataset.to_netcdf("test", engine='netcdf4') ``` Relevant log outputBoth output about 3.5s equivalent to just about 1GB/s. To get to about 3 ish GB/s (taking about 1.27s to write a 4GB array). One needs to do a few things:
For the h5netcdf backend you would have to add the following kwargs to h5netcdf constructor
Anything else we need to know?The main challenge is that while writing aligned data this way is REALLY fast, writing small chunks and unaligned data becomes REALLY slow. Personally, I think that someone might be able to write a new HDF5 driver that does better optimization, I feel like this can help people loading large datasets which seems to be a large part of the community of xarray users. Environment``` INSTALLED VERSIONS commit: None python: 3.9.9 (main, Dec 29 2021, 07:47:36) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.13.0-30-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 0.21.1 pandas: 1.4.0 numpy: 1.22.2 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: 0.13.1 h5py: 3.6.0.post1 Nio: None zarr: None cftime: 1.5.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.01.1 distributed: None matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: None fsspec: 2022.01.0 cupy: None pint: None sparse: None setuptools: 60.8.1 pip: 22.0.3 conda: None pytest: None IPython: 8.0.1 sphinx: None ``` h5py includes some additions of mine that allow you to use the DIRECT driver and I am using a version of HDF5 that is built with the DIRECT driver. |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/6309/reactions", "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 1 } |
13221727 | issue |