issues: 2016875829
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2016875829 | I_kwDOAMm_X854NxU1 | 8490 | performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset | 827870 | closed | 0 | 14 | 2023-11-29T15:40:24Z | 2023-11-30T03:37:41Z | 2023-11-30T03:37:41Z | NONE | What happened?I'm probably doing something wrong, but I'm seeing a large performance regression from 08 to 09 when opening a set of NASA POWER netcdf files, reducing them to only a subset of variables, and then saving them as a zarr file. Updated, see comments below: The speed difference is actually most apparent on the call to .to_zarr(). Deleted: The regression is apparent just in the time to call open_mfdataset; this operation on a month worth of files went from about 3.5 seconds to 9 seconds between these two versions, and remains slow even with 2023.11.0. One thing about my setup is that I'm reading the source files over NFS; the output zarr file is going to local fast temporary storage. This regression coincides with https://github.com/pydata/xarray/pull/7948 which changed the chunking for netcdf4 files, but I'm not sure if that's the cause. The performance doesn't change if I use chunks={} or chunks='auto'. I've tried this with dask 2023.08.0 through 2023.11.0 and there are no changes; I'm using netcdf4 version 1.6.5. The merra2 files are all lat/lon gridded and each represents a single day; I'm re-writing them to put multiple days in a one-month file: ``` netcdf power_901_daily_20230101_merra2_utc { dimensions: lon = 576 ; lat = 361 ; time = UNLIMITED ; // (1 currently) variables: double lon(lon) ; lon:_FillValue = -999. ; lon:long_name = "longitude" ; lon:units = "degrees_east" ; lon:standard_name = "longitude" ; double lat(lat) ; lat:_FillValue = -999. ; lat:long_name = "latitude" ; lat:units = "degrees_north" ; lat:standard_name = "latitude" ; double time(time) ; time:_FillValue = -999. ; time:long_name = "time" ; time:units = "days since 2023-01-01 00:00:00" ; ... double T2M(time, lat, lon) ; T2M:_FillValue = -999. ; T2M:least_significant_digit = 2LL ; T2M:units = "K" ; T2M:long_name = "Temperature at 2 Meters" ; T2M:standard_name = "Temperature_at_2_Meters" ; T2M:valid_min = 150. ; T2M:valid_max = 350. ; T2M:valid_range = 150., 350. ; ``` What did you expect to happen?No response Minimal Complete Verifiable Example
MVCE confirmation
Relevant log outputNo response Anything else we need to know?No response Environment
INSTALLED VERSIONS
------------------
commit: None
python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
python-bits: 64
OS: Linux
OS-release: 5.19.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.3-development
xarray: 2023.8.0
pandas: 2.1.2
numpy: 1.26.1
scipy: 1.11.3
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.2.0
h5py: 3.10.0
Nio: None
zarr: 2.16.1
cftime: 1.6.3
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.11.0
distributed: None
matplotlib: 3.8.1
cartopy: None
seaborn: 0.13.0
numbagg: None
fsspec: 2023.10.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.3
mypy: None
IPython: 8.17.2
sphinx: None
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/8490/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |