issues: 2016875829

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
2016875829	I_kwDOAMm_X854NxU1	8490	performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset	827870	closed	0			14	2023-11-29T15:40:24Z	2023-11-30T03:37:41Z	2023-11-30T03:37:41Z	NONE				What happened? I'm probably doing something wrong, but I'm seeing a large performance regression from 08 to 09 when opening a set of NASA POWER netcdf files, reducing them to only a subset of variables, and then saving them as a zarr file. Updated, see comments below: The speed difference is actually most apparent on the call to .to_zarr(). Deleted: The regression is apparent just in the time to call open_mfdataset; this operation on a month worth of files went from about 3.5 seconds to 9 seconds between these two versions, and remains slow even with 2023.11.0. One thing about my setup is that I'm reading the source files over NFS; the output zarr file is going to local fast temporary storage. This regression coincides with https://github.com/pydata/xarray/pull/7948 which changed the chunking for netcdf4 files, but I'm not sure if that's the cause. The performance doesn't change if I use chunks={} or chunks='auto'. I've tried this with dask 2023.08.0 through 2023.11.0 and there are no changes; I'm using netcdf4 version 1.6.5. The merra2 files are all lat/lon gridded and each represents a single day; I'm re-writing them to put multiple days in a one-month file: ``` netcdf power_901_daily_20230101_merra2_utc { dimensions: lon = 576 ; lat = 361 ; time = UNLIMITED ; // (1 currently) variables: double lon(lon) ; lon:_FillValue = -999. ; lon:long_name = "longitude" ; lon:units = "degrees_east" ; lon:standard_name = "longitude" ; double lat(lat) ; lat:_FillValue = -999. ; lat:long_name = "latitude" ; lat:units = "degrees_north" ; lat:standard_name = "latitude" ; double time(time) ; time:_FillValue = -999. ; time:long_name = "time" ; time:units = "days since 2023-01-01 00:00:00" ; ... double T2M(time, lat, lon) ; T2M:_FillValue = -999. ; T2M:least_significant_digit = 2LL ; T2M:units = "K" ; T2M:long_name = "Temperature at 2 Meters" ; T2M:standard_name = "Temperature_at_2_Meters" ; T2M:valid_min = 150. ; T2M:valid_max = 350. ; T2M:valid_range = 150., 350. ; ``` What did you expect to happen? No response Minimal Complete Verifiable Example `Python desired_fields = ["PRECTOTCORR", "T2M"] met_files = glob.glob("power_091_daily__merra2_utc.nc") df = xarray.open_mfdataset(met_files, chunks={'time': 1}, parallel=False).astype(np.float32) pt = df[desired_fields] pt.to_zarr("out.zarr", consolidated=True)` MVCE confirmation [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. [ ] Complete example — the example is self-contained, including all data and the text of any traceback. [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result. [X] New issue — a search of GitHub Issues suggests this is not a duplicate. [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies. Relevant log output No response* Anything else we need to know? No response Environment INSTALLED VERSIONS ------------------ commit: None python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 5.19.0-35-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development xarray: 2023.8.0 pandas: 2.1.2 numpy: 1.26.1 scipy: 1.11.3 netCDF4: 1.6.5 pydap: None h5netcdf: 1.2.0 h5py: 3.10.0 Nio: None zarr: 2.16.1 cftime: 1.6.3 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.11.0 distributed: None matplotlib: 3.8.1 cartopy: None seaborn: 0.13.0 numbagg: None fsspec: 2023.10.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.2.2 pip: 23.3.1 conda: None pytest: 7.4.3 mypy: None IPython: 8.17.2 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8490/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

4 rows from issues_id in issues_labels
0 rows from issue in issue_comments