home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 2016875829

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2016875829 I_kwDOAMm_X854NxU1 8490 performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset 827870 closed 0     14 2023-11-29T15:40:24Z 2023-11-30T03:37:41Z 2023-11-30T03:37:41Z NONE      

What happened?

I'm probably doing something wrong, but I'm seeing a large performance regression from 08 to 09 when opening a set of NASA POWER netcdf files, reducing them to only a subset of variables, and then saving them as a zarr file. Updated, see comments below: The speed difference is actually most apparent on the call to .to_zarr().

Deleted: The regression is apparent just in the time to call open_mfdataset; this operation on a month worth of files went from about 3.5 seconds to 9 seconds between these two versions, and remains slow even with 2023.11.0.

One thing about my setup is that I'm reading the source files over NFS; the output zarr file is going to local fast temporary storage.

This regression coincides with https://github.com/pydata/xarray/pull/7948 which changed the chunking for netcdf4 files, but I'm not sure if that's the cause. The performance doesn't change if I use chunks={} or chunks='auto'.

I've tried this with dask 2023.08.0 through 2023.11.0 and there are no changes; I'm using netcdf4 version 1.6.5.

The merra2 files are all lat/lon gridded and each represents a single day; I'm re-writing them to put multiple days in a one-month file:

``` netcdf power_901_daily_20230101_merra2_utc { dimensions: lon = 576 ; lat = 361 ; time = UNLIMITED ; // (1 currently) variables: double lon(lon) ; lon:_FillValue = -999. ; lon:long_name = "longitude" ; lon:units = "degrees_east" ; lon:standard_name = "longitude" ; double lat(lat) ; lat:_FillValue = -999. ; lat:long_name = "latitude" ; lat:units = "degrees_north" ; lat:standard_name = "latitude" ; double time(time) ; time:_FillValue = -999. ; time:long_name = "time" ; time:units = "days since 2023-01-01 00:00:00" ; ... double T2M(time, lat, lon) ; T2M:_FillValue = -999. ; T2M:least_significant_digit = 2LL ; T2M:units = "K" ; T2M:long_name = "Temperature at 2 Meters" ; T2M:standard_name = "Temperature_at_2_Meters" ; T2M:valid_min = 150. ; T2M:valid_max = 350. ; T2M:valid_range = 150., 350. ;

```

What did you expect to happen?

No response

Minimal Complete Verifiable Example

Python desired_fields = ["PRECTOTCORR", "T2M"] met_files = glob.glob("power_091_daily_*_merra2_utc.nc") df = xarray.open_mfdataset(met_files, chunks={'time': 1}, parallel=False).astype(np.float32) pt = df[desired_fields] pt.to_zarr("out.zarr", consolidated=True)

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 5.19.0-35-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development xarray: 2023.8.0 pandas: 2.1.2 numpy: 1.26.1 scipy: 1.11.3 netCDF4: 1.6.5 pydap: None h5netcdf: 1.2.0 h5py: 3.10.0 Nio: None zarr: 2.16.1 cftime: 1.6.3 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.11.0 distributed: None matplotlib: 3.8.1 cartopy: None seaborn: 0.13.0 numbagg: None fsspec: 2023.10.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.2.2 pip: 23.3.1 conda: None pytest: 7.4.3 mypy: None IPython: 8.17.2 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8490/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 4 rows from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 1.626ms · About: xarray-datasette