issues: 1371466778
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1371466778 | I_kwDOAMm_X85Rvuwa | 7028 | .to_zarr() or .to_netcdf slow and uses excess memory when datetime64[ns] variable in output; a reproducible example | 12818667 | closed | 0 | 3 | 2022-09-13T13:32:29Z | 2022-11-03T15:40:53Z | 2022-11-03T15:40:52Z | NONE | What happened?This bug report is a reproducible example with code of an issue that may be in #7018, #2912 and other bug reports reporting slow performance and memory exhaustion when using .to_zarr() or .to_netcdf(). I think this has been hard to track down because it only occurs for large data sets. I have included code that replicates the problem without the need for downloading a large dataset. The problem is that saving a xarray dataset which includes a variable with type datetime64[ns] is several orders of magnitude slower (!!) and uses a great deal of memory (!!) relative to the same dataset where that variable has another type. The work around is obvious -- turn off time decoding and treat time as a float64. But this is in-elegant, and I think this problem has lead to many un-answered questions on the issues page, such as the one above. If I save a dataset whose structure (based on my use case, the ocean-parcels Lagrangian particle tracker) is:
To recreate this graph, and to see a very simple code that replicates this problem, see the attached python code. Note that the directory you run it in should have at least 30Gb free for the data set it writes, and for machines with less than 256Gb of memory, it will crash before completing after exhausting the memory. However, the last figure will be saved in jnk_out.png, and you can always change the largest size it attempts to create. SmallestExample_zarrOutProblem.zip What did you expect to happen?I expect that the time to save a dataset with .to_zarr or .to_netcdf does not change dramatically if one of the variables has a datetime64[ns] type. Minimal Complete Verifiable Example```Python this code is also included as a zip file above.import xarray as xr from pylab import * from numpy import * from glob import globfrom os import pathimport time import dask from dask.diagnostics import ProgressBar import shutil import pickle this is a minimal code that illustrates issue with .to_zarr() or .to_netcdf when writing a dataset with datetime64 dataoutputDir is the name of the zarr output; it should be set to a location on a fast filesystem with enough spaceoutputDir='./testOut.zarr' def testToZarr(dimensions,haveTimeType=True): '''This code writes out an empty dataset with the dimensions specified in the "dimensions" arguement, and returns the time it took to create the dask delayed object and the time it took to compute the delayed object.
now lets do some benchmarkingif name == "main": figure(1,figsize=(10.0,8.0)) clf() style.use('ggplot')
``` MVCE confirmation
Relevant log outputNo response Anything else we need to know?No response EnvironmentNote -- I see the same thing on my linux machine
INSTALLED VERSIONS
------------------
commit: None
python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:22) [Clang 13.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 21.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None
xarray: 2022.6.0
pandas: 1.4.3
numpy: 1.23.2
scipy: 1.9.0
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.12.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.8.1
distributed: 2022.8.1
matplotlib: 3.5.3
cartopy: 0.20.3
seaborn: None
numbagg: None
fsspec: 2022.7.1
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.3.0
pip: 22.2.2
conda: None
pytest: None
IPython: 8.4.0
sphinx: None
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/7028/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |