home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1056247970

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1056247970 I_kwDOAMm_X84-9RCi 5995 High memory usage of xarray vs netCDF4 function 49512274 closed 0     3 2021-11-17T15:13:19Z 2023-09-12T15:44:19Z 2023-09-12T15:44:18Z NONE      

Hi,

I would like to open one netcdf file, changes some variables attributes, zlib compression and sometimes global attributes. I used to do it using netCDF4, and it worked.

Recently, i tried using xarray to perform the same job. The result are the same, but xarray always load entirely the file in memory, instead of write variable by variables.

Minimal example here

Creation of example file :

```python import xarray as xr

ds = xr.Dataset() obs=4835680 n = 20

basic_encoding = dict(zlib=True, shuffle=True, complevel=1)

some variables with scale factor

for i in range(3): vname = f"scale{i:02d}" ds[vname] = (["obs"], np.random.rand(obs).astype(np.float32)/1e3) ds[vname].encoding.update(basic_encoding) ds[vname].encoding.update({"dtype": np.uint16, "scale_factor": 0.0001, "add_offset": 0, "chunksizes": (1611894,)})

some variables without scale factor

for i in range(3): vname = f"float{i:02d}" ds[vname] = (["obs"], np.random.rand(obs).astype(np.float32)) ds[vname].encoding.update(basic_encoding) ds[vname].encoding.update({"chunksizes": (967136,)})

some variables with 2 dimensions which use more memory

for i in range(3): vname = f"matrix{i:02d}" ds[vname] = (["obs", "n"], np.random.rand(obs, n).astype(np.float32)*10) ds[vname].encoding.update(basic_encoding) ds[vname].encoding.update({"dtype": np.int16, "scale_factor": 0.01, "add_offset": 0, "chunksizes": (20000, 20)})

ds.to_netcdf("/tmp/test_original.nc") ```

here is my olf function to copy/rewrite my netcdf file, and the new function (i deleted useless changes in both function to keep only importants parts)

```python import netCDF4 def old_copy(f_in, f_out):

with netCDF4.Dataset(f_out, 'w') as h_out:
    with netCDF4.Dataset(f_in, 'r') as h_in:
        for dimension, size in h_in.dimensions.items():
            h_out.createDimension(dimension, len(size))

        for varname, var_in in h_in.variables.items():
            var_out = h_out.createVariable(
                varname, var_in.dtype, var_in.dimensions,
                zlib=True, complevel=2
            )
            for key in var_in.ncattrs():
                if key != '_FillValue':
                    setattr(var_out, key, getattr(var_in, key))
            var_in.set_auto_maskandscale(False)
            var_out.set_auto_maskandscale(False)
            var_out[:] = var_in[:]

        for attr in h_in.ncattrs():
            setattr(h_out, attr, getattr(h_in, attr))

def new_copy(f_in, f_out): with xr.open_dataset(f_in) as d_in: d_in.to_netcdf(f_out) ```

here i compare both function in term of memory usage,

```python import holoviews as hv from dask.diagnostics import ResourceProfiler, visualize hv.extension("bokeh")

F_IN = "/tmp/test_original.nc" F_OUT = "/tmp/test.nc"

!rm -rfv {F_OUT} with ResourceProfiler(dt=0.1) as rprof_old:

old_copy(F_IN, F_OUT)

rprof.visualize()

!rm -rfv {F_OUT} with ResourceProfiler(dt=0.1) as rprof_new:

new_copy(F_IN, F_OUT)

visualize([rprof_old, rprof_new]) ```

What happened:

xarray seems to load the entire file in memory to dump it.

What you expected to happen:

How can i tell xarray to load/dump variable by variable without loading the entire file ?

Thanks you

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.9.6 (default, Jul 30 2021, 16:35:19) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.15.0-142-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: ('fr_FR', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.19.0 pandas: 1.3.2 numpy: 1.20.3 scipy: 1.6.2 netCDF4: 1.5.6 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.8.1 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.07.2 distributed: 2021.07.2 matplotlib: 3.4.3 cartopy: 0.19.0 seaborn: None numbagg: None pint: 0.17 setuptools: 52.0.0.post20210125 pip: 21.2.2 conda: 4.10.3 pytest: 6.2.5 IPython: 7.26.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5995/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 2 rows from issue in issue_comments
Powered by Datasette · Queries took 0.952ms · About: xarray-datasette