id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1056247970,I_kwDOAMm_X84-9RCi,5995,High memory usage of xarray vs netCDF4 function,49512274,closed,0,,,3,2021-11-17T15:13:19Z,2023-09-12T15:44:19Z,2023-09-12T15:44:18Z,NONE,,,,"Hi, I would like to open one netcdf file, changes some variables attributes, zlib compression and sometimes global attributes. I used to do it using netCDF4, and it worked. Recently, i tried using xarray to perform the same job. The result are the same, but xarray always load entirely the file in memory, instead of write variable by variables. **Minimal example here** Creation of example file : ```python import xarray as xr ds = xr.Dataset() obs=4835680 n = 20 basic_encoding = dict(zlib=True, shuffle=True, complevel=1) # some variables with scale factor for i in range(3): vname = f""scale{i:02d}"" ds[vname] = ([""obs""], np.random.rand(obs).astype(np.float32)/1e3) ds[vname].encoding.update(basic_encoding) ds[vname].encoding.update({""dtype"": np.uint16, ""scale_factor"": 0.0001, ""add_offset"": 0, ""chunksizes"": (1611894,)}) # some variables without scale factor for i in range(3): vname = f""float{i:02d}"" ds[vname] = ([""obs""], np.random.rand(obs).astype(np.float32)) ds[vname].encoding.update(basic_encoding) ds[vname].encoding.update({""chunksizes"": (967136,)}) # some variables with 2 dimensions which use more memory for i in range(3): vname = f""matrix{i:02d}"" ds[vname] = ([""obs"", ""n""], np.random.rand(obs, n).astype(np.float32)*10) ds[vname].encoding.update(basic_encoding) ds[vname].encoding.update({""dtype"": np.int16, ""scale_factor"": 0.01, ""add_offset"": 0, ""chunksizes"": (20000, 20)}) ds.to_netcdf(""/tmp/test_original.nc"") ``` here is my olf function to copy/rewrite my netcdf file, and the new function (i deleted useless changes in both function to keep only importants parts) ```python import netCDF4 def old_copy(f_in, f_out): with netCDF4.Dataset(f_out, 'w') as h_out: with netCDF4.Dataset(f_in, 'r') as h_in: for dimension, size in h_in.dimensions.items(): h_out.createDimension(dimension, len(size)) for varname, var_in in h_in.variables.items(): var_out = h_out.createVariable( varname, var_in.dtype, var_in.dimensions, zlib=True, complevel=2 ) for key in var_in.ncattrs(): if key != '_FillValue': setattr(var_out, key, getattr(var_in, key)) var_in.set_auto_maskandscale(False) var_out.set_auto_maskandscale(False) var_out[:] = var_in[:] for attr in h_in.ncattrs(): setattr(h_out, attr, getattr(h_in, attr)) def new_copy(f_in, f_out): with xr.open_dataset(f_in) as d_in: d_in.to_netcdf(f_out) ``` here i compare both function in term of memory usage, ```python import holoviews as hv from dask.diagnostics import ResourceProfiler, visualize hv.extension(""bokeh"") F_IN = ""/tmp/test_original.nc"" F_OUT = ""/tmp/test.nc"" !rm -rfv {F_OUT} with ResourceProfiler(dt=0.1) as rprof_old: old_copy(F_IN, F_OUT) # rprof.visualize() !rm -rfv {F_OUT} with ResourceProfiler(dt=0.1) as rprof_new: new_copy(F_IN, F_OUT) visualize([rprof_old, rprof_new]) ``` ![image](https://user-images.githubusercontent.com/49512274/142226010-b6a3b69a-82f3-46dd-b02c-88da065dbf52.png) **What happened**: xarray seems to load the entire file in memory to dump it. **What you expected to happen**: How can i tell xarray to load/dump variable by variable without loading the entire file ? Thanks you **Environment**:
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.6 (default, Jul 30 2021, 16:35:19) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.15.0-142-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: ('fr_FR', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.19.0 pandas: 1.3.2 numpy: 1.20.3 scipy: 1.6.2 netCDF4: 1.5.6 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.8.1 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.07.2 distributed: 2021.07.2 matplotlib: 3.4.3 cartopy: 0.19.0 seaborn: None numbagg: None pint: 0.17 setuptools: 52.0.0.post20210125 pip: 21.2.2 conda: 4.10.3 pytest: 6.2.5 IPython: 7.26.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5995/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 822987300,MDU6SXNzdWU4MjI5ODczMDA=,5001,.min() doesn't work on np.datetime64 with a chunked Dataset,49512274,open,0,,,2,2021-03-05T11:12:19Z,2022-05-01T16:11:48Z,,NONE,,,," Hi all, if a xr.Dataset is chunked, i cannot do ds.time.min(), i get an error : `ufunc 'add' cannot use operands with types dtype('Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-133-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 libhdf5: 1.12.0 libnetcdf: 4.7.4 xarray: 0.16.2 pandas: 1.2.1 numpy: 1.19.5 scipy: 1.6.0 netCDF4: 1.5.5.1 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.6.1 cftime: 1.3.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.01.1 distributed: 2021.01.1 matplotlib: 3.3.4 cartopy: None seaborn: None numbagg: None pint: 0.16.1 setuptools: 52.0.0.post20210125 pip: 20.3.3 conda: None pytest: 6.2.2 IPython: 7.20.0 sphinx: 3.5.0 ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5001/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 673504545,MDU6SXNzdWU2NzM1MDQ1NDU=,4311,"uint32 variable in zarr, but float64 when loading with xarray",49512274,closed,0,,,1,2020-08-05T12:34:35Z,2021-04-19T08:59:51Z,2021-04-19T08:59:51Z,NONE,,,,"Hi all, I start to play with xarray and zarr and came across something curious : I create a zarr folder and a zarr variable in uint32. When i load this dataset with xarray, it loads in float64. I don't know if it is something expected ? ```python fichier1 = ""/tmp/test.zarr"" zh = zarr.open(fichier1, ""w"") example = np.zeros(10, dtype=np.uint32) myvar = zh.create_dataset(""myvar"", shape=example.shape, dtype=example.dtype ) myvar.attrs[""_ARRAY_DIMENSIONS""] = [""obs""] # <- without this, the zarr dataset will not be readable by xarray myvar[:] = example # dtype is uint32 zh.myvar.dtype ``` ```python >>> dtype('uint32') ``` when reloading with zarr : ```python # dtype is stil uint32 zh = zarr.open(fichier1, 'r') zh.myvar.dtype ``` ```python >>> dtype('uint32') ``` But when loading with xarray : ```python # dtype is float64 ds = xr.open_zarr(fichier1) ds.myvar.dtype ``` ```python >>> dtype('float64') ``` Is it something expected ? Am I missing something ? link to the notebook created : [bad_dtype_zarr_xarray](https://github.com/ludwigVonKoopa/problems/blob/master/bad_dtype_zarr_xarray.ipynb) **Environment**:
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-106-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.1 xarray: 0.15.1 pandas: 1.0.3 numpy: 1.18.1 scipy: 1.4.1 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: 2.3.2 cftime: 1.1.1.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.13.0 distributed: 2.13.0 matplotlib: 3.1.3 cartopy: None seaborn: None numbagg: None setuptools: 46.1.1.post20200323 pip: 20.0.2 conda: None pytest: 5.4.1 IPython: 7.13.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4311/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 778221436,MDU6SXNzdWU3NzgyMjE0MzY=,4763,Keep attributes across operations,49512274,closed,0,,,1,2021-01-04T16:45:45Z,2021-01-04T16:52:15Z,2021-01-04T16:52:15Z,NONE,,,,"Hi, I felt on this [issue#2582](https://github.com/pydata/xarray/issues/2582) about the problem when arithmetic operation doesn't keep attribute in an DataArray. Is this problem not merged yet ? I just installed a fresh conda env with python3.8 & xarray 0.16.2 and the problem still persist : ```python ds = xr.Dataset({""a"": ((""x"",), np.array([1,2,3]))}) ds[""a""].attrs[""units""] = ""m"" ds.a Out[1]: array([1, 2, 3]) Dimensions without coordinates: x Attributes: units: m ``` ```python ds[""b""] = ds.a * 2 ds.b Out[2]: array([2, 4, 6]) Dimensions without coordinates: x ```
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-128-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 libhdf5: None libnetcdf: None xarray: 0.16.2 pandas: 1.1.5 numpy: 1.19.2 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None pint: None setuptools: 51.0.0.post20201207 pip: 20.3.3 conda: None pytest: None IPython: 7.19.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4763/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue