id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1966264258,I_kwDOAMm_X851Ms_C,8385,The method to_netcdf does not preserve chunks,40218891,open,0,,,3,2023-10-27T22:29:45Z,2023-10-31T18:51:45Z,,NONE,,,,"### What happened? Methods ``to_zarr`` and ``to_netcdf`` behave inconsistently for chunked dataset. The latter does not preserve existing chunk information, the chunks must be specified within the ``encoding`` dictionary. ### What did you expect to happen? I expected the behaviour to be consistent for for all ``to_XXX()`` methods. ### Minimal Complete Verifiable Example ```Python import xarray as xr import dask.array as da rng = da.random.RandomState() shape = (20, 20) chunks = [10, 10] dims = [""x"", ""y""] z = rng.standard_normal(shape, chunks=chunks) ds = xr.DataArray(z, dims=dims, name=""z"").to_dataset() ds.chunks # This one is rechunked ds.to_netcdf(""/tmp/test1.nc"", encoding={""z"": {""chunksizes"": (5, 5)}}) # This one is not rechunked, also original chunks are lost ds.chunk({""x"": 5, ""y"": 5}).to_netcdf(""/tmp/test2.nc"") # This one is rechunked ds.chunk({""x"": 5, ""y"": 5}).to_zarr(""/tmp/test2"", mode=""w"") Frozen({'x': (10, 10), 'y': (10, 10)}) xr.open_mfdataset(""/tmp/test1.nc"").chunks xr.open_mfdataset(""/tmp/test2.nc"").chunks xr.open_mfdataset(""/tmp/test2"", engine=""zarr"").chunks Frozen({'x': (5, 5, 5, 5), 'y': (5, 5, 5, 5)}) Frozen({'x': (20,), 'y': (20,)}) Frozen({'x': (5, 5, 5, 5), 'y': (5, 5, 5, 5)}) ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. - [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies. ### Relevant log output _No response_ ### Anything else we need to know? I did get the same results for ``h5netcdf`` and ``scipy`` backends, so I am not sure whether this is a bug or not. The above code is a modified version of #2198. A suggestion: the documentation provides only examples of encoding styles. It would be helpful to provide links to a full specification. ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.5.5-1-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.10.1 pandas: 2.1.1 numpy: 1.24.4 scipy: 1.11.3 netCDF4: 1.6.4 pydap: None h5netcdf: 1.2.0 h5py: 3.10.0 Nio: None zarr: 2.16.1 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.10.0 distributed: 2023.10.0 matplotlib: 3.8.0 cartopy: 0.22.0 seaborn: None numbagg: 0.5.1 fsspec: 2023.10.0 cupy: None pint: None sparse: 0.14.0 flox: 0.8.1 numpy_groupies: 0.10.2 setuptools: 68.2.2 pip: 23.3.1 conda: None pytest: None mypy: None IPython: 8.16.1 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8385/reactions"", ""total_count"": 3, ""+1"": 3, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 789653499,MDU6SXNzdWU3ODk2NTM0OTk=,4830,GH2550 revisited,40218891,open,0,,,2,2021-01-20T05:40:16Z,2021-01-25T23:06:01Z,,NONE,,,," **Is your feature request related to a problem? Please describe.** I am retrieving files from AWS: https://registry.opendata.aws/wrf-se-alaska-snap/. An example: ``` import s3fs import xarray as xr s3 = s3fs.S3FileSystem(anon=True) s3path = 's3://wrf-se-ak-ar5/gfdl/hist/daily/1980/WRFDS_1980-01-0[12].nc' remote_files = s3.glob(s3path) fileset = [s3.open(file) for file in remote_files] ds = xr.open_mfdataset(fileset, concat_dim='Time', decode_cf=False) ds ``` Data files for 1980 are missing time coordinate, so the above code fails. The time could be obtained by parsing file name, however in the current implementation the *source* attribute is available only when the fileset consists of strings or *Path*s. **Describe the solution you'd like** I would suggest to return to the original suggestion in #2550 - pass *filename_or_object* as an argument to *preprocess* function, but with necessary inspection. Here is my attempt (code in *open_mfdataset*): ``` open_kwargs = dict( engine=engine, chunks=chunks or {}, lock=lock, autoclose=autoclose, **kwargs ) if preprocess is not None: # Get number of free arguments from inspect import signature parms = signature(preprocess).parameters num_preprocess_args = len([p for p in parms.values() if p.default == p.empty]) if num_preprocess_args not in (1, 2): raise ValueError('preprocess accepts only 1 or 2 arguments') if parallel: import dask # wrap the open_dataset, getattr, and preprocess with delayed open_ = dask.delayed(open_dataset) getattr_ = dask.delayed(getattr) if preprocess is not None: preprocess = dask.delayed(preprocess) else: open_ = open_dataset getattr_ = getattr datasets = [open_(p, **open_kwargs) for p in paths] file_objs = [getattr_(ds, ""_file_obj"") for ds in datasets] if preprocess is not None: if num_preprocess_args == 1: datasets = [preprocess(ds) for ds in datasets] else: datasets = [preprocess(ds, p) for (ds, p) in zip(datasets, paths)] ``` With this, I can define function *fix* as follows: ``` def fix(ds, source): vtime = datetime.strptime(os.path.basename(source.path), 'WRFDS_%Y-%m-%d.nc') return ds.assign_coords(Time=[vtime]) ds = xr.open_mfdataset(fileset, preprocess=fix, concat_dim='Time', decode_cf=False) ``` This is backward compatible, *preprocess* can accept any number of arguments: ``` from functools import partial import xarray as xr def fix1(ds): print('fix1') return ds def fix2(ds, file): print('fix2:', file.as_uri()) return ds def fix3(ds, file, arg): print('fix3:', file.as_uri(), arg) return ds fileset = [Path('/home/george/Downloads/WRFDS_1988-04-23.nc'), Path('/home/george/Downloads/WRFDS_1988-04-24.nc') ] ds = xr.open_mfdataset(fileset, preprocess=fix1, concat_dim='Time', parallel=True) ds = xr.open_mfdataset(fileset, preprocess=fix2, concat_dim='Time') ds = xr.open_mfdataset(fileset, preprocess=partial(fix3, arg='additional argument'), concat_dim='Time') ``` ``` fix1 fix1 fix2: file:///home/george/Downloads/WRFDS_1988-04-23.nc fix2: file:///home/george/Downloads/WRFDS_1988-04-24.nc fix3: file:///home/george/Downloads/WRFDS_1988-04-23.nc additional argument fix3: file:///home/george/Downloads/WRFDS_1988-04-24.nc additional argument ``` **Describe alternatives you've considered** The simple solution would be to make xarray s3fs aware. IMHO this is not particularly elegant. Either a check for an attribute, or an import within a *try/except* block would be needed. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4830/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue