id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 664611511,MDU6SXNzdWU2NjQ2MTE1MTE=,4261,to_zarr with append_dim behavior changed in 0.16.0 release ,41797673,closed,0,,,4,2020-07-23T16:26:16Z,2020-11-19T15:19:48Z,2020-11-19T15:19:48Z,NONE,,,," **What happened**: In version 0.15.1, calling `to_zarr` on a `Dataset` with a given `append_dim` would create a new zarr if it did not already exist. In version 0.16.0, this is not the case anymore: the call fails if the zarr does not exist. **Minimal Complete Verifiable Example**: ```python import xarray as xr a = xr.DataArray([1, 2], {""t"": [1, 2]}, (""t"",)) ds = xr.Dataset({""v"": a}) ds.to_zarr(""CHOOSE_PATH"", append_dim=""t"") ``` **Environment**:
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.7 (default, May 7 2020, 21:25:33) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.18.0-177.el8.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 libhdf5: None libnetcdf: None xarray: 0.16.0 pandas: 1.0.5 numpy: 1.18.1 scipy: 1.5.0 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.3.2 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.14.0 distributed: 2.14.0 matplotlib: None cartopy: None seaborn: None numbagg: None pint: None setuptools: 49.2.0.post20200714 pip: 20.1.1 conda: None pytest: 5.4.3 IPython: None sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4261/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 662982199,MDU6SXNzdWU2NjI5ODIxOTk=,4241,Parallel tasks on subsets of a dask array wrapped in an xarray Dataset,41797673,closed,0,,,5,2020-07-21T12:47:41Z,2020-07-27T08:18:13Z,2020-07-27T08:18:13Z,NONE,,,,"I have a large xarray.Dataset stored as a zarr. I want to perform some custom operations on it that cannot be done by just using numpy-like functions that a Dask cluster will automatically deal with. Therefore, I partition the dataset into small subsets and for each subset submit to my Dask cluster a task of the form ``` def my_task(zarr_path, subset_index): ds = xarray.open_zarr(zarr_path) # this returns an xarray.Dataset containing a dask.array sel = ds.sel(partition_index) sel = sel.load() # I want to get the data into memory # then do my custom operations ... ``` However, I have noticed this creates a ""task within a task"": when a worker receives ""my_task"", it in turn submits tasks to the cluster to load the relevant part of the dataset. To avoid this and ensure that the full task is executed within the worker, I am submitting instead the task: ``` def my_task_2(zarr_path, subset_index): with dask.config.set(scheduler=""threading""): my_task(zarr_path, subset_index) ``` Is this the best way to do this? What's the best practice for this kind of situation? I have already posted this on stackoverflow but did not get any answer, so I am adding this here hoping it increases visibility. Apologies if this is considered ""pollution"". https://stackoverflow.com/questions/62874267/parallel-tasks-on-subsets-of-a-dask-array-wrapped-in-an-xarray-dataset","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4241/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue