id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1083621690,I_kwDOAMm_X85AlsE6,6084,Initialise zarr metadata without computing dask graph,42455466,open,0,,,6,2021-12-17T21:17:42Z,2024-04-03T19:08:26Z,,NONE,,,,"**Is your feature request related to a problem? Please describe.** On writing large zarr stores, the [xarray docs](https://xarray.pydata.org/en/stable/user-guide/io.html#appending-to-existing-zarr-stores) recommend first creating an initial Zarr store without writing all of its array data. The recommended approach is to first create a dummy dask-backed `Dataset`, and then call `to_zarr` with `compute=False` to write only metadata to Zarr. This works great. It seems that in one common use case for this approach (including the example in the above docs), the entire dataset to be written to zarr is already represented in a `Dataset` (let's call this `ds`). Thus, rather than creating a dummy `Dataset` with exactly the same metadata as `ds`, it is more convenient to initialise the zarr Store with `ds.to_zarr(..., compute=False)`. See for example: https://discourse.pangeo.io/t/many-netcdf-to-single-zarr-store-using-concurrent-futures/2029 https://discourse.pangeo.io/t/map-blocks-and-to-zarr-region/2019 https://discourse.pangeo.io/t/netcdf-to-zarr-best-practices/1119/12 https://discourse.pangeo.io/t/best-practice-for-memory-management-to-iteratively-write-a-large-dataset-with-xarray/1989 However, calling `to_zarr` with `compute=False` still computes the dask graph for writing the Zarr store. The graph is never used in this use-case, but computing the graph can take a really long time for large graphs. **Describe the solution you'd like** Is there scope to add an option to `to_zarr` to initialise the store _without_ computing the dask graph? Or perhaps an `initialise_zarr` method would be cleaner? ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6084/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 1063046540,I_kwDOAMm_X84_XM2M,6026,Delaying open produces different type of `cftime` object,42455466,closed,0,,,3,2021-11-25T00:47:22Z,2022-01-13T13:49:27Z,2022-01-13T13:49:27Z,NONE,,,,"**What happened**: The task is opening a dataset (e.g. a netcdf or zarr file) with a time coordinate using `use_cftime=True`. Delaying the task with dask results in the time coordinate being represented as `cftime.datetime` objects, whereas when the task is not delayed `cftime.Datetime` objects are used. **What you expected to happen**: Consistent `cftime` objects to be used, regardless of whether the opening task is delayed or not. **Minimal Complete Verifiable Example**: ```python import dask import numpy as np import xarray as xr from dask.distributed import LocalCluster, Client cluster = LocalCluster() client = Client(cluster) # Write some data var = np.random.random(4) time = xr.cftime_range('2000-01-01', periods=4, calendar='julian') ds = xr.Dataset(data_vars={'var': ('time', var)}, coords={'time': time}) ds.to_netcdf('test.nc', mode='w') # Open written data ds1 = xr.open_dataset('test.nc', use_cftime=True) print(f'ds1: {ds1.time} \n') # Delayed open written data ds2 = dask.delayed(xr.open_dataset)('test.nc', use_cftime=True) ds2 = dask.compute(ds2)[0] print(f'ds2: {ds2.time} \n') # Operations like xr.open_mfdataset which use dask.delayed internally # when parallel=True (I think) produce the same result as ds2 ds3 = xr.open_mfdataset('test.nc', use_cftime=True, parallel=True) print(f'ds3: {ds3.time}') ``` returns ``` ds1: array([cftime.DatetimeJulian(2000, 1, 1, 0, 0, 0, 0, has_year_zero=False), cftime.DatetimeJulian(2000, 1, 2, 0, 0, 0, 0, has_year_zero=False), cftime.DatetimeJulian(2000, 1, 3, 0, 0, 0, 0, has_year_zero=False), cftime.DatetimeJulian(2000, 1, 4, 0, 0, 0, 0, has_year_zero=False)], dtype=object) Coordinates: * time (time) object 2000-01-01 00:00:00 ... 2000-01-04 00:00:00 ds2: array([cftime.datetime(2000, 1, 1, 0, 0, 0, 0, calendar='julian', has_year_zero=False), cftime.datetime(2000, 1, 2, 0, 0, 0, 0, calendar='julian', has_year_zero=False), cftime.datetime(2000, 1, 3, 0, 0, 0, 0, calendar='julian', has_year_zero=False), cftime.datetime(2000, 1, 4, 0, 0, 0, 0, calendar='julian', has_year_zero=False)], dtype=object) Coordinates: * time (time) object 2000-01-01 00:00:00 ... 2000-01-04 00:00:00 ds3: array([cftime.datetime(2000, 1, 1, 0, 0, 0, 0, calendar='julian', has_year_zero=False), cftime.datetime(2000, 1, 2, 0, 0, 0, 0, calendar='julian', has_year_zero=False), cftime.datetime(2000, 1, 3, 0, 0, 0, 0, calendar='julian', has_year_zero=False), cftime.datetime(2000, 1, 4, 0, 0, 0, 0, calendar='julian', has_year_zero=False)], dtype=object) Coordinates: * time (time) object 2000-01-01 00:00:00 ... 2000-01-04 00:00:00 ``` **Anything else we need to know?**: I noticed this because the DatetimeAccessor `ceil`, `floor` and `round` methods return errors for `cftime.datetime` objects (but not `cftime.Datetime` objects) for all calendar types other than 'gregorian'. For example, ```python ds3.time.dt.floor('D') ``` returns the following traceback: ``` --------------------------------------------------------------------------- TypeError Traceback (most recent call last) in ----> 1 ds3.time.dt.floor('D') /g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/core/accessor_dt.py in floor(self, freq) 220 """""" 221 --> 222 return self._tslib_round_accessor(""floor"", freq) 223 224 def ceil(self, freq): /g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/core/accessor_dt.py in _tslib_round_accessor(self, name, freq) 202 def _tslib_round_accessor(self, name, freq): 203 obj_type = type(self._obj) --> 204 result = _round_field(self._obj.data, name, freq) 205 return obj_type(result, name=name, coords=self._obj.coords, dims=self._obj.dims) 206 /g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/core/accessor_dt.py in _round_field(values, name, freq) 142 ) 143 else: --> 144 return _round_through_series_or_index(values, name, freq) 145 146 /g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/core/accessor_dt.py in _round_through_series_or_index(values, name, freq) 110 method = getattr(values_as_cftimeindex, name) 111 --> 112 field_values = method(freq=freq).values 113 114 return field_values.reshape(values.shape) /g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/coding/cftimeindex.py in floor(self, freq) 733 CFTimeIndex 734 """""" --> 735 return self._round_via_method(freq, _floor_int) 736 737 def ceil(self, freq): /g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/coding/cftimeindex.py in _round_via_method(self, freq, method) 714 715 unit = _total_microseconds(offset.as_timedelta()) --> 716 values = self.asi8 717 rounded = method(values, unit) 718 return _cftimeindex_from_i8(rounded, self.date_type, self.name) /g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/coding/cftimeindex.py in asi8(self) 684 epoch = self.date_type(1970, 1, 1) 685 return np.array( --> 686 [ 687 _total_microseconds(exact_cftime_datetime_difference(epoch, date)) 688 for date in self.values /g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/coding/cftimeindex.py in (.0) 685 return np.array( 686 [ --> 687 _total_microseconds(exact_cftime_datetime_difference(epoch, date)) 688 for date in self.values 689 ], /g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/core/resample_cftime.py in exact_cftime_datetime_difference(a, b) 356 datetime.timedelta 357 """""" --> 358 seconds = b.replace(microsecond=0) - a.replace(microsecond=0) 359 seconds = int(round(seconds.total_seconds())) 360 microseconds = b.microsecond - a.microsecond src/cftime/_cftime.pyx in cftime._cftime.datetime.__sub__() TypeError: cannot compute the time difference between dates with different calendars ``` My apologies for conflating two issues here. I'm happy to open a separate issue for this if that's preferred. **Environment**:
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:13:33) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 4.18.0-305.19.1.el8.nci.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.20.1 pandas: 1.3.4 numpy: 1.21.4 scipy: 1.6.3 netCDF4: 1.5.6 pydap: None h5netcdf: 0.11.0 h5py: 3.3.0 Nio: None zarr: 2.9.5 cftime: 1.5.0 nc_time_axis: 1.4.0 PseudoNetCDF: None rasterio: 1.2.4 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.11.2 distributed: 2021.11.2 matplotlib: 3.4.2 cartopy: 0.19.0.post1 seaborn: None numbagg: None fsspec: 2021.05.0 cupy: None pint: 0.18 sparse: None setuptools: 49.6.0.post20210108 pip: 21.1.2 conda: 4.10.1 pytest: None IPython: 7.24.0 sphinx: None ​
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6026/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 789755611,MDU6SXNzdWU3ODk3NTU2MTE=,4833,Strange behaviour when overwriting files with to_netcdf and html repr,42455466,closed,0,,,2,2021-01-20T08:28:35Z,2021-01-20T20:00:23Z,2021-01-20T20:00:23Z,NONE,,,,"**What happened**: I'm experiencing some strange behaviour when overwriting netcdf files using `to_netcdf` in a Jupyter notebook. The issue is a bit quirky and convoluted and only seems to come about when using xarray's html repr in Jupyter. I've tried to find a reproducible example that demonstrates the issue (it's still quite convoluted, sorry): I can generate some data, save it to a netcdf file, reopen it and everything works as expected: ```python import numpy as np import xarray as xr ones = xr.DataArray(np.ones(5), coords=[range(5)], dims=['x']).to_dataset(name='a') ones.to_netcdf('./a.nc') print(xr.open_dataset('./a.nc')['a']) ``` ``` array([1., 1., 1., 1., 1.]) Coordinates: * x (x) int64 0 1 2 3 4 ``` I can overwrite `a.nc` with a modified dataset and everything still works as expected: ```python twos = 2 * ones twos.to_netcdf('./a.nc') print(xr.open_dataset('./a.nc', cache=False)['a']) ``` ``` array([2., 2., 2., 2., 2.]) Coordinates: * x (x) int64 0 1 2 3 4 ``` I can run the above cell as many times as I like and always get the expected behaviour. However, if instead of `print`ing the `open_dataset` line, I allow it to be rendered by the xarray html repr, I find that the cell will run once and then will fail with a `Permission denied` error the second time it is run: ```python twos.to_netcdf('./a.nc') xr.open_dataset('./a.nc', cache=False)['a'] ``` ``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) .../lib/python3.8/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock) 198 try: --> 199 file = self._cache[self._key] 200 except KeyError: .../lib/python3.8/site-packages/xarray/backends/lru_cache.py in __getitem__(self, key) 52 with self._lock: ---> 53 value = self._cache[key] 54 self._cache.move_to_end(key) KeyError: [, ('.../a.nc',), 'a', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))] During handling of the above exception, another exception occurred: . . . PermissionError: [Errno 13] Permission denied: b'.../a.nc' ``` If I manually remove the file in question, I can resave it, but from then on xarray seems to have its wires crossed somehow and will present `twos` from `a.nc` regardless of what it actually contains: ```python !rm ./a.nc ones.to_netcdf('./a.nc') print(xr.open_dataset('./a.nc')['a']) ``` ``` array([2., 2., 2., 2., 2.]) Coordinates: * x (x) int64 0 1 2 3 4 ``` Note that in the last example, the data saved on disk is correct (i.e. contains ones) but xarray is still somehow linked to the `twos` data **Anything else we need to know?**: I've come across this unexpected behaviour a few times. In the above example, I've had to add `cache=True` to consistently produce the behaviour, but in the past I've managed to produce these symptoms _without_ `cache=True` (I'm just not exactly sure how). Anecdotally, the behaviour always seems to occur after having rendered the xarray object in Jupyter using the html repr ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4833/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue