home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

3 rows where user = 42455466 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

state 2

  • closed 2
  • open 1

type 1

  • issue 3

repo 1

  • xarray 3
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1083621690 I_kwDOAMm_X85AlsE6 6084 Initialise zarr metadata without computing dask graph dougiesquire 42455466 open 0     6 2021-12-17T21:17:42Z 2024-04-03T19:08:26Z   NONE      

Is your feature request related to a problem? Please describe. On writing large zarr stores, the xarray docs recommend first creating an initial Zarr store without writing all of its array data. The recommended approach is to first create a dummy dask-backed Dataset, and then call to_zarr with compute=False to write only metadata to Zarr. This works great.

It seems that in one common use case for this approach (including the example in the above docs), the entire dataset to be written to zarr is already represented in a Dataset (let's call this ds). Thus, rather than creating a dummy Dataset with exactly the same metadata as ds, it is more convenient to initialise the zarr Store with ds.to_zarr(..., compute=False). See for example:

https://discourse.pangeo.io/t/many-netcdf-to-single-zarr-store-using-concurrent-futures/2029 https://discourse.pangeo.io/t/map-blocks-and-to-zarr-region/2019 https://discourse.pangeo.io/t/netcdf-to-zarr-best-practices/1119/12 https://discourse.pangeo.io/t/best-practice-for-memory-management-to-iteratively-write-a-large-dataset-with-xarray/1989

However, calling to_zarr with compute=False still computes the dask graph for writing the Zarr store. The graph is never used in this use-case, but computing the graph can take a really long time for large graphs.

Describe the solution you'd like Is there scope to add an option to to_zarr to initialise the store without computing the dask graph? Or perhaps an initialise_zarr method would be cleaner?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6084/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1063046540 I_kwDOAMm_X84_XM2M 6026 Delaying open produces different type of `cftime` object dougiesquire 42455466 closed 0     3 2021-11-25T00:47:22Z 2022-01-13T13:49:27Z 2022-01-13T13:49:27Z NONE      

What happened: The task is opening a dataset (e.g. a netcdf or zarr file) with a time coordinate using use_cftime=True. Delaying the task with dask results in the time coordinate being represented as cftime.datetime objects, whereas when the task is not delayed cftime.Datetime<Calendar> objects are used.

What you expected to happen: Consistent cftime objects to be used, regardless of whether the opening task is delayed or not.

Minimal Complete Verifiable Example:

```python import dask import numpy as np import xarray as xr from dask.distributed import LocalCluster, Client

cluster = LocalCluster() client = Client(cluster)

Write some data

var = np.random.random(4) time = xr.cftime_range('2000-01-01', periods=4, calendar='julian') ds = xr.Dataset(data_vars={'var': ('time', var)}, coords={'time': time}) ds.to_netcdf('test.nc', mode='w')

Open written data

ds1 = xr.open_dataset('test.nc', use_cftime=True) print(f'ds1: {ds1.time} \n')

Delayed open written data

ds2 = dask.delayed(xr.open_dataset)('test.nc', use_cftime=True) ds2 = dask.compute(ds2)[0] print(f'ds2: {ds2.time} \n')

Operations like xr.open_mfdataset which use dask.delayed internally

when parallel=True (I think) produce the same result as ds2

ds3 = xr.open_mfdataset('test.nc', use_cftime=True, parallel=True) print(f'ds3: {ds3.time}') returns ds1: <xarray.DataArray 'time' (time: 4)> array([cftime.DatetimeJulian(2000, 1, 1, 0, 0, 0, 0, has_year_zero=False), cftime.DatetimeJulian(2000, 1, 2, 0, 0, 0, 0, has_year_zero=False), cftime.DatetimeJulian(2000, 1, 3, 0, 0, 0, 0, has_year_zero=False), cftime.DatetimeJulian(2000, 1, 4, 0, 0, 0, 0, has_year_zero=False)], dtype=object) Coordinates: * time (time) object 2000-01-01 00:00:00 ... 2000-01-04 00:00:00

ds2: <xarray.DataArray 'time' (time: 4)> array([cftime.datetime(2000, 1, 1, 0, 0, 0, 0, calendar='julian', has_year_zero=False), cftime.datetime(2000, 1, 2, 0, 0, 0, 0, calendar='julian', has_year_zero=False), cftime.datetime(2000, 1, 3, 0, 0, 0, 0, calendar='julian', has_year_zero=False), cftime.datetime(2000, 1, 4, 0, 0, 0, 0, calendar='julian', has_year_zero=False)], dtype=object) Coordinates: * time (time) object 2000-01-01 00:00:00 ... 2000-01-04 00:00:00

ds3: <xarray.DataArray 'time' (time: 4)> array([cftime.datetime(2000, 1, 1, 0, 0, 0, 0, calendar='julian', has_year_zero=False), cftime.datetime(2000, 1, 2, 0, 0, 0, 0, calendar='julian', has_year_zero=False), cftime.datetime(2000, 1, 3, 0, 0, 0, 0, calendar='julian', has_year_zero=False), cftime.datetime(2000, 1, 4, 0, 0, 0, 0, calendar='julian', has_year_zero=False)], dtype=object) Coordinates: * time (time) object 2000-01-01 00:00:00 ... 2000-01-04 00:00:00 ```

Anything else we need to know?: I noticed this because the DatetimeAccessor ceil, floor and round methods return errors for cftime.datetime objects (but not cftime.Datetime<Calendar> objects) for all calendar types other than 'gregorian'. For example, python ds3.time.dt.floor('D') returns the following traceback: ```


TypeError Traceback (most recent call last) <ipython-input-10-613e63624953> in <module> ----> 1 ds3.time.dt.floor('D')

/g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/core/accessor_dt.py in floor(self, freq) 220 """ 221 --> 222 return self._tslib_round_accessor("floor", freq) 223 224 def ceil(self, freq):

/g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/core/accessor_dt.py in _tslib_round_accessor(self, name, freq) 202 def _tslib_round_accessor(self, name, freq): 203 obj_type = type(self._obj) --> 204 result = _round_field(self._obj.data, name, freq) 205 return obj_type(result, name=name, coords=self._obj.coords, dims=self._obj.dims) 206

/g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/core/accessor_dt.py in _round_field(values, name, freq) 142 ) 143 else: --> 144 return _round_through_series_or_index(values, name, freq) 145 146

/g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/core/accessor_dt.py in _round_through_series_or_index(values, name, freq) 110 method = getattr(values_as_cftimeindex, name) 111 --> 112 field_values = method(freq=freq).values 113 114 return field_values.reshape(values.shape)

/g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/coding/cftimeindex.py in floor(self, freq) 733 CFTimeIndex 734 """ --> 735 return self._round_via_method(freq, _floor_int) 736 737 def ceil(self, freq):

/g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/coding/cftimeindex.py in _round_via_method(self, freq, method) 714 715 unit = _total_microseconds(offset.as_timedelta()) --> 716 values = self.asi8 717 rounded = method(values, unit) 718 return _cftimeindex_from_i8(rounded, self.date_type, self.name)

/g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/coding/cftimeindex.py in asi8(self) 684 epoch = self.date_type(1970, 1, 1) 685 return np.array( --> 686 [ 687 _total_microseconds(exact_cftime_datetime_difference(epoch, date)) 688 for date in self.values

/g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/coding/cftimeindex.py in <listcomp>(.0) 685 return np.array( 686 [ --> 687 _total_microseconds(exact_cftime_datetime_difference(epoch, date)) 688 for date in self.values 689 ],

/g/data/xv83/ds0092/software/miniconda3/envs/pangeo/lib/python3.9/site-packages/xarray/core/resample_cftime.py in exact_cftime_datetime_difference(a, b) 356 datetime.timedelta 357 """ --> 358 seconds = b.replace(microsecond=0) - a.replace(microsecond=0) 359 seconds = int(round(seconds.total_seconds())) 360 microseconds = b.microsecond - a.microsecond

src/cftime/_cftime.pyx in cftime._cftime.datetime.sub()

TypeError: cannot compute the time difference between dates with different calendars ``` My apologies for conflating two issues here. I'm happy to open a separate issue for this if that's preferred.

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:13:33) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 4.18.0-305.19.1.el8.nci.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.20.1 pandas: 1.3.4 numpy: 1.21.4 scipy: 1.6.3 netCDF4: 1.5.6 pydap: None h5netcdf: 0.11.0 h5py: 3.3.0 Nio: None zarr: 2.9.5 cftime: 1.5.0 nc_time_axis: 1.4.0 PseudoNetCDF: None rasterio: 1.2.4 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.11.2 distributed: 2021.11.2 matplotlib: 3.4.2 cartopy: 0.19.0.post1 seaborn: None numbagg: None fsspec: 2021.05.0 cupy: None pint: 0.18 sparse: None setuptools: 49.6.0.post20210108 pip: 21.1.2 conda: 4.10.1 pytest: None IPython: 7.24.0 sphinx: None ​
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6026/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
789755611 MDU6SXNzdWU3ODk3NTU2MTE= 4833 Strange behaviour when overwriting files with to_netcdf and html repr dougiesquire 42455466 closed 0     2 2021-01-20T08:28:35Z 2021-01-20T20:00:23Z 2021-01-20T20:00:23Z NONE      

What happened:

I'm experiencing some strange behaviour when overwriting netcdf files using to_netcdf in a Jupyter notebook. The issue is a bit quirky and convoluted and only seems to come about when using xarray's html repr in Jupyter. I've tried to find a reproducible example that demonstrates the issue (it's still quite convoluted, sorry):

I can generate some data, save it to a netcdf file, reopen it and everything works as expected: ```python import numpy as np import xarray as xr

ones = xr.DataArray(np.ones(5), coords=[range(5)], dims=['x']).to_dataset(name='a')

ones.to_netcdf('./a.nc') print(xr.open_dataset('./a.nc')['a']) <xarray.DataArray 'a' (x: 5)> array([1., 1., 1., 1., 1.]) Coordinates: * x (x) int64 0 1 2 3 4 I can overwrite `a.nc` with a modified dataset and everything still works as expected:python twos = 2 * ones twos.to_netcdf('./a.nc') print(xr.open_dataset('./a.nc', cache=False)['a']) <xarray.DataArray 'a' (x: 5)> array([2., 2., 2., 2., 2.]) Coordinates: * x (x) int64 0 1 2 3 4 I can run the above cell as many times as I like and always get the expected behaviour. However, if instead of `print`ing the `open_dataset` line, I allow it to be rendered by the xarray html repr, I find that the cell will run once and then will fail with a `Permission denied` error the second time it is run:python twos.to_netcdf('./a.nc') xr.open_dataset('./a.nc', cache=False)['a']


KeyError Traceback (most recent call last) .../lib/python3.8/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock) 198 try: --> 199 file = self._cache[self._key] 200 except KeyError:

.../lib/python3.8/site-packages/xarray/backends/lru_cache.py in getitem(self, key) 52 with self._lock: ---> 53 value = self._cache[key] 54 self._cache.move_to_end(key)

KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('.../a.nc',), 'a', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))]

During handling of the above exception, another exception occurred: . . . PermissionError: [Errno 13] Permission denied: b'.../a.nc' If I manually remove the file in question, I can resave it, but from then on xarray seems to have its wires crossed somehow and will present `twos` from `a.nc` regardless of what it actually contains:python !rm ./a.nc ones.to_netcdf('./a.nc') print(xr.open_dataset('./a.nc')['a']) <xarray.DataArray 'a' (x: 5)> array([2., 2., 2., 2., 2.]) Coordinates: * x (x) int64 0 1 2 3 4 ```

Note that in the last example, the data saved on disk is correct (i.e. contains ones) but xarray is still somehow linked to the twos data

Anything else we need to know?:

I've come across this unexpected behaviour a few times. In the above example, I've had to add cache=True to consistently produce the behaviour, but in the past I've managed to produce these symptoms without cache=True (I'm just not exactly sure how). Anecdotally, the behaviour always seems to occur after having rendered the xarray object in Jupyter using the html repr

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4833/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 22.539ms · About: xarray-datasette