home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

3 rows where user = 1530840 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 2

  • issue 2
  • pull 1

state 1

  • closed 3

repo 1

  • xarray 3
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
342531772 MDU6SXNzdWUzNDI1MzE3NzI= 2300 zarr and xarray chunking compatibility and `to_zarr` performance chrisbarber 1530840 closed 0     15 2018-07-18T23:58:40Z 2021-04-26T16:37:42Z 2021-04-26T16:37:42Z NONE      

I have a situation where I build large zarr arrays based on chunks which correspond to how I am reading data off a filesystem, for best I/O performance. Then I set these as variables on an xarray dataset which I want to persist to zarr, but with different chunks more optimal for querying.

One problem I ran into is that manually selecting chunks of a dataset prior to to_zarr results in https://github.com/pydata/xarray/blob/66be9c5db7d86ea385c3a4cd4295bfce67e3f25b/xarray/backends/zarr.py#L83

It's difficult for me to understand exactly how to select chunks manually at the dataset level which would also make this zarr "final chunk" constraint happy. I would have been satisfied however with letting zarr choose chunks for me, but could not find a way to trigger this through the xarray API short of "unchunking" it first, which would lead to loading entire variables into memory. I came up with the following hack to trigger zarr's automatic chunking despite having differently defined chunks on my xarray dataset:

python # monkey patch to get zarr to ignore dask chunks and use its own heuristics def copy_func(f): g = types.FunctionType(f.__code__, f.__globals__, name=f.__name__, argdefs=f.__defaults__, closure=f.__closure__) g = functools.update_wrapper(g, f) g.__kwdefaults__ = f.__kwdefaults__ return g orig_determine_zarr_chunks = copy_func(xr.backends.zarr._determine_zarr_chunks) xr.backends.zarr._determine_zarr_chunks = lambda enc_chunks, var_chunks, ndim: orig_determine_zarr_chunks(enc_chunks, None, ndim) The next problem to contend with is that da.store between zarr stores with differing chunks between source and destination is astronomically slow. The first thing to attempt would be to rechunk the dask arrays according to the destination zarr chunks, but xarray's consistent chunks constraint blocks this strategy as far as I can tell. Once again I took the dirty hack approach and inject a rechunking on a per-variable basis during the to_zarr operation, as follows:

python # monkey patch to make dask arrays writable with different chunks than zarr dest # could do without this but would have to contend with 'inconsistent chunks' on dataset def sync_using_zarr_copy(self, compute=True): if self.sources: import dask.array as da rechunked_sources = [source.rechunk(target.chunks) for source, target in zip(self.sources, self.targets)] delayed_store = da.store(rechunked_sources, self.targets, lock=self.lock, compute=compute, flush=True) self.sources = [] self.targets = [] return delayed_store xr.backends.common.ArrayWriter.sync = sync_using_zarr_copy I may have missed something in the API that would have made this easier, or another workaround which would be less hacky, but in any case I'm wondering if this scenario could be handled elegantly in xarray.

I'm not sure if there is a plan going forward to make legal xarray chunks 100% compatible with zarr; if so that would go a fair ways in alleviating the first problem. Alternatively, perhaps the xarray API could expose some ability to adjust chunks according to zarr's liking, as well as the option of defaulting entirely to zarr's heuristics for chunking.

As for the performance issue with differing chunks, I'm not sure whether my rechunking patch could be applied without causing side-effects. Or where the right place to solve this would be-- perhaps it could be more naturally addressed within da.store.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2300/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
351343574 MDU6SXNzdWUzNTEzNDM1NzQ= 2371 `AttributeError: 'DataArray' object has no attribute 'ravel'` when using `np.intersect1d(..., assume_unique=True)` chrisbarber 1530840 closed 0     5 2018-08-16T19:47:36Z 2018-10-22T21:27:22Z 2018-10-22T21:27:22Z NONE      

Code Sample, a copy-pastable example if possible

```python

import xarray as xr import numpy as np np.intersect1d(xr.DataArray(np.empty(5), dims=('a',)), xr.DataArray(np.empty(5), dims=('a',))) array([2.37151510e-322, 6.92748216e-310]) np.intersect1d(xr.DataArray(np.empty(5), dims=('a',)), xr.DataArray(np.empty(5), dims=('a',)), assume_unique=True) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/local1/opt/miniconda3/envs/datacube/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 368, in intersect1d ar1 = ar1.ravel() File "/local1/opt/miniconda3/envs/datacube/lib/python3.6/site-packages/xarray/core/common.py", line 176, in getattr (type(self).name, name)) AttributeError: 'DataArray' object has no attribute 'ravel' ```

Problem description

I believe this worked in a previous version, not sure what might have changed. But I don't see any reason calling np.intersect1d on DataArray's shouldn't work, or why assume_unique=True ought to make any difference.

Expected Output

Output should be the same as calling intersect1d with assume_unique=True directly on ndarray's. E.g. ```python

np.intersect1d(xr.DataArray(np.empty(5), dims=('a',)).values, xr.DataArray(np.empty(5), dims=('a',)).values, assume_unique=True) array([2.37151510e-322, 6.94714805e-310]) ```

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-514.26.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.15.0 scipy: 1.1.0 netCDF4: None h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.18.2 distributed: 1.22.0 matplotlib: 2.2.3 cartopy: None seaborn: None setuptools: 40.0.0 pip: 18.0 conda: None pytest: None IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2371/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
272325640 MDExOlB1bGxSZXF1ZXN0MTUxNDc3Nzcx 1702 fix empty dataset from_dict chrisbarber 1530840 closed 0     2 2017-11-08T19:47:37Z 2018-05-15T04:51:11Z 2018-05-15T04:51:03Z NONE   0 pydata/xarray/pulls/1702
  • [ ] Closes #xxxx
  • [x] Tests added / passed
  • [x] Passes git diff upstream/master **/*py | flake8 --diff
  • [ ] Fully documented, including whats-new.rst for all changes and api.rst for new API

Not sure if you want an issue or a whats-new entry for a small fix like this.

Also not sure how xarray tends to handle np.array([]). One option that I did not take was to provide ndims to as_compatible_data in some fashion and make sure the result comes back as having shape (0,)*ndim instead of just (0,).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1702/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 21.122ms · About: xarray-datasette