home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

6 rows where repo = 13221727, state = "open" and user = 6130352 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, created_at (date), updated_at (date)

type 1

  • issue 6

state 1

  • open · 6 ✖

repo 1

  • xarray · 6 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
696047530 MDU6SXNzdWU2OTYwNDc1MzA= 4412 Dataset.encode_cf function eric-czech 6130352 open 0     3 2020-09-08T17:22:55Z 2023-05-10T16:06:54Z   NONE      

I would like to be able to apply CF encoding to an existing DataArray (or multiple in a Dataset) and then store the encoded forms elsewhere. Is this already possible?

More specifically, I would like to encode a large array of 32-bit floats as 8-bit ints and then write them to a Zarr store using rechunker.

I'm essentially after this https://github.com/pangeo-data/rechunker/issues/45 (Xarray support in rechunker), but I'm looking for what functionality exists in Xarray to make it possible in the meantime.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4412/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
692238160 MDU6SXNzdWU2OTIyMzgxNjA= 4405 open_zarr: concat_characters has no effect when dtype=U1 eric-czech 6130352 open 0     8 2020-09-03T19:22:52Z 2022-04-27T23:48:29Z   NONE      

What happened:

It appears that either to_zarr or from_zarr is incorrectly concatenating the trailing dimension of single byte/character arrays and dropping the last dimension:

```python import xarray as xr import numpy as np xr.set_options(display_style='text')

chrs = np.array([ ['A', 'B'], ['C', 'D'], ['E', 'F'], ], dtype='S1') ds = xr.Dataset(dict(x=(('dim0', 'dim1'), chrs))) ds.x <xarray.DataArray 'x' (dim0: 3, dim1: 2)> array([[b'A', b'B'], [b'C', b'D'], [b'E', b'F']], dtype='|S1') Dimensions without coordinates: dim0, dim1

ds.to_zarr('/tmp/test.zarr', mode='w') xr.open_zarr('/tmp/test.zarr').x.compute()

The second dimension is lost and the values end up being concatenated

<xarray.DataArray 'x' (dim0: 3)> array([b'AB', b'CD', b'EF'], dtype='|S2') Dimensions without coordinates: dim0 ```

For N columns in a 2D array, you end up with an "|SN" 1D array. When using say "S2" or any fixed-length greater than 1, it doesn't happen.

Interestingly though, it only affects the trailing dimension. I.e. if you use 3 dimensions, you get a 2D result with the 3rd dimension dropped:

```python chrs = np.array([[ ['A', 'B'], ['C', 'D'], ['E', 'F'], ]], dtype='S1') ds = xr.Dataset(dict(x=(('dim0', 'dim1', 'dim2'), chrs))) ds <xarray.Dataset> Dimensions: (dim0: 1, dim1: 3, dim2: 2) Dimensions without coordinates: dim0, dim1, dim2 Data variables: x (dim0, dim1, dim2) |S1 b'A' b'B' b'C' b'D' b'E' b'F'

ds.to_zarr('/tmp/test.zarr', mode='w') xr.open_zarr('/tmp/test.zarr').x.compute()

dim2 is gone and the data concatenated to dim1

<xarray.DataArray 'x' (dim0: 1, dim1: 3)> array([[b'AB', b'CD', b'EF']], dtype='|S2') Dimensions without coordinates: dim0, dim1 ```

In short, this only affects the "S1" data type. "U1" is fine as is "SN" where N > 1.

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-42-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: None xarray: 0.16.0 pandas: 1.0.5 numpy: 1.19.0 scipy: 1.5.1 netCDF4: None pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.21.0 distributed: 2.21.0 matplotlib: 3.3.0 cartopy: None seaborn: 0.10.1 numbagg: None pint: None setuptools: 47.3.1.post20200616 pip: 20.1.1 conda: 4.8.2 pytest: 5.4.3 IPython: 7.15.0 sphinx: 3.2.1
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4405/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
707571360 MDU6SXNzdWU3MDc1NzEzNjA= 4452 Change default for concat_characters to False in open_* functions eric-czech 6130352 open 0     2 2020-09-23T18:06:07Z 2022-04-09T03:21:43Z   NONE      

I wanted to propose that concat_characters be False for open_{dataset,zarr,dataarray}. I'm not sure how often that affects anyone since working with individual character arrays is probably rare, but it's a particularly bad default in genetics. We often represent individual variations as single characters and the concatenation is destructive because we can't invert it when one of the characters is an empty string (which often corresponds to a deletion at a base pair location, and the order of the characters matters).

I also find it to be confusing behavior (e.g. https://github.com/pydata/xarray/issues/4405) since no other arrays are automatically transformed like this when deserialized.

If submit a PR for this, would anybody object?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4452/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
770006670 MDU6SXNzdWU3NzAwMDY2NzA= 4704 Retries for rare failures eric-czech 6130352 open 0     2 2020-12-17T13:06:51Z 2022-04-09T02:30:16Z   NONE      

I recently ran into several issues with gcsfs (https://github.com/dask/gcsfs/issues/316, https://github.com/dask/gcsfs/issues/315, and https://github.com/dask/gcsfs/issues/318) where errors are occasionally thrown, but only in large worfklows where enough http calls are made for them to become probable.

@martindurant suggested forcing dask to retry tasks that may fail like this with .compute(... retries=N) in https://github.com/dask/gcsfs/issues/316, which has worked well. However, I also see this in Xarray/Zarr code interacting with gcsfs directly:

Example Traceback ``` Traceback (most recent call last): File "scripts/convert_phesant_data.py", line 100, in <module> fire.Fire() File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "scripts/convert_phesant_data.py", line 96, in sort_zarr ds.to_zarr(fsspec.get_mapper(output_path), consolidated=True, mode="w") File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/xarray/core/dataset.py", line 1652, in to_zarr return to_zarr( File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/xarray/backends/api.py", line 1368, in to_zarr dump_to_store(dataset, zstore, writer, encoding=encoding) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/xarray/backends/api.py", line 1128, in dump_to_store store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/xarray/backends/zarr.py", line 417, in store self.set_variables( File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/xarray/backends/zarr.py", line 489, in set_variables writer.add(v.data, zarr_array, region=region) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/xarray/backends/common.py", line 145, in add target[...] = source File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/zarr/core.py", line 1115, in __setitem__ self.set_basic_selection(selection, value, fields=fields) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/zarr/core.py", line 1210, in set_basic_selection return self._set_basic_selection_nd(selection, value, fields=fields) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/zarr/core.py", line 1501, in _set_basic_selection_nd self._set_selection(indexer, value, fields=fields) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/zarr/core.py", line 1550, in _set_selection self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/zarr/core.py", line 1664, in _chunk_setitem self._chunk_setitem_nosync(chunk_coords, chunk_selection, value, File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/zarr/core.py", line 1729, in _chunk_setitem_nosync self.chunk_store[ckey] = cdata File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/mapping.py", line 151, in __setitem__ self.fs.pipe_file(key, maybe_convert(value)) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper return maybe_sync(func, self, *args, **kwargs) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync return sync(loop, func, *args, **kwargs) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync raise exc.with_traceback(tb) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f result[0] = await future File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1007, in _pipe_file return await simple_upload( File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1523, in simple_upload j = await fs._call( File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call raise e File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call self.validate_response(status, contents, json, path, headers) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response raise HttpError(error) gcsfs.utils.HttpError: Required ```

Has there already been a discussion about how to address rare errors like this? Arguably, I could file the same issue with Zarr but it seemed more productive to start here at a higher level of abstraction.

To be clear, the code for the example failure above typically succeeds and reproducing this failure is difficult. I have only seen it a couple times now like this, where the calling code does not include dask, but it did make me want to know if there were any plans to tolerate rare failures in Xarray as Dask does.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4704/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
876394165 MDU6SXNzdWU4NzYzOTQxNjU= 5261 Export ufuncs from DataArray API eric-czech 6130352 open 0     3 2021-05-05T12:24:03Z 2021-05-07T13:53:08Z   NONE      

Have there been discussions on promoting other ufuncs out of xr.ufuncs and into the DataArray API like DataArray.isnull or DataArray.notnull?

I can see how those two would be an exception given the pandas semantics for them, as opposed to numpy, but I am curious how to recommend best practices for our users as we build a library for genetics on Xarray.

We prefer to avoid anything in our documentation or examples outside of the Xarray API to make things simple for our users, who would likely be easily confused/frustrated by the intricacies of numpy, dask, and xarray API interactions (as we were too not long ago). To that end, we have a number of methods that produce NaN and infinite values, but recommending use of either of these to identify those values via ds.my_variable.pipe(xr.ufuncs.isfinite) or np.isfinite(ds.my_variable) is not ideal.

I would prefer ds.my_variable.isfinite() or maybe even ds.my_variable.ufuncs.isfinite(). Is there a sane way to export all of xr.ufuncs from DataArray?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5261/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
688501399 MDU6SXNzdWU2ODg1MDEzOTk= 4386 Zarr store array dtype incorrect eric-czech 6130352 open 0     2 2020-08-29T09:54:19Z 2021-04-20T01:23:45Z   NONE      

Writing a boolean array to a zarr store once works, but not twice. The dtype switches to int8 after the second write:

```python import xarray as xr import numpy as np

ds = xr.Dataset(dict( x=xr.DataArray(np.random.rand(100) > .5, dims='d1') ))

ds.to_zarr('/tmp/ds1.zarr', mode='w') xr.open_zarr('/tmp/ds1.zarr').x.dtype.str # |b1

xr.open_zarr('/tmp/ds1.zarr').to_zarr('/tmp/ds2.zarr', mode='w') xr.open_zarr('/tmp/ds2.zarr').x.dtype.str # |i1 ```

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-42-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: None xarray: 0.16.0 pandas: 1.0.5 numpy: 1.19.0 scipy: 1.5.1 netCDF4: None pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.21.0 distributed: 2.21.0 matplotlib: 3.3.0 cartopy: None seaborn: 0.10.1 numbagg: None pint: None setuptools: 47.3.1.post20200616 pip: 20.1.1 conda: 4.8.2 pytest: 5.4.3 IPython: 7.15.0 sphinx: 3.2.1
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4386/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 59.381ms · About: xarray-datasette