home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

7 rows where user = 6574622 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, created_at (date), updated_at (date), closed_at (date)

type 2

  • issue 4
  • pull 3

state 2

  • closed 5
  • open 2

repo 1

  • xarray 7
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
922804256 MDU6SXNzdWU5MjI4MDQyNTY= 5475 Is `_FillValue` really the same as zarr's `fill_value`? d70-t 6574622 open 0     2 2021-06-16T16:03:21Z 2024-04-02T08:17:23Z   CONTRIBUTOR      

The zarr backend uses the fill_value of zarrs .zarray key as if it would be the _FillValue according to CF-Conventions:

https://github.com/pydata/xarray/blob/1a7b285be676d5404a4140fc86e8756de75ee7ac/xarray/backends/zarr.py#L373

I think this interpretation of the fill_value is wrong and creates problems. Here's why:

The zarr v2 spec is still a little vague, but states that fill_value is

A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.

Accordingly this value should be used to fill all areas of a variable which are not backed by a stored chunk with this value. This is also different from what CF conventions state (emphasis mine):

The scalar attribute with the name _FillValue and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable. This value is considered to be a special value that indicates undefined or missing data, and is returned when reading values that were not written.

The difference between the two is, that fill_value is only a background value, which just isn't stored as a chunk. But _FillValue is (possibly) a background value and is interpreted as not being valid data. In my opinion, this mix of _FillValue and missing_value could be considered a defect in the CF-Conventions, but probably that's far to late as many depend on this.

Thinking of an example, when storing a density field (i.e. water droplets forming clouds) in a zarr dataset, it might be perfectly valid to set the fill_value to 0 and then store only chunks in regions of the atmosphere where clouds are actually present. In that case, 0 (i.e. no drops) would be a perfectly valid value, which just isn't stored. As most parts of the atmosphere are indeed cloud-free, this may save quite a bunch of storage. Other formats (e.g. OpenVDB) commonly use this trick.


The issue gets worse when looking into the upcoming zarr v3 spec where fill_value is described as:

Provides an element value to use for uninitialised portions of the Zarr array.

If the data type of the Zarr array is Boolean then the value must be the literal false or true. If the data type is one of the integer data types defined in this specification, then the value must be a number with no fraction or exponent part and must be within the range of the data type.

For any data type, if the fill_value is the literal null then the fill value is undefined and the implementation may use any arbitrary value that is consistent with the data type as the fill value.

[...]

Thus for boolean arrays, if the fill_value would be interpreted as a missing value indicator, only (missing, True) or (False, missing) arrays could be represented. A (False, True) array would not be possible. The issue applies similarly for integer types as well.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5475/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1159923690 I_kwDOAMm_X85FIwfq 6329 `to_zarr` with append or region mode and `_FillValue` doesnt work d70-t 6574622 open 0     17 2022-03-04T18:21:32Z 2023-03-17T16:14:30Z   CONTRIBUTOR      

What happened?

python import numpy as np import xarray as xr ds = xr.Dataset({"a": ("x", [3.], {"_FillValue": np.nan})}) m = {} ds.to_zarr(m) ds.to_zarr(m, append_dim="x") raises ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.

What did you expect to happen?

I'd expect this to just work (effectively concatenating the dataset to itself).

Anything else we need to know?

appears also for region writes

The same issue appears for region writes as in: python import numpy as np import dask.array as da import xarray as xr ds = xr.Dataset({"a": ("x", da.array([3.,4.]), {"_FillValue": np.nan})}) m = {} ds.to_zarr(m, compute=False, encoding={"a": {"chunks": (1,)}}) ds.isel(x=slice(0,1)).to_zarr(m, region={"x": slice(0,1)}) raises

ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.

there's a workaround

The workaround (deleting the _FillValue in subsequent writes): python m = {} ds.to_zarr(m) del ds.a.attrs["_FillValue"] ds.to_zarr(m, append_dim="x") seems to do the trick.

There are indications that the result might still be broken, but it's not yet clear how to reproduce them (see comments below).

This issue has been split off from #6069

Environment INSTALLED VERSIONS ------------------ commit: None python: 3.9.10 (main, Jan 15 2022, 11:48:00) [Clang 13.0.0 (clang-1300.0.29.3)] python-bits: 64 OS: Darwin OS-release: 20.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: ('de_DE', 'UTF-8') libhdf5: 1.12.0 libnetcdf: 4.7.4 xarray: 0.20.1 pandas: 1.2.0 numpy: 1.21.2 scipy: 1.6.2 netCDF4: 1.5.8 pydap: installed h5netcdf: 0.11.0 h5py: 3.2.1 Nio: None zarr: 2.11.0 cftime: 1.3.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: None iris: None bottleneck: None dask: 2021.11.1 distributed: 2021.11.1 matplotlib: 3.4.1 cartopy: 0.20.1 seaborn: 0.11.1 numbagg: None fsspec: 2021.11.1 cupy: None pint: 0.17 sparse: 0.13.0 setuptools: 60.5.0 pip: 21.3.1 conda: None pytest: 6.2.2 IPython: 8.0.0.dev sphinx: 3.5.0
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6329/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1128485610 PR_kwDOAMm_X84yTE49 6258 removed check for last dask chunk size in to_zarr d70-t 6574622 closed 0     4 2022-02-09T12:34:43Z 2022-02-09T15:13:21Z 2022-02-09T15:12:32Z CONTRIBUTOR   0 pydata/xarray/pulls/6258

When storing a dask-chunked dataset to zarr, the size of the last chunk in each dimension does not matter, as this single last chunk will be written to any number of zarr chunks, but none of the zarr chunks which are being written to will be accessed by any other dask chunk.

  • [x] Closes #6255
  • [x] Tests added
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst

cc'ing @rabernat who seems to have worked on this lately.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6258/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1128282637 I_kwDOAMm_X85DQDoN 6255 Writing large (aligned) dask-chunks to small zarr chunks fails. d70-t 6574622 closed 0     0 2022-02-09T09:35:24Z 2022-02-09T15:12:31Z 2022-02-09T15:12:31Z CONTRIBUTOR      

What happened?

I'm trying to write a dataset which is (dask-) chunked in large chunks into zarr which should be chunked in smaller chunks. The dask chunks are intentionally chosen to be integer multiples of the zarr chunks, such that there will never be two dask chunks which may be written into a single zarr chunk.

When trying to write such a dataset using to_zarr, the following exception appears:

NotImplementedError: Final chunk of Zarr array must be the same size or smaller than the first. Specified Zarr chunk encoding['chunks']=(1,), for variable named 'a' but (2, 2) in the variable's Dask chunks ((2, 2),) are incompatible with this encoding. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

What did you expect to happen?

I'd expect the write to "just work".

Minimal Complete Verifiable Example

python import xarray as xr ds = xr.Dataset({"a": ("x", [1, 2, 3, 4])}).chunk({"x": 2}) m = {} ds.to_zarr(m, encoding={"a": {"chunks": (1,)}})

Relevant log output

No response

Anything else we need to know?

I believe that the expected behaviour is according to this design choice: # DESIGN CHOICE: do not allow multiple dask chunks on a single zarr chunk # this avoids the need to get involved in zarr synchronization / locking # From zarr docs: # "If each worker in a parallel computation is writing to a separate # region of the array, and if region boundaries are perfectly aligned # with chunk boundaries, then no synchronization is required."

But I believe that this if-statement is not needed and should be removed. The if-statement compares the size of the last dask-chunk within each dimenstion to the zarr-chunk size. There are three possible cases, which (as far as I understand) should all be just fine: * the dask-chunk is smaller than the zarr chunk: one dask chunk will write into one (smaller, last) zarr chunk * the dask-chunk is equal than the zarr chunk: one dask chunk will write into one zarr chunk * ther dask-chunk is larger than the zarr chunk: one dask chunk will write into multiple zarr chunks. None of these zarr chunks will be touched by any other dask-chunk as all previous dask chunks are aligned to zarr-chunk boundaries.

Note: If that if-statement goes away, this one may go away as well (was introduced in #4312).

Environment

INSTALLED VERSIONS ``` commit: None python: 3.9.10 (main, Jan 15 2022, 11:48:00) [Clang 13.0.0 (clang-1300.0.29.3)] python-bits: 64 OS: Darwin OS-release: 20.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: ('de_DE', 'UTF-8') libhdf5: 1.12.0 libnetcdf: 4.7.4 xarray: 0.20.1 pandas: 1.2.0 numpy: 1.21.2 scipy: 1.6.2 netCDF4: 1.5.8 pydap: installed h5netcdf: 0.11.0 h5py: 3.2.1 Nio: None zarr: 2.10.2 cftime: 1.3.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.11.1 distributed: 2021.11.1 matplotlib: 3.4.1 cartopy: 0.20.1 seaborn: 0.11.1 numbagg: None fsspec: 2021.11.1 cupy: None pint: 0.17 sparse: 0.13.0 setuptools: 60.5.0 pip: 21.3.1 conda: None pytest: 6.2.2 IPython: 8.0.0.dev sphinx: 3.5.0 ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6255/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
673513695 MDExOlB1bGxSZXF1ZXN0NDYzMzYyMTIw 4312 allow manual zarr encoding on unchunked dask dimensions d70-t 6574622 closed 0     3 2020-08-05T12:49:04Z 2022-02-09T09:31:51Z 2020-08-19T14:58:09Z CONTRIBUTOR   0 pydata/xarray/pulls/4312

If a dask array is chunked along one dimension but not chunked along another, any manually specified zarr chunk size should be valid, but before this patch, this resulted in an error.

  • [x] Tests added
  • [x] Passes isort . && black . && mypy . && flake8 (only for modified sections)
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4312/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
817302678 MDExOlB1bGxSZXF1ZXN0NTgwODE3NDQ5 4966 conventions: decode unsigned integers to signed if _Unsigned=false d70-t 6574622 closed 0     5 2021-02-26T12:05:51Z 2021-03-12T14:21:12Z 2021-03-12T14:20:20Z CONTRIBUTOR   0 pydata/xarray/pulls/4966

netCDF3 doesn't know unsigned while OPeNDAP doesn't know signed (bytes). Depending on which backend source is used, the original data is stored with the wrong signedness and needs to be decoded based on the _Unsigned attribute. While the netCDF3 variant is already implemented, this commit adds the symmetric case covering OPeNDAP.

  • [x] Closes #4954
  • [x] Tests added
  • [x] Passes pre-commit run --all-files
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4966/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
815858485 MDU6SXNzdWU4MTU4NTg0ODU= 4954 Handling of signed bytes from OPeNDAP via pydap d70-t 6574622 closed 0     2 2021-02-24T21:21:38Z 2021-03-12T14:20:19Z 2021-03-12T14:20:19Z CONTRIBUTOR      

netCDF3 only knows signed bytes, but there's a convention of adding an attribute _Unsigned=True to the variable to be able to store unsigned bytes non the less. This convention is handled at this place by xarray.

OPeNDAP only knows unsigned bytes, but there's a hack which is used by the thredds server and the netCDF-c library of adding an attribute _Unsigned=False to the variable to be able to store signed bytes non the less. This hack is not handled by xarray, but maybe should be handled symmetrically at the same place (i.e. if .kind == "u" and unsigned == False).

As descibed in the "hack", netCDF-c handles this internally, but pydap doesn't. This is why the engine="netcdf4" variant returns (correctly according to the hack) negative values and the engine="pydap" variant doesn't. However, as xarray returns a warning at exactly the location referenced above, I think that this is the place where it should be fixed.

If you agree, I could prepare a PR to implement the fix.

```python In [1]: import xarray as xr

In [2]: xr.open_dataset("https://observations.ipsl.fr/thredds/dodsC/EUREC4A/PRODUCTS/testdata/netcdf_testfiles/test_NC_BYTE_neg.nc", engine="netcdf4") Out[2]: <xarray.Dataset> Dimensions: (test: 7) Coordinates: * test (test) float32 -128.0 -1.0 0.0 1.0 2.0 nan 127.0 Data variables: empty

In [3]: xr.open_dataset("https://observations.ipsl.fr/thredds/dodsC/EUREC4A/PRODUCTS/testdata/netcdf_testfiles/test_NC_BYTE_neg.nc", engine="pydap") /usr/local/lib/python3.9/site-packages/xarray/conventions.py:492: SerializationWarning: variable 'test' has _Unsigned attribute but is not of integer type. Ignoring attribute. new_vars[k] = decode_cf_variable( Out[3]: <xarray.Dataset> Dimensions: (test: 7) Coordinates: * test (test) float32 128.0 255.0 0.0 1.0 2.0 nan 127.0 Data variables: empty ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4954/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 19.552ms · About: xarray-datasette