home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

28 rows where user = 6574622 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 12

  • to_zarr: region not recognised as dataset dimensions 9
  • `to_zarr` with append or region mode and `_FillValue` doesnt work 7
  • Nan/ changed values in output when only reading data, saving and reading again 2
  • removed check for last dask chunk size in to_zarr 2
  • Low memory/out-of-core index? 1
  • allow manual zarr encoding on unchunked dask dimensions 1
  • Handling of signed bytes from OPeNDAP via pydap 1
  • conventions: decode unsigned integers to signed if _Unsigned=false 1
  • KeyError pulling from Nasa server with Pydap 1
  • Misleading error when opening file that does not exist 1
  • Be able to override calendar in `open_dataset`/`open_mfdataset`/etc OR include another calendar name 1
  • allow coordinates to be independent of `region` selection in to_zarr 1

user 1

  • d70-t · 28 ✖

author_association 1

  • CONTRIBUTOR 28
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1064981526 https://github.com/pydata/xarray/issues/6329#issuecomment-1064981526 https://api.github.com/repos/pydata/xarray/issues/6329 IC_kwDOAMm_X84_elQW d70-t 6574622 2022-03-11T10:28:35Z 2022-03-11T10:28:35Z CONTRIBUTOR

Thanks for pointing out region again. I've updated the header and the initial comment.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `to_zarr` with append or region mode and `_FillValue` doesnt work 1159923690
1063977656 https://github.com/pydata/xarray/issues/6329#issuecomment-1063977656 https://api.github.com/repos/pydata/xarray/issues/6329 IC_kwDOAMm_X84_awK4 d70-t 6574622 2022-03-10T11:56:44Z 2022-03-10T11:56:44Z CONTRIBUTOR

Yes, this is kind of the behaviour I'd expect. And great that it helped clarifying things. Still, building up the metadata nicely upfront (which is required for region writes) ist quite convoluted... That's what I meant with

some better tooling for writing and updating zarr dataset metadata (I don't know if that would fit in the realm of xarray though, as it looks like handling Datasets without content. For "appending" metadata, I really don't know how I'd picture this propery in xarray world.)

in the previous comment. I think, establishing and documenting good practices for this would help, but probably we also want to have better tools. In any case, this would probably be yet another issue.

Note that if you care about this paricular example (e.g. appending in a single thread in increasing order of timesteps), then it should also be possible to do this much simpler using append:

```python filename='processed_dataset.zarr' ds = xr.tutorial.open_dataset('air_temperature') ds.air.encoding['dtype']=np.dtype('float32') X,Y=250, 250 #size of each final timestep

for i in range(len(ds.time)): # some kind of heavy processing arr_r=some_processing(ds.isel(time=slice(i,i+1)),X,Y) del arr_r.air.attrs["_FillValue"] if os.path.exists(filename): arr_r.to_zarr(filename, append_dim='time') else: arr_r.to_zarr(filename) ```

If you find out more about the cloud case, please post a note, otherwise, we can assume that the original bug report is fine?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `to_zarr` with append or region mode and `_FillValue` doesnt work 1159923690
1063859715 https://github.com/pydata/xarray/issues/6329#issuecomment-1063859715 https://api.github.com/repos/pydata/xarray/issues/6329 IC_kwDOAMm_X84_aTYD d70-t 6574622 2022-03-10T09:44:59Z 2022-03-10T09:44:59Z CONTRIBUTOR

Sure, no problem. I believe, this page has a good summary:

mode ({"w", "w-", "a", "r+", None}, optional) – Persistence mode: “w” means create (overwrite if exists); “w-” means create (fail if exists); “a” means override existing variables (create if does not exist); “r+” means modify existing array values only (raise an error if any metadata or shapes would change). The default mode is “a” if append_dim is set. Otherwise, it is “r+” if region is set and w- otherwise.

So the difference between "a" and "r+" roughly codifies the intended behaviour for sequential access (it's ok to modify everything) and parallel access to independent chunks (where modifying metadata would be bad).

So probably that message was suggesting that you have to use "a" if you want to modify metadata (e.g. by expanding the shape), which is true. But to me, it's unclear how one would do that safely with (potentially) parallel region writes, so it's kind of reasonable that region writes don't like to modify metadata.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `to_zarr` with append or region mode and `_FillValue` doesnt work 1159923690
1062755678 https://github.com/pydata/xarray/issues/6329#issuecomment-1062755678 https://api.github.com/repos/pydata/xarray/issues/6329 IC_kwDOAMm_X84_WF1e d70-t 6574622 2022-03-09T10:06:22Z 2022-03-09T10:06:22Z CONTRIBUTOR

Yes, that looks like the error as described in the initial post. Adding the described workaround (i.e. del buff.air.attrs["_FillValue"] in this case) leads to the next error message:

ValueError: variable 'air' already exists with different dimension sizes: {'time': 0, 'y': 250, 'x': 250} != {'time': 1, 'y': 250, 'x': 250}. to_zarr() only supports changing dimension sizes when explicitly appending, but append_dim=None.

Which is due to a mix of append-mode (mode='a') and region-write (region={'time':slice(i,i+1)}), which is e.g. out of the scope as outlined in this comment. It may or may not be possible or intended to support this, but I'm not deep enough into the design of xarray to give a definitive answer here. For me, it's unclear how this should behave. My current point of view is:

  • append: may change structure-defining metadata, must be sequential, mode='a'
  • region: may not change structure-defining metadata, can be parallel, mode='r+'

Currently, I can't really imagine how a mix of both should behave. If you can't prepare the dataset for the final shape upfront (to use region) and you also can't use append_dim, then probably what's needed is a separate method of expanding the dataset (i.e. reshape) without filling in the data. If such a thing would be available, one could (as a user) ensure that all reshaping operations are properly sequenced with region operations, but region operations could be run in parallel. (I think this is possible with plain-zarr, but I'm not aware of a corresponding xarray API).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `to_zarr` with append or region mode and `_FillValue` doesnt work 1159923690
1061711069 https://github.com/pydata/xarray/issues/6329#issuecomment-1061711069 https://api.github.com/repos/pydata/xarray/issues/6329 IC_kwDOAMm_X84_SGzd d70-t 6574622 2022-03-08T12:09:38Z 2022-03-08T12:09:38Z CONTRIBUTOR

You've got the encoding of air set to int16: python print(buff.air.encoding) {'source': '.../xarray_tutorial_data/69c68be1605878a6c8efdd34d85b4ca1-air_temperature.nc', 'original_shape': (2920, 25, 53), 'dtype': dtype('int16'), 'scale_factor': 0.01, 'grid_mapping': 'spatial_ref'}

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `to_zarr` with append or region mode and `_FillValue` doesnt work 1159923690
1061081884 https://github.com/pydata/xarray/issues/6329#issuecomment-1061081884 https://api.github.com/repos/pydata/xarray/issues/6329 IC_kwDOAMm_X84_PtMc d70-t 6574622 2022-03-07T20:03:18Z 2022-03-07T20:03:18Z CONTRIBUTOR

Sorry, @Boorhin. But the code example you showed has many syntax errors:

$ python3 test.py File "test.py", line 8 return arr_r.x.values, arr_r.y.values ^ SyntaxError: invalid syntax (there are more and I wasn't sure how to fix them at all places to match what you likely wanted to express)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `to_zarr` with append or region mode and `_FillValue` doesnt work 1159923690
1059426353 https://github.com/pydata/xarray/issues/6329#issuecomment-1059426353 https://api.github.com/repos/pydata/xarray/issues/6329 IC_kwDOAMm_X84_JZAx d70-t 6574622 2022-03-04T18:48:13Z 2022-03-04T18:48:13Z CONTRIBUTOR

If that's necessary to reproduce the problem, then yes. If it's possible to show the same thing with less "noise", then it's better to not use the tutorial dataset and to not use something like a cloud backend. But we can also try to iterate on this again, to progressively get down to a smaller example.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `to_zarr` with append or region mode and `_FillValue` doesnt work 1159923690
1059405550 https://github.com/pydata/xarray/issues/6069#issuecomment-1059405550 https://api.github.com/repos/pydata/xarray/issues/6069 IC_kwDOAMm_X84_JT7u d70-t 6574622 2022-03-04T18:16:57Z 2022-03-04T18:16:57Z CONTRIBUTOR

I'll set up a new issue. @Boorhin, I couldn't confirm the weirdness with the small example, but will put in a note to your comment. If you can reproduce the weirdness on the minimal example, would you make a comment to the new issue?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_zarr: region not recognised as dataset dimensions 1077079208
1059378287 https://github.com/pydata/xarray/issues/6069#issuecomment-1059378287 https://api.github.com/repos/pydata/xarray/issues/6069 IC_kwDOAMm_X84_JNRv d70-t 6574622 2022-03-04T17:39:24Z 2022-03-04T17:39:24Z CONTRIBUTOR

I've made a simpler example of the _FillValue - append issue: python import numpy as np import xarray as xr ds = xr.Dataset({"a": ("x", [3.], {"_FillValue": np.nan})}) m = {} ds.to_zarr(m) ds.to_zarr(m, append_dim="x") raises ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually. I'd expect this to just work (effectively concatenating the dataset to itself).

The workaround: python m = {} ds.to_zarr(m) del ds.a.attrs["_FillValue"] ds.to_zarr(m, append_dim="x") does the trick, but doesn't look right.

@dcherian, @Boorhin should we make a new (CF-related) issue out of this and try to keep focussing on append and region use-cases here, which seemed to be the initial problem in this thread (probably by going further through your example @Boorhin?).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_zarr: region not recognised as dataset dimensions 1077079208
1059078961 https://github.com/pydata/xarray/issues/6069#issuecomment-1059078961 https://api.github.com/repos/pydata/xarray/issues/6069 IC_kwDOAMm_X84_IEMx d70-t 6574622 2022-03-04T11:27:12Z 2022-03-04T11:27:44Z CONTRIBUTOR

btw, as a work-around it works when removing the _FillValue from dst.air (you'll likely only want to do this for the append-writes, not the initial write):

python del dst.air.attrs["_FillValue"] dst.to_zarr(m, append_dim="time") works.

But still, this might call for another issue to solve.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_zarr: region not recognised as dataset dimensions 1077079208
1059076885 https://github.com/pydata/xarray/issues/6069#issuecomment-1059076885 https://api.github.com/repos/pydata/xarray/issues/6069 IC_kwDOAMm_X84_IDsV d70-t 6574622 2022-03-04T11:23:56Z 2022-03-04T11:23:56Z CONTRIBUTOR

Ok, I believe, I've now reproduced your error:

```python import xarray as xr from rasterio.enums import Resampling import numpy as np ds = xr.tutorial.open_dataset('air_temperature').isel(time=0) ds = ds.rio.write_crs('EPSG:4326') dst = ds.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan) dst.air.encoding = {} dst = dst.assign(air=dst.air.expand_dims("time"), time=dst.time.expand_dims("time"))

m = {} dst.to_zarr(m) dst.to_zarr(m, append_dim="time") raises: ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually. ```

This seems to be due to handling of CF-Conventions which might go wrong in the append case: the CFMaskCoder verifies that there isn't any fill value present in the dataset before defining one here. I'd guess in the append case, one wouldn't want to check if the fill value is already defined, but instead want to check that it is the same. However, I don't know a lot about the CF encoding pieces of xarray...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_zarr: region not recognised as dataset dimensions 1077079208
1059063397 https://github.com/pydata/xarray/issues/6069#issuecomment-1059063397 https://api.github.com/repos/pydata/xarray/issues/6069 IC_kwDOAMm_X84_IAZl d70-t 6574622 2022-03-04T11:05:07Z 2022-03-04T11:05:07Z CONTRIBUTOR

This error ist unrelated to region or append writes. The dataset dst got the _FillValue attribute from rio.reproject ```

dst.air.attrs {... '_FillValue': nan} ```

but still carries encoding-information from ds, i.e.: ```

dst.air.encoding {'source': '...air_temperature.nc', 'original_shape': (2920, 25, 53), 'dtype': dtype('int16'), 'scale_factor': 0.01, 'grid_mapping': 'spatial_ref'} `` The encoding get's picked up byto_zarr, but asnan(the_FillValuefromrio.reproject) can't be expressed as anint16, it's not possible to write that data. You'll have to get rid of the encoding or specify some encoding and_FillValue` which fit together. #5219 might be related.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_zarr: region not recognised as dataset dimensions 1077079208
1059025444 https://github.com/pydata/xarray/issues/6069#issuecomment-1059025444 https://api.github.com/repos/pydata/xarray/issues/6069 IC_kwDOAMm_X84_H3Ik d70-t 6574622 2022-03-04T10:13:40Z 2022-03-04T10:13:40Z CONTRIBUTOR

🤷 can't help any further without a minimal reproducible example here...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_zarr: region not recognised as dataset dimensions 1077079208
1058381922 https://github.com/pydata/xarray/issues/6069#issuecomment-1058381922 https://api.github.com/repos/pydata/xarray/issues/6069 IC_kwDOAMm_X84_FaBi d70-t 6574622 2022-03-03T18:56:13Z 2022-03-03T18:56:13Z CONTRIBUTOR

I don't yet know a proper answer, but there'd be three observations I have: * The ValueError seems to be related to the handling of CF-Conventions. I don't yet know if that's independent of this issue or if the error only appears in conjunction with this issue. * As far as I understand, appending should be possible without dropping anything (while potentially overwriting some things). * It shouldn't be possible to change _FillValue during appends, because that might require rewriting everything previously written, which you likely don't want. So if _FillValue is different on the append-call, I'd want xarray to produce an error.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_zarr: region not recognised as dataset dimensions 1077079208
1052252098 https://github.com/pydata/xarray/issues/6069#issuecomment-1052252098 https://api.github.com/repos/pydata/xarray/issues/6069 IC_kwDOAMm_X84-uBfC d70-t 6574622 2022-02-26T16:07:56Z 2022-02-26T16:07:56Z CONTRIBUTOR

While testing a bit further, I found another case which might potentially be dangerous:

```python

ds is the same as above, but chunksize is {"time": 1, "x": 1}

once on the coordinator

ds.to_zarr("test.zarr", compute=False, encoding={"time": {"chunks": [1]}, "x": {"chunks": [1]}})

in parallel

ds.isel(time=slice(0,1), x=slice(0,1)).to_zarr("test.zarr", mode="r+", region={"time": slice(0,1), "x": slice(0,1)}) ds.isel(time=slice(0,1), x=slice(1,2)).to_zarr("test.zarr", mode="r+", region={"time": slice(0,1), "x": slice(1,2)}) ds.isel(time=slice(0,1), x=slice(2,3)).to_zarr("test.zarr", mode="r+", region={"time": slice(0,1), "x": slice(2,3)}) ds.isel(time=slice(1,2), x=slice(0,1)).to_zarr("test.zarr", mode="r+", region={"time": slice(1,2), "x": slice(0,1)}) ds.isel(time=slice(1,2), x=slice(1,2)).to_zarr("test.zarr", mode="r+", region={"time": slice(1,2), "x": slice(1,2)}) ds.isel(time=slice(1,2), x=slice(2,3)).to_zarr("test.zarr", mode="r+", region={"time": slice(1,2), "x": slice(2,3)}) ```

This example doesn't produce any error, but the time and x coordinates are re-written multiple times without any warning. However, I don't yet know how a proper error / warning should be generated in this case. Maybe the check must be if every written variable touches all region-ed dimensions? But maybe thats overly restrictive?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_zarr: region not recognised as dataset dimensions 1077079208
1052240616 https://github.com/pydata/xarray/issues/6069#issuecomment-1052240616 https://api.github.com/repos/pydata/xarray/issues/6069 IC_kwDOAMm_X84-t-ro d70-t 6574622 2022-02-26T15:58:48Z 2022-02-26T15:58:48Z CONTRIBUTOR

I'm trying to picture some usage scenarios based on incrementally adding timesteps to data on store. I hope these might help to answer questions from above. In particular, I think that append and region options of to_zarr will imply different usage patterns, so might lead to different answers, and mixing terms might lead to confusion.

I'll use the following dataset for demonstration code:

python ds = xr.Dataset({ "T": (("time", "x"), [[1.,2.,3.],[11.,12.,13.]]), }, coords={ "time": (("time",), [21., 22.]), "x": (("x",), [100., 200., 300.]) }).chunk({"time": 1})

append

The purpose of append is to add (one or many) elements along one dimension after the end of all currently existing elements. This implies a read-modify-write cycle to at least the total shape of the array. Furthermore, the place to write new chunks is determined by the current shape of the existing array. Due to these implications, it doesn't seem to be useful to try append in parallel (it would become ambiguous where to write) and it doesn't seem to be too useful (but possible) to only write some of the variables defined on the append-dimension, because all other variables would implicitly be filled with fill_value and those places couldn't be filled with another append anymore.

As a consquence, append-mode writes will always have to be sequential and writes to objects shared touched by multiple append calls will always have a defined behaviour, even if they are modified / overwritten with each call. Creating and appending works as follows:

```python

writes 0-sized time-dimension, so only metadata and non-time dependent variables

ds.isel(time=slice(0,0)).to_zarr("test_append.zarr")

!tree -a test_append.zarr

ds.isel(time=slice(0,1)).to_zarr("test_append.zarr", mode="a", append_dim="time") ds.isel(time=slice(1,2)).to_zarr("test_append.zarr", mode="a", append_dim="time")

print() print("final dataset:") !tree -a test_append.zarr ```

Output ``` test_append.zarr ├── .zattrs ├── .zgroup ├── .zmetadata ├── T │ ├── .zarray │ └── .zattrs ├── time │ ├── .zarray │ └── .zattrs └── x ├── .zarray ├── .zattrs └── 0 3 directories, 10 files final dataset: test_append.zarr ├── .zattrs ├── .zgroup ├── .zmetadata ├── T │ ├── .zarray │ ├── .zattrs │ ├── 0.0 │ └── 1.0 ├── time │ ├── .zarray │ ├── .zattrs │ ├── 0 │ └── 1 └── x ├── .zarray ├── .zattrs └── 0 3 directories, 14 files ```

In this case, x would be overwritten with each append call, but the behaviour is well defined as we will only ever append sequentially, so whatever the last write writes into x will be the final result, e.g. [1, 2, 3] in the following case:

python ds.isel(time=slice(0,1)).to_zarr("test_append.zarr", mode="a", append_dim="time") ds2 = ds.assign(x=[1,2,3]) ds2.isel(time=slice(1,2)).to_zarr("test_append.zarr", mode="a", append_dim="time")

If instead, x shouldn't be overwritten, it's possible to append using: python ds.drop(["x"]).isel(time=slice(0,1)).to_zarr("test_append.zarr", mode="a", append_dim="time") ds.drop(["x"]).isel(time=slice(1,2)).to_zarr("test_append.zarr", mode="a", append_dim="time") This also works already with current xarray and has well defined behaviour. However, if there are many time-independent variables, it might be easier if something like .drop_if_not("time") or something similar would be available.

region

region behaves quite differently from append. It does not modify the shape of the arrays and it does not depend on the shape's value to determine where to write new data (it requires user input to do so). This generally enables parallel writes to the same dataset (if only distinct chunks are touched). But as metadata (e.g. shape) is still shared, updates to metadata must be done in a coordinated (likely sequential) manner.

Generally, the workflow with region would imply writing the metadata once and maybe update it from time to time but sequentially (e.g. on a coordinating node) and write all the chunks in parallel on worker nodes, while carefully ensuring that no common chunks are overwritten. Let's see how this might look like:

```python ds.to_zarr("test.zarr", compute=False, encoding={"time": {"chunks": [1]}}) !rm test.zarr/time/0 !rm test.zarr/time/1

!tree -a test.zarr

NOTE: these may run in parallel (even if that's not useful in time, but region might also be in time and space)

ds.drop(['x']).isel(time=slice(0,1)).to_zarr("test.zarr", mode="r+", region={"time": slice(0,1)}) ds.drop(['x']).isel(time=slice(1,2)).to_zarr("test.zarr", mode="r+", region={"time": slice(1,2)})

print() print("final dataset:") !tree -a test.zarr ```

Output ``` test.zarr ├── .zattrs ├── .zgroup ├── .zmetadata ├── T │ ├── .zarray │ └── .zattrs ├── time │ ├── .zarray │ └── .zattrs └── x ├── .zarray ├── .zattrs └── 0 3 directories, 10 files final dataset: test.zarr ├── .zattrs ├── .zgroup ├── .zmetadata ├── T │ ├── .zarray │ ├── .zattrs │ ├── 0.0 │ └── 1.0 ├── time │ ├── .zarray │ ├── .zattrs │ ├── 0 │ └── 1 └── x ├── .zarray ├── .zattrs └── 0 3 directories, 14 files ```

The above works and as far as I understand does what we'd want for parallel writes. It also avoids the mentioned ambiguous cases (due to the drop(['x']) statements). However this case is even more cumbersome to write than in the append case. The parallel writes might benefit from again from something like .drop_if_not("time") (which probably can't be optional in this case due to ambiguity). But what's even more problematic is the initial write of array metadata. In order to start building the dataset, I'll have to scaffold an (potentially not yet computed) Dataset of full size and use compute=False to write only metadata. However, this fails for coordinate variables (like time), because those are eagerly loaded and will still be written out. That's why I've removed those chunks in the example above.

If region should be used for parallel append, then there must be some process on a coordinating node which updates the metadata keys (at least by increasing the shape). I don't yet see how that could be written nicely using xarray.


So based on these two kinds of tasks, it seems to me that the actual append and region write-modes of to_zarr are already doing what they should do, but there could be some more convenience functions which would make those tasks much simpler:

  • some method like drop_if_not (maybe with a better name) which drops all the things we don't want to keep (maybe we should call it keep instead of drop). This method would essentially result in and simplify mode 1 in @shoyer's answer, which I'd argue is what we actually want in both use cases, becasue the dropped data would already have been written by the coordinating process. I'd believe that mode 1 shouldn't be the default for to_zarr though, because silently dropping data from being written isn't nice to the user.
  • some better tooling for writing and updating zarr dataset metadata (I don't know if that would fit in the realm of xarray though, as it looks like handling Datasets without content. For "appending" metadata, I really don't know how I'd picture this propery in xarray world.)
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_zarr: region not recognised as dataset dimensions 1077079208
1034170927 https://github.com/pydata/xarray/pull/6260#issuecomment-1034170927 https://api.github.com/repos/pydata/xarray/issues/6260 IC_kwDOAMm_X849pDIv d70-t 6574622 2022-02-09T20:38:27Z 2022-02-09T20:39:07Z CONTRIBUTOR

I'm wondering what the right option for this case would be: python data = Dataset( {"u": (("x", "y"), np.array([[10], [11], [12]]))}, coords={"x": [0, 1, 2], "y": [0], "z": ("x", [10, 11, 12])}, ) data2 = Dataset( {"u": (("x", "y"), np.array([[13], [14]]))}, coords={"x": [3, 4], "y": [1], "z": ("x", [13, 14])}, ) In this case, the y-coordinate would be independent of x, so probably should not be updated during the region-write (multiple concurrent writes on distinct regions would interfere). However, the z-coordinate probably should be written as that would result in concurrent writes to distinct regions.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  allow coordinates to be independent of `region` selection in to_zarr 1128821318
1034098265 https://github.com/pydata/xarray/issues/6259#issuecomment-1034098265 https://api.github.com/repos/pydata/xarray/issues/6259 IC_kwDOAMm_X849oxZZ d70-t 6574622 2022-02-09T19:05:44Z 2022-02-09T19:05:44Z CONTRIBUTOR

This sounds like it could theoretically be handled using intake derived datasets. To be fair, derived datasets are probably still in their early stages. But the basic idea would be to apply arbitrary transformations to a dataset after it has been opened (e.g. with decode_cf=False) and represent the outcome of this transformation as an entry in the catalog. A suitable transformation function might be something like: python def fix_calendar(ds): ds.time.calendar = "proleptic_gregorian" return xr.decode_cf(ds) ... but maybe it is still more convenient or useful to handle it in xarray directly (e.g. I don't know if stac has a similar approach).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Be able to override calendar in `open_dataset`/`open_mfdataset`/etc OR include another calendar name 1128759050
1033803014 https://github.com/pydata/xarray/pull/6258#issuecomment-1033803014 https://api.github.com/repos/pydata/xarray/issues/6258 IC_kwDOAMm_X849npUG d70-t 6574622 2022-02-09T14:12:21Z 2022-02-09T14:12:21Z CONTRIBUTOR

Indeed, those variable names have been quite unfortunate! I've changed them to goodenc. Thanks again for the review.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  removed check for last dask chunk size in to_zarr 1128485610
1033781323 https://github.com/pydata/xarray/pull/6258#issuecomment-1033781323 https://api.github.com/repos/pydata/xarray/issues/6258 IC_kwDOAMm_X849nkBL d70-t 6574622 2022-02-09T13:50:13Z 2022-02-09T13:50:13Z CONTRIBUTOR

Thanks Ryan for having a look into this.

Accidentally I didn't run enough of the tests locally before submitting the PR. I've now checked the failing tests and came to the conclusion that the previously existing tests had been overly restrictive and rewrote them to reflect more closely what I believe that we actually want.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  removed check for last dask chunk size in to_zarr 1128485610
863972083 https://github.com/pydata/xarray/issues/5490#issuecomment-863972083 https://api.github.com/repos/pydata/xarray/issues/5490 MDEyOklzc3VlQ29tbWVudDg2Mzk3MjA4Mw== d70-t 6574622 2021-06-18T11:32:38Z 2021-06-18T11:33:14Z CONTRIBUTOR

I've checked your example files. This is mostly related to the fact, that the original data is encoded as short and uses scale_factor and add_offset: python In [35]: ds_loc.q.encoding Out[35]: {'source': '/private/tmp/test_xarray/Minimal_test_data/2012_europe_9_130_131_132_133_135.nc', 'original_shape': (720, 26, 36, 41), 'dtype': dtype('int16'), 'missing_value': -32767, '_FillValue': -32767, 'scale_factor': 3.0672840096982675e-07, 'add_offset': 0.010050721147263318}

Probably the scaling and adding is carried out in float64, but then rounded down to float32. When storing the dataset back to netCDF, xarray re-uses the information from the encoding attribute and goes back to int16, possibly creating even more rounding errors. Reading the data back in is then not reproducible anymore.

Possibly related issues are #4826 and #3020

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Nan/ changed values in output when only reading data, saving and reading again 924676925
863945975 https://github.com/pydata/xarray/issues/5490#issuecomment-863945975 https://api.github.com/repos/pydata/xarray/issues/5490 MDEyOklzc3VlQ29tbWVudDg2Mzk0NTk3NQ== d70-t 6574622 2021-06-18T10:44:38Z 2021-06-18T10:44:38Z CONTRIBUTOR

Are your input files on (exactly) the same grid? If not, combining the files might introduce NaN to fill up missmatching cells. Furthemore, if you are working with NaNs, are you aware of: ```python In [1]: import numpy as np

In [2]: np.nan == np.nan Out[2]: False ``` Which is as it should be per IEEE 754.

When writing out the files to netCDF, do you accidentally convert from 64bit float to 32bit float?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Nan/ changed values in output when only reading data, saving and reading again 924676925
863939178 https://github.com/pydata/xarray/issues/5489#issuecomment-863939178 https://api.github.com/repos/pydata/xarray/issues/5489 MDEyOklzc3VlQ29tbWVudDg2MzkzOTE3OA== d70-t 6574622 2021-06-18T10:32:10Z 2021-06-18T10:32:10Z CONTRIBUTOR

I think there's more to think of then the suggested solution. For example when opening remote datasets (e.g. OPeNDAP resources), the supplied path will be a string which does not refer to a local path. The decision if a supplied "path" is valid might thus require to find an appropriate IO backend and then ask the backend if the supplied "path" is a valid one.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Misleading error when opening file that does not exist 924559401
863098508 https://github.com/pydata/xarray/issues/5189#issuecomment-863098508 https://api.github.com/repos/pydata/xarray/issues/5189 MDEyOklzc3VlQ29tbWVudDg2MzA5ODUwOA== d70-t 6574622 2021-06-17T09:49:48Z 2021-06-17T09:49:48Z CONTRIBUTOR

Pydap has several important fixes which have been merged into master already. Nevertheless, the latest release of Pydap is from May 2017, which is before the referenced PR.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  KeyError pulling from Nasa server with Pydap 861684673
824207037 https://github.com/pydata/xarray/issues/1650#issuecomment-824207037 https://api.github.com/repos/pydata/xarray/issues/1650 MDEyOklzc3VlQ29tbWVudDgyNDIwNzAzNw== d70-t 6574622 2021-04-21T16:46:54Z 2021-06-15T16:18:54Z CONTRIBUTOR

I'd be interested in this kind of thing as well. :+1:

We are having long time series data, which we would like to access via opendap or zarr over HTTP. Currently, the time coordinate variable is already more than 1 GB in size, which makes loading the dataset very slow or even impossible given the limitations of the opendap server and my home internet wire. Nonetheless, we know that the timestamps are in order and reasonably close to equidistant. Thus binary search or even interpolation search should be a quick method to find the right indices.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Low memory/out-of-core index? 267628781
797151374 https://github.com/pydata/xarray/pull/4966#issuecomment-797151374 https://api.github.com/repos/pydata/xarray/issues/4966 MDEyOklzc3VlQ29tbWVudDc5NzE1MTM3NA== d70-t 6574622 2021-03-12T00:38:47Z 2021-03-12T00:38:47Z CONTRIBUTOR

I don't know if this qualifies as "documentation", but according to this merged PR on the netcdf-c sources, this is how the thredds OPeNDAP server behaves, from which they conclude that netCDF should behave accordingly. I confirmed myself that this also is how my currently installed netCDF-C behaves.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  conventions: decode unsigned integers to signed if _Unsigned=false 817302678
786615778 https://github.com/pydata/xarray/issues/4954#issuecomment-786615778 https://api.github.com/repos/pydata/xarray/issues/4954 MDEyOklzc3VlQ29tbWVudDc4NjYxNTc3OA== d70-t 6574622 2021-02-26T12:22:52Z 2021-02-26T12:22:52Z CONTRIBUTOR

Thanks @dcherian. I added a PR #4966

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Handling of signed bytes from OPeNDAP via pydap 815858485
676193395 https://github.com/pydata/xarray/pull/4312#issuecomment-676193395 https://api.github.com/repos/pydata/xarray/issues/4312 MDEyOklzc3VlQ29tbWVudDY3NjE5MzM5NQ== d70-t 6574622 2020-08-19T11:32:37Z 2020-08-19T11:32:37Z CONTRIBUTOR

Do you know why Read the Docs complains? And if this is related to the PR?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  allow manual zarr encoding on unchunked dask dimensions 673513695

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 20.231ms · About: xarray-datasette