home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

731 rows where user = 1197350 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

issue >30

  • WIP: Zarr backend 37
  • Multidimensional groupby 27
  • Zarr chunking fixes 15
  • Appending to zarr store 12
  • Fix contour color 11
  • Add an example of ERA5 and GRIB data & visualization to the gallery 11
  • slow performance with open_mfdataset 10
  • Drop coordinates on loading large dataset. 9
  • Zarr consolidated 9
  • Use xarray.open_dataset() for password-protected Opendap files 8
  • groupby on dask objects doesn't handle chunks well 8
  • why time grouping doesn't preserve chunks 8
  • to_dict without data 7
  • Added ROMS ocean model example notebook 7
  • Fix datetime decoding when time units are 'days since 0000-01-01 00:00:00' 6
  • Sparse arrays 6
  • xarray to zarr 6
  • open_mfdataset usage and limitations. 6
  • open_mfdataset too many files 5
  • API for multi-dimensional resampling/regridding 5
  • time decoding error with "days since" 5
  • selecting a point from an mfdataset 5
  • concat prealigned objects 5
  • Let's list all the netCDF files that xarray can't open 5
  • problem appending to zarr on GCS when using json token 5
  • Option for closing files with scipy backend 4
  • groupby_bins: exclude bin or assign bin with nan when bin has no values 4
  • open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 4
  • Add persist method to DataSet 4
  • autoclose with distributed doesn't seem to work 4
  • …

user 1

  • rabernat · 731 ✖

author_association 1

  • MEMBER 731
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
285380106 https://github.com/pydata/xarray/issues/1303#issuecomment-285380106 https://api.github.com/repos/pydata/xarray/issues/1303 MDEyOklzc3VlQ29tbWVudDI4NTM4MDEwNg== rabernat 1197350 2017-03-09T15:18:18Z 2024-02-06T17:57:21Z MEMBER

Just wanted to link to a somewhat related discussion happening in brian-rose/climlab#50.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `xarray.core.variable.as_variable()` part of the public API? 213004586
1534724554 https://github.com/pydata/xarray/issues/3213#issuecomment-1534724554 https://api.github.com/repos/pydata/xarray/issues/3213 IC_kwDOAMm_X85begnK rabernat 1197350 2023-05-04T12:51:59Z 2023-05-04T12:51:59Z MEMBER

I suspect (but don't know, as I'm just a user of xarray, not a developer) that it's also not thoroughly tested.

Existing sparse testing is here: https://github.com/pydata/xarray/blob/main/xarray/tests/test_sparse.py

We would welcome enhancements to this!

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  How should xarray use/support sparse arrays? 479942077
1534001190 https://github.com/pydata/xarray/issues/3213#issuecomment-1534001190 https://api.github.com/repos/pydata/xarray/issues/3213 IC_kwDOAMm_X85bbwAm rabernat 1197350 2023-05-04T02:36:57Z 2023-05-04T02:36:57Z MEMBER

Hi @jdbutler and welcome! We would welcome this sort of contribution eagerly.

I would characterize our current support of sparse arrays as really just a proof of concept. When to use sparse and how to do it effectively is not well documented. Simply adding more documentation around the already-supported use cases would be a great place to start IMO.

My own exploration of this are described in this Pangeo post. The use case is regridding. It touches on quite a few of the points you're interested in, in particular the integration with geodataframe. Along similar lines, @dcherian has been working on using opt_einsum together with sparse in https://github.com/pangeo-data/xESMF/issues/222#issuecomment-1524041837 and https://github.com/pydata/xarray/issues/7764.

I'd also suggest catching up on what @martinfleis is doing with vector data cubes in xvec. (See also Pangeo post on this topic.)

Of the three topics you enumerated, I'm most interested in the serialization one. However, I'd rather see serialization of sparse arrays prototyped in Zarr, as its much more conducive to experimentation than NetCDF (which requires writing C to do anything custom). I would recommend exploring serialization from a sparse array in memory to a sparse format on disk via a custom codec. Zarr recently added support for a meta_array parameter that determines what array type is materialized by the codec pipeline (see https://github.com/zarr-developers/zarr-python/pull/1131). The use case there was loading data direct to GPU. In a way sparse is similar--it's an array container that is not numpy or dask.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  How should xarray use/support sparse arrays? 479942077
1524332001 https://github.com/pydata/xarray/issues/7764#issuecomment-1524332001 https://api.github.com/repos/pydata/xarray/issues/7764 IC_kwDOAMm_X85a23Xh rabernat 1197350 2023-04-27T00:56:21Z 2023-04-27T00:56:21Z MEMBER

Is there ever a case where it would be preferable to use numpy if opt_einsum were installed? If not, I would propose that, like bottleneck, we just automatically use it if available.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Support opt_einsum in xr.dot 1672288892
1497579600 https://github.com/pydata/xarray/issues/7716#issuecomment-1497579600 https://api.github.com/repos/pydata/xarray/issues/7716 IC_kwDOAMm_X85ZQ0BQ rabernat 1197350 2023-04-05T14:23:57Z 2023-04-05T14:23:57Z MEMBER

Do we have a plan to support pandas 2?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  bad conda solve with pandas 2 1654022522
1492139481 https://github.com/pydata/xarray/issues/6323#issuecomment-1492139481 https://api.github.com/repos/pydata/xarray/issues/6323 IC_kwDOAMm_X85Y8D3Z rabernat 1197350 2023-03-31T15:31:55Z 2023-03-31T15:31:55Z MEMBER

We should also consider a configuration option to automatically drop encoding.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  propagation of `encoding` 1158378382
1460185069 https://github.com/pydata/xarray/issues/7039#issuecomment-1460185069 https://api.github.com/repos/pydata/xarray/issues/7039 IC_kwDOAMm_X85XCKft rabernat 1197350 2023-03-08T13:51:06Z 2023-03-08T13:51:06Z MEMBER

Rather than using the scale_factor and add_offset approach, I would look into xbitinfo if you want to optimize your compression.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Encoding error when saving netcdf 1373352524
1460182260 https://github.com/pydata/xarray/pull/7540#issuecomment-1460182260 https://api.github.com/repos/pydata/xarray/issues/7540 IC_kwDOAMm_X85XCJz0 rabernat 1197350 2023-03-08T13:48:51Z 2023-03-08T13:49:21Z MEMBER

Regarding locks, I think we need to think hard about the best way to deal with this across the stack. There are a couple of different options: - Current status: just use a global lock on the entire array--super inefficient - A bit better: use per-variable locks - Even better: have locks at the shard level. This would allow concurrent writing of shards - Alternative which accomplishes the same thing: expose different virtual chunks when reading vs. writing. When writing, the writer library (e.g. Xarray or Dask) would see the shards as the chunks (with a lower layer of the stack handling breaking the shard down into chunks). When reading, the individual, smaller chunks would be accessible.

Note that there are still some deep inefficiencies in the way zarr-python writes shards (see https://github.com/zarr-developers/zarr-python/discussions/1338). I think we should be optimizing things at the Zarr level first, before implementing workarounds in Xarray.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  added 'storage_transformers' to valid_encodings 1588516592
1460175664 https://github.com/pydata/xarray/pull/7540#issuecomment-1460175664 https://api.github.com/repos/pydata/xarray/issues/7540 IC_kwDOAMm_X85XCIMw rabernat 1197350 2023-03-08T13:44:02Z 2023-03-08T13:44:02Z MEMBER

It's great to see this PR get started in Xarray! Thanks @JMorado!

From the perspective of a Zarr developer, the sharding feature is still highly experimental. The API may change significantly. While the sharding code is released in the sense that it is available deep in Zarr, it is not really considered part of the public API yet.

So perhaps it's a bit too early to be doing this?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  added 'storage_transformers' to valid_encodings 1588516592
1422860618 https://github.com/pydata/xarray/issues/7515#issuecomment-1422860618 https://api.github.com/repos/pydata/xarray/issues/7515 IC_kwDOAMm_X85UzyFK rabernat 1197350 2023-02-08T16:05:13Z 2023-02-08T16:47:59Z MEMBER

It seems like there are at least 3 separate topics being discussed here.

  1. Could Xarray wrap Aesara / PyTensor arrays, in the same way it wraps numpy arrays, Dask arrays, cupy arrays, sparse arrays, pint arrays, etc? This way, Xarray users could benefit from the performance and other features of Aesara while keeping the high-level analysis API they know and love. AFAIU, Any array library that implements the NEP 37 protocol should be wrappable. This is Joe's original topic.
  2. Should Aesara / PyTensor implement their own versions of named dimensions and coordinates? This is an internal question for those projects. Not the original topic, but nevertheless we would love to help by exposing some Xarray internals for reuse by other packages (this is on our roadmap). It would be a shame to reinvent wheels unnecessarily. I would be interested in understanding the tradeoffs and different use cases between this and topic 1.
  3. Pre-existing tensions between Aesara and PyTensor. Since this conversation is happening on our issue tracker, I'll point to our code of conduct and hope that the conversation can remain positive and respectful of all viewpoints. From our point of view as Xarray devs, PyTensor and Aesara do indeed seem quite similar in scope. It would be wonderful if we could all work together in some way towards topic 1.
{
    "total_count": 8,
    "+1": 8,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Aesara as an array backend in Xarray 1575494367
1416643026 https://github.com/pydata/xarray/pull/7142#issuecomment-1416643026 https://api.github.com/repos/pydata/xarray/issues/7142 IC_kwDOAMm_X85UcEHS rabernat 1197350 2023-02-04T03:02:09Z 2023-02-04T03:02:09Z MEMBER

I just noticed our very low coverage rating and found this PR. Did this PR work? Should we update it and merge?

It would be great to have our coverage back in the 90s rather than the 50s 😝 .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fix Codecov 1401132297
1412408324 https://github.com/pydata/xarray/pull/7496#issuecomment-1412408324 https://api.github.com/repos/pydata/xarray/issues/7496 IC_kwDOAMm_X85UL6QE rabernat 1197350 2023-02-01T17:06:47Z 2023-02-01T17:06:47Z MEMBER

It is true that Xarray is now becoming very different from pandas in how it opens data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  deprecate open_zarr 1564661430
1385683582 https://github.com/pydata/xarray/issues/7446#issuecomment-1385683582 https://api.github.com/repos/pydata/xarray/issues/7446 IC_kwDOAMm_X85Sl9p- rabernat 1197350 2023-01-17T16:23:01Z 2023-01-17T16:23:01Z MEMBER

Hi @gauteh! This is very cool! Thanks for sharing. I'm really excited about way that Rust can be used to optimized different parts of our stack.

A couple of questions: - Can your reader read over HTTP / S3 protocol? Or is it just local files? - Do you know about kerchunk? The approach you described:

The reader works by indexing the chunks of a dataset so that chunks can be accessed independently.

...is identical to the approach taken by Kerchunk (although the implementation is different). I'm curious what specification you use to store your indexes. Could we make your implementation interoperable with kerchunk, such that a kerchunk reference specification could be read by your reader? It would be great to reach for some degree of alignment here. - Do you know about hdf5-coro - http://icesat2sliderule.org/h5coro/ - they have similar goals, but focused on cloud-based access

I hope this can be of general interest, and if it would be of interest to move the hidefix xarray backend into xarray that would be very cool.

This is definitely of general interest! However, it is not necessary to add a new backend directly into xarray. We support entry points which allow packages to implement their own readers, as you have apparently already discovered: https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html

Installing your package should be enough to enable the new engine.

We would, however, welcome a documentation PR that described how to use this package on the I/O page.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel + multi-threaded reading of NetCDF4 + HDF5: Hidefix! 1536004355
1378079073 https://github.com/pydata/xarray/pull/7418#issuecomment-1378079073 https://api.github.com/repos/pydata/xarray/issues/7418 IC_kwDOAMm_X85SI9Fh rabernat 1197350 2023-01-11T00:34:03Z 2023-01-11T00:34:03Z MEMBER

we should carefully evaluate the datatree API to make sure we won't want to change it soon

I agree with this. We could use the PR process for this review.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Import datatree in xarray? 1519552711
1373993285 https://github.com/pydata/xarray/issues/3996#issuecomment-1373993285 https://api.github.com/repos/pydata/xarray/issues/3996 IC_kwDOAMm_X85R5XlF rabernat 1197350 2023-01-06T18:36:56Z 2023-01-06T18:47:48Z MEMBER

We found a nice solution to this using @TomNicholas's Datatree

```python import xarray as xr import datatree

dt = datatree.open_datatree("AQUA_MODIS.20220809T182500.L2.OC.nc")

def fix_dimension_names(ds): if 'pixel_control_points' in ds.dims: ds = ds.swap_dims({'pixel_control_points': 'pixels_per_line'}) return ds

dt_fixed = dt.map_over_subtree(fix_dimension_names)

all_dsets = [subtree.ds for node, subtree in dt_fixed.items()] ds = xr.merge(all_dsets, combine_attrs="drop_conflicts") ds = ds.set_coords(['latitude', 'longitude'])

ds.chlor_a.plot(x="longitude", y="latitude", robust=True) ```

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 1,
    "eyes": 0
}
  MODIS L2 Data Missing Data Variables and Geolocation Data 605608998
1372822656 https://github.com/pydata/xarray/pull/7418#issuecomment-1372822656 https://api.github.com/repos/pydata/xarray/issues/7418 IC_kwDOAMm_X85R05yA rabernat 1197350 2023-01-05T21:50:53Z 2023-01-05T21:50:53Z MEMBER

I personally favor just copying the code into Xarray and archiving the old repo.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Import datatree in xarray? 1519552711
1372802153 https://github.com/pydata/xarray/pull/7418#issuecomment-1372802153 https://api.github.com/repos/pydata/xarray/issues/7418 IC_kwDOAMm_X85R00xp rabernat 1197350 2023-01-05T21:31:33Z 2023-01-05T21:31:33Z MEMBER
  • At what stage is atatree "ready" to moved in here? At what stage should it become encouraged public API?

My opinion is that Datatree should move into Xarray now, ideally in a way that does not disrupt any existing user code, and that Datatree should become a first-class Xarray object (together with DataArray, and Dataset). Since it's a new feature, we don't necessarily have to be super conservative here. I think it is more than good enough / stable enough in its current state.

  • What's a good way to slowly roll the feature out?

Since Datatree sits above DataArray and Dataset, it should not interfere with any of our existing API. As long as test coverage is good, documentation is solid, and the code style matches the rest of Xarray, I think we can just bring it in.

  • How do I decrease the bus factor on datatree's code? Can I get some code reviews during the merging process? 🙏

I think that it is inevitable that you Tom will be the main owner of the Datatree code at the beginning (as @shoyer was of all of Xarray when he first released it). Over time, if people use it, some fraction of users will become maintainers, starting with the existing dev team.

  • Should I make a new CI environment just for testing datatree stuff?

Why? Are its dependencies different from Xarray?

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Import datatree in xarray? 1519552711
1315553661 https://github.com/pydata/xarray/issues/5878#issuecomment-1315553661 https://api.github.com/repos/pydata/xarray/issues/5878 IC_kwDOAMm_X85OacF9 rabernat 1197350 2022-11-15T16:22:30Z 2022-11-15T16:22:30Z MEMBER

Your issue is that the consolidated metadata have not been updated:

```python import gcsfs fs = gcsfs.GCSFileSystem()

the latest array metadata

print(fs.cat('gs://ldeo-glaciology/append_test/test30/temperature/.zarray').decode())

-> "shape": [ 6 ]

the consolidated metadata

print(fs.cat(''gs://ldeo-glaciology/append_test/test30/.zmetadata'').decode())

-> "shape": [ 3 ]

```

There are two ways to fix this.

  1. Don't use consolidated metadatda on read. (This will be a bit slower) python ds = xr.open_dataset('gs://ldeo-glaciology/append_test/test30', engine='zarr', consolidated=False)
  2. Reconsolidate your metadata after append. https://zarr.readthedocs.io/en/stable/tutorial.html#consolidating-metadata
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  problem appending to zarr on GCS when using json token  1030811490
1300863799 https://github.com/pydata/xarray/issues/6308#issuecomment-1300863799 https://api.github.com/repos/pydata/xarray/issues/6308 IC_kwDOAMm_X85NiZs3 rabernat 1197350 2022-11-02T16:39:53Z 2022-11-02T16:39:53Z MEMBER

Just found this issue! I agree that this would be helpful. But isn't it fundamentally a Dask issue? Vanilla Xarray + Numpy has none of these problems because everything is in memory.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xr.doctor(): diagnostics on a Dataset / DataArray ? 1151751524
1255550548 https://github.com/pydata/xarray/issues/6818#issuecomment-1255550548 https://api.github.com/repos/pydata/xarray/issues/6818 IC_kwDOAMm_X85K1i5U rabernat 1197350 2022-09-22T21:09:15Z 2022-09-22T21:09:15Z MEMBER

I just hit this same bug with numpy 1.23.3. Installing xarray from github main branch fixed it.

I think we really need to release soon (#7069).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray 2022.6.0 doesn't work well with numpy 1.20 1315607023
1248302788 https://github.com/pydata/xarray/issues/7039#issuecomment-1248302788 https://api.github.com/repos/pydata/xarray/issues/7039 IC_kwDOAMm_X85KZ5bE rabernat 1197350 2022-09-15T16:02:17Z 2022-09-15T16:02:17Z MEMBER

I am curious as to what exactly from the encoding introduces the noise (I still need to read through the documentation more thoroughly)?

The encoding says that your data should be encoded according to the following pseudocode formula: encoded = int((original - offset) / scale_factor) decoded = (scale_factor * float(encoded)) + offset

So the floating-point data are converted back and forth to a less precise type (integer) in order to save space. These numerical operations cannot preserve exact floating point accuracy. That's just how numerical float-point operations work. If you skip the encoding, then you just write the floating point bytes directly to disk, with no loss of precision.

This sort of encoding a crude form of lossy compression that is still unfortunately in use, even though there are much better algorithms available (and built into netcdf and zarr). Differences on the order of 10^-14 should not affect any real-world calculations.

However, this seems like a much, much smaller difference than the problem you originally reported. This suggests that the MRE does not actually reproduce the bug after all. How was the plot above (https://github.com/pydata/xarray/issues/7039#issue-1373352524) generated? From your actual MRE code? Or from your earlier example with real data?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Encoding error when saving netcdf 1373352524
1248241823 https://github.com/pydata/xarray/issues/7039#issuecomment-1248241823 https://api.github.com/repos/pydata/xarray/issues/7039 IC_kwDOAMm_X85KZqif rabernat 1197350 2022-09-15T15:12:34Z 2022-09-15T15:12:34Z MEMBER

I'm puzzled that I was not able to reproduce this error. I modified the end slightly as follows

```python

save dataset as netcdf

ds.to_netcdf("test.nc")

load saved dataset

ds_test = xr.open_dataset('test.nc')

verify that the two are equal within numerical precision

xr.testing.assert_allclose(ds, ds_test)

plot

plt.plot(ds.t2m - ds_test.t2m) ```

In my case, the differences were just numerical noise (order 10^-14)

I used the binder environment for this.

I'm pretty stumped.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Encoding error when saving netcdf 1373352524
1248098918 https://github.com/pydata/xarray/issues/7039#issuecomment-1248098918 https://api.github.com/repos/pydata/xarray/issues/7039 IC_kwDOAMm_X85KZHpm rabernat 1197350 2022-09-15T13:25:11Z 2022-09-15T13:25:11Z MEMBER

Thanks so much for taking the time to write up this detailed bug report! 🙏

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Encoding error when saving netcdf 1373352524
1246005938 https://github.com/pydata/xarray/issues/2812#issuecomment-1246005938 https://api.github.com/repos/pydata/xarray/issues/2812 IC_kwDOAMm_X85KRIqy rabernat 1197350 2022-09-13T22:18:31Z 2022-09-13T22:18:31Z MEMBER

Glad you got it working! So you're saying it does not work with open_zarr and does work with open_dataset(...engine='zarr')? Weird. We should deprecate open_zarr.

However, the behavior in Dask is strange. I think it is making each worker have its own cache and blowing up memory if I ask for a large cache.

Yes, I think I experienced that as well. I think the entire cache is serialized and passed around between workers.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expose zarr caching from xarray 421029352
1243823078 https://github.com/pydata/xarray/issues/2812#issuecomment-1243823078 https://api.github.com/repos/pydata/xarray/issues/2812 IC_kwDOAMm_X85KIzvm rabernat 1197350 2022-09-12T14:25:39Z 2022-09-12T14:25:39Z MEMBER

I have successfully used the Zarr LRU cache with Xarray. You just have to initialize the Store object outside of Xarray and then pass it to open_zarr or open_dataset(store, engine="zarr").

Have you tried that?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expose zarr caching from xarray 421029352
1216491512 https://github.com/pydata/xarray/issues/6916#issuecomment-1216491512 https://api.github.com/repos/pydata/xarray/issues/6916 IC_kwDOAMm_X85Igi_4 rabernat 1197350 2022-08-16T11:11:38Z 2022-08-16T11:11:38Z MEMBER

As a general principle, I think we should try to put enough information in encoding to enable one to re-open the dataset from scratch with the same parameters. So that would mean including the engine and other open_dataset options in encoding.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Given zarr-backed Xarray determine store and group 1339129609
1170451917 https://github.com/pydata/xarray/pull/6721#issuecomment-1170451917 https://api.github.com/repos/pydata/xarray/issues/6721 IC_kwDOAMm_X85Fw63N rabernat 1197350 2022-06-29T20:15:15Z 2022-06-29T20:15:15Z MEMBER

Awesome work!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fix .chunks loading lazy backed array data 1284071791
1146377099 https://github.com/pydata/xarray/issues/6662#issuecomment-1146377099 https://api.github.com/repos/pydata/xarray/issues/6662 IC_kwDOAMm_X85EVFOL rabernat 1197350 2022-06-03T21:30:48Z 2022-06-03T21:30:48Z MEMBER

Following up on the suggestion from @shoyer in to not use a context manager, if I redefine my function as

```python def open_pickle_and_reload(path): of = fsspec.open(path, mode='rb').open() ds1 = xr.open_dataset(of, engine='h5netcdf')

# pickle it and reload it
ds2 = loads(dumps(ds1))
ds2.load()

```

...it appears to work fine.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Obscure h5netcdf http serialization issue with python's http.server 1260047355
1146184372 https://github.com/pydata/xarray/issues/6662#issuecomment-1146184372 https://api.github.com/repos/pydata/xarray/issues/6662 IC_kwDOAMm_X85EUWK0 rabernat 1197350 2022-06-03T17:05:00Z 2022-06-03T17:06:26Z MEMBER

python with fsspec.open('http://127.0.0.1:8000/tiny.nc', mode='rb') as fp: with xr.open_dataset(fp, engine='h5netcdf') as ds1: print(type(fp)) print(fp.__dict__) ds1.load()

<class 'fsspec.implementations.http.HTTPFile'> {'asynchronous': False, 'url': 'http://127.0.0.1:8000/tiny.nc', 'session': <aiohttp.client.ClientSession object at 0x18bcdddc0>, '_details': {'name': 'http://127.0.0.1:8000/tiny.nc', 'size': 6164, 'type': 'file'}, 'size': 6164, 'path': 'http://127.0.0.1:8000/tiny.nc', 'fs': <fsspec.implementations.http.HTTPFileSystem object at 0x110059dc0>, 'mode': 'rb', 'blocksize': 5242880, 'loc': 1075, 'autocommit': True, 'end': None, 'start': None, '_closed': False, 'kwargs': {}, 'cache': <fsspec.caching.BytesCache object at 0x18eda16d0>, 'loop': <_UnixSelectorEventLoop running=True closed=False debug=False>}

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Obscure h5netcdf http serialization issue with python's http.server 1260047355
1146119478 https://github.com/pydata/xarray/issues/6662#issuecomment-1146119478 https://api.github.com/repos/pydata/xarray/issues/6662 IC_kwDOAMm_X85EUGU2 rabernat 1197350 2022-06-03T16:04:21Z 2022-06-03T16:05:40Z MEMBER

The http.server apparently does not accept range requests. That could definitely be related. However, I don't understand why that would affect only the pickled version. If the server doesn't support range requests, how are we able to load the file at all? This works:

python with fsspec.open('http://127.0.0.1:8000/tiny.nc', mode='rb') as fp: with xr.open_dataset(fp, engine='h5netcdf') as ds1: ds1.load()

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Obscure h5netcdf http serialization issue with python's http.server 1260047355
1146099479 https://github.com/pydata/xarray/issues/6662#issuecomment-1146099479 https://api.github.com/repos/pydata/xarray/issues/6662 IC_kwDOAMm_X85EUBcX rabernat 1197350 2022-06-03T15:54:34Z 2022-06-03T15:54:34Z MEMBER

Python's HTTP server does not normally provide content lengths without some extra work, that might be the difference.

Don't think that's it.

% curl -I "http://127.0.0.1:8000/tiny.nc" HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/3.9.9 Date: Fri, 03 Jun 2022 15:53:52 GMT Content-type: application/x-netcdf Content-Length: 6164 Last-Modified: Fri, 03 Jun 2022 15:00:52 GMT

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Obscure h5netcdf http serialization issue with python's http.server 1260047355
1137851771 https://github.com/pydata/xarray/issues/6633#issuecomment-1137851771 https://api.github.com/repos/pydata/xarray/issues/6633 IC_kwDOAMm_X85D0j17 rabernat 1197350 2022-05-25T21:10:44Z 2022-05-25T21:10:44Z MEMBER

Yes it is definitely a pathological example. 💣 But the fact remains that there are many cases where we just want to discover dataset contents as quickly as possible and want to avoid the cost of loading coordinates and creating indexes.

{
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Opening dataset without loading any indexes? 1247010680
1137821786 https://github.com/pydata/xarray/issues/6633#issuecomment-1137821786 https://api.github.com/repos/pydata/xarray/issues/6633 IC_kwDOAMm_X85D0cha rabernat 1197350 2022-05-25T20:34:30Z 2022-05-25T20:34:59Z MEMBER

Here is an example that really highlights the performance cost of always loading dimension coordinates:

python import zarr store = zarr.storage.FSStore("s3://mur-sst/zarr/", anon=True) %time list(zarr.open_consolidated(store)) # -> Wall time: 86.4 ms %time ds = xr.open_dataset(store, engine='zarr') # -> Wall time: 17.1 s

%prun confirms that Xarray is spending most of its time just loading data for the time axis, which you can reproduce at the zarr level as:

python zgroup = zarr.open_consolidated(store) %time _ = zgroup['time'][:] # -> Wall time: 14.7 s

Obviously this example is pretty extreme. There are things that could be done to optimize it, etc. But it really highlights the costs of eagerly loading dimension coordinates. If I don't care about label-based indexing for this dataset, I would rather have my 17s back!

:+1: to "indexes={} (empty dictionary) to explicitly skip creating indexes".

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Opening dataset without loading any indexes? 1247010680
1122649316 https://github.com/pydata/xarray/issues/4628#issuecomment-1122649316 https://api.github.com/repos/pydata/xarray/issues/4628 IC_kwDOAMm_X85C6kTk rabernat 1197350 2022-05-10T17:00:47Z 2022-05-10T17:02:34Z MEMBER

Any pointers regarding where to start / modules involved to implement this? I would like to have a try.

The starting point would be to look at the code in indexing.py and try to understand how lazy indexing works.

In particular, look at

https://github.com/pydata/xarray/blob/3920c48d61d1f213a849bae51faa473b9c471946/xarray/core/indexing.py#L465-L470

Then you may want to try writing a class that looks like

```python class LazilyConcatenatedArray: # have to decide what to inherit from

def __init__(self, *arrays: LazilyIndexedArray, concat_axis=0):
    # figure out what you need to keep track of

@property
def shape(self):
    # figure out how to determine the total shape

def __getitem__(self, indexer) -> LazilyIndexedArray:
    # figure out how to map an indexer to the right piece of data

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Lazy concatenation of arrays 753852119
1122567902 https://github.com/pydata/xarray/issues/6588#issuecomment-1122567902 https://api.github.com/repos/pydata/xarray/issues/6588 IC_kwDOAMm_X85C6Qbe rabernat 1197350 2022-05-10T15:48:03Z 2022-05-10T15:48:03Z MEMBER

Oops sorry for the duplicate issue! 🤦

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Support lazy concatenation *without dask* 1231184996
1115292947 https://github.com/pydata/xarray/pull/6566#issuecomment-1115292947 https://api.github.com/repos/pydata/xarray/issues/6566 IC_kwDOAMm_X85CegUT rabernat 1197350 2022-05-02T19:46:06Z 2022-05-02T19:46:06Z MEMBER

Exposing this options seems like a great idea IMO.

I'm not sure the best way to test this. I think the most basic test is just to make sure the inline=True option gets invoked in the test suite. Going further, one could examine the dask graph to make sure inlining is actually happening, but that sounds fragile and maybe also not xarray's responsibility. Let's just make sure it gets to dask.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  New inline_array kwarg for open_dataset 1223270563
1113408611 https://github.com/pydata/xarray/issues/6538#issuecomment-1113408611 https://api.github.com/repos/pydata/xarray/issues/6538 IC_kwDOAMm_X85CXURj rabernat 1197350 2022-04-29T14:46:13Z 2022-04-29T14:46:13Z MEMBER

Thanks so much for opening this @philippjfr!

I agree this is a major regression. Accessing .chunk on a variable should not trigger eager loading of the data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Accessing chunks on zarr backed xarray seems to load entire array into memory 1220990859
1102992117 https://github.com/pydata/xarray/issues/6484#issuecomment-1102992117 https://api.github.com/repos/pydata/xarray/issues/6484 IC_kwDOAMm_X85BvlL1 rabernat 1197350 2022-04-19T19:08:31Z 2022-04-19T19:08:31Z MEMBER

Big :+1:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should we raise a more informative error on no zarr dir? 1203835220
1099797820 https://github.com/pydata/xarray/issues/6448#issuecomment-1099797820 https://api.github.com/repos/pydata/xarray/issues/6448 IC_kwDOAMm_X85BjZU8 rabernat 1197350 2022-04-15T02:38:48Z 2022-04-15T02:38:48Z MEMBER

I am guilty of sidetracking this issue into the politics of CRS encoding. That discussion is important. But in the meantime, @wankoelias's original issue reveals is narrower technical issue with Xarray's Zarr writer: Xarray won't let you serialize a dictionary attribute to zarr, even though zarr has no problem with this. That is a problem we can fix pretty easily.

The _validate_attrs helper function was just borrowed from to_netcdf:

https://github.com/pydata/xarray/blob/586992e8d2998751cb97b1cab4d3caa9dca116e0/xarray/backends/api.py#L133-L135

We could refactor this function to be more flexible to account for zarr's broader range of allowed attribute types (as we have evidently already done for h5netcdf). Or we could just bypass it completely in the to_zarr method. That is the only real decision we need to make here right now.

@wankoelias - you seem to understand the issue pretty well. Would you be game for making a PR? We would be glad to support you along the way.

{
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 2,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing GDAL ZARR _CRS attribute not possible 1194993450
1091703481 https://github.com/pydata/xarray/issues/6448#issuecomment-1091703481 https://api.github.com/repos/pydata/xarray/issues/6448 IC_kwDOAMm_X85BEhK5 rabernat 1197350 2022-04-07T12:57:17Z 2022-04-07T12:57:17Z MEMBER

@christophenoel - I share your perspective. But there is a huge swath of the geospatial world who basically hate NetCDF and avoid it like the plague. These communities prefer to use geotiff and GDAL. We need to reach for interoperability.

{
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 1,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  Writing GDAL ZARR _CRS attribute not possible 1194993450
1090742693 https://github.com/pydata/xarray/issues/6448#issuecomment-1090742693 https://api.github.com/repos/pydata/xarray/issues/6448 IC_kwDOAMm_X85BA2ml rabernat 1197350 2022-04-06T20:21:20Z 2022-04-06T20:22:40Z MEMBER

I think the core problem here is that Zarr itself supports arbitrary json data structures as attributes, but netCDF does not. The Zarr serialization in Xarray is designed to emulate netCDF, but we could make that optional, for example, with a flag to bypass attribute encoding / decoding and just pass the python data directly through to Zarr.

However, my concern would be that netCDF4 C library would not be able to read those files (nczarr). What happens if you try to open up a GDAL-created Zarr with netCDF4?

FWIW, the new GeoZarr Spec by @christophenoel does not use the GDAL convention for CRS. Instead, it recommends to use CF conventions for encoding CRS. This is more compatible with NetCDF, but won't be parsed correctly by GDAL.

I am a little discouraged that we have not managed to align better across projects so far (e.g. having this conversation before the GDAL Zarr CRS convention was implemented). 😞 For example, either of these two GDAL PRs: - https://github.com/OSGeo/gdal/pull/3896 - https://github.com/OSGeo/gdal/pull/4521

However, it is not too late! Let's try to reach for a standard way of encoding CRS in Zarr that can be used across languages and implementations of Zarr.

My own preference would be to try to get GDAL to support the GeoZarr Spec and thus the CF-convention CRS attribute, rather than trying to get Xarray to be able to write the GDAL CRS convention.

{
    "total_count": 7,
    "+1": 7,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing GDAL ZARR _CRS attribute not possible 1194993450
1076810559 https://github.com/pydata/xarray/issues/6374#issuecomment-1076810559 https://api.github.com/repos/pydata/xarray/issues/6374 IC_kwDOAMm_X85ALtM_ rabernat 1197350 2022-03-23T20:54:39Z 2022-03-23T20:54:39Z MEMBER

Sure, to be clear, my hesitancy is mostly just around being reluctant to maintain more complexity in our zarr interface. If there is momentum to implement and maintain this compatibility, I am definitely not opposed. 🚀

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should the zarr backend support NCZarr conventions? 1172229856
1076622767 https://github.com/pydata/xarray/issues/6374#issuecomment-1076622767 https://api.github.com/repos/pydata/xarray/issues/6374 IC_kwDOAMm_X85AK_Wv rabernat 1197350 2022-03-23T17:39:57Z 2022-03-23T17:39:57Z MEMBER

My opinion is that we should not try to support the nczarr conventions directly. Xarray already supports nczarr via netCDF4. If netCDF4 can open the Zarr store, then Xarray can read it.

Supporting nczarr directly would require lots of custom logic within xarray. That's because nczarr introduces several additional metadata files that are not part of the zarr spec. These additional metadata files break the abstractions through which xarray interacts with zarr; working around this requires going under the hood, access the store object directly (rather than the zarr groups and arrays).

I would turn this question around and ask: if netCDF4 supports access to these datasets directly, what's the advantage of xarray bypassing netCDF4 and opening them directly? If there are significant performance benefits, I would be more likely to consider it worthwhile.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should the zarr backend support NCZarr conventions? 1172229856
1065385198 https://github.com/pydata/xarray/issues/6345#issuecomment-1065385198 https://api.github.com/repos/pydata/xarray/issues/6345 IC_kwDOAMm_X84_gHzu rabernat 1197350 2022-03-11T18:41:11Z 2022-03-11T18:41:11Z MEMBER

It seems like what we really want to do is verify that the datatype of the appended data matches the data type on disk.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `to_zarr` raises `ValueError: Invalid dtype` with `mode='a'` (but not with `mode='w'`) 1164454058
1065350469 https://github.com/pydata/xarray/issues/6345#issuecomment-1065350469 https://api.github.com/repos/pydata/xarray/issues/6345 IC_kwDOAMm_X84_f_VF rabernat 1197350 2022-03-11T17:58:28Z 2022-03-11T17:58:28Z MEMBER

Thanks for reporting this @kmsampson. My feeling is that it is a bug...which we can hopefully fix pretty easily!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `to_zarr` raises `ValueError: Invalid dtype` with `mode='a'` (but not with `mode='w'`) 1164454058
1063401936 https://github.com/pydata/xarray/issues/6345#issuecomment-1063401936 https://api.github.com/repos/pydata/xarray/issues/6345 IC_kwDOAMm_X84_YjnQ rabernat 1197350 2022-03-09T21:43:49Z 2022-03-09T21:43:49Z MEMBER

The relevant code is here

https://github.com/pydata/xarray/blob/d293f50f9590251ce09543319d1f0dc760466f1b/xarray/backends/api.py#L1405-L1406

and here

https://github.com/pydata/xarray/blob/d293f50f9590251ce09543319d1f0dc760466f1b/xarray/backends/api.py#L1280-L1298

What I don't understand is why different validation is needed for the append scenario than for the the write scenario. @shoyer worked on this in #5252, so maybe he has some ideas.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `to_zarr` raises `ValueError: Invalid dtype` with `mode='a'` (but not with `mode='w'`) 1164454058
1043038150 https://github.com/pydata/xarray/issues/1385#issuecomment-1043038150 https://api.github.com/repos/pydata/xarray/issues/1385 IC_kwDOAMm_X84-K3_G rabernat 1197350 2022-02-17T14:57:03Z 2022-02-17T14:57:03Z MEMBER

See deeper dive in https://github.com/pydata/xarray/discussions/6284

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
1043016100 https://github.com/pydata/xarray/issues/1385#issuecomment-1043016100 https://api.github.com/repos/pydata/xarray/issues/1385 IC_kwDOAMm_X84-Kymk rabernat 1197350 2022-02-17T14:36:23Z 2022-02-17T14:36:23Z MEMBER

Ah ok so if that is your goal, decode_times=False should be enough to solve it.

There is a problem with the time encoding in this file. The units (days since 1950-01-01T00:00:00Z) are not compatible with the values (738457.04166667, etc.). That would place your measurements sometime in the year 3971. This is part of the problem, but not the whole story.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
1043001146 https://github.com/pydata/xarray/issues/1385#issuecomment-1043001146 https://api.github.com/repos/pydata/xarray/issues/1385 IC_kwDOAMm_X84-Ku86 rabernat 1197350 2022-02-17T14:21:45Z 2022-02-17T14:22:23Z MEMBER

(I could post to a web server if there's any reason to prefer that.)

In general that would be a little more convenient than google drive, because then we could download the file from python (rather than having a manual step). This would allow us to share a fully copy-pasteable code snippet to reproduce the issue. But don't worry about that for now.

First, I'd note that your issue is not really related to open_mfdataset at all, since it is reproduced just using open_dataset. The core problem is that you have ~15M timesteps, and it is taking forever to decode the times out of them. It's fast when you do decode_times=False because the data aren't actually being read. I'm going to make a post over in discussions to dig a bit deeper into this. StackOverflow isn't monitored too regularly by this community.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
1042937825 https://github.com/pydata/xarray/issues/1385#issuecomment-1042937825 https://api.github.com/repos/pydata/xarray/issues/1385 IC_kwDOAMm_X84-Kffh rabernat 1197350 2022-02-17T13:14:50Z 2022-02-17T13:14:50Z MEMBER

Hi Tom! 👋

So much has evolved about xarray since this original issue was posted. However, we continue to use it as a catchall for people looking to speed up open_mfdataset. I saw your stackoverflow post. Any chance you could post a link to the actual file in question?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
1033782892 https://github.com/pydata/xarray/pull/6258#issuecomment-1033782892 https://api.github.com/repos/pydata/xarray/issues/6258 IC_kwDOAMm_X849nkZs rabernat 1197350 2022-02-09T13:51:55Z 2022-02-09T13:51:55Z MEMBER

came to the conclusion that the previously existing tests had been overly restrictive

Sounds very likely!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  removed check for last dask chunk size in to_zarr 1128485610
1033779138 https://github.com/pydata/xarray/pull/5692#issuecomment-1033779138 https://api.github.com/repos/pydata/xarray/issues/5692 IC_kwDOAMm_X849njfC rabernat 1197350 2022-02-09T13:47:43Z 2022-02-09T13:47:43Z MEMBER

Just chiming in to say 💪 ! We see the work you are putting in @benbovy. I'm so excited to be using this feature. Is there a way I can help?

{
    "total_count": 5,
    "+1": 5,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes 966983801
1033757210 https://github.com/pydata/xarray/pull/6258#issuecomment-1033757210 https://api.github.com/repos/pydata/xarray/issues/6258 IC_kwDOAMm_X849neIa rabernat 1197350 2022-02-09T13:23:23Z 2022-02-09T13:23:23Z MEMBER

Thanks for working on this Tobias! Yes I implemented much of the Dask / Zarr interface and would be happy to review when you're ready.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  removed check for last dask chunk size in to_zarr 1128485610
984940677 https://github.com/pydata/xarray/issues/1068#issuecomment-984940677 https://api.github.com/repos/pydata/xarray/issues/1068 IC_kwDOAMm_X846tQCF rabernat 1197350 2021-12-02T19:36:12Z 2021-12-02T19:36:12Z MEMBER

One solution to this problem might be the creation of a custom Xarray backend for NASA EarthData. This backend could manage authentication with EDL and have its own documentation. If this package were maintained by NASA, it would close the feedback loop more effectively.

{
    "total_count": 5,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 4,
    "eyes": 1
}
  Use xarray.open_dataset() for password-protected Opendap files 186169975
984920867 https://github.com/pydata/xarray/issues/1068#issuecomment-984920867 https://api.github.com/repos/pydata/xarray/issues/1068 IC_kwDOAMm_X846tLMj rabernat 1197350 2021-12-02T19:08:54Z 2021-12-02T19:08:54Z MEMBER

Just wanted to say how much I appreciate @betolink acting as a communication channel between Xarray and NASA. Users often end up on our issue tracker because Xarray raises errors whenever it can't read data. But the source of these problems is not with Xarray, it's with the upstream data provider.

This also happens all the time with xmitgcm, e.g. https://github.com/MITgcm/xmitgcm/issues/266

It would be great if NASA had a better way to respond to these issues which didn't require that you "know a guy".

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Use xarray.open_dataset() for password-protected Opendap files 186169975
971790307 https://github.com/pydata/xarray/issues/5995#issuecomment-971790307 https://api.github.com/repos/pydata/xarray/issues/5995 IC_kwDOAMm_X8457Ffj rabernat 1197350 2021-11-17T17:18:41Z 2021-11-17T17:18:41Z MEMBER

How can i tell xarray to load/dump variable by variable without loading the entire file ?

You could try to chunk the data and then Dask will write it for you in chunks. To do in in serial you could use the dask single-threaded scheduler.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  High memory usage of xarray vs netCDF4 function 1056247970
969021506 https://github.com/pydata/xarray/issues/5878#issuecomment-969021506 https://api.github.com/repos/pydata/xarray/issues/5878 IC_kwDOAMm_X845whhC rabernat 1197350 2021-11-15T15:25:37Z 2021-11-15T15:25:46Z MEMBER

So there are two layers here where caching could be happening: - gcsfs / fsspec (python) - gcs itself

I propose we eliminate the python layer entirely for the moment. Whenever you load the dataset, it's shape is completely determined by whatever zarr sees in gs://ldeo-glaciology/append_test/test5/temperature/.zarray. So try looking at this file directly. You can figure out its public URL and just do curl, e.g. curl https://storage.googleapis.com/ldeo-glaciology/append_test/test5/temperature/.zarray { "chunks": [ 3 ], "compressor": { "blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1 }, "dtype": "<i8", "fill_value": null, "filters": null, "order": "C", "shape": [ 6 ], "zarr_format": 2 }

Run this from jupyterhub from the command line. Then try gcs.cat('ldeo-glaciology/append_test/test5/temperature/.zarray' and see if you see the same thing. Basically just eliminate as many layers as possible from the problem until you get to the core issue.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  problem appending to zarr on GCS when using json token  1030811490
968993065 https://github.com/pydata/xarray/issues/1068#issuecomment-968993065 https://api.github.com/repos/pydata/xarray/issues/1068 IC_kwDOAMm_X845wakp rabernat 1197350 2021-11-15T14:58:05Z 2021-11-15T14:58:05Z MEMBER

At what point do we escalate this issue to NASA? Is there a channel via which they can receive and respond to user feedback?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Use xarray.open_dataset() for password-protected Opendap files 186169975
967363845 https://github.com/pydata/xarray/issues/5878#issuecomment-967363845 https://api.github.com/repos/pydata/xarray/issues/5878 IC_kwDOAMm_X845qM0F rabernat 1197350 2021-11-12T19:18:38Z 2021-11-12T19:18:38Z MEMBER

Ok I think I may understand what is happening

```python

load the zarr store

ds_both = xr.open_zarr(mapper) ```

When you do this, zarr reads a file called gs://ldeo-glaciology/append_test/test5/temperature/.zarray. Since the data are public, I can look at it right now:

$ gsutil cat gs://ldeo-glaciology/append_tet/test5/temperature/.zarray { "chunks": [ 3 ], "compressor": { "blocksize": 0, "clevel": 5, "cname": "lz4", "id": "blosc", "shuffle": 1 }, "dtype": "<i8", "fill_value": null, "filters": null, "order": "C", "shape": [ 6 ], }

Right now, it shows the shape is [6], as expected after the appending. However, if you read the file immediately after appending (within the 3600s max-age), you will get the cached copy. The cached copy will still be of shape [3]--it won't know about the append.

To test this hypothesis, you would need to disable caching on the bucket. Do you have privileges to do that?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  problem appending to zarr on GCS when using json token  1030811490
967142419 https://github.com/pydata/xarray/issues/5878#issuecomment-967142419 https://api.github.com/repos/pydata/xarray/issues/5878 IC_kwDOAMm_X845pWwT rabernat 1197350 2021-11-12T14:05:36Z 2021-11-12T14:05:36Z MEMBER

Can you post the full stack trace of the error you get when you try to append?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  problem appending to zarr on GCS when using json token  1030811490
966665066 https://github.com/pydata/xarray/issues/5878#issuecomment-966665066 https://api.github.com/repos/pydata/xarray/issues/5878 IC_kwDOAMm_X845niNq rabernat 1197350 2021-11-11T22:17:32Z 2021-11-11T22:17:32Z MEMBER

I think that this is not an issue with xarray, zarr, or anything in python world but rather an issue with how caching works on GCS public buckets: https://cloud.google.com/storage/docs/metadata

To test this, forget about xarray and zarr for a minute and just use gcsfs to list the bucket contents before and after your writes. I think you will find that the default cache lifetime of 3600 seconds means that you cannot "see" the changes to the bucket or the objects as quickly as needed in order to append.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  problem appending to zarr on GCS when using json token  1030811490
966324523 https://github.com/pydata/xarray/issues/1068#issuecomment-966324523 https://api.github.com/repos/pydata/xarray/issues/1068 IC_kwDOAMm_X845mPEr rabernat 1197350 2021-11-11T13:59:55Z 2021-11-11T13:59:55Z MEMBER

I'd like to tag @betolink in this issue. He knows quite a bit about both Xarray and Earthdata login. Maybe he can help us get to the bottom of these problems. Luis, any ideas?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Use xarray.open_dataset() for password-protected Opendap files 186169975
964084038 https://github.com/pydata/xarray/issues/5954#issuecomment-964084038 https://api.github.com/repos/pydata/xarray/issues/5954 IC_kwDOAMm_X845dsFG rabernat 1197350 2021-11-09T11:56:30Z 2021-11-09T11:56:30Z MEMBER

Thanks for the info @alexamici!

2. but most backends serialise writes anyway, so the advantage is limited.

I'm not sure I understand this comment, specifically what is meant by "serialise writes". I often use Xarray to do distributed writes to Zarr stores using 100+ distributed dask workers. It works great. We would need the same thing from a TileDB backend.

We are focusing on the user-facing API, but in the end, whether we call it .to, .to_dataset, or .store_dataset is not really a difficult or important question. It's clear we need some generic writing method. The much harder question is the back-end API. As Alessandro says:

Adding support for a single save_dataset entry point to the backend API is trivial, but adding full support for possibly distributed writes looks like it is much more work.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writeable backends via entrypoints 1047608434
961202990 https://github.com/pydata/xarray/issues/5918#issuecomment-961202990 https://api.github.com/repos/pydata/xarray/issues/5918 IC_kwDOAMm_X845Sssu rabernat 1197350 2021-11-04T16:21:23Z 2021-11-04T16:21:23Z MEMBER

Maybe @martindurant has some insights?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Reading zarr gives unspecific PermissionError: Access Denied when public data has been consolidated after being written to S3 1039844354
938741037 https://github.com/pydata/xarray/issues/1900#issuecomment-938741037 https://api.github.com/repos/pydata/xarray/issues/1900 IC_kwDOAMm_X8439A0t rabernat 1197350 2021-10-08T15:41:29Z 2021-10-08T15:41:29Z MEMBER

But Pydantic looks promising

Big :+1: to this.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Representing & checking Dataset schemas  295959111
863452266 https://github.com/pydata/xarray/pull/5252#issuecomment-863452266 https://api.github.com/repos/pydata/xarray/issues/5252 MDEyOklzc3VlQ29tbWVudDg2MzQ1MjI2Ng== rabernat 1197350 2021-06-17T18:07:28Z 2021-06-17T18:07:28Z MEMBER

Really sorry I didn't get around to review. My excuse is that I moved back to NYC last week and fell behind on everything. Thanks for moving it forward. 💪

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add mode="r+" for to_zarr and use consolidated writes/reads by default 874331538
863213400 https://github.com/pydata/xarray/issues/5028#issuecomment-863213400 https://api.github.com/repos/pydata/xarray/issues/5028 MDEyOklzc3VlQ29tbWVudDg2MzIxMzQwMA== rabernat 1197350 2021-06-17T12:53:16Z 2021-06-17T12:53:22Z MEMBER

So glad this got fixed upstream! That's how it is supposed to work! 🏆 Thanks to everyone for making this happen.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Saving zarr to remote location lower cases all data_vars 830507003
839106491 https://github.com/pydata/xarray/issues/5219#issuecomment-839106491 https://api.github.com/repos/pydata/xarray/issues/5219 MDEyOklzc3VlQ29tbWVudDgzOTEwNjQ5MQ== rabernat 1197350 2021-05-11T20:08:27Z 2021-05-11T20:08:27Z MEMBER

Instead we could require explicitly supplying chunks vis the encoding parameter in the to_zarr() call.

This could also break existing workflows though. For example, pangeo-forge is using the encoding.chunks attribute to specify target dataset chunks.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr encoding attributes persist after slicing data, raising error on `to_zarr` 868352536
832712426 https://github.com/pydata/xarray/issues/3653#issuecomment-832712426 https://api.github.com/repos/pydata/xarray/issues/3653 MDEyOklzc3VlQ29tbWVudDgzMjcxMjQyNg== rabernat 1197350 2021-05-05T14:01:25Z 2021-05-05T14:01:33Z MEMBER

Update: there is now a way to read a remote netCDF file from an HTTP server directly using the netcdf-python library. The trick is to append #mode=bytes to the end of the url.

```python import xarray as xr import netCDF4 # I'm using version 1.5.6

url = "https://www.ldeo.columbia.edu/~rpa/NOAA_NCDC_ERSST_v3b_SST.nc#mode=bytes"

raw netcdf4 Dataset

ds = netCDF4.Dataset(url)

xarray Dataset

ds = xr.open_dataset(url) ```

{
    "total_count": 12,
    "+1": 5,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 6,
    "rocket": 0,
    "eyes": 0
}
  "[Errno -90] NetCDF: file not found: b" when opening netCDF from server 543197350
831970193 https://github.com/pydata/xarray/pull/5252#issuecomment-831970193 https://api.github.com/repos/pydata/xarray/issues/5252 MDEyOklzc3VlQ29tbWVudDgzMTk3MDE5Mw== rabernat 1197350 2021-05-04T14:07:03Z 2021-05-04T14:07:03Z MEMBER

Question: does this mode still require eager loading of dimension coordinates?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add mode="r+" for to_zarr and use consolidated writes/reads by default 874331538
828071017 https://github.com/pydata/xarray/issues/5219#issuecomment-828071017 https://api.github.com/repos/pydata/xarray/issues/5219 MDEyOklzc3VlQ29tbWVudDgyODA3MTAxNw== rabernat 1197350 2021-04-28T01:26:34Z 2021-04-28T01:26:34Z MEMBER

we probably would NOT want to use safe_chunks=False, correct?

correct

The problem in this issue is that the dataset is carrying around its original chunks in .encoding and then xarray tries to use these values to set the chunk encoding on the second write op. The solution is to manually delete the chunk encoding from all your data variables. Something like python for var in ds: del ds[var].encoding['chunks']

Originally part of #5056 was a change that would have xarray automatically do this deletion after some operations (such as calling .chunk()); however, we could not reach a consensus on the best way to implement that change. Your example is interesting because it is a slightly different scenario -- calling sel() instead of chunk() -- but the root cause appears to be the same: encoding['chunks'] is being kept around too conservatively.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr encoding attributes persist after slicing data, raising error on `to_zarr` 868352536
826913149 https://github.com/pydata/xarray/pull/5065#issuecomment-826913149 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgyNjkxMzE0OQ== rabernat 1197350 2021-04-26T15:08:43Z 2021-04-26T15:08:43Z MEMBER

I think this PR has received a very thorough review. I would be pleased if someone from @pydata/xarray would merge it soon.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
826888674 https://github.com/pydata/xarray/pull/5065#issuecomment-826888674 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgyNjg4ODY3NA== rabernat 1197350 2021-04-26T14:38:49Z 2021-04-26T14:38:49Z MEMBER

The pre-commit workflow is raising a blackdoc error I am not seeing in my local env

```diff diff --git a/doc/internals/duck-arrays-integration.rst b/doc/internals/duck-arrays-integration.rst index eb5c4d8..2bc3c1f 100644 --- a/doc/internals/duck-arrays-integration.rst +++ b/doc/internals/duck-arrays-integration.rst @@ -25,7 +25,7 @@ argument: ...

     def _repr_inline_(self, max_width):
  • """ format to a single line with at most max_width characters """
  • """format to a single line with at most max_width characters""" ... ```
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
822571688 https://github.com/pydata/xarray/issues/4554#issuecomment-822571688 https://api.github.com/repos/pydata/xarray/issues/4554 MDEyOklzc3VlQ29tbWVudDgyMjU3MTY4OA== rabernat 1197350 2021-04-19T15:44:07Z 2021-04-19T15:44:07Z MEMBER

we rearrange the DataArrays to 2D arrays

FWIW, this is the exact same thing we do in xhistorgram in order to apply histogram over a specific group of axes:

https://github.com/xgcm/xhistogram/blob/2681aee6fe04e7656c458f32277f87e76653b6e8/xhistogram/core.py#L238-L254

We noticed a similar problem with Dask's reshape implementation, raised here: https://github.com/dask/dask/issues/5544

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unexpected chunking of 3d DataArray in `polyfit()` 732910109
821315433 https://github.com/pydata/xarray/issues/5172#issuecomment-821315433 https://api.github.com/repos/pydata/xarray/issues/5172 MDEyOklzc3VlQ29tbWVudDgyMTMxNTQzMw== rabernat 1197350 2021-04-16T17:07:03Z 2021-04-16T17:07:03Z MEMBER

Yes I agree. Should I just close this and move it to h5netcdf?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Inconsistent attribute handling between netcdf4 and h5netcdf engines 859945463
817990859 https://github.com/pydata/xarray/pull/5065#issuecomment-817990859 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgxNzk5MDg1OQ== rabernat 1197350 2021-04-12T17:27:28Z 2021-04-12T17:27:28Z MEMBER

Any further feedback on this now reduced-scope PR? Merging this would be helpful for moving forward Pangeo forge.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
815019613 https://github.com/pydata/xarray/pull/5065#issuecomment-815019613 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgxNTAxOTYxMw== rabernat 1197350 2021-04-07T15:44:25Z 2021-04-07T15:44:25Z MEMBER

I have removed the controversial encoding['chunks'] stuff from the PR. Now it only contains the safe_chunks option in to_zarr.

If there are no further comments on this, I think this is good to go.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
814102743 https://github.com/pydata/xarray/pull/5065#issuecomment-814102743 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgxNDEwMjc0Mw== rabernat 1197350 2021-04-06T13:03:53Z 2021-04-06T13:03:53Z MEMBER

We seem to be unable to resolve the complexities around chunk encoding. I propose to remove this from the PR and reduce the scope to just the safe_chunks features. @aurghs should probably be the one to tackle the chunk encoding problem; unfortunately it exceeds my understanding, and I don't have time to dig deeper at the moment. In the meantime safe_chunks is important for pangeo-forge forward progress.

Please give a 👍 or 👎 to this idea if you have an opinion.

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
811975731 https://github.com/pydata/xarray/pull/5065#issuecomment-811975731 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgxMTk3NTczMQ== rabernat 1197350 2021-04-01T15:12:15Z 2021-04-01T15:12:15Z MEMBER

But it seems to me that having two different definitions of chunks (dask one and encoded one), is not very intuitive and it's not easy to define a clear default in writing.

My use for encoding.chunks is to tell Zarr what chunks to use on disk.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
811308284 https://github.com/pydata/xarray/pull/5065#issuecomment-811308284 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgxMTMwODI4NA== rabernat 1197350 2021-03-31T18:23:03Z 2021-03-31T18:23:03Z MEMBER

So any ideas how to proceed? 🧐

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
811275436 https://github.com/pydata/xarray/pull/5065#issuecomment-811275436 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgxMTI3NTQzNg== rabernat 1197350 2021-03-31T17:31:53Z 2021-03-31T17:32:12Z MEMBER

A just pushed a new commit which deletes all encoding inside variable.chunk(). But as you will see when the CI finishes, this leads to a lot of test failures. For example:

``` =============================================================================== FAILURES ================================================================================ _______ TestNetCDF4ViaDaskData.testroundtrip_string_encoded_characters ________

self = <xarray.tests.test_backends.TestNetCDF4ViaDaskData object at 0x18cba4c40>

def test_roundtrip_string_encoded_characters(self):
    expected = Dataset({"x": ("t", ["ab", "cdef"])})
    expected["x"].encoding["dtype"] = "S1"
    with self.roundtrip(expected) as actual:
        assert_identical(expected, actual)
      assert actual["x"].encoding["_Encoding"] == "utf-8"

E KeyError: '_Encoding'

/Users/rpa/Code/xarray/xarray/tests/test_backends.py:485: KeyError ```

Why is chunk getting called here? Does it actually get called every time we load a dataset with chunks? If so, we will need a more sophisticated solution.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
811265134 https://github.com/pydata/xarray/pull/5065#issuecomment-811265134 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgxMTI2NTEzNA== rabernat 1197350 2021-03-31T17:17:07Z 2021-03-31T17:17:07Z MEMBER

Replace self._encoding with None here?

Thanks! Yeah that's what I had in mind. But I was wondering if there was an example of doing that it else I could copy.

In any case, I'll give it a try now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
811189539 https://github.com/pydata/xarray/pull/5065#issuecomment-811189539 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgxMTE4OTUzOQ== rabernat 1197350 2021-03-31T16:12:13Z 2021-03-31T16:12:23Z MEMBER

In today's dev call, we proposed to handle encoding in chunk the same way we handle it in indexing: by deleting all encoding.

The problem is, I can't figure out where this happens. Can someone point me to the place in the code where indexing operations delete encoding?

A related question: I discovered this encoding option preferred_chunks, which is treated specially: https://github.com/pydata/xarray/blob/57a4479fcd3ebc579cf00e0d6bf85007eda44b56/xarray/core/dataset.py#L396

Should the Zarr backend be setting this?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
811148122 https://github.com/pydata/xarray/pull/5065#issuecomment-811148122 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgxMTE0ODEyMg== rabernat 1197350 2021-03-31T15:16:37Z 2021-03-31T15:16:37Z MEMBER

I appreciate the discussion on this PR. Does anyone have a concrete suggestion of what to do?

If we are not in agreement about the encoding stuff, perhaps I should remove that and just move forward with the safe_chunks part of this PR?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
810683846 https://github.com/pydata/xarray/issues/4470#issuecomment-810683846 https://api.github.com/repos/pydata/xarray/issues/4470 MDEyOklzc3VlQ29tbWVudDgxMDY4Mzg0Ng== rabernat 1197350 2021-03-31T01:22:29Z 2021-03-31T01:22:29Z MEMBER

I just saw this very cool tweet about ipyvista / iris integration and it reminded me of this thread.

Are there any clear steps we can take to help advance the vtk / pyvista / xarray integration further?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray / vtk integration 710357592
807128780 https://github.com/pydata/xarray/pull/5065#issuecomment-807128780 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgwNzEyODc4MA== rabernat 1197350 2021-03-25T17:19:15Z 2021-03-25T17:19:15Z MEMBER

Perhaps a kwarg in to_zarr like ignore_encoding_chunks?

I would argue that this is unnecessary. If you want to explicitly drop encoding, just del da.encoding['chunks'] before writing. But most users don't figure out that they should do this, because the default behavior is counterintuitive.

The problem here is with the default behavior of propagating chunk encoding through computations when it no longer makes sense. My example with the dtype encoding illustrates that we already drop encoding on certain operations, so it's not unprecedented. It's more of an implementation question: where and how to do the dropping.

FWIW, I would also favor dropping encoding['chunks'] after indexing, coarsening, interpolating, etc. Basically anything that changes the array shape or chunk structure.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
806724345 https://github.com/pydata/xarray/pull/5065#issuecomment-806724345 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgwNjcyNDM0NQ== rabernat 1197350 2021-03-25T13:17:03Z 2021-03-25T13:17:59Z MEMBER

I see your point. I guess I don't fully understand where else in the code path encoding gets dropped. Consider this example

python import xarray as xr ds = xr.Dataset({'foo': ('time', [1, 1], {'dtype': 'int16'})}) ds = xr.decode_cf(ds).compute() assert "dtype" in ds.foo.encoding assert "dtype" not in (0.5 * ds.foo).encoding

Xarray knows to drop the dtype encoding after an arithmetic operation. How does that work? To me .chunk feel like a similar case: an operation that invalidates any existing encoding.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
806701802 https://github.com/pydata/xarray/issues/4118#issuecomment-806701802 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDgwNjcwMTgwMg== rabernat 1197350 2021-03-25T13:01:56Z 2021-03-25T13:05:03Z MEMBER

So we have: - Numerous promising prototypes to draw from - A technical team who can write the proposal and execute the proposed work (@aurghs & @alexamici of B-open) - Numerous supporting use cases from the bioimaging (@joshmoore), condensed matter (@tacaswell), and bayesian modeling (ArviZ; @OriolAbril) domains

We are just missing a PI, someone who is willing to put their name on top of the proposal and click submit. I have gone on record as committed to not leading any new proposals this year. And in any case, this is a good opportunity for someone else from the @pydata/xarray core dev team to try on a leadership role.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
805883595 https://github.com/pydata/xarray/issues/2300#issuecomment-805883595 https://api.github.com/repos/pydata/xarray/issues/2300 MDEyOklzc3VlQ29tbWVudDgwNTg4MzU5NQ== rabernat 1197350 2021-03-24T14:48:55Z 2021-03-24T14:48:55Z MEMBER

In #5056, I have implemented the solution of deleting chunks from encoding when chunk() is called on a variable. A review of that PR would be welcome.

{
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 2,
    "rocket": 0,
    "eyes": 0
}
  zarr and xarray chunking compatibility and `to_zarr` performance 342531772
804050169 https://github.com/pydata/xarray/pull/5065#issuecomment-804050169 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgwNDA1MDE2OQ== rabernat 1197350 2021-03-22T13:12:45Z 2021-03-22T13:12:45Z MEMBER

Thanks Anderson. Fixed by rebasing. Now RTD build is failing, but there is no obvious error in the logs...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
803712024 https://github.com/pydata/xarray/pull/5065#issuecomment-803712024 https://api.github.com/repos/pydata/xarray/issues/5065 MDEyOklzc3VlQ29tbWVudDgwMzcxMjAyNA== rabernat 1197350 2021-03-22T01:58:23Z 2021-03-22T02:02:00Z MEMBER

Confused about the test error. It seems unrelated. In test_sparse.py:test_variable_method

E TypeError: no implementation found for 'numpy.allclose' on types that implement __array_function__: [<class 'numpy.ndarray'>, <class 'sparse._coo.core.COO'>]

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr chunking fixes 837243943
801240559 https://github.com/pydata/xarray/issues/4118#issuecomment-801240559 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDgwMTI0MDU1OQ== rabernat 1197350 2021-03-17T16:47:20Z 2021-03-17T16:47:20Z MEMBER

On today's Xarray dev call, we discussed pursuing another CZI grant to support this feature in Xarray. The image pyramid use case would provide a strong link to the bioimaging community. @alexamici and the B-open folks seem enthusiastic.

I had to leave the meeting early, so I didn't hear the end of the conversation. But did we decide who might serve as PI for such a proposal?

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
790088409 https://github.com/pydata/xarray/issues/2300#issuecomment-790088409 https://api.github.com/repos/pydata/xarray/issues/2300 MDEyOklzc3VlQ29tbWVudDc5MDA4ODQwOQ== rabernat 1197350 2021-03-03T21:55:44Z 2021-03-03T21:55:44Z MEMBER

alternatively to_zarr could ignore encoding["chunks"] when the data is already chunked?

I would not favor that. A user may choose to define their desired zarr chunks by putting this information in encoding. In this case, it's good to raise the error. (This is the case I had in mind when I wrote this code.)

The problem here is that encoding is often being carried over from the original dataset and persisted across operations that change chunk size.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  zarr and xarray chunking compatibility and `to_zarr` performance 342531772
789974968 https://github.com/pydata/xarray/issues/2300#issuecomment-789974968 https://api.github.com/repos/pydata/xarray/issues/2300 MDEyOklzc3VlQ29tbWVudDc4OTk3NDk2OA== rabernat 1197350 2021-03-03T18:54:43Z 2021-03-03T18:54:43Z MEMBER

I think we are all in agreement. Just waiting for someone to make a PR. It's probably just a few lines of code changes.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  zarr and xarray chunking compatibility and `to_zarr` performance 342531772
761136148 https://github.com/pydata/xarray/issues/4691#issuecomment-761136148 https://api.github.com/repos/pydata/xarray/issues/4691 MDEyOklzc3VlQ29tbWVudDc2MTEzNjE0OA== rabernat 1197350 2021-01-15T19:18:50Z 2021-01-15T19:18:50Z MEMBER

cc @martindurant for fsspec issue

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Non-HTTPS remote URLs no longer work as input for open_zarr 766826777
758373462 https://github.com/pydata/xarray/issues/4789#issuecomment-758373462 https://api.github.com/repos/pydata/xarray/issues/4789 MDEyOklzc3VlQ29tbWVudDc1ODM3MzQ2Mg== rabernat 1197350 2021-01-12T03:36:26Z 2021-01-12T03:36:26Z MEMBER

I uncovered this issue with Dask's SVG in its _repr_html function: https://github.com/dask/dask/issues/6670. The fix made a big difference in repr size. Possibly related?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Poor performance of repr of large arrays, particularly jupyter repr 782943813
741949159 https://github.com/pydata/xarray/pull/4461#issuecomment-741949159 https://api.github.com/repos/pydata/xarray/issues/4461 MDEyOklzc3VlQ29tbWVudDc0MTk0OTE1OQ== rabernat 1197350 2020-12-09T18:02:03Z 2020-12-09T18:02:11Z MEMBER

I think @shoyer has laid out the options in a very clear way.

I weakly favor option 2, as I think it preferable in terms of software architecture and our broader roadmap for Xarray. However, I am cognizant of the significant effort that @martindurant has put into this, and I don't want his effort to go to waste.

Some mitigating factors are: - The example I gave above (https://github.com/pydata/xarray/pull/4461#issuecomment-741939277) shows that one high-impact feature that users want (async capabilities in Zarr) already works, albiet with a different syntax. So this PR is more about convenience. - Presumably the knowledge about Xarray that Martin has gained by implementing this PR is transferrable to a different context, and so we would not be starting from scratch if we went with 2.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec/zarr/mfdataset 709187212
741939277 https://github.com/pydata/xarray/pull/4461#issuecomment-741939277 https://api.github.com/repos/pydata/xarray/issues/4461 MDEyOklzc3VlQ29tbWVudDc0MTkzOTI3Nw== rabernat 1197350 2020-12-09T17:44:55Z 2020-12-09T17:44:55Z MEMBER

@rsignell-usgs: note that your example works without this PR (but with the just released zarr 2.6.1) as follows python mapper = fsspec.get_mapper('s3://noaa-nwm-retro-v2.0-zarr-pds') ds = xr.open_zarr(mapper, consolidated=True)

Took 4s on my laptop (outside of AWS).

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow fsspec/zarr/mfdataset 709187212
736786380 https://github.com/pydata/xarray/issues/4631#issuecomment-736786380 https://api.github.com/repos/pydata/xarray/issues/4631 MDEyOklzc3VlQ29tbWVudDczNjc4NjM4MA== rabernat 1197350 2020-12-01T20:03:54Z 2020-12-01T20:03:54Z MEMBER

Ok then I am 👍 on @dcherian's solution.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decode_cf fails when scale_factor is a length-1 list 753965875
736526797 https://github.com/pydata/xarray/issues/4631#issuecomment-736526797 https://api.github.com/repos/pydata/xarray/issues/4631 MDEyOklzc3VlQ29tbWVudDczNjUyNjc5Nw== rabernat 1197350 2020-12-01T12:39:53Z 2020-12-01T12:39:53Z MEMBER

But what did we do before?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Decode_cf fails when scale_factor is a length-1 list 753965875

Next page

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 77.6ms · About: xarray-datasette