home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

5,143 rows where author_association = "MEMBER" and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

issue >30

  • WIP: indexing with broadcasting 25
  • Explicit indexes in xarray's data-model (Future of MultiIndex) 24
  • support for units 23
  • Multidimensional groupby 19
  • WIP: Optional indexes (no more default coordinates given by range(n)) 18
  • Hooks for XArray operations 18
  • WIP: Zarr backend 17
  • open_mfdataset too many files 15
  • API design for pointwise indexing 14
  • xarray to and from iris 14
  • New function for applying vectorized functions for unlabeled arrays to xarray objects 14
  • CFTimeIndex 14
  • xarray.backends refactor 14
  • Fixes OS error arising from too many files open 13
  • Document the new __repr__ 13
  • implement interp() 13
  • segmentation fault with `open_mfdataset` 12
  • Integration with dask/distributed (xarray backend design) 12
  • html repr of xarray object (for the notebook) 12
  • allow passing coordinate names as x and y to plot methods 11
  • Feature/rolling 11
  • Add tensordot to dataarray class also add its test to test_dataarray 11
  • Allow concat() to drop/replace duplicate index labels? 11
  • Remove caching logic from xarray.Variable 11
  • Sortby 11
  • Added PNC backend to xarray 11
  • pd.Grouper support? 10
  • added to_dict function for xarray objects 10
  • Towards a (temporary?) workaround for datetime issues at the xarray-level 10
  • v0.10 Release 10
  • …

user 1

  • shoyer · 5,143 ✖

author_association 1

  • MEMBER · 5,143 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1572412059 https://github.com/pydata/xarray/pull/7880#issuecomment-1572412059 https://api.github.com/repos/pydata/xarray/issues/7880 IC_kwDOAMm_X85duRqb shoyer 1217238 2023-06-01T16:51:07Z 2023-06-01T17:10:49Z MEMBER

Given that this error only is caused when Python is shutting down, which is exactly a case in which we do not need to clean up open file objects, maybe we can remove the __del__ instead?

Something like: ```python import atexit

@atexit.register def _remove_del_method(): # We don't need to close unclosed files at program exit, # and may not be able to do, because Python is cleaning up # imports. del CachingFileManager.del ```

(I have not tested this!)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  don't use `CacheFileManager.__del__` on interpreter shutdown 1730664352
1572350143 https://github.com/pydata/xarray/pull/7880#issuecomment-1572350143 https://api.github.com/repos/pydata/xarray/issues/7880 IC_kwDOAMm_X85duCi_ shoyer 1217238 2023-06-01T16:16:40Z 2023-06-01T16:16:40Z MEMBER

I agree that this seems very hard to test!

Have you verfied that this fixes things at least on your machine?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  don't use `CacheFileManager.__del__` on interpreter shutdown 1730664352
1546951468 https://github.com/pydata/xarray/issues/5511#issuecomment-1546951468 https://api.github.com/repos/pydata/xarray/issues/5511 IC_kwDOAMm_X85cNJss shoyer 1217238 2023-05-14T17:17:56Z 2023-05-14T17:17:56Z MEMBER

If we can find cases where we know concurrent writes are unsafe, we can definitely start raising errors. Better to be safe than to suffer from silent data corruption!

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Appending data to a dataset stored in Zarr format produce PermissonError or NaN values in the final result 927617256
1543042186 https://github.com/pydata/xarray/issues/7325#issuecomment-1543042186 https://api.github.com/repos/pydata/xarray/issues/7325 IC_kwDOAMm_X85b-PSK shoyer 1217238 2023-05-11T01:24:27Z 2023-05-11T01:24:27Z MEMBER

For anyone following along, I released a small package for reading TensorStore data into Xarray: https://github.com/google/xarray-tensorstore

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Support reading Zarr data via TensorStore 1465287257
1530685353 https://github.com/pydata/xarray/issues/4001#issuecomment-1530685353 https://api.github.com/repos/pydata/xarray/issues/4001 IC_kwDOAMm_X85bPGep shoyer 1217238 2023-05-02T00:35:52Z 2023-05-02T00:35:52Z MEMBER

Can we delete the "Flexible indexes" meeting? It doesn't happen anymore.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [community] Bi-weekly community developers meeting 606530049
1526489103 https://github.com/pydata/xarray/issues/7764#issuecomment-1526489103 https://api.github.com/repos/pydata/xarray/issues/7764 IC_kwDOAMm_X85a_GAP shoyer 1217238 2023-04-27T21:15:23Z 2023-04-27T21:15:23Z MEMBER

Allowing for explicitly passing a function matching the einsum interface is certainly more flexible than a boolean or enum argument, so @TomNicholas's suggestion of einsum_func=np.einsum is the version I would suggest.

The overhead from optimizing contraction paths is probably very small relative to the overhead of Xarray in general, so I would support setting optimize=True by default in Xarray, and/or using opt-einsum automatically if it is installed. JAX always use opt-einsum (opt-einsum is actually a hard dependency) and I have never heard any complaints.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Support opt_einsum in xr.dot 1672288892
1496912849 https://github.com/pydata/xarray/issues/6323#issuecomment-1496912849 https://api.github.com/repos/pydata/xarray/issues/6323 IC_kwDOAMm_X85ZORPR shoyer 1217238 2023-04-05T04:49:34Z 2023-04-05T04:49:34Z MEMBER

In the hypothetical invocation open_dataset(..., return_encoding=True), do you envision the returned encoding as being a separate returned object, or would it still be an attribute on the Dataset object?

My expectation was that this would be a separate object, e.g., dataset, encoding = xarray.open_dataset(..., return_encoding=True), where encoding is a dict providing the encoding on each variable, and which could be passed as the encoding argument into to_netcdf(). That said, I can see how keeping encoding as variable attributes could also be convenient.

"disable all encoding propagation by discarding encoding attributes once a Dataset has been modified" would be an intermediate step, on the route to removing encoding from Xarray's data model entirely entirely.

(As a side note, I would probably spell this as open_dataset_with_encoding rather than having a function with a variable return signature.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  propagation of `encoding` 1158378382
1464180874 https://github.com/pydata/xarray/issues/2227#issuecomment-1464180874 https://api.github.com/repos/pydata/xarray/issues/2227 IC_kwDOAMm_X85XRaCK shoyer 1217238 2023-03-10T18:04:23Z 2023-03-10T18:04:23Z MEMBER

@dschwoerer are you sure that you are actually calculating the same thing in both cases? What exactly do the values of slc[d] look like? I would test thing on smaller inputs to verify. My guess is that you are inadvertently calculating something different, recalling that Xarray's broadcasting rules differ slightly from NumPy's.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Slow performance of isel 331668890
1434932769 https://github.com/pydata/xarray/issues/4079#issuecomment-1434932769 https://api.github.com/repos/pydata/xarray/issues/4079 IC_kwDOAMm_X85Vh1Yh shoyer 1217238 2023-02-17T17:03:52Z 2023-02-17T17:03:52Z MEMBER

I agree, automatic dimension only ever really made sense for interactive usecases, where a user could see and fix the default names.

It's a little late to change the default now to raising an error instead, but maybe we could add a warning?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unnamed dimensions 621078539
1414068565 https://github.com/pydata/xarray/issues/5081#issuecomment-1414068565 https://api.github.com/repos/pydata/xarray/issues/5081 IC_kwDOAMm_X85USPlV shoyer 1217238 2023-02-02T17:00:39Z 2023-02-02T17:00:39Z MEMBER

Is LazilyIndexedArray really a public API? I don't see it on the API docs page.

Personally I would not want to guarantee external stability/availability for this API in its current state.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Lazy indexing arrays as a stand-alone package 842436143
1412396693 https://github.com/pydata/xarray/pull/7496#issuecomment-1412396693 https://api.github.com/repos/pydata/xarray/issues/7496 IC_kwDOAMm_X85UL3aV shoyer 1217238 2023-02-01T17:00:21Z 2023-02-01T17:00:21Z MEMBER

I like open_zarr(...) because it's less typing than open_dataset(..., engine='zarr'). The automatic backend detection logic doesn't currently work for Zarr, and in every case it adds overhead, which could be significant in the case of remote storage backends like Zarr.

So personally I would rather go the other direction and add open_netcdf().

The inconsistency in the chunks argument is non-ideal, but that could be handled by a separate deprecation process.

{
    "total_count": 5,
    "+1": 5,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  deprecate open_zarr 1564661430
1378074559 https://github.com/pydata/xarray/pull/7418#issuecomment-1378074559 https://api.github.com/repos/pydata/xarray/issues/7418 IC_kwDOAMm_X85SI7-_ shoyer 1217238 2023-01-11T00:27:47Z 2023-01-11T00:27:47Z MEMBER

I agree, datatree is an important data structure for Xarray. My preferred way to do this would be follow @rabernat's suggestion and to fork the code the existing repo into the Xarray main codebase.

My main concern is that we should carefully evaluate the datatree API to make sure we won't want to change it soon. Once we bring it into Xarray, there will be a higher expectation that the interface will remain stable.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Import datatree in xarray? 1519552711
1366880017 https://github.com/pydata/xarray/issues/7404#issuecomment-1366880017 https://api.github.com/repos/pydata/xarray/issues/7404 IC_kwDOAMm_X85ReO8R shoyer 1217238 2022-12-28T19:46:07Z 2022-12-28T19:46:07Z MEMBER

If you care about memory usage, you should explicitly close files after you use them, e.g., by calling ds.close() or by using a context manager. Does that work for you?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak - xr.open_dataset() not releasing memory. 1512460818
1351908915 https://github.com/pydata/xarray/issues/7344#issuecomment-1351908915 https://api.github.com/repos/pydata/xarray/issues/7344 IC_kwDOAMm_X85QlH4z shoyer 1217238 2022-12-14T18:24:04Z 2022-12-14T18:24:04Z MEMBER

I think it's OK to still require bottleneck for ffill and bfill:

  1. There are no numerical concerns: these functions simply repeat numbers forward (or backwards).
  2. There is no good alternative to using a loop, and writing the loop in NumPy would be probitively slow.
{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable bottleneck by default? 1471685307
1345646743 https://github.com/pydata/xarray/pull/7368#issuecomment-1345646743 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QNPCX shoyer 1217238 2022-12-11T20:17:15Z 2022-12-11T20:17:15Z MEMBER

I'm actually trying to merge IndexedCoordinates with Coordinates but I'm stuck: the latter is abstract and I don't really see how I could refactor it together with DatasetCoordinates and DataArrayCoordinates

Coordinates is abstract because in the (current) Xarray data model, it doesn't actually store any data -- coordinates are stored in private attributes of the original Dataset (._variables and ._coord_names) or DataArray (._coords). So Coordinates needs to serve as a proxy for the data.

In the long term, I think we should refactor Dataset/DataArray to actually store data (coordinate variables, indexes and dimension sizes) on Coordinates, but that's a bigger refactor.

For now, it's worth noting that the current Coordinates class isn't actually exposed in Xarray's public API, just the DatasetCoordinates and DataArrayCoordinates classes (and not even their constructors). So the intermdiate step I would try is: 1. Rename the current Coordinates baseclass to AbstractCoordinates. 2. Rename your IndexedCoordinates class to Coordinates. Expose it in the public API. Make sure it can handle DatasetCoordinates and DataArrayCoordinates in the constructor. 3. Maybe: use some Python magic to make DatasetCorodinates/DataArrayCoordinates subclasses of the new Coordinates. Or maybe make them actual subclassses, overriding many of the methods (including the constructor).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1344968954 https://github.com/pydata/xarray/pull/7368#issuecomment-1344968954 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QKpj6 shoyer 1217238 2022-12-10T01:37:35Z 2022-12-10T01:37:35Z MEMBER

Long term, do you think it would make sense to merge together Indexes, Coordinates and IndexedCoordinates? They are sort of all containers for the same thing.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1344944917 https://github.com/pydata/xarray/pull/7368#issuecomment-1344944917 https://api.github.com/repos/pydata/xarray/issues/7368 IC_kwDOAMm_X85QKjsV shoyer 1217238 2022-12-10T00:31:46Z 2022-12-10T00:31:46Z MEMBER

what do you think about the approach proposed here? I'd like to check that with you before going further with tests, docs, etc.

Generally this looks great to me!

  • How to avoid building any default index? It seems silly to add or use the indexes argument just for that purpose? We could address that later.

My suggestion would be:

  • coords passed as a dict: create default indexes
  • coords passed as IndexedCoordinates: do not create defaults

Alternatively to an IndexedCoordinates subclass I was wondering if we could reuse the Coordinates base class?

Yes, this makes more sense to me!

What if the Indexes class was a facade based on IndexedCoordinates instead of the other way around?

Yes, I also agree! This makes more sense.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Expose "Coordinates" as part of Xarray's public API 1485037066
1341296800 https://github.com/pydata/xarray/issues/6610#issuecomment-1341296800 https://api.github.com/repos/pydata/xarray/issues/6610 IC_kwDOAMm_X85P8pCg shoyer 1217238 2022-12-07T17:12:05Z 2022-12-07T17:12:05Z MEMBER

I also like the idea of creating specific Grouper objects for different types of selection, e.g., UniqueGrouper (the default), BinGrouper, TimeResampleGrouper, etc.

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Update GroupBy constructor for grouping by multiple variables, dask arrays 1236174701
1338121102 https://github.com/pydata/xarray/issues/7350#issuecomment-1338121102 https://api.github.com/repos/pydata/xarray/issues/7350 IC_kwDOAMm_X85PwhuO shoyer 1217238 2022-12-05T20:23:46Z 2022-12-05T20:23:46Z MEMBER

IMO, it's not correctly implementing the rule as you phrased it. You said "still present", which isn't the case here since the coordinate wasn't present before.

Another way of describing the current behavior would be that xarray keeps around "every coordinate which could possibly still be valid," which is determined based upon dimension names.

The main challenge is that "Coordinate variables should not have their coordinates changed" doesn't really make sense in Xarray's data model. Only Dataset or DataArray objects have coordinates, which apply to the the entire Dataset/DataArray.

Let me give an example of why we might want to keep scalar coordinates around. Consider a Dataset where lat and lon need to be represented as 2D arrays, along x and y dimensions. If we index out a single lat/lon point, i.e., ds.isel(x=0, y=0) it would have scalar coordinates "x", "y", "lat" and "lon." If we now convert any of these to a DataArray, arguably all the coordinates are still valid.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Coordinate variable gains coordinate on subset 1473329967
1336304695 https://github.com/pydata/xarray/issues/7342#issuecomment-1336304695 https://api.github.com/repos/pydata/xarray/issues/7342 IC_kwDOAMm_X85PpmQ3 shoyer 1217238 2022-12-04T02:28:45Z 2022-12-04T02:28:45Z MEMBER

The "robust" part is really just a modification to how the limits for color scales are chosen, i.e., ignoring the bottom and top 2% of the dtaa from the color scale. So it sounds like what you're hoping for is separate per-column or per-row color scaling?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `xr.DataArray.plot.pcolormesh(robust="col/row")` 1471561942
1336302962 https://github.com/pydata/xarray/issues/7350#issuecomment-1336302962 https://api.github.com/repos/pydata/xarray/issues/7350 IC_kwDOAMm_X85Ppl1y shoyer 1217238 2022-12-04T02:16:25Z 2022-12-04T02:16:25Z MEMBER

This was an intentional design choice, back in the early days of Xarray.

The rule Xarray uses for choosing which coordinates to associate with a DataArray created from a Dataset or DataArray is "every coordinate whose dimensions are still present on the new DataArray." This includes scalar coordinates, which are always kept around (because their dimensions are always included).

What rule would you suggest instead? I agree that the behavior in this case "feels" wrong, but keep in mind that once time because a scalar coordinate, Xarray doesn't have any way of knowing that it used to have its own dimension.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Coordinate variable gains coordinate on subset 1473329967
1336299057 https://github.com/pydata/xarray/issues/7344#issuecomment-1336299057 https://api.github.com/repos/pydata/xarray/issues/7344 IC_kwDOAMm_X85Ppk4x shoyer 1217238 2022-12-04T01:55:34Z 2022-12-04T01:55:34Z MEMBER

The case where Bottleneck really makes a difference was for moving window statistics, where it uses a smarter algorithm than our current NumPy implementation, which creating a moving window view.

Otherwise, I agree, it probably isn't worth the trouble.

That said -- we could also switch to smarter NumPy based algorithms to implement most moving window calculations, e.g,. using np.nancumsum for moving window means.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable bottleneck by default? 1471685307
1330164500 https://github.com/pydata/xarray/issues/7299#issuecomment-1330164500 https://api.github.com/repos/pydata/xarray/issues/7299 IC_kwDOAMm_X85PSLMU shoyer 1217238 2022-11-29T06:53:48Z 2022-11-29T06:53:48Z MEMBER

Difference between empty and non-empty arrays comes from different logic used for empty arrays in Variable._getitem_with_mask.

The problem itself can be solved by fixing dtypes.maybe_promote to return fill_value=np.float32('nan') instead of fill_value=np.nan on dtype('float32') input.

Thanks for the excellent report!

I agree, this sounds like a good fix to me.

I think something like the following would work: Replace the return line of maybe_promote python return np.dtype(dtype), fill_value with python dtype = np.dtype(dtype) fill_value = dtype.type(fill_value) return dtype, fill_value

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.reindex of empty array changes dtype from float32 to float64 1455771929
1328156723 https://github.com/pydata/xarray/pull/7323#issuecomment-1328156723 https://api.github.com/repos/pydata/xarray/issues/7323 IC_kwDOAMm_X85PKhAz shoyer 1217238 2022-11-27T02:31:51Z 2022-11-27T02:31:51Z MEMBER

Use cases would be in any web service that would like to provide the final data values back to a user in JSON.

For what it's worth, I think your users will have a poor experience with encoded JSON data for very large arrays. It will be slow to compress and transfer this data.

In the long term, you would probably do better to transmit the data in some binary form (e.g., by calling tobytes() on the underlying np.ndarray objects, or by using Xarray's to_netcdf).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  (Issue #7324) added functions that return data values in memory efficient manner 1465047346
1328156304 https://github.com/pydata/xarray/pull/7323#issuecomment-1328156304 https://api.github.com/repos/pydata/xarray/issues/7323 IC_kwDOAMm_X85PKg6Q shoyer 1217238 2022-11-27T02:27:07Z 2022-11-27T02:27:07Z MEMBER

Thanks for report and the PR!

This really needs a "minimal complete verifiable" example (e.g., by creating and loading a Zarr array with random data) so others can verify your reported the performance gains: https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports https://stackoverflow.com/help/minimal-reproducible-example

To be honest, this fix looks a little funny to me, because NumPy's own implementation of tolist() is so similar. I would love to understand what is going on.

If you can reproduce the issue only using NumPy, it could also make more sense to file this as a upstream bug report to NumPy. The NumPy maintainers are in a better position to debug tricky memory allocation issues involving NumPy.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  (Issue #7324) added functions that return data values in memory efficient manner 1465047346
1295283938 https://github.com/pydata/xarray/pull/7214#issuecomment-1295283938 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85NNHbi shoyer 1217238 2022-10-28T17:49:10Z 2022-10-28T17:49:10Z MEMBER

Explicitly providing indexes is an advanced user feature.

Agreed. However, xr.Dataset(coords={"x": pandas_midx}) is something that presumably a lot of users rely on (it is used extensively in Xarray's tests) and that we should really deprecate IMO. If we don't provide a convenient alternative, I expect many of those users will complain.

I agree -- we should support this for backwards compatibility (even if we deprecate it).

it's easier to explicitly manipulate indexes in the form of a dict

While generally I also prefer handling plain dict objects over custom dict-like objects, here I don't see much reasons of manipulating Xarray index objects independently of their coordinate variables. Indexes allows keeping them tied together, and it is already returned by .xindexes.

EDIT -- For more context: initially an Indexes object was almost equivalent to a Frozen(obj._indexes). In #5692 I tried hard and struggled to keep dealing with separate dicts of indexes and indexed variables, but in the end it made things much easier to encapsulate the variables in Indexes, which is also used internally in different places.

OK, this totally makes sense.

I don't love that it is possible to express invalid states in Xarray's data model. This motivated the creation of assert_internal_invariants and currently mostly is a concern for Xarray's own developers, but when we exposes the indexes argument, it will be easier for users to make the same sort of errors.

I wonder if we should consider the broader refactor of merging the Indexes and Coordinates objects, and expose the constructor as a public API. For clarity, I'll call it CoordinatesAndIndexes for now, but it could likely reuse the public name of Coordinates.

This would have a number of benefits:

  1. It's impossible to provide inconsistent coords and indexes, because there is no separate indexes argument.
  2. Likewise, it is impossible to create inconsistent coordinates and indexes on an existing Xarray object.
  3. All the logic for verifying consistent coords and indexes can go in one place, shared between Dataset/DataArray. (Yes, it would be annoying to refactor Dataset to merge in variables from CoordinatesAndIndexes rather than the current separate Dataset._variables)
  4. The public API also becomes clearer: if users want default indexes, they can pass a dict of variables into coords. If they want to copy indexes from another object, they can pass in a CoordinatesAndIndexes object (either from another Xarray object or constructed directly).
{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378
1294262457 https://github.com/pydata/xarray/pull/7221#issuecomment-1294262457 https://api.github.com/repos/pydata/xarray/issues/7221 IC_kwDOAMm_X85NJOC5 shoyer 1217238 2022-10-28T00:27:22Z 2022-10-28T00:27:22Z MEMBER

I no longer remember why I added these checks, but I certainly did not expect to see this sort of performance penalty!

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Remove debugging slow assert statement 1423312198
1293909730 https://github.com/pydata/xarray/pull/7214#issuecomment-1293909730 https://api.github.com/repos/pydata/xarray/issues/7214 IC_kwDOAMm_X85NH37i shoyer 1217238 2022-10-27T18:28:40Z 2022-10-27T18:28:40Z MEMBER

I'm thinking of only accepting one or more instances of Indexes as indexes argument in the Dataset and DataArray constructors

I would lean against this, only because it's easier to explicitly manipulate indexes in the form of a dict than an xarray.Indexes object.

Explicitly providing indexes is an advanced user feature. I think it's OK to require users to do a bit more work in this case and to not necessarily do consistency checks (beyond verifying that the coordinate variables exist).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Pass indexes directly to the DataArray and Dataset constructors 1422543378
1288188522 https://github.com/pydata/xarray/issues/7132#issuecomment-1288188522 https://api.github.com/repos/pydata/xarray/issues/7132 IC_kwDOAMm_X85MyDJq shoyer 1217238 2022-10-23T19:59:28Z 2022-10-23T19:59:28Z MEMBER

This is correct -- CFDatetimeCoder.encode is not lazy, even if the inputs are Dask arrays.

We would welcome contributions to fix this. This would entail making the encode look similar to the decode method (using lazy_elemwise_func).

We would also need a fall-back method for determining appropriate time units without looking at the array values. Something like seconds since 1900-01-01T00:00:00 would probably be a reasonable choice.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Saving a DataArray of datetime objects as zarr is not a lazy operation despite compute=False 1397532790
1286421985 https://github.com/pydata/xarray/issues/6807#issuecomment-1286421985 https://api.github.com/repos/pydata/xarray/issues/6807 IC_kwDOAMm_X85MrT3h shoyer 1217238 2022-10-21T03:49:18Z 2022-10-21T03:49:18Z MEMBER

Cubed should define a concatenate function, so that should be OK

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Alternative parallel execution frameworks in xarray 1308715638
1278202565 https://github.com/pydata/xarray/pull/4879#issuecomment-1278202565 https://api.github.com/repos/pydata/xarray/issues/4879 IC_kwDOAMm_X85ML9LF shoyer 1217238 2022-10-13T21:34:05Z 2022-10-13T21:34:05Z MEMBER

I think we could fix this by marking CachingFileManager with typing.final

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Cache files for different CachingFileManager objects separately 803068773
1269050790 https://github.com/pydata/xarray/pull/4879#issuecomment-1269050790 https://api.github.com/repos/pydata/xarray/issues/4879 IC_kwDOAMm_X85LpC2m shoyer 1217238 2022-10-05T22:27:28Z 2022-10-05T22:27:28Z MEMBER

Anyone want to review here? I think this should be ready to go in.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Cache files for different CachingFileManager objects separately 803068773
1268700309 https://github.com/pydata/xarray/pull/4879#issuecomment-1268700309 https://api.github.com/repos/pydata/xarray/issues/4879 IC_kwDOAMm_X85LntSV shoyer 1217238 2022-10-05T17:06:02Z 2022-10-05T17:57:19Z MEMBER

~~Actually maybe we don't need to keep files open after pickling... let me give this one more try.~~

Nevermind, this didn't work -- it still results in failing tests with dask-distributed on Windows.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Cache files for different CachingFileManager objects separately 803068773
1268684962 https://github.com/pydata/xarray/pull/4879#issuecomment-1268684962 https://api.github.com/repos/pydata/xarray/issues/4879 IC_kwDOAMm_X85Lnpii shoyer 1217238 2022-10-05T16:51:14Z 2022-10-05T16:51:14Z MEMBER

OK, after a bit more futzing tests are passing and I think this is actually ready to go in. I ended up leaving in the reference counting after all -- I couldn't figure out another way to keep files open after a pickle round-trip.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Cache files for different CachingFileManager objects separately 803068773
1260250383 https://github.com/pydata/xarray/issues/6293#issuecomment-1260250383 https://api.github.com/repos/pydata/xarray/issues/6293 IC_kwDOAMm_X85LHeUP shoyer 1217238 2022-09-28T00:49:26Z 2022-09-28T00:49:26Z MEMBER

Yes yes -- the sooner we can get rid of MultiIndex special cases the better!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes: next steps 1148021907
1259823913 https://github.com/pydata/xarray/pull/4879#issuecomment-1259823913 https://api.github.com/repos/pydata/xarray/issues/4879 IC_kwDOAMm_X85LF2Mp shoyer 1217238 2022-09-27T17:26:06Z 2022-09-27T17:26:06Z MEMBER

I added @cjauvin's integration test, and verified that the fix works for the scipy and h5netcdf backends.

Unfortunately, it doesn't work yet for the netCDF4 backend. I don't think we can solve this in Xarray without fixes netCDF4-Python or the netCDF-C library: https://github.com/Unidata/netcdf4-python/issues/1195

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Cache files for different CachingFileManager objects separately 803068773
1249910951 https://github.com/pydata/xarray/issues/7045#issuecomment-1249910951 https://api.github.com/repos/pydata/xarray/issues/7045 IC_kwDOAMm_X85KgCCn shoyer 1217238 2022-09-16T22:26:36Z 2022-09-16T22:26:36Z MEMBER

As a concrete example, suppose we have two datasets: 1. Hourly predictions for 10 days 2. Daily observations for a month.

```python import numpy as np import pandas as pd import xarray

predictions = xarray.DataArray( np.random.RandomState(0).randn(24*10), {'time': pd.date_range('2022-01-01', '2022-01-11', freq='1h', closed='left')}, ) observations = xarray.DataArray( np.random.RandomState(1).randn(31), {'time': pd.date_range('2022-01-01', '2022-01-31', freq='24h')}, ) ```

Today, if you compare these datasets, they automatically align: ```

predictions - observations <xarray.DataArray (time: 10)> array([ 0.13970698, 2.88151104, -1.0857261 , 2.21236931, -0.85490761, 2.67796423, 0.63833301, 1.94923669, -0.35832191, 0.23234996]) Coordinates: * time (time) datetime64[ns] 2022-01-01 2022-01-02 ... 2022-01-10 ```

With this proposed change, you would get an error, e.g., something like: ```

predictions - observations ValueError: xarray objects are not aligned along dimension 'time':
array(['2022-01-01T00:00:00.000000000', '2022-01-02T00:00:00.000000000', '2022-01-03T00:00:00.000000000', '2022-01-04T00:00:00.000000000', '2022-01-05T00:00:00.000000000', '2022-01-06T00:00:00.000000000', '2022-01-07T00:00:00.000000000', '2022-01-08T00:00:00.000000000', '2022-01-09T00:00:00.000000000', '2022-01-10T00:00:00.000000000', '2022-01-11T00:00:00.000000000', '2022-01-12T00:00:00.000000000', '2022-01-13T00:00:00.000000000', '2022-01-14T00:00:00.000000000', '2022-01-15T00:00:00.000000000', '2022-01-16T00:00:00.000000000', '2022-01-17T00:00:00.000000000', '2022-01-18T00:00:00.000000000', '2022-01-19T00:00:00.000000000', '2022-01-20T00:00:00.000000000', '2022-01-21T00:00:00.000000000', '2022-01-22T00:00:00.000000000', '2022-01-23T00:00:00.000000000', '2022-01-24T00:00:00.000000000', '2022-01-25T00:00:00.000000000', '2022-01-26T00:00:00.000000000', '2022-01-27T00:00:00.000000000', '2022-01-28T00:00:00.000000000', '2022-01-29T00:00:00.000000000', '2022-01-30T00:00:00.000000000', '2022-01-31T00:00:00.000000000'], dtype='datetime64[ns]') vs array(['2022-01-01T00:00:00.000000000', '2022-01-01T01:00:00.000000000', '2022-01-01T02:00:00.000000000', ..., '2022-01-10T21:00:00.000000000', '2022-01-10T22:00:00.000000000', '2022-01-10T23:00:00.000000000'], dtype='datetime64[ns]') ```

Instead, you would need to manually align these objects, e.g., with xarray.align, reindex_like() or interp_like(), e.g., ```

predictions, observations = xarray.align(predictions, observations) or observations = observations.reindex_like(predictions) or predictions = predictions.interp_like(observations) ```

To (partially) simulate the effect of this change on a codebase today, you could write xarray.set_options(arithmetic_join='exact') -- but presmably it would also make sense to change Xarray's other alignment code (e.g., in concat and merge).

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should Xarray stop doing automatic index-based alignment? 1376109308
1249601076 https://github.com/pydata/xarray/issues/7045#issuecomment-1249601076 https://api.github.com/repos/pydata/xarray/issues/7045 IC_kwDOAMm_X85Ke2Y0 shoyer 1217238 2022-09-16T17:16:52Z 2022-09-16T17:18:38Z MEMBER

IMO we could first align (hah) these choices to be the same:

the exact mode of automatic alignment (outer vs inner vs left join) depends on the specific operation.

The problem is that user expectations are actually rather different for different options:

  • With data movement operations like xarray.merge, you expect to keep around all existing data -- so you want an outer join.
  • With inplace operations that modify an existing Dataset, e.g., by adding new variables, you don't expect the existing coordinates to change -- so you want a left join.
  • With computate based operations (like arithmatic), you don't have an expectation that all existing data is unmodified, so keeping around a bunch of NaN values felt very wasteful -- hence the inner join.

What do you think of making the default FloatIndex use a reasonable (hard to define!) rtol for comparisons?

This would definitely be a step forward! However, it's a tricky nut to crack. We would both need a heuristic for defining rtol (some fraction of coordinate spacing?) and a method for deciding what the resulting coordinates should be (use values from the first object?).

Even then, automatic alignment is often problematic, e.g., imagine cases where a coordinate is defined in separate units.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should Xarray stop doing automatic index-based alignment? 1376109308
1244918028 https://github.com/pydata/xarray/issues/7002#issuecomment-1244918028 https://api.github.com/repos/pydata/xarray/issues/7002 IC_kwDOAMm_X85KM_EM shoyer 1217238 2022-09-13T05:30:12Z 2022-09-13T05:30:12Z MEMBER

I like option (4). If a multi-coordinate index needs to care about order, it can implement that logic itself.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Custom indexes and coordinate (re)ordering 1364388790
1210976795 https://github.com/pydata/xarray/issues/6904#issuecomment-1210976795 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85ILgob shoyer 1217238 2022-08-10T16:43:36Z 2022-08-10T16:43:36Z MEMBER

You might look into different multiprocessing modes: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

It may also be that the NetCDF or HDF5 libraries were simply not written in a way that can support multi-processing. This would not surprise me.

BTW is there any advantage or difference in terms of cpu and memory consumption in opening the file only one or let it open by every process? I'm asking because I thought opening in every process was just plain stupid but it seems to perform exactly the same, so maybe I'm just creating a problem where there is none

I agree, maybe this isn't worth the trouble. I have not seen it done successfully before.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210255676 https://github.com/pydata/xarray/issues/6904#issuecomment-1210255676 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IIwk8 shoyer 1217238 2022-08-10T07:10:41Z 2022-08-10T07:10:41Z MEMBER

Will that work in the same way if I still use process_map, which uses concurrent.futures under the hood?

Yes it should, as long as you're using multi-processing under the covers.

If you do multi-threading, then you would want to use threading.Lock(). But I believe we already apply a thread lock by default.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210233503 https://github.com/pydata/xarray/issues/6904#issuecomment-1210233503 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IIrKf shoyer 1217238 2022-08-10T06:45:06Z 2022-08-10T06:45:06Z MEMBER

Can you try explicitly passing in a multiprocessing lock into the open_dataset() constructor? Something like: python from multiprocessing import Lock ds = xarray.open_dataset(file, lock=Lock())

(We automatically select appropriate locks if using Dask, but I'm not sure how we would do that more generally...)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210190649 https://github.com/pydata/xarray/issues/4285#issuecomment-1210190649 https://api.github.com/repos/pydata/xarray/issues/4285 IC_kwDOAMm_X85IIgs5 shoyer 1217238 2022-08-10T05:48:47Z 2022-08-10T05:48:47Z MEMBER

I am tempted to suggest that the right way to handle Awkward array is to treat "var" dimensions similar to NumPy's structured dtypes, with shape only handling non-variable dimensions. The uniform dimensions are the only ones for which Xarray's API is going to work properly out of the box, and Awkward array properly already has the right tools for working with ragged dimensions.

Either way, I would definitely encourage figuring out some actual use-cases before building this out :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Awkward array backend? 667864088
1204280307 https://github.com/pydata/xarray/pull/6874#issuecomment-1204280307 https://api.github.com/repos/pydata/xarray/issues/6874 IC_kwDOAMm_X85Hx9vz shoyer 1217238 2022-08-03T17:44:20Z 2022-08-03T17:44:20Z MEMBER

As I understand it, the main purpose here is to remove Xarray lazy indexing class.

Maybe call this get_duck_array(), just to be a little more descriptive?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid calling np.asarray on lazy indexing classes 1327380960
1200314984 https://github.com/pydata/xarray/issues/2304#issuecomment-1200314984 https://api.github.com/repos/pydata/xarray/issues/2304 IC_kwDOAMm_X85Hi1po shoyer 1217238 2022-07-30T23:55:04Z 2022-07-30T23:55:04Z MEMBER

the unpacked data should match the type of these attributes, which must both be of type float or both be of type double. An additional restriction in this case is that the variable containing the packed data must be of type byte, short or int. It is not advised to unpack an int into a float as there is a potential precision loss.

I find this is ambiguous. is float above referring to float16 or float32? Is double referring to float64?

Yes, I'm pretty sure "float" means single precision (np.float32), given that "double" certainly means double precision (no.float64).

If so, then they do recommend float64, as requested by the OP, because the test data is short and the scale_factor is float64 (a.k.a double?)

Yes, I believe so.

The broader discussion here is about CF compliance. I find the spec ambiguous and xarray non-compliant. So many tests rely on the existing behavior, that I am unsure how best to proceed to improve compliance. I worry it may be a major refactor, and possibly break things relying on the existing behavior. I'd like to discuss architecture. Should this be in a new issue, if this closes with PR #6851? Should there be a new keyword for cf_strict or something?

I think we can treat this a bug fix and just go forward with it. Yes, some people are going to be surprised, but I don't think it's distruptive enough that we need to go to a major effort to preserve backwards compatibility. It should already be straightforward to work around by setting decode_cf=False when opening a file and then explicitly calling xarray.decode_cf().

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray  343659822
1199939328 https://github.com/pydata/xarray/issues/6849#issuecomment-1199939328 https://api.github.com/repos/pydata/xarray/issues/6849 IC_kwDOAMm_X85HhZ8A shoyer 1217238 2022-07-29T20:56:05Z 2022-07-29T20:56:05Z MEMBER

I agree, I think only setting a few indexes at a time would be normal. If we eventually need convenience methods for setting multiple indexes we can add those later.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Public API for setting new indexes: add a set_xindex method? 1322198907
1199753281 https://github.com/pydata/xarray/issues/6849#issuecomment-1199753281 https://api.github.com/repos/pydata/xarray/issues/6849 IC_kwDOAMm_X85HgshB shoyer 1217238 2022-07-29T17:00:06Z 2022-07-29T17:00:06Z MEMBER

This sounds great to me!

I don't think we need support for setting multiple indexes at once in a single method call. You can call set_xindex multiple times for that if needed.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Public API for setting new indexes: add a set_xindex method? 1322198907
1198375377 https://github.com/pydata/xarray/issues/6833#issuecomment-1198375377 https://api.github.com/repos/pydata/xarray/issues/6833 IC_kwDOAMm_X85HbcHR shoyer 1217238 2022-07-28T16:29:30Z 2022-07-28T16:29:30Z MEMBER

I just toggled the "Require a pull request before merging" option

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Require a pull request before merging to main 1318800553
1188520871 https://github.com/pydata/xarray/issues/6807#issuecomment-1188520871 https://api.github.com/repos/pydata/xarray/issues/6807 IC_kwDOAMm_X85G12On shoyer 1217238 2022-07-19T02:18:03Z 2022-07-19T02:18:03Z MEMBER

Sounds good to me. The challenge will be defining a parallel computing API that works across all these projects, with their slightly different models.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Alternative parallel execution frameworks in xarray 1308715638
1183458691 https://github.com/pydata/xarray/issues/6505#issuecomment-1183458691 https://api.github.com/repos/pydata/xarray/issues/6505 IC_kwDOAMm_X85GiiWD shoyer 1217238 2022-07-13T16:51:09Z 2022-07-13T16:51:31Z MEMBER

Reopening because my second example print(stacked.assign_coords(z=[1, 2, 3, 4])) is still broken with the same error message. It would be ideal to fix this before the release.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dropping a MultiIndex variable raises an error after explicit indexes refactor 1210267320
1176808719 https://github.com/pydata/xarray/issues/2697#issuecomment-1176808719 https://api.github.com/repos/pydata/xarray/issues/2697 IC_kwDOAMm_X85GJK0P shoyer 1217238 2022-07-06T22:21:48Z 2022-07-06T22:21:48Z MEMBER

Maybe a separate project in xarray-contrib would make sense?

I would be reluctant to add this into Xarray proper if we need a new external dependency for reading XML files.

On Wed, Jul 6, 2022 at 2:37 PM David Huard @.***> wrote:

I've got a first draft that parses an NcML document and spits out an xarray.Dataset. It does not cover all the NcML syntax, but the essential elements are there.

It uses xsdata https://xsdata.readthedocs.io/en/latest/ to parse the XML, using a datamodel automatically generated from the NcML 2-2 schema. I've scrapped test files from the netcdf-java https://github.com/Unidata/netcdf-java repo to create a test suite.

Wondering what's the best place to host the code, tests and test data so others may give it a spin ?

— Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2697#issuecomment-1176775280, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJJFVW32WV5YKZZP7KFVBTVSX4BZANCNFSM4GRUVDBQ . You are receiving this because you were mentioned.Message ID: @.***>

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  read ncml files to create multifile datasets 401874795
1165848854 https://github.com/pydata/xarray/pull/6721#issuecomment-1165848854 https://api.github.com/repos/pydata/xarray/issues/6721 IC_kwDOAMm_X85FfXEW shoyer 1217238 2022-06-24T18:57:42Z 2022-06-24T18:57:42Z MEMBER

The simplest option would probably be a custom Zarr store that raises an error if you try to look at array data. This could be implemented as a subclass of an existing Zarr store (e.g., the in memory store) that raises an error in __getitem__ is the filename of requests does not start with ..

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fix .chunks loading lazy backed array data 1284071791
1165847538 https://github.com/pydata/xarray/pull/6721#issuecomment-1165847538 https://api.github.com/repos/pydata/xarray/issues/6721 IC_kwDOAMm_X85FfWvy shoyer 1217238 2022-06-24T18:55:51Z 2022-06-24T18:55:51Z MEMBER

We have some tests with InaccessibleVariableDataStore for this sort of thing, but I don't know immediately how to hook that into the Zarr backend.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fix .chunks loading lazy backed array data 1284071791
1163345547 https://github.com/pydata/xarray/issues/6704#issuecomment-1163345547 https://api.github.com/repos/pydata/xarray/issues/6704 IC_kwDOAMm_X85FVz6L shoyer 1217238 2022-06-22T16:31:33Z 2022-06-22T16:31:33Z MEMBER

Dataset.rename does both variables and dimensions. That seems useful in many cases. I think it also makes more sense than Dataset.drop does, given that variables and dimensions often use the same names -- whereas drop mixed up variable names and index values.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Future of `DataArray.rename` 1275752720
1163299397 https://github.com/pydata/xarray/issues/6646#issuecomment-1163299397 https://api.github.com/repos/pydata/xarray/issues/6646 IC_kwDOAMm_X85FVopF shoyer 1217238 2022-06-22T15:57:14Z 2022-06-22T15:57:14Z MEMBER

NumPy mostly uses axis instead of axes, which we could copy.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `dim` vs `dims` 1250939008
1163296444 https://github.com/pydata/xarray/issues/6646#issuecomment-1163296444 https://api.github.com/repos/pydata/xarray/issues/6646 IC_kwDOAMm_X85FVn68 shoyer 1217238 2022-06-22T15:55:13Z 2022-06-22T15:56:35Z MEMBER

It would be helpful to understand if there are also other uses of dim/dims that are inconsistent. Which is the most common pattern?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `dim` vs `dims` 1250939008
1163292851 https://github.com/pydata/xarray/issues/6704#issuecomment-1163292851 https://api.github.com/repos/pydata/xarray/issues/6704 IC_kwDOAMm_X85FVnCz shoyer 1217238 2022-06-22T15:52:12Z 2022-06-22T15:52:12Z MEMBER

Should we call it rename_vars or rename_coords?

The later might make more sense, but then it wouldn't mirror Dataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Future of `DataArray.rename` 1275752720
1150280375 https://github.com/pydata/xarray/issues/644#issuecomment-1150280375 https://api.github.com/repos/pydata/xarray/issues/644 IC_kwDOAMm_X85Ej-K3 shoyer 1217238 2022-06-08T18:56:17Z 2022-06-08T18:56:17Z MEMBER

This might fit more naturally into interp() as a new method like "nearest-valid" rather than in sel().

The difference is that sel() only looks at indexes (and not the data) to select out a single value, whereas interp() can combine adjacent values in arbitrary ways.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature request: only allow nearest-neighbor .sel for valid data (not NaN positions) 114773593
1146873595 https://github.com/pydata/xarray/issues/6524#issuecomment-1146873595 https://api.github.com/repos/pydata/xarray/issues/6524 IC_kwDOAMm_X85EW-b7 shoyer 1217238 2022-06-05T19:54:47Z 2022-06-05T19:54:47Z MEMBER

error: "ndarray[Any, dtype[Any]]" has no attribute "rename"

Yes, it's worth discussing. I don't know if there will be a satisfying resolution, though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  NumPy `__array_ufunc__` does not work with typing 1217425815
1137839614 https://github.com/pydata/xarray/issues/6633#issuecomment-1137839614 https://api.github.com/repos/pydata/xarray/issues/6633 IC_kwDOAMm_X85D0g3- shoyer 1217238 2022-05-25T20:55:14Z 2022-05-25T20:55:14Z MEMBER

Looking at this mur-sst dataset in particular, it stores time in chunks of size 5. That means fetching the 6443 time values requires 1288 separate HTTP requests -- no wonder it's so slow! If the time axis were instead stored in a single chunk of 51 KB, Xarray would only need 3 small size HTTP requests to load the lat, lon and time indexes, which would probably complete in a fraction of a second.

That said, I agree that this would be nice to have in general.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Opening dataset without loading any indexes? 1247010680
1137754031 https://github.com/pydata/xarray/issues/6633#issuecomment-1137754031 https://api.github.com/repos/pydata/xarray/issues/6633 IC_kwDOAMm_X85D0L-v shoyer 1217238 2022-05-25T19:12:40Z 2022-05-25T19:12:40Z MEMBER

but another option (post explicit index refactor) might be an option for opening a dataset without creating indexes for 1D coordinates along dimensions.

It might indeed be worth considering this case too in #6392. Maybe indexes=None (default) to create default indexes for 1D coordinates and indexes={} (empty dictionary) to explicitly skip creating indexes?

+1 this syntax makes sense to me!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Opening dataset without loading any indexes? 1247010680
1137661171 https://github.com/pydata/xarray/pull/6475#issuecomment-1137661171 https://api.github.com/repos/pydata/xarray/issues/6475 IC_kwDOAMm_X85Dz1Tz shoyer 1217238 2022-05-25T18:10:21Z 2022-05-25T18:10:21Z MEMBER

One issue with relying only on Array and Group as currently implemented in Zarr-Python is that we can create array nodes outside of any group subfolder. e.g. one can currently create an Array directly at path 'array1' and this would put the chunks under 'data/root/array1/', and metadata at 'meta/root/array1.array.json'. However, the root itself is not a Group. A group is basically a subfolder under root (e.g.' open_group with path = group1 creates '/meta/root/group1/' folder and 'meta/root/group1.group.json' metadata). There is no mechanism in the spec to open root directly as a Group!

is there an issue on the Zarr side where this is currently being discussed?

I opened up https://github.com/zarr-developers/zarr-python/issues/1039

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  implement Zarr v3 spec support 1200581329
1137572812 https://github.com/pydata/xarray/issues/6633#issuecomment-1137572812 https://api.github.com/repos/pydata/xarray/issues/6633 IC_kwDOAMm_X85DzfvM shoyer 1217238 2022-05-25T17:10:04Z 2022-05-25T17:10:04Z MEMBER

Early versions of Xarray used to have lazy loading of data for indexes, but we removed this for the sake of simplicity. In principle we could restore lazy indexes, but another option (post explicit index refactor) might be an option for opening a dataset without creating indexes for 1D coordinates along dimensions.

Another way to solve this sort of challenges might be to load index data in parallel when using Dask. Right now I believe the data corresponding to indexes is always loaded eagerly, without using Dask.

All that said -- Do you have a specific example where this has been problematic? In my experience it has been pretty reasonable to use xarray.Dataset objects for schema-like templates, even with index data needing to be loaded eagerly. Possibly another Zarr chunking scheme for your index data could be more efficient?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Opening dataset without loading any indexes? 1247010680
1126587818 https://github.com/pydata/xarray/issues/6607#issuecomment-1126587818 https://api.github.com/repos/pydata/xarray/issues/6607 IC_kwDOAMm_X85DJl2q shoyer 1217238 2022-05-14T00:10:13Z 2022-05-14T00:10:13Z MEMBER

We could raise an error asking the user to switch to swap_dims.

This seems like a good idea

In the long term, we like to decouple indexes from coordinate, and make something like the following work: ds.set_coords(['lon']).rename(x='lon').set_index('lon')

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Coordinate promotion workaround broken 1235725650
1126255398 https://github.com/pydata/xarray/pull/5734#issuecomment-1126255398 https://api.github.com/repos/pydata/xarray/issues/5734 IC_kwDOAMm_X85DIUsm shoyer 1217238 2022-05-13T16:51:24Z 2022-05-13T16:51:24Z MEMBER

👍 this looks great to me!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Enable `flox` in `GroupBy` and `resample` 978356586
1124302215 https://github.com/pydata/xarray/pull/6566#issuecomment-1124302215 https://api.github.com/repos/pydata/xarray/issues/6566 IC_kwDOAMm_X85DA32H shoyer 1217238 2022-05-11T21:15:36Z 2022-05-11T21:15:36Z MEMBER

For whatever reason, Windows seems to be much stricter about requiring file handles to be explicitly closed. So my guess is that this could be solved by using open_dataset() as a context manager.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  New inline_array kwarg for open_dataset 1223270563
1116397246 https://github.com/pydata/xarray/issues/6517#issuecomment-1116397246 https://api.github.com/repos/pydata/xarray/issues/6517 IC_kwDOAMm_X85Cit6- shoyer 1217238 2022-05-03T18:09:42Z 2022-05-03T18:09:42Z MEMBER

I'm a little skeptical that it makes sense to add special case logic into Xarray in an attempt to keep NumPy's "OWNDATA" flag up to date. There are lots of places where we create views of data from existing arrays inside Xarray operations.

There are definitely cases where Xarray's internal operations do memory copies followed by views, which would also result in datasets with misleading "OWNDATA" flags if you look only at resulting datasets, e.g., DataArray.interp() which definitely does internal memory copies: ```

y = xarray.DataArray([1, 2, 3], dims='x', coords={'x': [0, 1, 2]}) y.interp(x=0.5).data.flags C_CONTIGUOUS : True F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False ```

Overall, I just don't think this is a reliable way to trace memory allocation with NumPy. Maybe you could do better by also tracing back to source arrays with .base?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Loading from NetCDF creates unnecessary numpy.ndarray-views that clears the OWNDATA-flag 1216517115
1114173984 https://github.com/pydata/xarray/issues/1621#issuecomment-1114173984 https://api.github.com/repos/pydata/xarray/issues/1621 IC_kwDOAMm_X85CaPIg shoyer 1217238 2022-05-01T08:49:40Z 2022-05-01T08:49:40Z MEMBER

Still relevant!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Undesired decoding to timedelta64 (was: units of "seconds" translated to time coordinate) 264321376
1111813044 https://github.com/pydata/xarray/issues/6524#issuecomment-1111813044 https://api.github.com/repos/pydata/xarray/issues/6524 IC_kwDOAMm_X85CROu0 shoyer 1217238 2022-04-28T06:52:04Z 2022-04-28T06:52:04Z MEMBER

I think this would need to get updated on the NumPy side. Ideally NumPy ufuncs would be typed to check for __array_ufunc__. Something like: ```python from typing import Protocol, TypeVar

class HasArrayUFunc(Protocol): def array_ufunc(ufunc, method, inputs, *kwargs): pass

ArrayOrHasArrayUFunc = TypeVar("ArrayOrHasArrayUFunc", ndarray, HasArrayUFunc)

def exp(x: ArrayOrHasArrayUFunc) -> ArrayOrHasArrayUFunc: ... ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  NumPy `__array_ufunc__` does not work with typing 1217425815
863427710 https://github.com/pydata/xarray/issues/2171#issuecomment-863427710 https://api.github.com/repos/pydata/xarray/issues/2171 MDEyOklzc3VlQ29tbWVudDg2MzQyNzcxMA== shoyer 1217238 2021-06-17T17:30:17Z 2022-04-19T03:15:24Z MEMBER

@gagebeni please open a new discussion for your issue: https://github.com/pydata/xarray/discussions

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Support alignment/broadcasting with unlabeled dimensions of size 1 325439138
1100953736 https://github.com/pydata/xarray/issues/4267#issuecomment-1100953736 https://api.github.com/repos/pydata/xarray/issues/4267 IC_kwDOAMm_X85BnziI shoyer 1217238 2022-04-17T21:42:36Z 2022-04-17T21:42:36Z MEMBER

This is still relevant

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CachingFileManager should not use __del__ 665488672
1099788049 https://github.com/pydata/xarray/pull/6476#issuecomment-1099788049 https://api.github.com/repos/pydata/xarray/issues/6476 IC_kwDOAMm_X85BjW8R shoyer 1217238 2022-04-15T02:14:56Z 2022-04-15T02:14:56Z MEMBER

I will take a look soon!

On Thu, Apr 14, 2022 at 6:23 PM Maximilian Roos @.***> wrote:

Hi @cisaacstern https://github.com/cisaacstern — thanks a lot and welcome to xarray!

This looks very coherent, as far as the context I have. Any thoughts from others who know the area better?

— Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/6476#issuecomment-1099769304, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJJFVVACXN36KM4XREIBCLVFDAKVANCNFSM5TE2KQ4Q . You are receiving this because you were mentioned.Message ID: @.***>

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  Fix zarr append dtype checks 1200716594
1099309755 https://github.com/pydata/xarray/pull/6420#issuecomment-1099309755 https://api.github.com/repos/pydata/xarray/issues/6420 IC_kwDOAMm_X85BhiK7 shoyer 1217238 2022-04-14T15:36:14Z 2022-04-14T15:36:14Z MEMBER

Thanks @malmans2 !

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add support in the "zarr" backend for reading NCZarr data 1183534905
1099307673 https://github.com/pydata/xarray/pull/6475#issuecomment-1099307673 https://api.github.com/repos/pydata/xarray/issues/6475 IC_kwDOAMm_X85BhhqZ shoyer 1217238 2022-04-14T15:33:54Z 2022-04-14T15:33:54Z MEMBER

One issue with relying only on Array and Group as currently implemented in Zarr-Python is that we can create array nodes outside of any group subfolder. e.g. one can currently create an Array directly at path 'array1' and this would put the chunks under 'data/root/array1/', and metadata at 'meta/root/array1.array.json'. However, the root itself is not a Group. A group is basically a subfolder under root (e.g.' open_group with path = group1 creates '/meta/root/group1/' folder and 'meta/root/group1.group.json' metadata). There is no mechanism in the spec to open root directly as a Group!

is there an issue on the Zarr side where this is currently being discussed?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  implement Zarr v3 spec support 1200581329
1098229361 https://github.com/pydata/xarray/pull/6475#issuecomment-1098229361 https://api.github.com/repos/pydata/xarray/issues/6475 IC_kwDOAMm_X85BdaZx shoyer 1217238 2022-04-13T16:04:23Z 2022-04-13T16:04:23Z MEMBER
  • The v3 spec requires a path be specified when calling open_group or open_consolidated. This PR currently just sets a default group name of 'xarray' if one is not specified via the group kwarg to ZarrStore.open_group. I think that is convenient, but one could instead be stricter and raise an error in this case.

Does Zarr v3 have a notion of a "root" group? That feels like a more sensible default to me, both for Xarray and Zarr-Python

  • If a string corresponding to a filesystem path or URL is used for store, then it is not possible to infer which version of the zarr spec is desired. In this case, the user must specify zarr_version to choose the zarr protocol version. The default of zarr_version=None will infer the version from a zarr BaseStore subclass when possible, otherwise defaulting to zarr_version=2 for backwards compatibility.

This sounds fine for now, but I am concerned that it will slow the adoption of Zarr v3. Eventually, we would presumably want to change the default to version 3, but this is difficult to do if it entirely breaks backwards compatibility.

My preference would be for the default behavior to try opening Zarr v2, and fall back to opening in v3 mode, even if this requires attempting to open a file from the store. This is similar to how Xarray handles other Zarr versioning issues (e.g., for consolidated metadata). Perhaps Zarr-Python could raise an informative error that we could catch if the Zarr version is incorrect, or even handle this behavior itself?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  implement Zarr v3 spec support 1200581329
1094123521 https://github.com/pydata/xarray/pull/6420#issuecomment-1094123521 https://api.github.com/repos/pydata/xarray/issues/6420 IC_kwDOAMm_X85BNwAB shoyer 1217238 2022-04-09T21:00:04Z 2022-04-09T21:00:04Z MEMBER

Could you also add brief updates to mention NCZarr support in the docstring for open_zarr and the user guide? In particular this paragraph should be updated:

Xarray can’t open just any zarr dataset, because xarray requires special metadata (attributes) describing the dataset dimensions and coordinates. At this time, xarray can only open zarr datasets that have been written by xarray. For implementation details, see Zarr Encoding Specification.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Add support in the "zarr" backend for reading NCZarr data 1183534905
1090499559 https://github.com/pydata/xarray/issues/6374#issuecomment-1090499559 https://api.github.com/repos/pydata/xarray/issues/6374 IC_kwDOAMm_X85A_7Pn shoyer 1217238 2022-04-06T17:04:26Z 2022-04-06T17:04:26Z MEMBER

As it is currently it is also not possible to write a zarr which follows the GDAL ZARR driver conventions. Writing the _CRS attribute also results in a TypeError:

Can you elaborate? What API are you using to do the write: python, netcdf-c, or what?

This error message comes from Xarray and can be triggered by calling to_zarr(): https://github.com/pydata/xarray/blob/facafac359c39c3e940391a3829869b4a3df5d70/xarray/backends/api.py#L162

I don't think netCDF-C needs to be involved at all, which is why I suggested opening a separate issue.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should the zarr backend support NCZarr conventions? 1172229856
1090464275 https://github.com/pydata/xarray/issues/6374#issuecomment-1090464275 https://api.github.com/repos/pydata/xarray/issues/6374 IC_kwDOAMm_X85A_yoT shoyer 1217238 2022-04-06T16:25:40Z 2022-04-06T16:25:40Z MEMBER

@wankoelias could you kindly open a new issue for writing GDAL ZARR?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should the zarr backend support NCZarr conventions? 1172229856
1078695337 https://github.com/pydata/xarray/issues/2233#issuecomment-1078695337 https://api.github.com/repos/pydata/xarray/issues/2233 IC_kwDOAMm_X85AS5Wp shoyer 1217238 2022-03-25T06:20:10Z 2022-03-25T06:20:10Z MEMBER

This is the second follow-up item in https://github.com/pydata/xarray/issues/6293

I think we could definitely experiment with relaxing this constraint now, although ideally we would continue to check off auditing all of the methods in that long list first.

{
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Problem opening unstructured grid ocean forecasts with 4D vertical coordinates 332471780
1077253534 https://github.com/pydata/xarray/issues/6408#issuecomment-1077253534 https://api.github.com/repos/pydata/xarray/issues/6408 IC_kwDOAMm_X85ANZWe shoyer 1217238 2022-03-24T05:53:56Z 2022-03-24T05:53:56Z MEMBER

I think this is probably fine without a deprecation cycle. This is a very easy fix for users.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  backwards incompatible changes in reductions 1178949620
1076796582 https://github.com/pydata/xarray/issues/6374#issuecomment-1076796582 https://api.github.com/repos/pydata/xarray/issues/6374 IC_kwDOAMm_X85ALpym shoyer 1217238 2022-03-23T20:38:12Z 2022-03-23T20:38:12Z MEMBER

@DennisHeimbigner I think it would be great to standardize NCZarr as a super-set of the "Xarray-Zarr" standard! I think Xarray should indeed be able to read such files. If you want to read a sub-group, you can read the sub-group in a separate call to xarray.open_zarr().

@rabernat I would not be opposed to adding support inside Xarray for reading NCZarr data, specifically to understand NCZarr's encoding of dimension names when using Zarr-Python. This wouldn't give 100% compatibility with NCZarr, but it would be very close (maybe just with incorrect dtypes for attributes) with a minimal amount of work. I don't think it would be a big deal to look for .nczvar files.

{
    "total_count": 3,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 2,
    "rocket": 0,
    "eyes": 0
}
  Should the zarr backend support NCZarr conventions? 1172229856
1071104882 https://github.com/pydata/xarray/pull/5692#issuecomment-1071104882 https://api.github.com/repos/pydata/xarray/issues/5692 IC_kwDOAMm_X84_18Ny shoyer 1217238 2022-03-17T17:12:07Z 2022-03-17T17:12:07Z MEMBER

OK, in it goes! Big thanks to @benbovy for seeing this through :)

{
    "total_count": 24,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 13,
    "confused": 0,
    "heart": 1,
    "rocket": 10,
    "eyes": 0
}
  Explicit indexes 966983801
1069344000 https://github.com/pydata/xarray/pull/5692#issuecomment-1069344000 https://api.github.com/repos/pydata/xarray/issues/5692 IC_kwDOAMm_X84_vOUA shoyer 1217238 2022-03-16T16:47:45Z 2022-03-16T16:47:45Z MEMBER

OK, I think we’re good to go here?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes 966983801
1065381346 https://github.com/pydata/xarray/issues/6345#issuecomment-1065381346 https://api.github.com/repos/pydata/xarray/issues/6345 IC_kwDOAMm_X84_gG3i shoyer 1217238 2022-03-11T18:38:42Z 2022-03-11T18:38:42Z MEMBER

The data type restriction here seems to date back to the original PR adding support for appending. I turned up this comment that seems to summarize the motivation for this check: https://github.com/pydata/xarray/pull/2706#issuecomment-502481584

I think the original issue was that appending a fixed-width string could be a problem if the fixed-width does not match the width of the existing string dtype stored in Zarr.

This obviously doesn't apply in this case, because you are adding an entirely new variable. So I guess the check could be removed in that case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `to_zarr` raises `ValueError: Invalid dtype` with `mode='a'` (but not with `mode='w'`) 1164454058
1062211273 https://github.com/pydata/xarray/issues/1613#issuecomment-1062211273 https://api.github.com/repos/pydata/xarray/issues/1613 IC_kwDOAMm_X84_UA7J shoyer 1217238 2022-03-08T21:09:05Z 2022-03-08T21:09:05Z MEMBER

Another challenge with changing the meaning of slice is handling partial slices, e.g., what does slice(500, None) mean? With a monotonic decreasing index, that would select values below 500, but ignoring underlying coordinate order it would presumably mean selecting values above 500.

I think the separate new API (e.g., xarray.Between or .sel_between()) is probably a better idea.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should sel with slice objects care about underlying coordinate order? 263403430
1059662347 https://github.com/pydata/xarray/pull/5692#issuecomment-1059662347 https://api.github.com/repos/pydata/xarray/issues/5692 IC_kwDOAMm_X84_KSoL shoyer 1217238 2022-03-05T03:05:36Z 2022-03-05T03:05:36Z MEMBER

I would like to merge this PR very soon so it can get testing before the next release. If anyone has any remaining concerns, please speak up!

{
    "total_count": 5,
    "+1": 5,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes 966983801
1059546596 https://github.com/pydata/xarray/issues/1460#issuecomment-1059546596 https://api.github.com/repos/pydata/xarray/issues/1460 IC_kwDOAMm_X84_J2Xk shoyer 1217238 2022-03-04T21:31:41Z 2022-03-04T21:31:41Z MEMBER

Well, even if we keep squeeze as an option, I think squeeze=False would be much more consistent default behavior :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby should still squeeze for non-monotonic inputs 237008177
1058366320 https://github.com/pydata/xarray/issues/1613#issuecomment-1058366320 https://api.github.com/repos/pydata/xarray/issues/1613 IC_kwDOAMm_X84_FWNw shoyer 1217238 2022-03-03T18:39:59Z 2022-03-03T18:39:59Z MEMBER

One complication with using sel() with slice objects is that you can do selection over non-monotonic indexes, merely based on matching bounds: ```

data = xarray.DataArray([1, 2, 3, 4, 5], dims=['x'], coords=[[5, 1, 4, 3, 2]]) data <xarray.DataArray (x: 5)> array([1, 2, 3, 4, 5]) Coordinates: * x (x) int64 5 1 4 3 2 data.sel(x=slice(1, 3))) <xarray.DataArray (x: 3)> array([2, 3, 4]) Coordinates: * x (x) int64 1 4 3 ```

If we change the semantics of slice in sel() to do filtering rather than be concerned about order (which does seem much less useful), we should probably deprecate the handling of non-monotonic ascending or descending indexes.

Alternatively, we could either do the dedicated indexing object like xarray.Between(lower, upper) or have a dedicated method for selecting between values, e.g., perhaps data.sel_between(x=(1, 3)) or data.sel_bounds(x=(1, 3)).

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
  Should sel with slice objects care about underlying coordinate order? 263403430
1058293194 https://github.com/pydata/xarray/issues/1613#issuecomment-1058293194 https://api.github.com/repos/pydata/xarray/issues/1613 IC_kwDOAMm_X84_FEXK shoyer 1217238 2022-03-03T17:23:09Z 2022-03-03T17:23:09Z MEMBER

This is probably worth fixing if possible in a straightforward way. I don't think anyone is well served by matching the behavior of Python list indexing here -- it's a strange edge that case that indexing a list like x[5:0] returns an empty list.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Should sel with slice objects care about underlying coordinate order? 263403430
1057657161 https://github.com/pydata/xarray/issues/6176#issuecomment-1057657161 https://api.github.com/repos/pydata/xarray/issues/6176 IC_kwDOAMm_X84_CpFJ shoyer 1217238 2022-03-03T04:32:10Z 2022-03-03T04:32:10Z MEMBER

Breaking changes will continue to be very rare, and whenever possible will be preceeded by deprecation or future warnings for multiple months.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Xarray versioning to switch to CalVer 1108564253
1051297372 https://github.com/pydata/xarray/issues/6304#issuecomment-1051297372 https://api.github.com/repos/pydata/xarray/issues/6304 IC_kwDOAMm_X84-qYZc shoyer 1217238 2022-02-25T21:50:15Z 2022-02-25T21:50:15Z MEMBER

Adding a join argument sounds good to me. I do not remember why the default is an outer join.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  add join argument to xr.broadcast? 1150251120
1042660100 https://github.com/pydata/xarray/issues/4118#issuecomment-1042660100 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X84-JbsE shoyer 1217238 2022-02-17T07:45:24Z 2022-02-17T07:45:24Z MEMBER

One thing that came up in our discussion about this in the developer meeting today is that we could also pretty easily expose a "low level" API for IO using dictionaries of xarray.Variable objects. This intermediate representation could be useful for cleaning up data into a form suitable for conversion into Dataset objects.

On Wed, Feb 16, 2022 at 11:39 PM Alessandro Amici @.***> wrote:

@TomNicholas https://github.com/TomNicholas (cc @mraspaud https://github.com/mraspaud)

Do you have use cases which one of these designs could handle but the other couldn't?

The two main classes of on-disk formats that, I know of, which cannot be always represented in the "group is a Dataset" approach are:

  • in netCDF following the CF conventions for groups https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#groups, it is legal for an array to refer to a dimension or a coordinate in a different group and so arrays in the same group may have dimensions with the same name, but different size / coordinate values,
  • the current spec for the Next-generation file formats (NGFF) https://ngff.openmicroscopy.org for bio-imaging has all scales of the same 5D data in the same group.

I don't have an example at hand, but my impression is that satellite products that use HDF5 file format also place arrays with inconsistent dimensions / coordinates in the same group.

— Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/4118#issuecomment-1042656377, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJJFVT27QD4RQDYZ2N4W7TU3SQ3BANCNFSM4NQEIKFQ . You are receiving this because you were mentioned.Message ID: @.***>

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
1035611864 https://github.com/pydata/xarray/issues/2186#issuecomment-1035611864 https://api.github.com/repos/pydata/xarray/issues/2186 IC_kwDOAMm_X849ui7Y shoyer 1217238 2022-02-10T22:49:40Z 2022-02-10T22:50:01Z MEMBER

For what it's wroth, the recommended way to do this is to explicitly close the Dataset with ds.close() rather than using del ds.

Or with a context manager, e.g., python for num in range(100): with xr.open_dataset('data.{}.nc'.format(num)) as ds: # do some stuff, but NOT assigning any data in ds to new variables ...

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory leak while looping through a Dataset 326533369
1034196986 https://github.com/pydata/xarray/issues/6069#issuecomment-1034196986 https://api.github.com/repos/pydata/xarray/issues/6069 IC_kwDOAMm_X849pJf6 shoyer 1217238 2022-02-09T21:12:31Z 2022-02-09T21:12:31Z MEMBER

The reason why this isn't allowed is because it's ambiguous what to do with the other variables that are not restricted to the region (['cell', 'face', 'layer', 'max_cell_node', 'max_face_nodes', 'node', 'siglay'] in this case).

I can imagine quite a few different ways this behavior could be implemented:

  1. Ignore these variables entirely.
  2. Ignore variables if they also already exist, but write new ones.
  3. Write or overwrite both new and existing these variables.
  4. Write new variables. Ignore existing variables only if they already exist with the same values, and if not, raise an error.

I believe your proposal here (removing these checks from _validate_region) would achieve (3), but I'm not sure that's the best option.

(4) seems like perhaps the most user-friendly option, but checking existing variables can add significant overhead. When experimenting adding region support Xarray-Beam, I found many cases where it was easy to inadvertently make large parallel pipelines much slower by downloaded existing variables.

The current solution is not to do any of these, and to force the user to make an explicit choice by dropping new variables, or write them in a separate call to to_zarr. I think it would also be OK to let a user explicitly opt-in to one of these behaviors, but I don't think guessing what the user wants would be ideal.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_zarr: region not recognised as dataset dimensions 1077079208
1032051447 https://github.com/pydata/xarray/issues/6230#issuecomment-1032051447 https://api.github.com/repos/pydata/xarray/issues/6230 IC_kwDOAMm_X849g9r3 shoyer 1217238 2022-02-07T23:40:48Z 2022-02-07T23:40:48Z MEMBER

In the long term (cc @benbovy) I think we would ideally split IndexVariable into two classes:

  1. FrozenVariable which is just an immutable Variable, and thus that can be safely used for coordinates that have indexes.
  2. PandasIndexArray which wraps pandas.Index objects to satisfy the np.ndarray interface. This is the object which could allow duck_array_ops.isin to use the pandas.Index.isin method.
{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [PERFORMANCE]: `isin` on `CFTimeIndex`-backed `Coordinate` slow  1120583442
1031811347 https://github.com/pydata/xarray/issues/6230#issuecomment-1031811347 https://api.github.com/repos/pydata/xarray/issues/6230 IC_kwDOAMm_X849gDET shoyer 1217238 2022-02-07T19:01:54Z 2022-02-07T19:01:54Z MEMBER

Oh, I guess the challenge is that apply_ufunc operates on arrays, not indexes. I'm not entirely sure how to deal with this easily....

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [PERFORMANCE]: `isin` on `CFTimeIndex`-backed `Coordinate` slow  1120583442
1031810590 https://github.com/pydata/xarray/issues/6230#issuecomment-1031810590 https://api.github.com/repos/pydata/xarray/issues/6230 IC_kwDOAMm_X849gC4e shoyer 1217238 2022-02-07T19:01:08Z 2022-02-07T19:01:08Z MEMBER

Yes, I think replacing this with something like lambda x, y: x.isin(y) if isinstance(x, pd.Index) else np.isin(x, y) could work

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [PERFORMANCE]: `isin` on `CFTimeIndex`-backed `Coordinate` slow  1120583442
1028136906 https://github.com/pydata/xarray/issues/6174#issuecomment-1028136906 https://api.github.com/repos/pydata/xarray/issues/6174 IC_kwDOAMm_X849SB_K shoyer 1217238 2022-02-02T16:46:24Z 2022-02-02T17:20:50Z MEMBER

Have you seen xarray.save_mfdataset?

In principle, it was designed for exactly this sort of thing.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation 1108138101
1020635094 https://github.com/pydata/xarray/pull/6187#issuecomment-1020635094 https://api.github.com/repos/pydata/xarray/issues/6187 IC_kwDOAMm_X8481afW shoyer 1217238 2022-01-24T23:01:14Z 2022-01-24T23:01:14Z MEMBER

Let me ponder the linked issue. This was not an intentional feature for compute=False, so I'd like to be sure we can be committed to supporting it before we document it :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_netcdf: docstrings for compute parameter 1112365912
1011450955 https://github.com/pydata/xarray/issues/6084#issuecomment-1011450955 https://api.github.com/repos/pydata/xarray/issues/6084 IC_kwDOAMm_X848SYRL shoyer 1217238 2022-01-12T21:05:59Z 2022-01-12T21:05:59Z MEMBER

E.g., I think skipping this line would save some of the users in my original post a lot of time.

I don't think that line adds any measurable overhead. It's just telling dask to delay computation of a single function.

For sure this would be worth elaborating on in the Xarray docs! I wrote a little bit about this in the docs for Xarray-Beam: see "One recommended pattern" in https://xarray-beam.readthedocs.io/en/latest/read-write.html#writing-data-to-zarr

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Initialise zarr metadata without computing dask graph 1083621690

Next page

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 244.618ms · About: xarray-datasette