home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

17 rows where issue = 1368740629 and user = 35968931 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • TomNicholas · 17 ✖

issue 1

  • Generalize handling of chunked array types · 17 ✖

author_association 1

  • MEMBER 17
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1553395594 https://github.com/pydata/xarray/pull/7019#issuecomment-1553395594 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85clu-K TomNicholas 35968931 2023-05-18T17:37:22Z 2023-05-18T17:37:22Z MEMBER

Woooo thanks @dcherian !

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1550564976 https://github.com/pydata/xarray/pull/7019#issuecomment-1550564976 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85ca75w TomNicholas 35968931 2023-05-17T01:39:08Z 2023-05-17T01:39:08Z MEMBER

@Illviljan thanks for all your comments!

Would you (or @keewis?) be willing to approve this PR now? I would really like to merge this so that I can release a version of xarray that I can use as a dependency for cubed-xarray.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1534087275 https://github.com/pydata/xarray/pull/7019#issuecomment-1534087275 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85bcFBr TomNicholas 35968931 2023-05-04T04:41:22Z 2023-05-04T04:41:22Z MEMBER

(Okay now the failures are from https://github.com/pydata/xarray/pull/7815 which I've separated out, and from https://github.com/pydata/xarray/pull/7561 being recently merged into main which is definitely not my fault :sweat_smile: https://github.com/pydata/xarray/pull/7019/commits/316c63d55f4e2c317b028842f752a40596f16c6d shows that this PR passes the tests by itself.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1531992793 https://github.com/pydata/xarray/pull/7019#issuecomment-1531992793 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85bUFrZ TomNicholas 35968931 2023-05-02T18:58:23Z 2023-05-02T19:01:20Z MEMBER

I would like to merge this now please! It works, it passes the tests, including mypy.

The main feature not in this PR is using parallel=True with open_mfdataset, which is still coupled to dask.delayed - I made #7811 to track that so I could get this PR merged.

If we merge this I can start properly testing cubed with xarray (in cubed-xarray).

@shoyer @dcherian if one of you could merge this or otherwise tell me anything else you think is still required!

{
    "total_count": 4,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 2,
    "rocket": 2,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1499791533 https://github.com/pydata/xarray/pull/7019#issuecomment-1499791533 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85ZZQCt TomNicholas 35968931 2023-04-07T00:32:47Z 2023-04-07T00:59:03Z MEMBER

I'm having problems with ensuring the behaviour of the chunks='auto' option is consistent between .chunk and open_dataset

Update on this rabbit hole: This commit to dask changed the behaviour of dask's auto-chunking logic, such that if I run my little test script test_old_get_chunk.py on dask releases before and after that commit I get different chunking patterns:

```python from xarray.core.variable import IndexVariable from dask.array.core import normalize_chunks # import the import itertools from numbers import Number import dask import dask.array as da import xarray as xr import numpy as np # This function is copied from xarray, but calls dask.array.core.normalize_chunks # It is used in open_dataset, but not in Dataset.chunk def _get_chunk(var, chunks): """ Return map from each dim to chunk sizes, accounting for backend's preferred chunks. """ if isinstance(var, IndexVariable): return {} dims = var.dims shape = var.shape # Determine the explicit requested chunks. preferred_chunks = var.encoding.get("preferred_chunks", {}) preferred_chunk_shape = tuple( preferred_chunks.get(dim, size) for dim, size in zip(dims, shape) ) if isinstance(chunks, Number) or (chunks == "auto"): chunks = dict.fromkeys(dims, chunks) chunk_shape = tuple( chunks.get(dim, None) or preferred_chunk_sizes for dim, preferred_chunk_sizes in zip(dims, preferred_chunk_shape) ) chunk_shape = normalize_chunks( chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape ) # Warn where requested chunks break preferred chunks, provided that the variable # contains data. if var.size: for dim, size, chunk_sizes in zip(dims, shape, chunk_shape): try: preferred_chunk_sizes = preferred_chunks[dim] except KeyError: continue # Determine the stop indices of the preferred chunks, but omit the last stop # (equal to the dim size). In particular, assume that when a sequence # expresses the preferred chunks, the sequence sums to the size. preferred_stops = ( range(preferred_chunk_sizes, size, preferred_chunk_sizes) if isinstance(preferred_chunk_sizes, Number) else itertools.accumulate(preferred_chunk_sizes[:-1]) ) # Gather any stop indices of the specified chunks that are not a stop index # of a preferred chunk. Again, omit the last stop, assuming that it equals # the dim size. breaks = set(itertools.accumulate(chunk_sizes[:-1])).difference( preferred_stops ) if breaks: warnings.warn( "The specified Dask chunks separate the stored chunks along " f'dimension "{dim}" starting at index {min(breaks)}. This could ' "degrade performance. Instead, consider rechunking after loading." ) return dict(zip(dims, chunk_shape)) chunks = 'auto' encoded_chunks = 100 dask_arr = da.from_array( np.ones((500, 500), dtype="float64"), chunks=encoded_chunks ) var = xr.core.variable.Variable(data=dask_arr, dims=['x', 'y']) with dask.config.set({"array.chunk-size": "1MiB"}): chunks_suggested = _get_chunk(var, chunks) print(chunks_suggested) ```

python (cubed) tom@tom-XPS-9315:~/Documents/Work/Code/dask$ git checkout 2022.9.2 Previous HEAD position was 7fe622b44 Add docs on running Dask in a standalone Python script (#9513) HEAD is now at 3ef47422b bump version to 2022.9.2 (cubed) tom@tom-XPS-9315:~/Documents/Work/Code/dask$ python ../experimentation/bugs/auto_chunking/test_old_get_chunk.py {'x': (362, 138), 'y': (362, 138)} (cubed) tom@tom-XPS-9315:~/Documents/Work/Code/dask$ git checkout 2022.9.1 Previous HEAD position was 3ef47422b bump version to 2022.9.2 HEAD is now at b944abf68 bump version to 2022.9.1 (cubed) tom@tom-XPS-9315:~/Documents/Work/Code/dask$ python ../experimentation/bugs/auto_chunking/test_old_get_chunk.py {'x': (250, 250), 'y': (250, 250)} (I was absolutely tearing my hair out trying to find this bug, because after the change normalize_chunks became a pure function, but before the change it actually wasn't, so I was trying calling normalize_chunks with the exact same set of input arguments and was still not able to reproduce the bug :angry: )

Anyway what this means is as this PR vendors dask.array.core.normalize_chunks, but the behaviour of dask.array.core.normalize_chunks changed between the version in CI job min-all-deps and the other CI jobs, the single vendored function cannot possibly match both behaviours.

I think one simple way to fix this failure without should be to upgrade the minimum version of dask to >=2022.9.2 (from 2022.1.1 where it currently is).

EDIT: I tried changing the minimum version of dask-core in min-all-deps.yml but the conda solve failed. But also would updating to 2022.9.2 now violate xarray's minimum dependency versions policy?

EDIT2: Another way to fix this should be to un-vendor dask.array.core.normalize_chunks within xarray. We could still achieve the goal of running cubed without dask by making normalize_chunks the responsibility of the chunkmanager instead, as cubed's vendored version of that function is not subject to xarray's minimum dependencies requirement.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1499432372 https://github.com/pydata/xarray/pull/7019#issuecomment-1499432372 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85ZX4W0 TomNicholas 35968931 2023-04-06T18:03:48Z 2023-04-06T18:07:24Z MEMBER

I'm having problems with ensuring the behaviour of the chunks='auto' option is consistent between .chunk and open_dataset. These problems appeared since vendoring dask.array.core.normalize_chunks. Right now the only failing tests use chunks='auto' (e.g xarray/tests/test_backends.py::test_chunking_consintency[auto] - yes there's a typo in that test's name), and they fail because xarray decides on different sizes for the automatically-chosen chunks.

What's weird is that all tests pass for me locally but these failures occur on just some of the CI jobs (and which CI jobs is not even consistent apparently???). I have no idea why this would behave differently on only some of the CI jobs, especially after double-checking that array-chunk-size is being correctly determined from the dask config variable within normalize_chunks.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1483155153 https://github.com/pydata/xarray/pull/7019#issuecomment-1483155153 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85YZybR TomNicholas 35968931 2023-03-24T17:19:44Z 2023-03-24T17:21:32Z MEMBER

I've made a bare-bones cubed-xarray package to store the CubedManager class, as well as any integration tests (yet to be written). @tomwhite you should have an invitation to be an owner of that repo. It uses the entrypoint exposed in this PR to hook in, and seems to work for me locally :grin:

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 1,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1481626515 https://github.com/pydata/xarray/pull/7019#issuecomment-1481626515 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85YT9OT TomNicholas 35968931 2023-03-23T17:47:35Z 2023-03-23T17:47:51Z MEMBER

Thanks for the review @dcherian! I agree with basically everything you wrote.

The main difficulty I have at this point is non-reproducible failures as described here

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1481504146 https://github.com/pydata/xarray/pull/7019#issuecomment-1481504146 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85YTfWS TomNicholas 35968931 2023-03-23T16:26:53Z 2023-03-23T17:36:41Z MEMBER

I would like to get to the point where you can use xarray with a chunked array without ever importing dask. I think this PR gets very close, but that would be tricky to test because cubed depends on dask (so I can't just run the test suite without dask in the environment

I just released Cubed 0.6.0 which doesn't have a dependency on Dask, so this should be possible now.

Actually testing cubed with xarray in an environment without dask is currently blocked by rechunker's explicitly dependency on dask, see https://github.com/pangeo-data/rechunker/issues/139

EDIT: We can hack around this by pip installing cubed, then pip uninstalling dask as mentioned here

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1478328995 https://github.com/pydata/xarray/pull/7019#issuecomment-1478328995 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85YHYKj TomNicholas 35968931 2023-03-21T17:40:36Z 2023-03-21T17:40:36Z MEMBER

Does this mean my comment https://github.com/pydata/xarray/pull/7019#discussion_r970713341 is valid again?

Yes I think it does @headtr1ck - thanks for the reminder about that.

I now want to finish this PR by exposing the "chunk manager" interface as a new entrypoint, copying the pattern used for xarray's backends. That would allow me to move the cubed-specific CubedManager code into a separate repository, have the choice of chunkmanager default to whatever is installed, but ask explicitly what to do if multiple chunkmanagers are installed. That should address your comment @headtr1ck.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1472766481 https://github.com/pydata/xarray/pull/7019#issuecomment-1472766481 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85XyKIR TomNicholas 35968931 2023-03-16T21:26:36Z 2023-03-16T21:26:36Z MEMBER

Thanks @dcherian ! Once I copied that explicit indexer business I was able to get serialization to and from zarr working with cubed!

```python In [1]: import xarray as xr

In [2]: from cubed import Spec

In [3]: ds = xr.open_dataset( ...: 'airtemps.zarr', ...: chunks={}, ...: from_array_kwargs={ ...: 'manager': 'cubed', ...: 'spec': Spec(work_dir="tmp", max_mem=20e6), ...: } ...: ) /home/tom/Documents/Work/Code/xarray/xarray/backends/plugins.py:139: RuntimeWarning: 'netcdf4' fails while guessing warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning) /home/tom/Documents/Work/Code/xarray/xarray/backends/plugins.py:139: RuntimeWarning: 'scipy' fails while guessing warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)

In [4]: ds['air'] Out[4]: <xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> cubed.Array<array-004, shape=(2920, 25, 53), dtype=float32, chunks=((730, 730, 730, 730), (13, 12), (27, 26))> Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: GRIB_id: 11 ...

In [5]: ds.isel(time=slice(100, 300)).to_zarr("cubed_subset.zarr") /home/tom/Documents/Work/Code/xarray/xarray/core/dataset.py:2118: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs return to_zarr( # type: ignore Out[5]: <xarray.backends.zarr.ZarrStore at 0x7f34953033c0> ```

{
    "total_count": 8,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 6,
    "rocket": 0,
    "eyes": 2
}
  Generalize handling of chunked array types 1368740629
1469050607 https://github.com/pydata/xarray/pull/7019#issuecomment-1469050607 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85Xj-7v TomNicholas 35968931 2023-03-15T00:30:08Z 2023-03-15T10:03:11Z MEMBER

I tried opening a zarr store into xarray with chunking via cubed, but I got an error inside the indexing adapter classes. Somehow the type is completely wrong - would be good to type hint this part of the code, because this happens despite mypy passing now.

```python

create example zarr store

orig = xr.tutorial.open_dataset("air_temperature") orig.to_zarr('air2.zarr')

open it as a cubed array

ds = xr.open_dataset('air2.zarr', engine='zarr', chunks={}, from_array_kwargs={'manager': 'cubed'})

fails at this point

ds.load() ```

```python --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) File ~/miniconda3/envs/cubed/lib/python3.9/site-packages/tenacity/__init__.py:382, in Retrying.__call__(self, fn, *args, **kwargs) 381 try: --> 382 result = fn(*args, **kwargs) 383 except BaseException: # noqa: B902 File ~/miniconda3/envs/cubed/lib/python3.9/site-packages/cubed/runtime/executors/python.py:10, in exec_stage_func(func, *args, **kwargs) 8 @retry(stop=stop_after_attempt(3)) 9 def exec_stage_func(func, *args, **kwargs): ---> 10 return func(*args, **kwargs) File ~/miniconda3/envs/cubed/lib/python3.9/site-packages/cubed/primitive/blockwise.py:66, in apply_blockwise(out_key, config) 64 args.append(arg) ---> 66 result = config.function(*args) 67 if isinstance(result, dict): # structured array with named fields File ~/miniconda3/envs/cubed/lib/python3.9/site-packages/cubed/core/ops.py:439, in map_blocks.<locals>.func_with_block_id.<locals>.wrap(*a, **kw) 438 block_id = offset_to_block_id(a[-1].item()) --> 439 return func(*a[:-1], block_id=block_id, **kw) File ~/miniconda3/envs/cubed/lib/python3.9/site-packages/cubed/core/ops.py:572, in map_direct.<locals>.new_func.<locals>.wrap(block_id, *a, **kw) 571 args = a + arrays --> 572 return func(*args, block_id=block_id, **kw) File ~/miniconda3/envs/cubed/lib/python3.9/site-packages/cubed/core/ops.py:76, in _from_array(e, x, outchunks, asarray, block_id) 75 def _from_array(e, x, outchunks=None, asarray=None, block_id=None): ---> 76 out = x[get_item(outchunks, block_id)] 77 if asarray: File ~/Documents/Work/Code/xarray/xarray/core/indexing.py:627, in CopyOnWriteArray.__getitem__(self, key) 626 def __getitem__(self, key): --> 627 return type(self)(_wrap_numpy_scalars(self.array[key])) File ~/Documents/Work/Code/xarray/xarray/core/indexing.py:534, in LazilyIndexedArray.__getitem__(self, indexer) 533 return array[indexer] --> 534 return type(self)(self.array, self._updated_key(indexer)) File ~/Documents/Work/Code/xarray/xarray/core/indexing.py:500, in LazilyIndexedArray._updated_key(self, new_key) 499 def _updated_key(self, new_key): --> 500 iter_new_key = iter(expanded_indexer(new_key.tuple, self.ndim)) 501 full_key = [] AttributeError: 'tuple' object has no attribute 'tuple' The above exception was the direct cause of the following exception: RetryError Traceback (most recent call last) Cell In[69], line 1 ----> 1 ds.load() File ~/Documents/Work/Code/xarray/xarray/core/dataset.py:761, in Dataset.load(self, **kwargs) 758 chunkmanager = get_chunked_array_type(*lazy_data.values()) 760 # evaluate all the chunked arrays simultaneously --> 761 evaluated_data = chunkmanager.compute(*lazy_data.values(), **kwargs) 763 for k, data in zip(lazy_data, evaluated_data): 764 self.variables[k].data = data File ~/Documents/Work/Code/xarray/xarray/core/parallelcompat.py:451, in CubedManager.compute(self, *data, **kwargs) 448 def compute(self, *data: "CubedArray", **kwargs) -> np.ndarray: 449 from cubed import compute --> 451 return compute(*data, **kwargs) File ~/miniconda3/envs/cubed/lib/python3.9/site-packages/cubed/core/array.py:300, in compute(executor, callbacks, optimize_graph, *arrays, **kwargs) 297 executor = PythonDagExecutor() 299 _return_in_memory_array = kwargs.pop("_return_in_memory_array", True) --> 300 plan.execute( 301 executor=executor, 302 callbacks=callbacks, 303 optimize_graph=optimize_graph, 304 array_names=[a.name for a in arrays], 305 **kwargs, 306 ) 308 if _return_in_memory_array: 309 return tuple(a._read_stored() for a in arrays) File ~/miniconda3/envs/cubed/lib/python3.9/site-packages/cubed/core/plan.py:154, in Plan.execute(self, executor, callbacks, optimize_graph, array_names, **kwargs) 152 if callbacks is not None: 153 [callback.on_compute_start(dag) for callback in callbacks] --> 154 executor.execute_dag( 155 dag, callbacks=callbacks, array_names=array_names, **kwargs 156 ) 157 if callbacks is not None: 158 [callback.on_compute_end(dag) for callback in callbacks] File ~/miniconda3/envs/cubed/lib/python3.9/site-packages/cubed/runtime/executors/python.py:22, in PythonDagExecutor.execute_dag(self, dag, callbacks, array_names, **kwargs) 20 if stage.mappable is not None: 21 for m in stage.mappable: ---> 22 exec_stage_func(stage.function, m, config=pipeline.config) 23 if callbacks is not None: 24 event = TaskEndEvent(array_name=name) File ~/miniconda3/envs/cubed/lib/python3.9/site-packages/tenacity/__init__.py:289, in BaseRetrying.wraps.<locals>.wrapped_f(*args, **kw) 287 @functools.wraps(f) 288 def wrapped_f(*args: t.Any, **kw: t.Any) -> t.Any: --> 289 return self(f, *args, **kw) File ~/miniconda3/envs/cubed/lib/python3.9/site-packages/tenacity/__init__.py:379, in Retrying.__call__(self, fn, *args, **kwargs) 377 retry_state = RetryCallState(retry_object=self, fn=fn, args=args, kwargs=kwargs) 378 while True: --> 379 do = self.iter(retry_state=retry_state) 380 if isinstance(do, DoAttempt): 381 try: File ~/miniconda3/envs/cubed/lib/python3.9/site-packages/tenacity/__init__.py:326, in BaseRetrying.iter(self, retry_state) 324 if self.reraise: 325 raise retry_exc.reraise() --> 326 raise retry_exc from fut.exception() 328 if self.wait: 329 sleep = self.wait(retry_state) RetryError: RetryError[<Future at 0x7fc0c69be4f0 state=finished raised AttributeError> ```

This still works fine for dask.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1468732253 https://github.com/pydata/xarray/pull/7019#issuecomment-1468732253 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85XixNd TomNicholas 35968931 2023-03-14T19:52:38Z 2023-03-14T19:52:38Z MEMBER

Thanks @tomwhite - I think it might make sense for me to remove the CubedManager class from this PR and instead put that & cubed+xarray tests into another repo. That keeps xarray's changes minimal, doesn't require putting cubed in any xarray CI envs, and hopefully allows us to merge the ChunkManager changes here earlier.


Places dask is still explicitly imported in xarray

There are a few remaining places where I haven't generalised to remove specific import dask calls either because it won't be imported at runtime unless you ask for it, cubed doesn't implement the equivalent function, that function isn't in the array API standard, or because I'm not sure if the dask concept used generalises to other parallel frameworks.

  • [ ] open_mfdataset(..., parallel=True) - there is no cubed.delayed to wrap the open_dataset calls in,
  • [ ] Dataset.__dask_graph__ and all the other similar dask magic methods
  • [ ] dask_array_ops.rolling - uses functions from dask.array.overlap,
  • [ ] dask_array_ops.least_squares - uses dask.array.apply_along_axis and dask.array.linalg.lstsq,
  • [ ] dask_array_ops.push - uses dask.array.reductions.cumreduction

I would like to get to the point where you can use xarray with a chunked array without ever importing dask. I think this PR gets very close, but that would be tricky to test because cubed depends on dask (so I can't just run the test suite without dask in the environment), and there are not yet any other parallel chunk-aware frameworks I know of (ramba and arkouda don't have a chunks attribute so wouldn't require this PR).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1460980509 https://github.com/pydata/xarray/pull/7019#issuecomment-1460980509 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85XFMsd TomNicholas 35968931 2023-03-08T22:37:21Z 2023-03-08T22:39:46Z MEMBER

I'm making progress with this PR, and now that @tomwhite implemented cubed.apply_gufunc I've re-routed xarray.apply_ufunc to use whatever version of apply_gufunc is defined by the chosen ChunkManager. This means many basic operations should now just work:

```python In [1]: import xarray as xr

In [2]: da = xr.DataArray([1, 2, 3], dims='x')

In [3]: da_chunked = da.chunk(from_array_kwargs={'manager': 'cubed'})

In [4]: da_chunked Out[4]: <xarray.DataArray (x: 3)> cubed.Array<array-003, shape=(3,), dtype=int64, chunks=((3,),)> Dimensions without coordinates: x

In [5]: da_chunked.mean() Out[5]: <xarray.DataArray ()> cubed.Array<array-006, shape=(), dtype=int64, chunks=()>

In [6]: da_chunked.mean().compute() [cubed.Array<array-009, shape=(), dtype=int64, chunks=()>] Out[6]: <xarray.DataArray ()> array(2) ```

(You need to install both cubed>0.5.0 and the main branch of rechunker for this to work.)

I still have a fair bit more to do on this PR (see checklist at top), but for testing should I:

  1. Start making a test_cubed.py file in xarray as part of this PR with bespoke tests,
  2. Put bespoke tests for xarray wrapping cubed somewhere else (e.g. the cubed repo or a new cubed-xarray repo),
  3. Merge this PR without cubed-specific tests and concentrate on finishing the general duck-array testing framework in #6908 so we can implement (b) in the way we actually eventually want things to work for 3rd-party duck array libraries?

I would prefer not to have this PR grow to be thousands of lines by including tests in it, but also waiting for #6908 might take a while because that's also a fairly ambitious PR.

The fact that the tests are currently green for this PR (ignoring some mypy stuff) is evidence that the decoupling of dask from xarray is working so far.

(I have already added some tests for the ability to register custom ChunkManagers though.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1400848135 https://github.com/pydata/xarray/pull/7019#issuecomment-1400848135 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85Tfz8H TomNicholas 35968931 2023-01-23T19:14:26Z 2023-01-23T19:14:26Z MEMBER

@drtodd13 mentioned today that ramba doesn't actually require explicit chunks to work, which I hadn't realised. So forcing wrapped libraries to implement an explicit chunks method might be too restrictive. Ramba could possibly work entirely through the numpy array API standard.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1400846131 https://github.com/pydata/xarray/pull/7019#issuecomment-1400846131 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85Tfzcz TomNicholas 35968931 2023-01-23T19:12:57Z 2023-01-23T19:12:57Z MEMBER

@drtodd13 tagging you here and linking my notes from today's distributed arrays working group meeting for the links and references to this PR.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629
1255310245 https://github.com/pydata/xarray/pull/7019#issuecomment-1255310245 https://api.github.com/repos/pydata/xarray/issues/7019 IC_kwDOAMm_X85K0oOl TomNicholas 35968931 2022-09-22T17:06:42Z 2022-09-22T17:06:42Z MEMBER

This was more abstract than expected

Yeah I was kind of asking whether this was unnecessarily abstract, and if there was a simpler design that achieved the same flexibility.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Generalize handling of chunked array types 1368740629

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 543.7ms · About: xarray-datasette