home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 2128501296

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2128501296 I_kwDOAMm_X85-3low 8733 A basic default ChunkManager for arrays that report their own chunks 90008 open 0     21 2024-02-10T14:36:55Z 2024-03-10T17:26:13Z   CONTRIBUTOR      

Is your feature request related to a problem?

I'm creating duckarrays for various file backed datastructures for mine that are naturally "chunked". i.e. different parts of the array may appear in completely different files.

Using these "chunks" and the "strides" algorithms can better decide on how to iterate in a convenient manner.

For example, an MP4 file's chunks may be defined as being delimited by I frames, while images stored in a TIFF may be delimited by a page.

So for me, chunks are not so useful for parallel computing, but more for computing locally and choosing the appropriate way to iterate through a large arrays (TB of uncompressed data).

Describe the solution you'd like

I think a default Chunk manager could simply implement compute as np.asarray as a default instance, and be a catchall to all other instances.

Advanced users could then go in an reimplement their own chunkmanager, but I was unable to use my duckarrays that incldued a chunk property because they weren't associated with any chunk manager.

Something as simple as:

```patch diff --git a/xarray/core/parallelcompat.py b/xarray/core/parallelcompat.py index c009ef48..bf500abb 100644 --- a/xarray/core/parallelcompat.py +++ b/xarray/core/parallelcompat.py @@ -681,3 +681,26 @@ class ChunkManagerEntrypoint(ABC, Generic[T_ChunkedArray]): cubed.store """ raise NotImplementedError() + + +class DefaultChunkManager(ChunkMangerEntrypoint): + def init(self) -> None: + self.array_cls = None + + def is_chunked_array(self, data: Any) -> bool: + return is_duck_array(data) and hasattr(data, "chunks") + + def chunks(self, data: T_ChunkedArray) -> T_NormalizedChunks: + return data.chunks + + def compute(self, data: T_ChunkedArray | Any, kwargs) -> tuple[np.ndarray, ...]: + raise tuple(np.asarray(d) for d in data) + + def normalize_chunks(self, args, kwargs): + raise NotImplementedError() + + def from_array(self, *args, kwargs): + raise NotImplementedError() + + def apply_gufunc(self, args, *kwargs): + raise NotImplementedError()

```

Describe alternatives you've considered

I created my own chunk manager, with my own chunk manager entry point.

Kinda tedious...

Additional context

It seems that this is related to: https://github.com/pydata/xarray/pull/7019

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8733/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 3 rows from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 0.51ms · About: xarray-datasette