home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 1499791533

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/pull/7019#issuecomment-1499791533 https://api.github.com/repos/pydata/xarray/issues/7019 1499791533 IC_kwDOAMm_X85ZZQCt 35968931 2023-04-07T00:32:47Z 2023-04-07T00:59:03Z MEMBER

I'm having problems with ensuring the behaviour of the chunks='auto' option is consistent between .chunk and open_dataset

Update on this rabbit hole: This commit to dask changed the behaviour of dask's auto-chunking logic, such that if I run my little test script test_old_get_chunk.py on dask releases before and after that commit I get different chunking patterns:

```python from xarray.core.variable import IndexVariable from dask.array.core import normalize_chunks # import the import itertools from numbers import Number import dask import dask.array as da import xarray as xr import numpy as np # This function is copied from xarray, but calls dask.array.core.normalize_chunks # It is used in open_dataset, but not in Dataset.chunk def _get_chunk(var, chunks): """ Return map from each dim to chunk sizes, accounting for backend's preferred chunks. """ if isinstance(var, IndexVariable): return {} dims = var.dims shape = var.shape # Determine the explicit requested chunks. preferred_chunks = var.encoding.get("preferred_chunks", {}) preferred_chunk_shape = tuple( preferred_chunks.get(dim, size) for dim, size in zip(dims, shape) ) if isinstance(chunks, Number) or (chunks == "auto"): chunks = dict.fromkeys(dims, chunks) chunk_shape = tuple( chunks.get(dim, None) or preferred_chunk_sizes for dim, preferred_chunk_sizes in zip(dims, preferred_chunk_shape) ) chunk_shape = normalize_chunks( chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape ) # Warn where requested chunks break preferred chunks, provided that the variable # contains data. if var.size: for dim, size, chunk_sizes in zip(dims, shape, chunk_shape): try: preferred_chunk_sizes = preferred_chunks[dim] except KeyError: continue # Determine the stop indices of the preferred chunks, but omit the last stop # (equal to the dim size). In particular, assume that when a sequence # expresses the preferred chunks, the sequence sums to the size. preferred_stops = ( range(preferred_chunk_sizes, size, preferred_chunk_sizes) if isinstance(preferred_chunk_sizes, Number) else itertools.accumulate(preferred_chunk_sizes[:-1]) ) # Gather any stop indices of the specified chunks that are not a stop index # of a preferred chunk. Again, omit the last stop, assuming that it equals # the dim size. breaks = set(itertools.accumulate(chunk_sizes[:-1])).difference( preferred_stops ) if breaks: warnings.warn( "The specified Dask chunks separate the stored chunks along " f'dimension "{dim}" starting at index {min(breaks)}. This could ' "degrade performance. Instead, consider rechunking after loading." ) return dict(zip(dims, chunk_shape)) chunks = 'auto' encoded_chunks = 100 dask_arr = da.from_array( np.ones((500, 500), dtype="float64"), chunks=encoded_chunks ) var = xr.core.variable.Variable(data=dask_arr, dims=['x', 'y']) with dask.config.set({"array.chunk-size": "1MiB"}): chunks_suggested = _get_chunk(var, chunks) print(chunks_suggested) ```

python (cubed) tom@tom-XPS-9315:~/Documents/Work/Code/dask$ git checkout 2022.9.2 Previous HEAD position was 7fe622b44 Add docs on running Dask in a standalone Python script (#9513) HEAD is now at 3ef47422b bump version to 2022.9.2 (cubed) tom@tom-XPS-9315:~/Documents/Work/Code/dask$ python ../experimentation/bugs/auto_chunking/test_old_get_chunk.py {'x': (362, 138), 'y': (362, 138)} (cubed) tom@tom-XPS-9315:~/Documents/Work/Code/dask$ git checkout 2022.9.1 Previous HEAD position was 3ef47422b bump version to 2022.9.2 HEAD is now at b944abf68 bump version to 2022.9.1 (cubed) tom@tom-XPS-9315:~/Documents/Work/Code/dask$ python ../experimentation/bugs/auto_chunking/test_old_get_chunk.py {'x': (250, 250), 'y': (250, 250)} (I was absolutely tearing my hair out trying to find this bug, because after the change normalize_chunks became a pure function, but before the change it actually wasn't, so I was trying calling normalize_chunks with the exact same set of input arguments and was still not able to reproduce the bug :angry: )

Anyway what this means is as this PR vendors dask.array.core.normalize_chunks, but the behaviour of dask.array.core.normalize_chunks changed between the version in CI job min-all-deps and the other CI jobs, the single vendored function cannot possibly match both behaviours.

I think one simple way to fix this failure without should be to upgrade the minimum version of dask to >=2022.9.2 (from 2022.1.1 where it currently is).

EDIT: I tried changing the minimum version of dask-core in min-all-deps.yml but the conda solve failed. But also would updating to 2022.9.2 now violate xarray's minimum dependency versions policy?

EDIT2: Another way to fix this should be to un-vendor dask.array.core.normalize_chunks within xarray. We could still achieve the goal of running cubed without dask by making normalize_chunks the responsibility of the chunkmanager instead, as cubed's vendored version of that function is not subject to xarray's minimum dependencies requirement.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  1368740629
Powered by Datasette · Queries took 3.361ms · About: xarray-datasette