issue_comments: 1499791533

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/pull/7019#issuecomment-1499791533	https://api.github.com/repos/pydata/xarray/issues/7019	1499791533	IC_kwDOAMm_X85ZZQCt	35968931	2023-04-07T00:32:47Z	2023-04-07T00:59:03Z	MEMBER	I'm having problems with ensuring the behaviour of the chunks='auto' option is consistent between `.chunk` and `open_dataset` Update on this rabbit hole: This commit to dask changed the behaviour of dask's auto-chunking logic, such that if I run my little test script `test_old_get_chunk.py` on dask releases before and after that commit I get different chunking patterns: ```python from xarray.core.variable import IndexVariable from dask.array.core import normalize_chunks # import the import itertools from numbers import Number import dask import dask.array as da import xarray as xr import numpy as np # This function is copied from xarray, but calls dask.array.core.normalize_chunks # It is used in open_dataset, but not in Dataset.chunk def _get_chunk(var, chunks): """ Return map from each dim to chunk sizes, accounting for backend's preferred chunks. """ if isinstance(var, IndexVariable): return {} dims = var.dims shape = var.shape # Determine the explicit requested chunks. preferred_chunks = var.encoding.get("preferred_chunks", {}) preferred_chunk_shape = tuple( preferred_chunks.get(dim, size) for dim, size in zip(dims, shape) ) if isinstance(chunks, Number) or (chunks == "auto"): chunks = dict.fromkeys(dims, chunks) chunk_shape = tuple( chunks.get(dim, None) or preferred_chunk_sizes for dim, preferred_chunk_sizes in zip(dims, preferred_chunk_shape) ) chunk_shape = normalize_chunks( chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape ) # Warn where requested chunks break preferred chunks, provided that the variable # contains data. if var.size: for dim, size, chunk_sizes in zip(dims, shape, chunk_shape): try: preferred_chunk_sizes = preferred_chunks[dim] except KeyError: continue # Determine the stop indices of the preferred chunks, but omit the last stop # (equal to the dim size). In particular, assume that when a sequence # expresses the preferred chunks, the sequence sums to the size. preferred_stops = ( range(preferred_chunk_sizes, size, preferred_chunk_sizes) if isinstance(preferred_chunk_sizes, Number) else itertools.accumulate(preferred_chunk_sizes[:-1]) ) # Gather any stop indices of the specified chunks that are not a stop index # of a preferred chunk. Again, omit the last stop, assuming that it equals # the dim size. breaks = set(itertools.accumulate(chunk_sizes[:-1])).difference( preferred_stops ) if breaks: warnings.warn( "The specified Dask chunks separate the stored chunks along " f'dimension "{dim}" starting at index {min(breaks)}. This could ' "degrade performance. Instead, consider rechunking after loading." ) return dict(zip(dims, chunk_shape)) chunks = 'auto' encoded_chunks = 100 dask_arr = da.from_array( np.ones((500, 500), dtype="float64"), chunks=encoded_chunks ) var = xr.core.variable.Variable(data=dask_arr, dims=['x', 'y']) with dask.config.set({"array.chunk-size": "1MiB"}): chunks_suggested = _get_chunk(var, chunks) print(chunks_suggested) ``` python (cubed) tom@tom-XPS-9315:~/Documents/Work/Code/dask$ git checkout 2022.9.2 Previous HEAD position was 7fe622b44 Add docs on running Dask in a standalone Python script (#9513) HEAD is now at 3ef47422b bump version to 2022.9.2 (cubed) tom@tom-XPS-9315:~/Documents/Work/Code/dask$ python ../experimentation/bugs/auto_chunking/test_old_get_chunk.py {'x': (362, 138), 'y': (362, 138)} (cubed) tom@tom-XPS-9315:~/Documents/Work/Code/dask$ git checkout 2022.9.1 Previous HEAD position was 3ef47422b bump version to 2022.9.2 HEAD is now at b944abf68 bump version to 2022.9.1 (cubed) tom@tom-XPS-9315:~/Documents/Work/Code/dask$ python ../experimentation/bugs/auto_chunking/test_old_get_chunk.py {'x': (250, 250), 'y': (250, 250)} (I was absolutely tearing my hair out trying to find this bug, because after the change `normalize_chunks` became a pure function, but before the change it actually wasn't, so I was trying calling `normalize_chunks` with the exact same set of input arguments and was still not able to reproduce the bug :angry: ) Anyway what this means is as this PR vendors `dask.array.core.normalize_chunks`, but the behaviour of `dask.array.core.normalize_chunks` changed between the version in CI job `min-all-deps` and the other CI jobs, the single vendored function cannot possibly match both behaviours. I think one simple way to fix this failure without should be to upgrade the minimum version of dask to >=2022.9.2 (from 2022.1.1 where it currently is). EDIT: I tried changing the minimum version of dask-core in `min-all-deps.yml` but the conda solve failed. But also would updating to `2022.9.2` now violate xarray's minimum dependency versions policy? EDIT2: Another way to fix this should be to un-vendor `dask.array.core.normalize_chunks` within xarray. We could still achieve the goal of running cubed without dask by making `normalize_chunks` the responsibility of the `chunkmanager` instead, as cubed's vendored version of that function is not subject to xarray's minimum dependencies requirement.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		1368740629