html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/2300#issuecomment-673565228,https://api.github.com/repos/pydata/xarray/issues/2300,673565228,MDEyOklzc3VlQ29tbWVudDY3MzU2NTIyOA==,4441338,2020-08-13T16:04:04Z,2020-08-13T16:04:04Z,NONE,"I arrived here due to a different use case / problem, which ultimately I solved, but I think there's value in documenting it here.
My use case is the following workflow:
1 . take raw data, build a dataset, append it to a zarr store Z
2 . analyze the data on Z, then maybe goto 1.
Step 2's performance is much better when data on Z is chunked properly along the appending dimension 'frame' (chunks of size 50), however step 1 only adds 1 element along it. I end up with Z having chunks (1,1,1,1,1...) on 'frame'.
On xarray 0.16.0, this seems solvable via the encoding parameter, if we take care to only use it on the store creation.
Before that version, I was using something like the monkey patch posted by @chrisbarber .
Code:
```python
import shutil
import xarray as xr
import numpy as np
import tempfile
zarr_path = tempfile.mkdtemp()
def append_test(ds,chunks):
shutil.rmtree(zarr_path)
for i in range(21):
d = ds.isel(frame=slice(i,i+1))
d = d.chunk(chunks)
d.to_zarr(zarr_path,consolidated=True,**(dict(mode='a',append_dim='frame') if i>0 else {}))
dsa = xr.open_zarr(str(zarr_path),consolidated=True)
print(dsa.chunks,dsa.dims)
#sometime before 0.16.0
import contextlib
@contextlib.contextmanager
def change_determine_zarr_chunks(chunks):
orig_determine_zarr_chunks = xr.backends.zarr._determine_zarr_chunks
try:
def new_determine_zarr_chunks( enc_chunks, var_chunks, ndim, name):
da = ds[name]
zchunks = tuple(chunks[dim] if (dim in chunks and chunks[dim] is not None) else da.shape[i] for i,dim in enumerate(da.dims))
return zchunks
xr.backends.zarr._determine_zarr_chunks = new_determine_zarr_chunks
yield
finally:
xr.backends.zarr._determine_zarr_chunks = orig_determine_zarr_chunks
chunks = {'frame':10,'other':50}
ds = xr.Dataset({'data':xr.DataArray(data=np.random.rand(100,100),dims=('frame','other'))})
append_test(ds,chunks)
with change_determine_zarr_chunks(chunks):
append_test(ds,chunks)
#with 0.16.0
def append_test_encoding(ds,chunks):
shutil.rmtree(zarr_path)
encoding = {}
for k,v in ds.variables.items():
encoding[k]={'chunks':tuple(chunks[dk] if dk in chunks else v.shape[i] for i,dk in enumerate(v.dims))}
for i in range(21):
d = ds.isel(frame=slice(i,i+1))
d = d.chunk(chunks)
d.to_zarr(zarr_path,consolidated=True,**(dict(mode='a',append_dim='frame') if i>0 else dict(encoding = encoding)))
dsa = xr.open_zarr(str(zarr_path),consolidated=True)
print(dsa.chunks,dsa.dims)
append_test_encoding(ds,chunks)
```
```
Frozen(SortedKeysDict({'frame': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 'other': (50, 50)})) Frozen(SortedKeysDict({'frame': 21, 'other': 100}))
Frozen(SortedKeysDict({'frame': (10, 10, 1), 'other': (50, 50)})) Frozen(SortedKeysDict({'frame': 21, 'other': 100}))
Frozen(SortedKeysDict({'frame': (10, 10, 1), 'other': (50, 50)})) Frozen(SortedKeysDict({'frame': 21, 'other': 100}))
```
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,342531772
https://github.com/pydata/xarray/issues/2300#issuecomment-493408428,https://api.github.com/repos/pydata/xarray/issues/2300,493408428,MDEyOklzc3VlQ29tbWVudDQ5MzQwODQyOA==,46813815,2019-05-17T10:37:35Z,2019-05-17T10:37:35Z,NONE,"Hi, I'm new to xarray & zarr ,
After reading a zarr file, I re-chunk the data using xarray.Dataset.chunk. Then create a newly chunked data stored as zarr file with xarray.Dataset.to_zarr But I get error message:
'NotImplementedError: Specified zarr chunks (200, 100, 1) would overlap multiple dask chunks ((50, 50, 50, 50), (25, 25, 25, 25), (10000,)). This is not implemented in xarray yet. Consider rechunking the data using `chunk()` or specifying different chunks in encoding.'
My xarray version is12.1, & and my understanding is that according to this post https://github.com/pydata/xarray/issues/2300 .it is fixed, thus so it is implemented to 12.1??
Then why do I get 'notimplemented error ?
Do I have to use 'del dsread.data.encoding['chunks']. each time before using 'Dataset.to_zarr' as a workaround? but probably I am missing somthing. I hope someone can point me out...
I made a notebook here for reproducing the pb.
https://github.com/tinaok/Pangeo-for-beginners/blob/master/3-1%20zarr%20and%20re-chunking%20bug%20report.ipynb
thanks for your help, regards Tina","{""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,342531772
https://github.com/pydata/xarray/issues/2300#issuecomment-406732486,https://api.github.com/repos/pydata/xarray/issues/2300,406732486,MDEyOklzc3VlQ29tbWVudDQwNjczMjQ4Ng==,1530840,2018-07-20T21:33:08Z,2018-07-20T21:33:08Z,NONE,"I took a closer look and noticed my one-dimensional fields of size 505359 were reporting a chunksize or 63170. Turns out that's enough to come up with a minimal repro:
```python
>>> xr.__version__
'0.10.8'
>>> ds=xr.Dataset({'foo': (['bar'], np.zeros((505359,)))})
>>> ds.to_zarr('test.zarr')
>>> ds2=xr.open_zarr('test.zarr')
>>> ds2
Dimensions: (bar: 505359)
Dimensions without coordinates: bar
Data variables:
foo (bar) float64 dask.array
>>> ds2.foo.encoding
{'chunks': (63170,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters':
None, '_FillValue': nan, 'dtype': dtype('float64')}
>>> ds2.to_zarr('test2.zarr')
```
raises
```
NotImplementedError: Specified zarr chunks (63170,) would overlap multiple dask chunks ((63170, 63170, 63
170, 63170, 63170, 63170, 63170, 63169),). This is not implemented in xarray yet. Consider rechunking th
e data using `chunk()` or specifying different chunks in encoding.
```","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,342531772
https://github.com/pydata/xarray/issues/2300#issuecomment-406705740,https://api.github.com/repos/pydata/xarray/issues/2300,406705740,MDEyOklzc3VlQ29tbWVudDQwNjcwNTc0MA==,1530840,2018-07-20T19:36:08Z,2018-07-20T19:38:03Z,NONE,"Ah, that's great. I do see *some* improvement. Specifically, I can now set chunks using xarray, and successfully write to zarr, and reopen it. However, when reopening it I do find that the chunks have been inconsistently applied (some fields have the expected chunksize whereas some small fields have the entire variable in one chunk). Furthermore, trying to write a second time with `to_zarr` leads to:
`
*** NotImplementedError: Specified zarr chunks (100,) would overlap multiple dask chunks ((100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 4),). This is not implemented in xarray yet. Consider rechunking the data using `chunk()` or specifying different chunks in encoding.
`
Trying to reapply the original chunks with `xr.Dataset.chunk` succeeds, and `ds.chunks` no longer reports ""inconsistent chunks"", but trying to write still produces the same error.
I also tried loading my entire dataset into memory, allowing the initial `to_zarr` to default to zarr's chunking heuristics. Trying to read and write a second time again results in the same error:
`
NotImplementedError: Specified zarr chunks (63170,) would overlap multiple dask chunks ((63170, 63170, 63170, 63170, 63170, 63170, 63170, 63169),). This is not implemented in xarray yet. Consider rechunking the data using `chunk()` or specifying different chunks in encoding.
`
I tried this round-tripping experiment with my monkey patches, and it works for a sequence of read/write/read/write... without any intervention in between. This only works for default zarr-chunking, however, since the patch to `xr.backends.zarr._determine_zarr_chunks` overrides whatever chunks are on the originating dataset.
Curious: Is there any downside in xarray to using datasets with inconsistent chunks? I take it that it is a supported configuration because xarray allows it to happen, but just outputs that error when calling `ds.chunks`, which is just a sort of convenience method for looking at chunks across a whole dataset which happens to have consistent chunks...?
One other thing to add: it might be nice to have an option to allow zarr auto-chunking even when `chunks!={}`. I don't know how sensitive zarr performance is to chunksizes, but it'd be nice to have some form of sane auto-chunking available when you don't want to bother with manually choosing.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,342531772