home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 636491064

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4113#issuecomment-636491064 https://api.github.com/repos/pydata/xarray/issues/4113 636491064 MDEyOklzc3VlQ29tbWVudDYzNjQ5MTA2NA== 36678697 2020-05-31T16:04:39Z 2020-05-31T16:04:39Z NONE

Thanks for the answer.

I tried some experiments with chunked reading with dask, but I have observations I don't fully get :

1) Still loading memory

Reading with chunks load the memory more than reading without chunks, but not loading an amount of memory equals to the size of the array (300MB for a 800MB array in the example below). And by the way, also loading up the memory a bit more when stacking.

But I think this may be normal, because of something like loading the dask machinery in the memory, and that I will see the full benefits when working on bigger data, am I right?

2) Stacking is breaking the chunks

When stacking a chunked array, only chunks alongside the first stacking dimension are conserved, and chunks along the second stacking dimension seem to be merged.

I think this has something to do with the very nature of indexes, but not sure.

3) Rechunking load the memory

A workaround to 2) could have been to re-chunk as wanted after stacking, but then it is fully loading the data.

Example

(Considering the following to replace the main() function of the script in the original post.)

```python def main():

fname = "da.nc"
shape = 512, 2048, 100  # 800 MB

xr.DataArray(
    np.random.randn(*shape),
    dims=("x", "y", "z"),
).to_netcdf(fname)
print_ram_state()

da = xr.open_dataarray(fname, chunks=dict(x=1, y=1))
print(f" da: {mb(da.nbytes)} MB")
print_ram_state()

mda = da.stack(px=("x", "y"))
print_ram_state()

mda = mda.chunk(dict(px=1))
print_ram_state()

```

which outputs something like:

RAM: 94.52 MB da: 800.0 MB RAM: 398.83 MB RAM: 589.05 MB RAM: 1409.11 MB

Chunks displayed thanks to the jupyter notebook visualization:

Before stacking:

After stacking:

A workaround could have been to save the data already stacked, but "MultiIndex cannot yet be serialized to netCDF".

Maybe there is another workaround?

(Sorry for the long post)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  627735640
Powered by Datasette · Queries took 0.441ms · About: xarray-datasette