html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/2329#issuecomment-415450958,https://api.github.com/repos/pydata/xarray/issues/2329,415450958,MDEyOklzc3VlQ29tbWVudDQxNTQ1MDk1OA==,12278765,2018-08-23T15:02:57Z,2018-08-23T15:02:57Z,NONE,"It seems that I managed to get something working as it should.
I first load my monthly grib files with iris, convert to xarray, then write to zarr. This uses all the CPU cores, but loads the full array into memory. Since the individual arrays are relatively small, that is not an issue.
Then I load the monthly zarr stores with xarray, concatenate them with `auto_combine` and write to a big zarr. The memory usage peaked just above 17GB with 32 CPU threads. The array and chunks dimensions are:
`(time, latitude, longitude) float16 dask.array<shape=(113969, 721, 1440), chunksize=(113969, 20, 20)>`
I guess that reducing the chunk size will lower the memory usage.

Using that big zarr storage, plotting a map of the mean values along the time axis takes around 15min, uses all the cores and around 24GB of RAM.
The strange part is: I think I tried that before and it was not working[...](https://boyter.org/static/books/Cf7eHZ1W4AEeZJA.jpg)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,345715825
https://github.com/pydata/xarray/issues/2329#issuecomment-415005804,https://api.github.com/repos/pydata/xarray/issues/2329,415005804,MDEyOklzc3VlQ29tbWVudDQxNTAwNTgwNA==,12278765,2018-08-22T11:51:36Z,2018-08-22T11:51:36Z,NONE,"The dask task graph seems right (mean on the time dimension, lower number of chunks to make the visualisation practical):

![mean](https://user-images.githubusercontent.com/12278765/44461604-70ab6f80-a609-11e8-8213-4e6732b0a23f.png)

If I understand well, the 'getter' are doing the actual reading of the file, but in reality, they do not seem to run in parallel.

As for the zarr writing part, I do not know how to check the task graph.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,345715825
https://github.com/pydata/xarray/issues/2329#issuecomment-409565674,https://api.github.com/repos/pydata/xarray/issues/2329,409565674,MDEyOklzc3VlQ29tbWVudDQwOTU2NTY3NA==,12278765,2018-08-01T12:58:31Z,2018-08-01T12:58:31Z,NONE,"I ran a comparison of the impact of chunk sizes with a profiler:

```python
profiler = Profiler()
for chunks in [{'time': 30}, {'lat': 30}, {'lon': 30}]:
    print(chunks)
    profiler.start()
    with xr.open_dataset(nc_path, chunks=chunks) as ds:
        print(ds.mean(dim='time').load())
    profiler.stop()
    print(profiler.output_text(unicode=True, color=True))
```

I am not sure if the profiler results are useful:

```
    {'time': 30}
    <xarray.Dataset>
    Dimensions:  (lat: 721, lon: 1440)
    Coordinates:
      * lon      (lon) float32 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 2.25 2.5 ...
      * lat      (lat) float32 90.0 89.75 89.5 89.25 89.0 88.75 88.5 88.25 88.0 ...
    Data variables:
        mtpr     (lat, lon) float32 8.30159e-06 8.30159e-06 8.30159e-06 ...
    5652.770 compare_chunks  read_grib.py:281
    └─ 5652.613 load  xarray/core/dataset.py:466
       └─ 5652.613 compute  dask/base.py:349
          └─ 5652.404 get  dask/threaded.py:33
             └─ 5652.400 get_async  dask/local.py:389
                └─ 5629.663 queue_get  dask/local.py:127
                   └─ 5629.663 get  Queue.py:150
                      └─ 5629.656 wait  threading.py:309
```

In the case of chunks on `lat` or `lon` only, I get a `MemoryError`.

I don't know if this helps, but it would be great to have a solution or workaround for that. Surely I am not the only one working with dataset of that size? What would be the best practice in my case?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,345715825
https://github.com/pydata/xarray/issues/2329#issuecomment-409276937,https://api.github.com/repos/pydata/xarray/issues/2329,409276937,MDEyOklzc3VlQ29tbWVudDQwOTI3NjkzNw==,12278765,2018-07-31T16:08:33Z,2018-07-31T16:08:33Z,NONE,"I did some tests with my big netcdf. The chunking indeed makes a difference.

```
chunks = {'time': 'auto', 'lat': 'auto', 'lon': 'auto'}
ds = xr.open_dataset('era5_precip.nc', chunks=chunks)
ds.sum().load()

real	161m37.119s
user	33m9.720s
sys	63m47.696s

chunks = {'time': 1}
ds = xr.open_dataset('era5_precip.nc', chunks=chunks)
print(ds.sum().load())

real	109m55.839s
user	303m40.665s
sys	451m30.788s
```

I'll do some more tests with the calculation of the mean on the time axis, it might be more representative of what I want to do.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,345715825
https://github.com/pydata/xarray/issues/2329#issuecomment-409167123,https://api.github.com/repos/pydata/xarray/issues/2329,409167123,MDEyOklzc3VlQ29tbWVudDQwOTE2NzEyMw==,12278765,2018-07-31T10:04:10Z,2018-07-31T10:04:41Z,NONE,"@fmaussion `ds.mtpr`:
```
<xarray.DataArray 'mtpr' (time: 119330, lat: 721, lon: 1440)>
dask.array<shape=(119330, 721, 1440), dtype=float32, chunksize=(119330, 16, 16)>
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01T06:00:00 2000-01-01T06:00:00 ...
  * lon      (lon) float32 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 2.25 2.5 ...
  * lat      (lat) float32 90.0 89.75 89.5 89.25 89.0 88.75 88.5 88.25 88.0 ...
Attributes:
    long_name:  Mean total precipitation rate
    units:      kg m**-2 s**-1
    code:       55
    table:      235
```
`ds.mtpr.encoding`:
```
{'complevel': 0, 'shuffle': False, 'dtype': dtype('float32'), 'contiguous': False,
 'zlib': False, 'source': u'era5_precip.nc', 'fletcher32': False,
 'original_shape': (119330, 721, 1440), 'chunksizes': (1, 721, 1440)}
```
Those chunksizes are the opposite of what I was expecting...","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,345715825
https://github.com/pydata/xarray/issues/2329#issuecomment-409157118,https://api.github.com/repos/pydata/xarray/issues/2329,409157118,MDEyOklzc3VlQ29tbWVudDQwOTE1NzExOA==,12278765,2018-07-31T09:28:48Z,2018-07-31T09:28:48Z,NONE,"@rabernat I tried to do the sum. I have the same issue. The process just seems to read the disk endlessly, without even writing to the RAM:

![screenshot from 2018-07-31 10-05-04](https://user-images.githubusercontent.com/12278765/43450682-28db1f30-94ab-11e8-87f5-2d8cda915a04.png)

I tried to lower the chunk size, but it doesn't seem to change anything. Without chunk, I logically get a `MemoryError`.

I plan to do time-series analysis, so I thought that having contiguous chunks in time would be more efficient. The netcdf was created with `cdo -f nc4 mergetime`, so it should have mostly the same structure, I guess.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,345715825
https://github.com/pydata/xarray/issues/2329#issuecomment-408894639,https://api.github.com/repos/pydata/xarray/issues/2329,408894639,MDEyOklzc3VlQ29tbWVudDQwODg5NDYzOQ==,12278765,2018-07-30T15:01:27Z,2018-07-30T15:10:43Z,NONE,"@rabernat Thanks for your answer.

I have one big NetCDF of ~500GB.
What I have changed:
- Run in a Jupyter notebook with distributed to get the dashboard
- Change the chunks to `{'lat': 90, 'lon': 90}`. That should be around 1GB per chunk.
- Chunk from the beginning with `ds = xr.open_dataset('my_netcdf.nc', chunks=chunks)`
- About the LZ4 compression, I did some test with a 1.5GB extract and the writing time was just 2% slower than uncompressed.

Now when I run `to_zarr()`, it creates a zarr store (~40kB) and all the workers start to read the disk, but they don't write anything.

The Dask dashboard looks like this:

![screenshot from 2018-07-30 of the Dask dashboard](https://user-images.githubusercontent.com/12278765/43405382-98e24ad2-9411-11e8-9526-a39e6204056a.png)

After a while I get warnings:

> distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.55 GB -- Worker memory limit: 2.08 GB

Is this the expected behaviour? I was somehow expecting that each worker will read a chunk and then write it to zarr, streamlined. This does not seem to be the case.

","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,345715825