home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

16 rows where issue = 345715825 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • lrntct 7
  • fmaussion 5
  • rabernat 3
  • shoyer 1

author_association 2

  • MEMBER 9
  • NONE 7

issue 1

  • Out-of-core processing with dask not working properly? · 16 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
415450958 https://github.com/pydata/xarray/issues/2329#issuecomment-415450958 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQxNTQ1MDk1OA== lrntct 12278765 2018-08-23T15:02:57Z 2018-08-23T15:02:57Z NONE

It seems that I managed to get something working as it should. I first load my monthly grib files with iris, convert to xarray, then write to zarr. This uses all the CPU cores, but loads the full array into memory. Since the individual arrays are relatively small, that is not an issue. Then I load the monthly zarr stores with xarray, concatenate them with auto_combine and write to a big zarr. The memory usage peaked just above 17GB with 32 CPU threads. The array and chunks dimensions are: (time, latitude, longitude) float16 dask.array<shape=(113969, 721, 1440), chunksize=(113969, 20, 20)> I guess that reducing the chunk size will lower the memory usage.

Using that big zarr storage, plotting a map of the mean values along the time axis takes around 15min, uses all the cores and around 24GB of RAM. The strange part is: I think I tried that before and it was not working...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
415177535 https://github.com/pydata/xarray/issues/2329#issuecomment-415177535 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQxNTE3NzUzNQ== shoyer 1217238 2018-08-22T20:58:36Z 2018-08-22T20:58:36Z MEMBER

This might be worth testing with the changes from https://github.com/pydata/xarray/pull/2261, which refactors xarray's IO handling.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
415005804 https://github.com/pydata/xarray/issues/2329#issuecomment-415005804 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQxNTAwNTgwNA== lrntct 12278765 2018-08-22T11:51:36Z 2018-08-22T11:51:36Z NONE

The dask task graph seems right (mean on the time dimension, lower number of chunks to make the visualisation practical):

If I understand well, the 'getter' are doing the actual reading of the file, but in reality, they do not seem to run in parallel.

As for the zarr writing part, I do not know how to check the task graph.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
409565674 https://github.com/pydata/xarray/issues/2329#issuecomment-409565674 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwOTU2NTY3NA== lrntct 12278765 2018-08-01T12:58:31Z 2018-08-01T12:58:31Z NONE

I ran a comparison of the impact of chunk sizes with a profiler:

python profiler = Profiler() for chunks in [{'time': 30}, {'lat': 30}, {'lon': 30}]: print(chunks) profiler.start() with xr.open_dataset(nc_path, chunks=chunks) as ds: print(ds.mean(dim='time').load()) profiler.stop() print(profiler.output_text(unicode=True, color=True))

I am not sure if the profiler results are useful:

{'time': 30} <xarray.Dataset> Dimensions: (lat: 721, lon: 1440) Coordinates: * lon (lon) float32 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 2.25 2.5 ... * lat (lat) float32 90.0 89.75 89.5 89.25 89.0 88.75 88.5 88.25 88.0 ... Data variables: mtpr (lat, lon) float32 8.30159e-06 8.30159e-06 8.30159e-06 ... 5652.770 compare_chunks read_grib.py:281 └─ 5652.613 load xarray/core/dataset.py:466 └─ 5652.613 compute dask/base.py:349 └─ 5652.404 get dask/threaded.py:33 └─ 5652.400 get_async dask/local.py:389 └─ 5629.663 queue_get dask/local.py:127 └─ 5629.663 get Queue.py:150 └─ 5629.656 wait threading.py:309

In the case of chunks on lat or lon only, I get a MemoryError.

I don't know if this helps, but it would be great to have a solution or workaround for that. Surely I am not the only one working with dataset of that size? What would be the best practice in my case?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
409276937 https://github.com/pydata/xarray/issues/2329#issuecomment-409276937 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwOTI3NjkzNw== lrntct 12278765 2018-07-31T16:08:33Z 2018-07-31T16:08:33Z NONE

I did some tests with my big netcdf. The chunking indeed makes a difference.

``` chunks = {'time': 'auto', 'lat': 'auto', 'lon': 'auto'} ds = xr.open_dataset('era5_precip.nc', chunks=chunks) ds.sum().load()

real 161m37.119s user 33m9.720s sys 63m47.696s

chunks = {'time': 1} ds = xr.open_dataset('era5_precip.nc', chunks=chunks) print(ds.sum().load())

real 109m55.839s user 303m40.665s sys 451m30.788s ```

I'll do some more tests with the calculation of the mean on the time axis, it might be more representative of what I want to do.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
409238042 https://github.com/pydata/xarray/issues/2329#issuecomment-409238042 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwOTIzODA0Mg== fmaussion 10050469 2018-07-31T14:20:06Z 2018-07-31T14:20:06Z MEMBER

I updated my example above to show that the chunking over the last dimension is ridiculously slow.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
409172635 https://github.com/pydata/xarray/issues/2329#issuecomment-409172635 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwOTE3MjYzNQ== fmaussion 10050469 2018-07-31T10:25:16Z 2018-07-31T14:18:29Z MEMBER

Sorry for the confusion, I had an obvious mistake in my timing experiment above (forgot to do the actual computations...).

The dimension order does make a difference:

```python import dask as da import xarray as xr

d = xr.DataArray(da.array.zeros((1000, 721, 1440), chunks=(10, 721, 1440)), dims=('z', 'y', 'x')) d.to_netcdf('da.nc') # 8.3 Gb

with xr.open_dataarray('da.nc', chunks={'z':10}) as d: %timeit d.sum().load() 3.94 s ± 95.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

with xr.open_dataarray('da.nc', chunks={'y':10}) as d: %timeit d.sum().load() 4.15 s ± 316 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

with xr.open_dataarray('da.nc', chunks={'x':10}) as d: %timeit d.sum().load() 1min 54s ± 1.43 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

with xr.open_dataarray('da.nc', chunks={'y':10, 'x':10}) as d: %timeit d.sum().load() 2min 23s ± 215 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ```

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
409168605 https://github.com/pydata/xarray/issues/2329#issuecomment-409168605 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwOTE2ODYwNQ== fmaussion 10050469 2018-07-31T10:09:36Z 2018-07-31T13:21:34Z MEMBER

Those chunksizes are the opposite of what I was expecting...

chunksizes in encoding are ignored in your case, dask still uses your user provided encoding.

Can you still try to chunk along one dimension only? i.e. chunks={'time':200}

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
409165114 https://github.com/pydata/xarray/issues/2329#issuecomment-409165114 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwOTE2NTExNA== fmaussion 10050469 2018-07-31T09:56:54Z 2018-07-31T10:20:32Z MEMBER

[EDIT]: forgot the load ...

<s> forget my comment about chunks - I thought this would make a difference but it's actually the opposite (to my surprise): </s>

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
409159969 https://github.com/pydata/xarray/issues/2329#issuecomment-409159969 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwOTE1OTk2OQ== fmaussion 10050469 2018-07-31T09:38:37Z 2018-07-31T10:19:37Z MEMBER

Out of curiosity: - why do you chunk over lats and lons rather than time? The order of dimensions in your dataarray suggest that chunking over time could be more efficient - can you show the output of ds.mtpr and ds.mtpr.encoding ?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
409167123 https://github.com/pydata/xarray/issues/2329#issuecomment-409167123 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwOTE2NzEyMw== lrntct 12278765 2018-07-31T10:04:10Z 2018-07-31T10:04:41Z NONE

@fmaussion ds.mtpr: <xarray.DataArray 'mtpr' (time: 119330, lat: 721, lon: 1440)> dask.array<shape=(119330, 721, 1440), dtype=float32, chunksize=(119330, 16, 16)> Coordinates: * time (time) datetime64[ns] 2000-01-01T06:00:00 2000-01-01T06:00:00 ... * lon (lon) float32 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 2.25 2.5 ... * lat (lat) float32 90.0 89.75 89.5 89.25 89.0 88.75 88.5 88.25 88.0 ... Attributes: long_name: Mean total precipitation rate units: kg m**-2 s**-1 code: 55 table: 235 ds.mtpr.encoding: {'complevel': 0, 'shuffle': False, 'dtype': dtype('float32'), 'contiguous': False, 'zlib': False, 'source': u'era5_precip.nc', 'fletcher32': False, 'original_shape': (119330, 721, 1440), 'chunksizes': (1, 721, 1440)} Those chunksizes are the opposite of what I was expecting...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
409157118 https://github.com/pydata/xarray/issues/2329#issuecomment-409157118 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwOTE1NzExOA== lrntct 12278765 2018-07-31T09:28:48Z 2018-07-31T09:28:48Z NONE

@rabernat I tried to do the sum. I have the same issue. The process just seems to read the disk endlessly, without even writing to the RAM:

I tried to lower the chunk size, but it doesn't seem to change anything. Without chunk, I logically get a MemoryError.

I plan to do time-series analysis, so I thought that having contiguous chunks in time would be more efficient. The netcdf was created with cdo -f nc4 mergetime, so it should have mostly the same structure, I guess.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
408928221 https://github.com/pydata/xarray/issues/2329#issuecomment-408928221 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwODkyODIyMQ== rabernat 1197350 2018-07-30T16:37:05Z 2018-07-30T16:37:23Z MEMBER

Can you forget about zarr for a moment and just do a reduction on your dataset? For example: python ds.sum().load()

Keep the same chunk arguments you are currently using. This will help us understand if the problem is with reading the files.

Is it your intention to chunk the files contiguously in time? Depending on the underlying structure of the data within the netCDF file, this could amount to a complete transposition of the data, which could be very slow / expensive. This could have some parallels with #2004.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
408925488 https://github.com/pydata/xarray/issues/2329#issuecomment-408925488 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwODkyNTQ4OA== rabernat 1197350 2018-07-30T16:28:31Z 2018-07-30T16:28:31Z MEMBER

I was somehow expecting that each worker will read a chunk and then write it to zarr, streamlined.

Yes, this is what we want!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
408894639 https://github.com/pydata/xarray/issues/2329#issuecomment-408894639 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwODg5NDYzOQ== lrntct 12278765 2018-07-30T15:01:27Z 2018-07-30T15:10:43Z NONE

@rabernat Thanks for your answer.

I have one big NetCDF of ~500GB. What I have changed: - Run in a Jupyter notebook with distributed to get the dashboard - Change the chunks to {'lat': 90, 'lon': 90}. That should be around 1GB per chunk. - Chunk from the beginning with ds = xr.open_dataset('my_netcdf.nc', chunks=chunks) - About the LZ4 compression, I did some test with a 1.5GB extract and the writing time was just 2% slower than uncompressed.

Now when I run to_zarr(), it creates a zarr store (~40kB) and all the workers start to read the disk, but they don't write anything.

The Dask dashboard looks like this:

After a while I get warnings:

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 1.55 GB -- Worker memory limit: 2.08 GB

Is this the expected behaviour? I was somehow expecting that each worker will read a chunk and then write it to zarr, streamlined. This does not seem to be the case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
408860643 https://github.com/pydata/xarray/issues/2329#issuecomment-408860643 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwODg2MDY0Mw== rabernat 1197350 2018-07-30T13:20:59Z 2018-07-30T13:20:59Z MEMBER

@lrntct - this sounds like a reasonable way to use zarr. We routinely do this sort of transcoding and it works reasonable well. Unfortunately something clearly isn't working right in your case. These things can be hard to debug, but we will try to help you.

You might want to start by reviewing the guide I wrote for Pangeo on preparing zarr datasets.

It would also be good to see a bit more detail. You posted a function netcdf2zarr that converts a single netcdf file to a single zarr file. How are you invoking that function? Are you trying to create one zarr store for each netCDF file? How many netCDF files are there? If there are many (e.g. one per timmestep), my recommendation is to create only one zarr store for the whole dataset. Open the netcdf files using open_mfdataset.

If instead you have just one big netCDF file as in the example you posted above, I think I see you problem: you are calling .chunk() after calling open_dataset(), rather calling open_dataset(nc_path, chunks=chunks). This probably means that you are loading the whole dataset in a single task and then re-chunking. That could be the source of the inefficiency.

More ideas: - explicitly specify the chunks (rather than using 'auto') - eliminate the negative number in your chunk sizes - make sure you really need clevel=9

Another useful piece of advice would be to use the dask distributed dashboard to monitor what is happening under the hood. You can do this by running python from dask.distributed import Client client = Client() client In a notebook, this should provide you a link to the scheduler dashboard. Once you call ds.to_zarr(), watch the task stream in the dashboard to see what is happening.

Hopefully these ideas can help you move forward.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 18.018ms · About: xarray-datasette