home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 306814837

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/1440#issuecomment-306814837 https://api.github.com/repos/pydata/xarray/issues/1440 306814837 MDEyOklzc3VlQ29tbWVudDMwNjgxNDgzNw== 9655353 2017-06-07T14:37:14Z 2017-06-07T14:37:14Z NONE

We had a similar issue some time ago. We use xr.open_mfdataset to open long time series of data, where each time slice is a single file. In this case each file becomes a single dask chunk, which is appropriate for most data we have to work with (ESA CCI datasets).

We encountered a problem, however, with a few datasets that had very significant compression levels, such that a single file would fit in memory, but not a few of them, on a consumer-ish laptop. So, the machine would quickly run out of memory when working with the opened dataset.

As we have to be able to open 'automatically' all ESA CCI datasets, manually denoting the chunk sizes was not an option, so we explored a few ways how to do this. Aligning the chunk sizes with NetCDF chunking was not a great idea because of the reason shoyer mentions above. The chunk sizes for some datasets would be too small and the bottleneck moves from memory consumption to the amount of read/write operations.

We eventually figured (with help from shoyer :)) that the chunks should be small enough to fit in memory on an average user's laptop. yet as big as possible to maximize the amount of NetCDF chunks falling nicely in the dask chunk. Also, shape of the dask chunk can be of importance to maximize the amount of NetCDF chunks falling nicely in. We figured it's a good guess to divide both lat and lon dimensions by the same divisor, as that's also how NetCDF is often chunked.

So, we open the first file, determine it's 'uncompressed' size and then figure out if we should chunk it as 1, 2x2, 3x3, etc. It's far from a perfect solution, but it works in our case. Here's how we have implemented this: https://github.com/CCI-Tools/cate-core/blob/master/cate/core/ds.py#L506

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  233350060
Powered by Datasette · Queries took 0.738ms · About: xarray-datasette