issues: 225774140
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
225774140 | MDU6SXNzdWUyMjU3NzQxNDA= | 1396 | selecting a point from an mfdataset | 1197350 | closed | 0 | 12 | 2017-05-02T18:02:50Z | 2019-01-13T06:32:45Z | 2019-01-13T06:32:45Z | MEMBER | Sorry to be opening so many vague performance issues. I am really having a hard time with my current dataset, which is exposing certain limitations of xarray and dask in a way none of my previous work has done. I have a directory full of netCDF4 files. There are 1754 files, each 8.1GB in size, each representing a single model timestep. So there is ~14 TB of data total. (In addition to the time-dependent output, there is a single file with information about the grid.) Imagine I want to extract a timeseries from a single point (indexed by I could do the same sort of loop using xarray:
Of course, what I really want is to avoid a loop and deal with the whole dataset as a single self-contained object.
Now, to extract the same timeseries, I would like to say
I monitor what is happening under the hood using when I call this by using netdata and the dask.distributed dashboard, using only a single process and thread. First, all the files are opened (see #1394). Then they start getting read. Each read takes between 10 and 30 seconds, and the memory usage starts increasing steadily. My impression is that the entire dataset is being read into memory for concatenation. (I have dumped out the dask graph in case anyone can make sense of it.) I have never let this calculation complete, as it looks like it would eat up all the memory on my system...plus it's extremely slow. To me, this seems like a failure of lazy indexing. I naively expected that the underlying file access would work similar to my loop, perhaps even in parallel. Can anyone shed some light on what might be going wrong? |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/1396/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |