html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/2501#issuecomment-510217080,https://api.github.com/repos/pydata/xarray/issues/2501,510217080,MDEyOklzc3VlQ29tbWVudDUxMDIxNzA4MA==,1312546,2019-07-10T20:30:41Z,2019-07-10T20:30:41Z,MEMBER,"Yep, that’s my suspicion as well. I’m still plugging away at it. Currently the pausing logic isn’t quite working well. 

> On Jul 10, 2019, at 12:10, Ryan Abernathey <notifications@github.com> wrote:
> 
> I believe that the memory issue is basically the same as dask/distributed#2602.
> 
> The graphs look like: read --> rechunk --> write.
> 
> Reading and rechunking increase memory consumption. Writing relieves it. In Rich's case, the workers just load too much data before they write it. Eventually they run out of memory.
> 
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-510167911,https://api.github.com/repos/pydata/xarray/issues/2501,510167911,MDEyOklzc3VlQ29tbWVudDUxMDE2NzkxMQ==,1312546,2019-07-10T18:05:07Z,2019-07-10T18:05:07Z,MEMBER,"Great, thanks. I’ll look into the memory issue when writing. We may already have an issue for it. 

> On Jul 10, 2019, at 10:59, Rich Signell <notifications@github.com> wrote:
> 
> @TomAugspurger , I sat down here at Scipy with @rabernat and he instantly realized that we needed to drop the feature_id coordinate to prevent open_mfdataset from trying to harmonize that coordinate from all the chunks.
> 
> So if I use this code, the open_mdfdataset command finishes:
> 
> def drop_coords(ds):
>     ds = ds.drop(['reference_time','feature_id'])
>     return ds.reset_coords(drop=True)
> and I can then add back in the dropped coordinate values at the end:
> 
> dsets = [xr.open_dataset(f) for f in files[:3]]
> ds.coords['feature_id'] = dsets[0].coords['feature_id']
> I'm now running into memory issues when I write the zarr data -- but I should raise that as a new issue, right?
> 
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-509346055,https://api.github.com/repos/pydata/xarray/issues/2501,509346055,MDEyOklzc3VlQ29tbWVudDUwOTM0NjA1NQ==,1312546,2019-07-08T18:46:58Z,2019-07-08T18:46:58Z,MEMBER,"@rsignell-usgs very helpful, thanks. I'd noticed that there was a pause after the open_dataset tasks finish, indicating that either the scheduler or (more likely) the client was doing work rather than the cluster. Most likely @rabernat's guess

> In open_mfdataset, all of the dimensions and coordinates of the individual files have to be checked and verified to be compatible. That is often the source of slow performance with open_mfdataset.

is correct. Verifying all that now, and looking into if / how that can be done on the workers.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-509307081,https://api.github.com/repos/pydata/xarray/issues/2501,509307081,MDEyOklzc3VlQ29tbWVudDUwOTMwNzA4MQ==,1312546,2019-07-08T16:57:15Z,2019-07-08T16:57:15Z,MEMBER,"I'm looking into it today. Can you clarify

> The memory use kept growing until the process died.

by ""process"" do you mean a dask worker process, or just the main python process executing the `ds = xr.open_mfdataset(...)` code?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-506497180,https://api.github.com/repos/pydata/xarray/issues/2501,506497180,MDEyOklzc3VlQ29tbWVudDUwNjQ5NzE4MA==,1312546,2019-06-27T20:24:26Z,2019-06-27T20:24:26Z,MEMBER,"> The datasets in our cloud datastore are designed explicitly to avoid this problem!


Good to know!

FYI, https://github.com/pydata/xarray/issues/2501#issuecomment-506478508 was user error (I can access it, but need to specify the us-east-1 region). Taking a look now.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-506478508,https://api.github.com/repos/pydata/xarray/issues/2501,506478508,MDEyOklzc3VlQ29tbWVudDUwNjQ3ODUwOA==,1312546,2019-06-27T19:25:05Z,2019-06-27T19:25:05Z,MEMBER,"Thanks, will take a look this afternoon. Are there any datasets on https://pangeo-data.github.io/pangeo-datastore/ that would exhibit this poor behavior? I may not have access to the bucket (or I'm misusing `rclone`)

```
2019/06/27 14:23:50 NOTICE: Config file ""/Users/taugspurger/.config/rclone/rclone.conf"" not found - using defaults
2019/06/27 14:23:50 Failed to create file system for ""aws-east:nwm-archive/2009"": didn't find section in config file
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074