html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/1521#issuecomment-327338750,https://api.github.com/repos/pydata/xarray/issues/1521,327338750,MDEyOklzc3VlQ29tbWVudDMyNzMzODc1MA==,1217238,2017-09-06T00:20:49Z,2017-09-06T00:20:49Z,MEMBER,"Enjoy your holiday!

On Tue, Sep 5, 2017 at 5:01 PM crusaderky <notifications@github.com> wrote:

> P.S. need to put #1522 <https://github.com/pydata/xarray/issues/1522> as
> a prerequisite in order not to lose my sanity, as this change is very much
> hitting on the same nails
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <https://github.com/pydata/xarray/issues/1521#issuecomment-327336207>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ABKS1lU4s3CH5v8Pvc1SPyI3jCpZiJtfks5sfeDkgaJpZM4PBH12>
> .
>
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,252541496
https://github.com/pydata/xarray/issues/1521#issuecomment-326867685,https://api.github.com/repos/pydata/xarray/issues/1521,326867685,MDEyOklzc3VlQ29tbWVudDMyNjg2NzY4NQ==,1217238,2017-09-04T05:13:59Z,2017-09-04T05:20:50Z,MEMBER,"The problem is these lines in `combine.py`:
https://github.com/pydata/xarray/blob/78ca20a6ea1a42eb637ae2ef09189f481cfda9a2/xarray/core/combine.py#L158-L168

We inspect compare coordinates for equality in order to decide whether to ignore redundant coordinates or stack them up. This happens if `coords='different'`. That is the default choice, which was convenient before we supported dask, but is now a source of performance trouble as you point out.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,252541496
https://github.com/pydata/xarray/issues/1521#issuecomment-326868217,https://api.github.com/repos/pydata/xarray/issues/1521,326868217,MDEyOklzc3VlQ29tbWVudDMyNjg2ODIxNw==,1217238,2017-09-04T05:18:55Z,2017-09-04T05:18:55Z,MEMBER,"So, to be more precise, I think the problem is that the first variable is computed many times over (once per comparison), inside the `differs` helper function above.

A very simple fix, slightly more conservative than loading every coordinate into memory, is to simply compute these first coordinates on the first variable, e.g., `v = datasets[0].variables[vname]` -> `v = datasets[0].variables[vname].compute()`. I am slightly nervous about the potential memory overhead of loading *all* coordinates into memory.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,252541496
https://github.com/pydata/xarray/issues/1521#issuecomment-324708622,https://api.github.com/repos/pydata/xarray/issues/1521,324708622,MDEyOklzc3VlQ29tbWVudDMyNDcwODYyMg==,1217238,2017-08-24T17:51:42Z,2017-08-24T17:51:42Z,MEMBER,"> change open_dataset() to always eagerly load the coords to memory, regardless of the chunks parameter. Is there any valid use case where lazy coords are actually desirable?

In principle, coords can have the same shape as data variables. In those cases, you probably want to use the same chunking scheme.

> An additional, more radical observation is that, very frequently, a user knows in advance that all coords are aligned. In this use case, the user could explicitly request xarray to blindly trust this assumption, and thus skip loading the coords not based on concat_dim in all datasets beyond the first.

@rabernat is interested in this use case. See https://github.com/pydata/xarray/issues/1385 and https://github.com/pydata/xarray/pull/1413 for discussion.

> This also leads to another inefficiency of open_dataset(chunks=...), where you may have your data e.g. shape=(50000, 2**30), chunks=(1, 2**30). If you pass the chunks above to open_dataset, it will break down the coords on the first dim into dask arrays of 1 element - which hardly benefits anybody. Things get worse if the dataset is compressed with zlib or whatever, but only the data vars were chunked at the moment of writing. Am I correct in understanding that the whole coord var will be read from disk 50000 times over?

Yes, I think you're correct here as well. This is also an annoying inefficiency, but the API design is a little tricky.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,252541496