home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 785329941

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
785329941 MDU6SXNzdWU3ODUzMjk5NDE= 4804 Improve performance of xarray.corr() on big datasets 37177103 open 0     9 2021-01-13T18:18:12Z 2021-06-05T18:23:47Z   NONE      

Is your feature request related to a problem? Please describe.

I calculated correlation coefficients based on datasets with sizes between 90-180 GB using xarray and Dask distributed and experienced very low performance for the xarray.corr() function. By observing the Dask dashboard it seemed that during the calculation the whole datasets are loaded from disk several times which, given the size of my datasets, became for some of the calculations a major "performance bottleneck".

Describe the solution you'd like

The problem became so annoying that I implemented my own function to calculate the correlation coefficient (thanks @willirath!), which is considerably more performant (especially for the big datasets!), because it only touches the full data once. I have uploaded a Jupyter notebook that shows the equivalence of the xarray.corr() function and my implementation (using an "unaligned data with nan values"-example, which is what xarray.corr() covers) and an example based on Dask arrays, which demonstrates the performance problems that I have stated above, and also that the xarray.corr() function is not fully lazy. (Which I assume is actually not very desirable?)

At the moment, I think, in terms of improving big data performance, a considerable improvement could be achieved by removing the if not valid_values.all() clause here, because that seems to determine that a call of xarray.corr() is not fully lazy and causes the first (of several?) full touches of the datasets? I haven't checked what's going on afterwards, but maybe that is already a useful starting point? :thinking:

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4804/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 3 rows from issues_id in issues_labels
  • 9 rows from issue in issue_comments
Powered by Datasette · Queries took 0.864ms · About: xarray-datasette