home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 416962458

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
416962458 MDU6SXNzdWU0MTY5NjI0NTg= 2799 Performance: numpy indexes small amounts of data 1000 faster than xarray 1386642 open 0     42 2019-03-04T19:44:17Z 2024-03-18T17:51:25Z   CONTRIBUTOR      

Machine learning applications often require iterating over every index along some of the dimensions of a dataset. For instance, iterating over all the (lat, lon) pairs in a 4D dataset with dimensions (time, level, lat, lon). Unfortunately, this is very slow with xarray objects compared to numpy (or h5py) arrays. When the Pangeo machine learning working group met today, we found that several of us have struggled with this.

I made some simplified benchmarks, which show that xarray is about 1000 times slower than numpy when repeatedly grabbing a small amount of data from an array. This is a problem with both isel or [] indexing. After doing some profiling, the main culprits seem to be xarray routines like _validate_indexers and _broadcast_indexes.

While python will always be slower than C when iterating over an array in this fashion, I would hope that xarray could be nearly as fast as numpy. I am not sure what the best way to improve this is though.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2799/reactions",
    "total_count": 9,
    "+1": 9,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 40 rows from issue in issue_comments
Powered by Datasette · Queries took 6.432ms · About: xarray-datasette