home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

8 rows where author_association = "NONE" and issue = 416962458 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • eserie 5
  • openSourcerer9000 2
  • DerWeh 1

issue 1

  • Performance: numpy indexes small amounts of data 1000 faster than xarray · 8 ✖

author_association 1

  • NONE · 8 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1306386310 https://github.com/pydata/xarray/issues/2799#issuecomment-1306386310 https://api.github.com/repos/pydata/xarray/issues/2799 IC_kwDOAMm_X85N3d-G openSourcerer9000 61931826 2022-11-07T23:53:17Z 2022-11-07T23:53:17Z NONE

So a workaround I was able to use was to load the whole thing into a np array (18GB!) in 1 minute da.values, index 15 nodes in 0.4 seconds (was taking ~5min in xarray), then load it back into a dataarray. Not accommodating for our friends of differently-abled memory cards, but it worked in my case

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
1306300937 https://github.com/pydata/xarray/issues/2799#issuecomment-1306300937 https://api.github.com/repos/pydata/xarray/issues/2799 IC_kwDOAMm_X85N3JIJ openSourcerer9000 61931826 2022-11-07T22:16:55Z 2022-11-07T22:16:55Z NONE

I'm really not understanding why indexing is so slow. My dataarray has 2 dims, one axis 1.5 million long ('node') and the other 1500 ('time'). Trying to pull a single timeseries by indexing 1 node takes 16 seconds. the Variable workaround or playing around with chunking doesn't change anything. The only thing loading into memory should be array of 1500 values.

Not sure what's going on under the hood but there may be a way to specify that you're only looking to optimize indexing along 1 dim. Once it gets indexed it becomes a very tiny data set. I would think chunks={'node':1} would do exactly this but I guess not.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
863062212 https://github.com/pydata/xarray/issues/2799#issuecomment-863062212 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDg2MzA2MjIxMg== eserie 17484729 2021-06-17T08:57:28Z 2021-06-17T08:57:28Z NONE

Hello,

I don't want to disrupt the issue too much (so let me know if you'd rather we continue the discussion outside).

Somewhat related to the discussions in this issue, I recently released an open-source library: WAX-ML, https://github.com/eserie/wax-ml, where I implement an accessor to unroll JAX transformations on DataSet and Dataarray xarray containers along a time dimension.

I hope this can help!

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
797116001 https://github.com/pydata/xarray/issues/2799#issuecomment-797116001 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDc5NzExNjAwMQ== eserie 17484729 2021-03-11T23:20:50Z 2021-03-11T23:20:50Z NONE

FWIW, I think the xarray-lite concept would be a great chunk of work to write a small-ish proposal around. I think we could target the next round of CZI EOSS with such a concept.

@jhamman I'll be happy to participate in the discussion.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
790546398 https://github.com/pydata/xarray/issues/2799#issuecomment-790546398 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDc5MDU0NjM5OA== eserie 17484729 2021-03-04T11:29:37Z 2021-03-04T11:29:37Z NONE

In case it could be usefull, and be reused for benchmarks, I released on my github two notebooks with an implementation of a faster (but somehow very simplified and not very optimized in term of code architecture) version of DataArray and Dataset containers. The second notebook contains some line profilings for buffer experiments with various containers. This permits to point on operations which are slow in Datarray implementation for this use case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
786837666 https://github.com/pydata/xarray/issues/2799#issuecomment-786837666 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDc4NjgzNzY2Ng== eserie 17484729 2021-02-26T19:06:08Z 2021-02-26T19:07:20Z NONE

Thanks all for your prompt responses!

@hmaarrfk , I share your recommendation and it's a great thing to be able to fallback to numpy arrays when the algorithmic part is well decoupled from the data preparation process. It's what I also do when I can. However, in workflows working on streaming data the two things (data preparation and computation) may be intricated or frequently alternated. My example of "buffer data array" structure is something quite natural to consider in such a context and having an efficient implementation of labelled ndarray could really serve the task.

@shoyer I think a first "lite" implementation fully implemented in python could be already a great thing. It would not achieve numpy performance, but the additional cost du to management of coordinates alignement should not be too expensive.

An additional suggestion: if the target is computational workflows, trying to have some compatibility with packages such as eagerpy would enabling working with other tensor frameworks commonly used in machine learning. This kind of feature could be adressed yet in another package, but having it in mind may influence the early choices in term of implementation (ex: pure python vs C++).

@jhamman, @shoyer I would be pleased to share my work on buffer data array if you think it could serve as kind of use-case. In this context, I experimented a bit with a « crafted » lite version of xarray and I could achieve a x10 factor in performance improvement.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
786759897 https://github.com/pydata/xarray/issues/2799#issuecomment-786759897 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDc4Njc1OTg5Nw== eserie 17484729 2021-02-26T16:43:23Z 2021-02-26T16:43:23Z NONE

Hi,

I'm working on a machine learning application where I want to stream data and use xarray containers to store them in a buffer (with an additional "lag" dimension) and guaranty good alignement of the coordinates on various dimensions of the streamed data. Doing so, I noticed that the version of my code working with xarray is very slow when compared to a pure numpy implementation (with no coordinate alignement) or even an implementation with deque+pandas. I think the performance issue that I noticed is basically the same observation than the ones of this issue.

I have the impression that for this kind of applications or more generally for intensive algorithmic usages, also as stated at the begining of this issue, a light (with less functionalities and checks) and fast version of xarray DataArray and Dataset containers could be developped.

Do you think this could be something doable in the scope of xarray? Would it be preferable to create a dedicated library?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458
529569885 https://github.com/pydata/xarray/issues/2799#issuecomment-529569885 https://api.github.com/repos/pydata/xarray/issues/2799 MDEyOklzc3VlQ29tbWVudDUyOTU2OTg4NQ== DerWeh 22542812 2019-09-09T16:53:20Z 2019-09-09T16:53:20Z NONE

It might be interesting to see, if pythran is an alternative to Cython. It seems like it handles high level numpy quite well, and would retain the readability of Python. Of course, it has its own issues...

But it seems like other libraries like e.g. scikit-image made some good experience with it.

Sadly I can't be of much help, as I lack experience (and most importantly time).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Performance: numpy indexes small amounts of data 1000 faster than xarray 416962458

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.384ms · About: xarray-datasette