home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

15 rows where issue = 323703742 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 5

  • ghost 6
  • mankoff 3
  • max-sixty 3
  • shoyer 2
  • jhamman 1

author_association 3

  • MEMBER 6
  • NONE 6
  • CONTRIBUTOR 3

issue 1

  • From pandas to xarray without blowing up memory · 15 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
708616198 https://github.com/pydata/xarray/issues/2139#issuecomment-708616198 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDcwODYxNjE5OA== max-sixty 5635139 2020-10-14T19:34:53Z 2020-10-14T19:34:53Z MEMBER

As you wish — if there's a motivating example then that has more weight, and big issues should have ample supply of motivating examples. That said, if you have something ready to go, then happy to take a look at it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
708594913 https://github.com/pydata/xarray/issues/2139#issuecomment-708594913 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDcwODU5NDkxMw== mankoff 145117 2020-10-14T18:52:38Z 2020-10-14T18:52:38Z CONTRIBUTOR

The issue is that if you pass in names = ['a','b','c'] to pd.read_csv and there are more columns than names, it takes all the columns without a name and creates a multi-index. That was a bug in my code that I had more columns than names, didn't want a multi-index, and didn't make use of usecols.

This multi-index came from a small 12 MB file - 5000 rows and 40 variables. When I then did df.to_xarray() it filled up my RAM. If I ran the code I provided above, it worked.

Now that I've figured all this out, I don't think that any bugs exist in xarray or pandas, just my code. As usual :). But if the fact that I can fill ram with df.to_xarray() but not with the 3 lines shown above sounds like an issue you want to explore, I'm happy to provide an MWE on a new ticket and tag you there. Let me know...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
708579401 https://github.com/pydata/xarray/issues/2139#issuecomment-708579401 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDcwODU3OTQwMQ== max-sixty 5635139 2020-10-14T18:23:16Z 2020-10-14T18:23:16Z MEMBER

Great! Post here / a new issue if something does come up!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
708513119 https://github.com/pydata/xarray/issues/2139#issuecomment-708513119 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDcwODUxMzExOQ== mankoff 145117 2020-10-14T16:23:36Z 2020-10-14T16:23:36Z CONTRIBUTOR

@max-sixty Sorry for posting this here. This memory blow-up was a byproduct of another bug that it took me a few more hours to track down. This other bug is in Pandas, not xarray.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
708499472 https://github.com/pydata/xarray/issues/2139#issuecomment-708499472 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDcwODQ5OTQ3Mg== max-sixty 5635139 2020-10-14T16:00:35Z 2020-10-14T16:00:35Z MEMBER

@mankoff Thanks for the issue, do you have a fuller reproduction? I'm happy to take a look at this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
708339519 https://github.com/pydata/xarray/issues/2139#issuecomment-708339519 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDcwODMzOTUxOQ== mankoff 145117 2020-10-14T11:25:03Z 2020-10-14T11:25:03Z CONTRIBUTOR

Late reply, but if anyone else finds this issue, I was filling memory with: ds = df.to_xarray(), but if I build the dataset more manually, I have no memory issues:

python ds = xr.Dataset({df.columns[0]: xr.DataArray(data=df[df.columns[0]], dims=['index'], coords={'index':df.index})}) for c in df.columns[1:]: ds[c] = (('index'), df[c])

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
389622523 https://github.com/pydata/xarray/issues/2139#issuecomment-389622523 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDM4OTYyMjUyMw== ghost 10137 2018-05-16T18:37:24Z 2018-05-16T18:37:24Z NONE

Does that sound like it will play well with GeoViews if I want widgets for the categorical vars?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
389622155 https://github.com/pydata/xarray/issues/2139#issuecomment-389622155 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDM4OTYyMjE1NQ== ghost 10137 2018-05-16T18:36:17Z 2018-05-16T18:36:17Z NONE

Ok. Looks like the way forward is a netCDF file for each level of my categorical variables. Will give it a shot.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
389620638 https://github.com/pydata/xarray/issues/2139#issuecomment-389620638 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDM4OTYyMDYzOA== shoyer 1217238 2018-05-16T18:31:35Z 2018-05-16T18:31:35Z MEMBER

MetaCSV looks interesting but I haven't used it myself. My guess would be that it just wraps pandas/xarray for processing data, so I think it's unlikely to give a performance boost. It's more about a declarative way to specify how to load a CSV into pandas/xarray.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
389618279 https://github.com/pydata/xarray/issues/2139#issuecomment-389618279 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDM4OTYxODI3OQ== ghost 10137 2018-05-16T18:24:02Z 2018-05-16T18:24:02Z NONE

@shoyer Thank you. Does metacsv look likely to work to you? It has attracted almost no attention so I wonder if it will exhaust memory. I'm kind of surprised this path (csv -> xarray) isn't better fleshed out as I would have expected it to be very common, perhaps the most common esp. for "found data."

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
389598338 https://github.com/pydata/xarray/issues/2139#issuecomment-389598338 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDM4OTU5ODMzOA== shoyer 1217238 2018-05-16T17:20:03Z 2018-05-16T17:20:03Z MEMBER

If you don't want the full Cartesian product, you need to ensure that the index only contains the variables you want to expand into a grid, e.g., time, lat and lon.

If the problem is only running out of memory (which is indeed likely with 1e9 rows), then you'll need to think about a more clever way to convert the data. One good option might be to groups over subsets of the data (using dask or another parallel processing library like spark or beam), and write a bunch of smaller netCDF which you then open with xarray's open_mfdataset(). It's probably most convenient to split over time, e.g., into files for each day or month.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
389596244 https://github.com/pydata/xarray/issues/2139#issuecomment-389596244 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDM4OTU5NjI0NA== ghost 10137 2018-05-16T17:13:11Z 2018-05-16T17:13:11Z NONE

This looks potentially helpful http://metacsv.readthedocs.io/en/latest/

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
389592602 https://github.com/pydata/xarray/issues/2139#issuecomment-389592602 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDM4OTU5MjYwMg== ghost 10137 2018-05-16T17:01:37Z 2018-05-16T17:01:37Z NONE

PS: I started with Dask but haven't found a way to go from Dask to xarray.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
389592243 https://github.com/pydata/xarray/issues/2139#issuecomment-389592243 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDM4OTU5MjI0Mw== ghost 10137 2018-05-16T17:00:24Z 2018-05-16T17:00:24Z NONE

Hi @jhamman The original data is literally just a flat csv file with ie: lat,lon,epoch,cat1,cat2,var1,var2,...,var50 with 1 billion rows.

I'm looking to xarray for GeoViews, which I think would benefit from having the data properly grouped/indexed by its categories

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742
389590507 https://github.com/pydata/xarray/issues/2139#issuecomment-389590507 https://api.github.com/repos/pydata/xarray/issues/2139 MDEyOklzc3VlQ29tbWVudDM4OTU5MDUwNw== jhamman 2443309 2018-05-16T16:55:27Z 2018-05-16T16:55:27Z MEMBER

@brianmingus - any chance you can provide a reproducible example with some dummy data?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  From pandas to xarray without blowing up memory 323703742

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.267ms · About: xarray-datasette