home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

12 rows where author_association = "NONE" and issue = 187608079 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 4

  • naught101 7
  • andreall 3
  • darothen 1
  • stale[bot] 1

issue 1

  • Is there a more efficient way to convert a subset of variables to a dataframe? · 12 ✖

author_association 1

  • NONE · 12 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1100969648 https://github.com/pydata/xarray/issues/1086#issuecomment-1100969648 https://api.github.com/repos/pydata/xarray/issues/1086 IC_kwDOAMm_X85Bn3aw stale[bot] 26384082 2022-04-17T23:43:46Z 2022-04-17T23:43:46Z NONE

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Is there a more efficient way to convert a subset of variables to a dataframe? 187608079
661972749 https://github.com/pydata/xarray/issues/1086#issuecomment-661972749 https://api.github.com/repos/pydata/xarray/issues/1086 MDEyOklzc3VlQ29tbWVudDY2MTk3Mjc0OQ== andreall 25382032 2020-07-21T16:41:52Z 2020-07-21T16:41:52Z NONE

Hi @darothen , Thanks a lot..I hadn't thought of processing each file and then merging. Will give it a try, Thanks,

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Is there a more efficient way to convert a subset of variables to a dataframe? 187608079
661953980 https://github.com/pydata/xarray/issues/1086#issuecomment-661953980 https://api.github.com/repos/pydata/xarray/issues/1086 MDEyOklzc3VlQ29tbWVudDY2MTk1Mzk4MA== darothen 4992424 2020-07-21T16:09:25Z 2020-07-21T16:09:52Z NONE

Hi @andreall, I'll leave @dcherian or another maintainer to comment on internals of xarray that might be pertinent for optimization here. However, just to throw it out there, for workflows like this, it can sometimes be a bit easier to process each NetCDF file (subsetting your locations and whatnot) and convert it to CSV individually, then merge/concatenate those CSV files together at the end. This sort of workflow can be parallelized a few different ways, but is nice because you can parallelize across the number of files you need to process. A simple example based on your MRE:

``` python import xarray as xr from pathlib import Path from joblib import delayed, Parallel

dir_input = Path('.') fns = list(sorted(dir_input.glob('*/' + 'WW3_EUR-11_CCCma-CanESM2_r1i1p1_CLMcom-CCLM4-8-17_v1_6hr_.nc')))

Helper function to convert NetCDF to CSV with our processing

def _nc_to_csv(fn): data_ww3 = xr.open_dataset(fn) data_ww3 = data_ww3.isel(latitude=74, longitude=18) df_ww3 = data_ww3[['hs', 't02', 't0m1', 't01', 'fp', 'dir', 'spr', 'dp']].to_dataframe()

out_fn = fn.replace(".nc", ".csv")
df_ww3.to_csv(out_fn)

return out_fn

Using joblib.Parallel to distribute my work across whatever resources i have

out_fns = Parallel( n_jobs=-1, # Use all cores available here delayed(_nc_to_csv)(fn) for fn in fns )

Read the CSV files and merge them

dfs = [ pd.read_csv(fn) for fn in out_fns ] df_ww3_all = pd.concat(dfs, ignore_index=True) ```

YMMV but this pattern often works for many types of processing applications.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Is there a more efficient way to convert a subset of variables to a dataframe? 187608079
661940009 https://github.com/pydata/xarray/issues/1086#issuecomment-661940009 https://api.github.com/repos/pydata/xarray/issues/1086 MDEyOklzc3VlQ29tbWVudDY2MTk0MDAwOQ== andreall 25382032 2020-07-21T15:44:54Z 2020-07-21T15:46:06Z NONE

Hi,

``` import xarray as xr from pathlib import Path

dir_input = Path('.') data_ww3 = xr.open_mfdataset(dir_input.glob('*/' + 'WW3_EUR-11_CCCma-CanESM2_r1i1p1_CLMcom-CCLM4-8-17_v1_6hr_.nc'))

data_ww3 = data_ww3.isel(latitude=74, longitude=18) df_ww3 = data_ww3[['hs', 't02', 't0m1', 't01', 'fp', 'dir', 'spr', 'dp']].to_dataframe() ```

You can download one file here: https://nasgdfa.ugr.es:5001/d/f/566168344466602780 (3.5 GB). I did a profiler when opening 2 .nc files an it said the to_dataframe() call was the one taking most of the time.

I'm just wondering if there's a way to reduce computing time. I need to open 95 files and it takes about 1.5 hour.

Thanks,

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Is there a more efficient way to convert a subset of variables to a dataframe? 187608079
661775197 https://github.com/pydata/xarray/issues/1086#issuecomment-661775197 https://api.github.com/repos/pydata/xarray/issues/1086 MDEyOklzc3VlQ29tbWVudDY2MTc3NTE5Nw== andreall 25382032 2020-07-21T10:29:48Z 2020-07-21T10:29:48Z NONE

I am running into the same problem, this might be a long shot but @naught101 , do you remember if you managed to convert to dataframe in a more efficient way? Thanks,

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Is there a more efficient way to convert a subset of variables to a dataframe? 187608079
259044958 https://github.com/pydata/xarray/issues/1086#issuecomment-259044958 https://api.github.com/repos/pydata/xarray/issues/1086 MDEyOklzc3VlQ29tbWVudDI1OTA0NDk1OA== naught101 167164 2016-11-08T04:47:56Z 2016-11-08T04:47:56Z NONE

Ok, no worries. I'll try it if it gets desperate :)

Thanks for your help, shoyer!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Is there a more efficient way to convert a subset of variables to a dataframe? 187608079
259041491 https://github.com/pydata/xarray/issues/1086#issuecomment-259041491 https://api.github.com/repos/pydata/xarray/issues/1086 MDEyOklzc3VlQ29tbWVudDI1OTA0MTQ5MQ== naught101 167164 2016-11-08T04:16:26Z 2016-11-08T04:16:26Z NONE

So it would be more efficient to concat all of the datasets (subset for the relevant variables), and then just use a single .to_dataframe() call on the entire dataset? If so, that would require quite a bit of refactoring on my part, but it could be worth it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Is there a more efficient way to convert a subset of variables to a dataframe? 187608079
259033970 https://github.com/pydata/xarray/issues/1086#issuecomment-259033970 https://api.github.com/repos/pydata/xarray/issues/1086 MDEyOklzc3VlQ29tbWVudDI1OTAzMzk3MA== naught101 167164 2016-11-08T03:14:50Z 2016-11-08T03:14:50Z NONE

Yeah, I'm loading each file separately with xr.open_dataset(), since it's not really a multi-file dataset (it's a lot of single-site datasets, some of which have different variables, and overlapping time dimensions). I don't think I can avoid loading them separately...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Is there a more efficient way to convert a subset of variables to a dataframe? 187608079
259026069 https://github.com/pydata/xarray/issues/1086#issuecomment-259026069 https://api.github.com/repos/pydata/xarray/issues/1086 MDEyOklzc3VlQ29tbWVudDI1OTAyNjA2OQ== naught101 167164 2016-11-08T02:19:01Z 2016-11-08T02:19:01Z NONE

Not easily - most scripts require multiple (up to 200, of which the linked one is one of the smallest, some are up to 10Mb) of these datasets in a specific directory structure, and rely on a couple of private python modules. I was just asking because I thought I might have been missing something obvious, but now I guess that isn't the case. Probably not worth spending too much time on this - if it starts becoming a real problem for me, I will try to generate something self-contained that shows the problem. Until then, maybe it's best to assume that xarray/pandas are doing the best they can given the requirements, and close this for now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Is there a more efficient way to convert a subset of variables to a dataframe? 187608079
258774196 https://github.com/pydata/xarray/issues/1086#issuecomment-258774196 https://api.github.com/repos/pydata/xarray/issues/1086 MDEyOklzc3VlQ29tbWVudDI1ODc3NDE5Ng== naught101 167164 2016-11-07T08:30:25Z 2016-11-07T08:30:25Z NONE

I loaded it from a netcdf file. There's an example you can play with at https://dl.dropboxusercontent.com/u/50684199/MitraEFluxnet.1.4_flux.nc

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Is there a more efficient way to convert a subset of variables to a dataframe? 187608079
258755061 https://github.com/pydata/xarray/issues/1086#issuecomment-258755061 https://api.github.com/repos/pydata/xarray/issues/1086 MDEyOklzc3VlQ29tbWVudDI1ODc1NTA2MQ== naught101 167164 2016-11-07T06:12:27Z 2016-11-07T06:12:27Z NONE

Slightly slower (using %timeit in ipython)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Is there a more efficient way to convert a subset of variables to a dataframe? 187608079
258753366 https://github.com/pydata/xarray/issues/1086#issuecomment-258753366 https://api.github.com/repos/pydata/xarray/issues/1086 MDEyOklzc3VlQ29tbWVudDI1ODc1MzM2Ng== naught101 167164 2016-11-07T05:56:26Z 2016-11-07T05:56:26Z NONE

Squeeze is pretty much identical in efficiency. Seems very slightly better (2-5%) on smaller datasets. (I still need to add the final [data_vars] to get rid of the extraneous index_var columns, but that doesn't affect performance much).

I'm not calling pandas.tslib.array_to_timedelta64, to_dataframe is - the caller list is (sorry, I'm not sure of a better way to show this):

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Is there a more efficient way to convert a subset of variables to a dataframe? 187608079

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.104ms · About: xarray-datasette