html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/1086#issuecomment-1100969648,https://api.github.com/repos/pydata/xarray/issues/1086,1100969648,IC_kwDOAMm_X85Bn3aw,26384082,2022-04-17T23:43:46Z,2022-04-17T23:43:46Z,NONE,"In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the `stale` label; otherwise it will be marked as closed automatically ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-661972749,https://api.github.com/repos/pydata/xarray/issues/1086,661972749,MDEyOklzc3VlQ29tbWVudDY2MTk3Mjc0OQ==,25382032,2020-07-21T16:41:52Z,2020-07-21T16:41:52Z,NONE,"Hi @darothen , Thanks a lot..I hadn't thought of processing each file and then merging. Will give it a try, Thanks,","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-661953980,https://api.github.com/repos/pydata/xarray/issues/1086,661953980,MDEyOklzc3VlQ29tbWVudDY2MTk1Mzk4MA==,4992424,2020-07-21T16:09:25Z,2020-07-21T16:09:52Z,NONE,"Hi @andreall, I'll leave @dcherian or another maintainer to comment on internals of `xarray` that might be pertinent for optimization here. However, just to throw it out there, for workflows like this, it can sometimes be a bit easier to process each NetCDF file (subsetting your locations and whatnot) and convert it to CSV individually, then merge/concatenate those CSV files together at the end. This sort of workflow can be parallelized a few different ways, but is nice because you can parallelize across the number of files you need to process. A simple example based on your MRE: ``` python import xarray as xr from pathlib import Path from joblib import delayed, Parallel dir_input = Path('.') fns = list(sorted(dir_input.glob('**/' + 'WW3_EUR-11_CCCma-CanESM2_r1i1p1_CLMcom-CCLM4-8-17_v1_6hr_*.nc'))) # Helper function to convert NetCDF to CSV with our processing def _nc_to_csv(fn): data_ww3 = xr.open_dataset(fn) data_ww3 = data_ww3.isel(latitude=74, longitude=18) df_ww3 = data_ww3[['hs', 't02', 't0m1', 't01', 'fp', 'dir', 'spr', 'dp']].to_dataframe() out_fn = fn.replace("".nc"", "".csv"") df_ww3.to_csv(out_fn) return out_fn # Using joblib.Parallel to distribute my work across whatever resources i have out_fns = Parallel( n_jobs=-1, # Use all cores available here delayed(_nc_to_csv)(fn) for fn in fns ) # Read the CSV files and merge them dfs = [ pd.read_csv(fn) for fn in out_fns ] df_ww3_all = pd.concat(dfs, ignore_index=True) ``` YMMV but this pattern often works for many types of processing applications.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-661940009,https://api.github.com/repos/pydata/xarray/issues/1086,661940009,MDEyOklzc3VlQ29tbWVudDY2MTk0MDAwOQ==,25382032,2020-07-21T15:44:54Z,2020-07-21T15:46:06Z,NONE,"Hi, ``` import xarray as xr from pathlib import Path dir_input = Path('.') data_ww3 = xr.open_mfdataset(dir_input.glob('**/' + 'WW3_EUR-11_CCCma-CanESM2_r1i1p1_CLMcom-CCLM4-8-17_v1_6hr_*.nc')) data_ww3 = data_ww3.isel(latitude=74, longitude=18) df_ww3 = data_ww3[['hs', 't02', 't0m1', 't01', 'fp', 'dir', 'spr', 'dp']].to_dataframe() ``` You can download one file here: https://nasgdfa.ugr.es:5001/d/f/566168344466602780 (3.5 GB). I did a profiler when opening 2 .nc files an it said the to_dataframe() call was the one taking most of the time.  I'm just wondering if there's a way to reduce computing time. I need to open 95 files and it takes about 1.5 hour. Thanks, ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-661919828,https://api.github.com/repos/pydata/xarray/issues/1086,661919828,MDEyOklzc3VlQ29tbWVudDY2MTkxOTgyOA==,2448579,2020-07-21T15:10:02Z,2020-07-21T15:10:02Z,MEMBER,can you make a reproducible example @andreall?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-661775197,https://api.github.com/repos/pydata/xarray/issues/1086,661775197,MDEyOklzc3VlQ29tbWVudDY2MTc3NTE5Nw==,25382032,2020-07-21T10:29:48Z,2020-07-21T10:29:48Z,NONE,"I am running into the same problem, this might be a long shot but @naught101 , do you remember if you managed to convert to dataframe in a more efficient way? Thanks,","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-259044958,https://api.github.com/repos/pydata/xarray/issues/1086,259044958,MDEyOklzc3VlQ29tbWVudDI1OTA0NDk1OA==,167164,2016-11-08T04:47:56Z,2016-11-08T04:47:56Z,NONE,"Ok, no worries. I'll try it if it gets desperate :) Thanks for your help, shoyer! ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-259044805,https://api.github.com/repos/pydata/xarray/issues/1086,259044805,MDEyOklzc3VlQ29tbWVudDI1OTA0NDgwNQ==,1217238,2016-11-08T04:46:23Z,2016-11-08T04:46:23Z,MEMBER,"> So it would be more efficient to concat all of the datasets (subset for the relevant variables), and then just use a single .to_dataframe() call on the entire dataset? If so, that would require quite a bit of refactoring on my part, but it could be worth it. Maybe? I'm not confident enough to advise you to go to that trouble. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-259041491,https://api.github.com/repos/pydata/xarray/issues/1086,259041491,MDEyOklzc3VlQ29tbWVudDI1OTA0MTQ5MQ==,167164,2016-11-08T04:16:26Z,2016-11-08T04:16:26Z,NONE,"So it would be more efficient to concat all of the datasets (subset for the relevant variables), and then just use a single .to_dataframe() call on the entire dataset? If so, that would require quite a bit of refactoring on my part, but it could be worth it. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-259035428,https://api.github.com/repos/pydata/xarray/issues/1086,259035428,MDEyOklzc3VlQ29tbWVudDI1OTAzNTQyOA==,1217238,2016-11-08T03:25:58Z,2016-11-08T03:25:58Z,MEMBER,"Under the covers open_mfdataset just uses open_dataset and merge/concat. So this would be similar either way. On Mon, Nov 7, 2016 at 7:14 PM naught101 notifications@github.com wrote: > Yeah, I'm loading each file separately with xr.open_dataset(), since it's > not really a multi-file dataset (it's a lot of single-site datasets, some > of which have different variables, and overlapping time dimensions). I > don't think I can avoid loading them separately... > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > https://github.com/pydata/xarray/issues/1086#issuecomment-259033970, or mute > the thread > https://github.com/notifications/unsubscribe-auth/ABKS1oUWnGIBO3mX5h56mgPvCbCU7PI3ks5q7-krgaJpZM4Kqw2_ > . ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-259033970,https://api.github.com/repos/pydata/xarray/issues/1086,259033970,MDEyOklzc3VlQ29tbWVudDI1OTAzMzk3MA==,167164,2016-11-08T03:14:50Z,2016-11-08T03:14:50Z,NONE,"Yeah, I'm loading each file separately with `xr.open_dataset()`, since it's not really a multi-file dataset (it's a lot of single-site datasets, some of which have different variables, and overlapping time dimensions). I don't think I can avoid loading them separately... ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-259028693,https://api.github.com/repos/pydata/xarray/issues/1086,259028693,MDEyOklzc3VlQ29tbWVudDI1OTAyODY5Mw==,1217238,2016-11-08T02:36:16Z,2016-11-08T02:36:16Z,MEMBER,"One thing that might hurt is that xarray (lazily) decodes times from each file separately, rather than decoding times all at one. But this hasn't been much of an issue before even with hundreds of times, so I'm not sure what's going on here. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-259026069,https://api.github.com/repos/pydata/xarray/issues/1086,259026069,MDEyOklzc3VlQ29tbWVudDI1OTAyNjA2OQ==,167164,2016-11-08T02:19:01Z,2016-11-08T02:19:01Z,NONE,"Not easily - most scripts require multiple (up to 200, of which the linked one is one of the smallest, some are up to 10Mb) of these datasets in a specific directory structure, and rely on a couple of private python modules. I was just asking because I thought I might have been missing something obvious, but now I guess that isn't the case. Probably not worth spending too much time on this - if it starts becoming a real problem for me, I will try to generate something self-contained that shows the problem. Until then, maybe it's best to assume that xarray/pandas are doing the best they can given the requirements, and close this for now. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-258884141,https://api.github.com/repos/pydata/xarray/issues/1086,258884141,MDEyOklzc3VlQ29tbWVudDI1ODg4NDE0MQ==,1217238,2016-11-07T16:27:21Z,2016-11-07T16:27:21Z,MEMBER,"can you give me a copy/pastable script that has the slowness issue with that file? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-258774196,https://api.github.com/repos/pydata/xarray/issues/1086,258774196,MDEyOklzc3VlQ29tbWVudDI1ODc3NDE5Ng==,167164,2016-11-07T08:30:25Z,2016-11-07T08:30:25Z,NONE,"I loaded it from a netcdf file. There's an example you can play with at https://dl.dropboxusercontent.com/u/50684199/MitraEFluxnet.1.4_flux.nc ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-258755912,https://api.github.com/repos/pydata/xarray/issues/1086,258755912,MDEyOklzc3VlQ29tbWVudDI1ODc1NTkxMg==,1217238,2016-11-07T06:20:18Z,2016-11-07T06:20:18Z,MEMBER,"How did you construct this dataset? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-258755061,https://api.github.com/repos/pydata/xarray/issues/1086,258755061,MDEyOklzc3VlQ29tbWVudDI1ODc1NTA2MQ==,167164,2016-11-07T06:12:27Z,2016-11-07T06:12:27Z,NONE,"Slightly slower (using `%timeit` in ipython) ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-258754037,https://api.github.com/repos/pydata/xarray/issues/1086,258754037,MDEyOklzc3VlQ29tbWVudDI1ODc1NDAzNw==,1217238,2016-11-07T06:02:56Z,2016-11-07T06:02:56Z,MEMBER,"Try calling `.load()` before `.to_dataframe` ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-258753366,https://api.github.com/repos/pydata/xarray/issues/1086,258753366,MDEyOklzc3VlQ29tbWVudDI1ODc1MzM2Ng==,167164,2016-11-07T05:56:26Z,2016-11-07T05:56:26Z,NONE,"Squeeze is pretty much identical in efficiency. Seems very slightly better (2-5%) on smaller datasets. (I still need to add the final `[data_vars]` to get rid of the extraneous index_var columns, but that doesn't affect performance much). I'm not calling `pandas.tslib.array_to_timedelta64`, `to_dataframe` is - the caller list is (sorry, I'm not sure of a better way to show this):  ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-258748969,https://api.github.com/repos/pydata/xarray/issues/1086,258748969,MDEyOklzc3VlQ29tbWVudDI1ODc0ODk2OQ==,1217238,2016-11-07T05:14:11Z,2016-11-07T05:14:24Z,MEMBER,"The simplest thing to try is making use of `.squeeze()`, e.g., `dataset[data_vars].squeeze().to_dataframe()`. Does that have any better performance? At least it's a bit less typing. I'm not sure why `pandas.tslib.array_to_timedelta64` is slow here, or even how it is being called in your example. I would need a complete example that I can run to debug that. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079