html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/1086#issuecomment-1100969648,https://api.github.com/repos/pydata/xarray/issues/1086,1100969648,IC_kwDOAMm_X85Bn3aw,26384082,2022-04-17T23:43:46Z,2022-04-17T23:43:46Z,NONE,"In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the `stale` label; otherwise it will be marked as closed automatically ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-661972749,https://api.github.com/repos/pydata/xarray/issues/1086,661972749,MDEyOklzc3VlQ29tbWVudDY2MTk3Mjc0OQ==,25382032,2020-07-21T16:41:52Z,2020-07-21T16:41:52Z,NONE,"Hi @darothen , Thanks a lot..I hadn't thought of processing each file and then merging. Will give it a try, Thanks,","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-661953980,https://api.github.com/repos/pydata/xarray/issues/1086,661953980,MDEyOklzc3VlQ29tbWVudDY2MTk1Mzk4MA==,4992424,2020-07-21T16:09:25Z,2020-07-21T16:09:52Z,NONE,"Hi @andreall, I'll leave @dcherian or another maintainer to comment on internals of `xarray` that might be pertinent for optimization here. However, just to throw it out there, for workflows like this, it can sometimes be a bit easier to process each NetCDF file (subsetting your locations and whatnot) and convert it to CSV individually, then merge/concatenate those CSV files together at the end. This sort of workflow can be parallelized a few different ways, but is nice because you can parallelize across the number of files you need to process. A simple example based on your MRE: ``` python import xarray as xr from pathlib import Path from joblib import delayed, Parallel dir_input = Path('.') fns = list(sorted(dir_input.glob('**/' + 'WW3_EUR-11_CCCma-CanESM2_r1i1p1_CLMcom-CCLM4-8-17_v1_6hr_*.nc'))) # Helper function to convert NetCDF to CSV with our processing def _nc_to_csv(fn): data_ww3 = xr.open_dataset(fn) data_ww3 = data_ww3.isel(latitude=74, longitude=18) df_ww3 = data_ww3[['hs', 't02', 't0m1', 't01', 'fp', 'dir', 'spr', 'dp']].to_dataframe() out_fn = fn.replace("".nc"", "".csv"") df_ww3.to_csv(out_fn) return out_fn # Using joblib.Parallel to distribute my work across whatever resources i have out_fns = Parallel( n_jobs=-1, # Use all cores available here delayed(_nc_to_csv)(fn) for fn in fns ) # Read the CSV files and merge them dfs = [ pd.read_csv(fn) for fn in out_fns ] df_ww3_all = pd.concat(dfs, ignore_index=True) ``` YMMV but this pattern often works for many types of processing applications.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-661940009,https://api.github.com/repos/pydata/xarray/issues/1086,661940009,MDEyOklzc3VlQ29tbWVudDY2MTk0MDAwOQ==,25382032,2020-07-21T15:44:54Z,2020-07-21T15:46:06Z,NONE,"Hi, ``` import xarray as xr from pathlib import Path dir_input = Path('.') data_ww3 = xr.open_mfdataset(dir_input.glob('**/' + 'WW3_EUR-11_CCCma-CanESM2_r1i1p1_CLMcom-CCLM4-8-17_v1_6hr_*.nc')) data_ww3 = data_ww3.isel(latitude=74, longitude=18) df_ww3 = data_ww3[['hs', 't02', 't0m1', 't01', 'fp', 'dir', 'spr', 'dp']].to_dataframe() ``` You can download one file here: https://nasgdfa.ugr.es:5001/d/f/566168344466602780 (3.5 GB). I did a profiler when opening 2 .nc files an it said the to_dataframe() call was the one taking most of the time. ![src1](https://user-images.githubusercontent.com/25382032/88075274-db101600-cb78-11ea-8424-5d60a80b9bc4.png) I'm just wondering if there's a way to reduce computing time. I need to open 95 files and it takes about 1.5 hour. Thanks, ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-661775197,https://api.github.com/repos/pydata/xarray/issues/1086,661775197,MDEyOklzc3VlQ29tbWVudDY2MTc3NTE5Nw==,25382032,2020-07-21T10:29:48Z,2020-07-21T10:29:48Z,NONE,"I am running into the same problem, this might be a long shot but @naught101 , do you remember if you managed to convert to dataframe in a more efficient way? Thanks,","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-259044958,https://api.github.com/repos/pydata/xarray/issues/1086,259044958,MDEyOklzc3VlQ29tbWVudDI1OTA0NDk1OA==,167164,2016-11-08T04:47:56Z,2016-11-08T04:47:56Z,NONE,"Ok, no worries. I'll try it if it gets desperate :) Thanks for your help, shoyer! ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-259041491,https://api.github.com/repos/pydata/xarray/issues/1086,259041491,MDEyOklzc3VlQ29tbWVudDI1OTA0MTQ5MQ==,167164,2016-11-08T04:16:26Z,2016-11-08T04:16:26Z,NONE,"So it would be more efficient to concat all of the datasets (subset for the relevant variables), and then just use a single .to_dataframe() call on the entire dataset? If so, that would require quite a bit of refactoring on my part, but it could be worth it. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-259033970,https://api.github.com/repos/pydata/xarray/issues/1086,259033970,MDEyOklzc3VlQ29tbWVudDI1OTAzMzk3MA==,167164,2016-11-08T03:14:50Z,2016-11-08T03:14:50Z,NONE,"Yeah, I'm loading each file separately with `xr.open_dataset()`, since it's not really a multi-file dataset (it's a lot of single-site datasets, some of which have different variables, and overlapping time dimensions). I don't think I can avoid loading them separately... ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-259026069,https://api.github.com/repos/pydata/xarray/issues/1086,259026069,MDEyOklzc3VlQ29tbWVudDI1OTAyNjA2OQ==,167164,2016-11-08T02:19:01Z,2016-11-08T02:19:01Z,NONE,"Not easily - most scripts require multiple (up to 200, of which the linked one is one of the smallest, some are up to 10Mb) of these datasets in a specific directory structure, and rely on a couple of private python modules. I was just asking because I thought I might have been missing something obvious, but now I guess that isn't the case. Probably not worth spending too much time on this - if it starts becoming a real problem for me, I will try to generate something self-contained that shows the problem. Until then, maybe it's best to assume that xarray/pandas are doing the best they can given the requirements, and close this for now. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-258774196,https://api.github.com/repos/pydata/xarray/issues/1086,258774196,MDEyOklzc3VlQ29tbWVudDI1ODc3NDE5Ng==,167164,2016-11-07T08:30:25Z,2016-11-07T08:30:25Z,NONE,"I loaded it from a netcdf file. There's an example you can play with at https://dl.dropboxusercontent.com/u/50684199/MitraEFluxnet.1.4_flux.nc ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-258755061,https://api.github.com/repos/pydata/xarray/issues/1086,258755061,MDEyOklzc3VlQ29tbWVudDI1ODc1NTA2MQ==,167164,2016-11-07T06:12:27Z,2016-11-07T06:12:27Z,NONE,"Slightly slower (using `%timeit` in ipython) ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079 https://github.com/pydata/xarray/issues/1086#issuecomment-258753366,https://api.github.com/repos/pydata/xarray/issues/1086,258753366,MDEyOklzc3VlQ29tbWVudDI1ODc1MzM2Ng==,167164,2016-11-07T05:56:26Z,2016-11-07T05:56:26Z,NONE,"Squeeze is pretty much identical in efficiency. Seems very slightly better (2-5%) on smaller datasets. (I still need to add the final `[data_vars]` to get rid of the extraneous index_var columns, but that doesn't affect performance much). I'm not calling `pandas.tslib.array_to_timedelta64`, `to_dataframe` is - the caller list is (sorry, I'm not sure of a better way to show this): ![caller_graph](https://cloud.githubusercontent.com/assets/167164/20047700/ce39d18e-a50a-11e6-8fd9-bda2dfe54188.png) ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,187608079