issues: 376370028
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
376370028 | MDU6SXNzdWUzNzYzNzAwMjg= | 2534 | to_dataframe() excessive memory usage | 1665346 | closed | 0 | 3 | 2018-11-01T12:20:39Z | 2022-05-01T22:04:51Z | 2022-05-01T22:04:43Z | NONE | Code Sample, a copy-pastable example if possible```python import xarray as xr from glob import glob This refers to a large multi-file NetCDF datasetfile_list = sorted(glob('~/Data///*.nc')) dataset = xr.open_mfdataset(file_list, decode_times=True, autoclose=True, decode_cf=True, cache=False, concat_dim='time') At this point, the total RAM used by the python process is ~1.4GSelect a timeseries at a single pointThis is near instantaneous and uses no additional memoryts = dataset.sel({'lat': 10, 'lon': 10}, method='nearest') Convert that timeseries to a pandas dataframe.This is where the actual data reading happens, and reads the data into memorydf = ts.to_dataframe() At this point, the total RAM used by the python process is ~10.5G``` Problem descriptionDespite the fact that the resulting dataframe only has a single lat/lon point's worth of data, a huge amount of RAM is used. I can get (what appears to be) an identical pandas DataFrame by changing the final line to:
Expected OutputI would expect that Output of
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/2534/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |