issues: 376370028
This data as json
| id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 376370028 | MDU6SXNzdWUzNzYzNzAwMjg= | 2534 | to_dataframe() excessive memory usage | 1665346 | closed | 0 | 3 | 2018-11-01T12:20:39Z | 2022-05-01T22:04:51Z | 2022-05-01T22:04:43Z | NONE | Code Sample, a copy-pastable example if possible```python import xarray as xr from glob import glob This refers to a large multi-file NetCDF datasetfile_list = sorted(glob('~/Data///*.nc')) dataset = xr.open_mfdataset(file_list, decode_times=True, autoclose=True, decode_cf=True, cache=False, concat_dim='time') At this point, the total RAM used by the python process is ~1.4GSelect a timeseries at a single pointThis is near instantaneous and uses no additional memoryts = dataset.sel({'lat': 10, 'lon': 10}, method='nearest') Convert that timeseries to a pandas dataframe.This is where the actual data reading happens, and reads the data into memorydf = ts.to_dataframe() At this point, the total RAM used by the python process is ~10.5G``` Problem descriptionDespite the fact that the resulting dataframe only has a single lat/lon point's worth of data, a huge amount of RAM is used. I can get (what appears to be) an identical pandas DataFrame by changing the final line to:
Expected OutputI would expect that Output of
|
{
"url": "https://api.github.com/repos/pydata/xarray/issues/2534/reactions",
"total_count": 0,
"+1": 0,
"-1": 0,
"laugh": 0,
"hooray": 0,
"confused": 0,
"heart": 0,
"rocket": 0,
"eyes": 0
} |
completed | 13221727 | issue |