github: issues: 1 row where user = 1665346 sorted by updated

1 row where user = 1665346 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at ▲	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
376370028	MDU6SXNzdWUzNzYzNzAwMjg=	2534	to_dataframe() excessive memory usage	guygriffiths 1665346	closed	0			3	2018-11-01T12:20:39Z	2022-05-01T22:04:51Z	2022-05-01T22:04:43Z	NONE				Code Sample, a copy-pastable example if possible ```python import xarray as xr from glob import glob This refers to a large multi-file NetCDF dataset file_list = sorted(glob('~/Data///.nc')) dataset = xr.open_mfdataset(file_list, decode_times=True, autoclose=True, decode_cf=True, cache=False, concat_dim='time') At this point, the total RAM used by the python process is ~1.4G Select a timeseries at a single point This is near instantaneous and uses no additional memory ts = dataset.sel({'lat': 10, 'lon': 10}, method='nearest') Convert that timeseries to a pandas dataframe. This is where the actual data reading happens, and reads the data into memory df = ts.to_dataframe() At this point, the total RAM used by the python process is ~10.5G ``` Problem description Despite the fact that the resulting dataframe only has a single lat/lon point's worth of data, a huge amount of RAM is used. I can get (what appears to be) an identical pandas DataFrame by changing the final line to: `python df = (ts 1.0).to_dataframe()` which reduces the total RAM to ~2.2G (i.e. 0.6G additional RAM for that single line vs 9G additional RAM). No type conversion is taking place (i.e. `ts` and `ts * 1.0` both have identical data types) Expected Output I would expect that `to_dataframe()` would require the same amount of memory whether or not it was multiplied by 1.0. I'm aware there could be a good reason for this, but it took me by surprise somewhat. Output of `xr.show_versions()` commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-36-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 xarray: 0.10.7 pandas: 0.23.1 numpy: 1.13.3 scipy: 0.17.0 netCDF4: 1.4.1 h5netcdf: None h5py: None Nio: None zarr: None bottleneck: None cyordereddict: None dask: 0.19.0 distributed: None matplotlib: 1.5.1 cartopy: None seaborn: 0.8.1 setuptools: 20.7.0 pip: 18.0 conda: None pytest: None IPython: 2.4.1 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2534/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

1 row where user = 1665346 sorted by updated_at descending

Code Sample, a copy-pastable example if possible

This refers to a large multi-file NetCDF dataset

At this point, the total RAM used by the python process is ~1.4G

Select a timeseries at a single point

This is near instantaneous and uses no additional memory

Convert that timeseries to a pandas dataframe.

This is where the actual data reading happens, and reads the data into memory

At this point, the total RAM used by the python process is ~10.5G

Problem description

Expected Output

Output of xr.show_versions()

Advanced export

Output of `xr.show_versions()`