issues: 187608079

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
187608079	MDU6SXNzdWUxODc2MDgwNzk=	1086	Is there a more efficient way to convert a subset of variables to a dataframe?	167164	closed	0			21	2016-11-07T01:43:20Z	2023-12-15T20:47:53Z	2023-12-15T20:47:53Z	NONE				I have the following chunk of code that gets used a lot in my scripts: ```{python} /data/documents/uni/phd/projects/pals_utils/pals_utils/data.py(291)pals_xr_to_df() 289 # TODO: This is not suitable for gridded datasets: 290 index_vars = {v: dataset.coords[v].values[0] for v in index_vars} 1-> 291 df = dataset.sel(*index_vars)[data_vars].to_dataframe()[data_vars] 292 293 if qc: ``` It basically extracts a few `data_vars` from a dataset, and converts it to a dataframe, limiting the axis to a single grid-cell (this particular data only has one location anyway). The first `[data_vars]` call massively improve the efficiency (by dropping most variables before converting to a dataframe), the second one is to get rid of the x, y, and z in the dataframe (side-issue: it would be nice to have a `drop_dims=` option in `.to_dataframe` that dropped all dimensions of length 1) Here's an example of it in use: ```{python} ipdb> index_vars {'y': 1.0, 'x': 1.0, 'z': 1.0} ipdb> data_vars ['Qle'] ipdb> dataset <xarray.Dataset> Dimensions: (time: 70128, x: 1, y: 1, z: 1) Coordinates: x (x) float64 1.0 * y (y) float64 1.0 * time (time) datetime64[ns] 2002-01-01T00:30:00 ... * z (z) float64 1.0 Data variables: latitude (y, x) float64 -35.66 longitude (y, x) float64 148.2 elevation (y, x) float64 1.2e+03 reference_height (y, x) float64 70.0 NEE (time, y, x) float64 1.597 1.651 1.691 1.735 1.778 ... Qh (time, y, x) float64 -26.11 -25.99 -25.89 -25.78 ... Qle (time, y, x) float64 5.892 5.898 5.864 5.826 5.788 ... Attributes: Production_time: 2012-09-27 12:44:42 Production_source: PALS automated netcdf conversion PALS_fluxtower_template_version: 1.0.2 PALS_dataset_name: TumbaFluxnet PALS_dataset_version: 1.4 Contact: palshelp@gmail.com ipdb> dataset.sel(**index_vars)[data_vars].to_dataframe()[data_vars].head() Qle time 2002-01-01 00:30:00 5.891888 2002-01-01 01:00:00 5.898049 2002-01-01 01:30:00 5.863696 2002-01-01 02:00:00 5.825712 2002-01-01 02:30:00 5.787727 ``` This particular line of code eventually calls pandas.tslib.array_to_timedelta64, which takes up a significant chunk of my script's run time. My line of code doesn't look like it's the best way to do things, and I'm wondering if there's any way to get the same resulting data that's more efficient. Any help would be greatly appreciated.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1086/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

1 row from issues_id in issues_labels
20 rows from issue in issue_comments