home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 187608079

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
187608079 MDU6SXNzdWUxODc2MDgwNzk= 1086 Is there a more efficient way to convert a subset of variables to a dataframe? 167164 closed 0     21 2016-11-07T01:43:20Z 2023-12-15T20:47:53Z 2023-12-15T20:47:53Z NONE      

I have the following chunk of code that gets used a lot in my scripts:

```{python}

/data/documents/uni/phd/projects/pals_utils/pals_utils/data.py(291)pals_xr_to_df() 289 # TODO: This is not suitable for gridded datasets: 290 index_vars = {v: dataset.coords[v].values[0] for v in index_vars} 1-> 291 df = dataset.sel(**index_vars)[data_vars].to_dataframe()[data_vars] 292 293 if qc: ```

It basically extracts a few data_vars from a dataset, and converts it to a dataframe, limiting the axis to a single grid-cell (this particular data only has one location anyway). The first [data_vars] call massively improve the efficiency (by dropping most variables before converting to a dataframe), the second one is to get rid of the x, y, and z in the dataframe (side-issue: it would be nice to have a drop_dims= option in .to_dataframe that dropped all dimensions of length 1)

Here's an example of it in use:

```{python} ipdb> index_vars {'y': 1.0, 'x': 1.0, 'z': 1.0}

ipdb> data_vars ['Qle']

ipdb> dataset <xarray.Dataset> Dimensions: (time: 70128, x: 1, y: 1, z: 1) Coordinates: * x (x) float64 1.0 * y (y) float64 1.0 * time (time) datetime64[ns] 2002-01-01T00:30:00 ... * z (z) float64 1.0 Data variables: latitude (y, x) float64 -35.66 longitude (y, x) float64 148.2 elevation (y, x) float64 1.2e+03 reference_height (y, x) float64 70.0 NEE (time, y, x) float64 1.597 1.651 1.691 1.735 1.778 ... Qh (time, y, x) float64 -26.11 -25.99 -25.89 -25.78 ... Qle (time, y, x) float64 5.892 5.898 5.864 5.826 5.788 ... Attributes: Production_time: 2012-09-27 12:44:42 Production_source: PALS automated netcdf conversion PALS_fluxtower_template_version: 1.0.2 PALS_dataset_name: TumbaFluxnet PALS_dataset_version: 1.4 Contact: palshelp@gmail.com

ipdb> dataset.sel(**index_vars)[data_vars].to_dataframe()[data_vars].head() Qle time
2002-01-01 00:30:00 5.891888 2002-01-01 01:00:00 5.898049 2002-01-01 01:30:00 5.863696 2002-01-01 02:00:00 5.825712 2002-01-01 02:30:00 5.787727 ```

This particular line of code eventually calls pandas.tslib.array_to_timedelta64, which takes up a significant chunk of my script's run time. My line of code doesn't look like it's the best way to do things, and I'm wondering if there's any way to get the same resulting data that's more efficient. Any help would be greatly appreciated.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1086/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 20 rows from issue in issue_comments
Powered by Datasette · Queries took 0.677ms · About: xarray-datasette