home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1384226112

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1384226112 I_kwDOAMm_X85SgZ1A 7075 Convert xarray dataset to pandas dataframe is much slower in newest xarray version 20794996 closed 0     4 2022-09-23T19:36:28Z 2023-10-14T20:37:40Z 2023-10-14T20:37:40Z NONE      

What is your issue?

Converting an xarray dataset to pandas dataframe has become much slower in the newest xarray version.

I want to read in very large netcdf files, extract a slice, and convert the slice to a pandas dataframe. For an input size of 2GB, the xarray version 0.21.0 takes 3 seconds versus the xarray version 2022.6.0 takes 44 seconds. See table below for more tests with increasing size of xarray dataset.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/rilllydi/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/rilllydi/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> Number of NetCDF Input Files in Xarray Dataset (~1GB per file): | 2 | 5 | 10 | 15 | 20 | 30 | 40 -- | -- | -- | -- | -- | -- | -- | -- Older Xarray Version 0.21.0 | 0:03 | 0:02 | 0:04 | 0:06 | 0:09 | 0:13 | 0:17 Newer Xarray Version 2022.6.0 | 0:44 | 1:30 | 2:46 | 4:01 | 5:23 | 7:56 | 10:29 </body> </html>

Here is my code: ```

Read in a list of netcdf files and combine into a single dataset.

with xr.open_mfdataset(infile_list, combine='by_coords') as ds:

    # Extract the data for a single location (the nearest grid point) using the provided coordinates (lat/lon).
    ds_slice = ds.sel(lon=-84.725, lat=42.3583, method='nearest')

    # Convert xarray dataset to a pandas dataframe.
    # This is now the slow part since the xarray library was updated.
    df = ds_slice.to_dataframe()

```

The netcdf files I am reading in are about 1 GB each, containing daily weather data for the entire CONUS. There is 1 file per year, so if I read in 2 files, the dimensions are (lon: 1386, lat: 585, day: 731, crs: 1) with coordinates of lon, lat, day, and crs. They include 8 float data variables.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7075/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  not_planned 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 3 rows from issue in issue_comments
Powered by Datasette · Queries took 0.672ms · About: xarray-datasette