home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 224553135

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
224553135 MDU6SXNzdWUyMjQ1NTMxMzU= 1385 slow performance with open_mfdataset 1197350 open 0     52 2017-04-26T18:06:32Z 2024-03-14T01:31:21Z   MEMBER      

We have a dataset stored across multiple netCDF files. We are getting very slow performance with open_mfdataset, and I would like to improve this.

Each individual netCDF file looks like this: python %time ds_single = xr.open_dataset('float_trajectories.0000000000.nc') ds_single ``` CPU times: user 14.9 ms, sys: 48.4 ms, total: 63.4 ms Wall time: 60.8 ms

<xarray.Dataset> Dimensions: (npart: 8192000, time: 1) Coordinates: * time (time) datetime64[ns] 1993-01-01 * npart (npart) int32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... Data variables: z (time, npart) float32 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ... vort (time, npart) float32 -9.71733e-10 -9.72858e-10 -9.73001e-10 ... u (time, npart) float32 0.000545563 0.000544884 0.000544204 ... v (time, npart) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... x (time, npart) float32 180.016 180.047 180.078 180.109 180.141 ... y (time, npart) float32 -79.9844 -79.9844 -79.9844 -79.9844 ... ```

As shown above, a single data file opens in ~60 ms.

When I call open_mdsdataset on 49 files (each with a different time dimension but the same npart), here is what happens:

python %time ds = xr.open_mfdataset('*.nc', ) ds ``` CPU times: user 1min 31s, sys: 25.4 s, total: 1min 57s Wall time: 2min 4s

<xarray.Dataset> Dimensions: (npart: 8192000, time: 49) Coordinates: * npart (npart) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * time (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ... Data variables: z (time, npart) float64 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ... vort (time, npart) float64 -9.717e-10 -9.729e-10 -9.73e-10 -9.73e-10 ... u (time, npart) float64 0.0005456 0.0005449 0.0005442 0.0005437 ... v (time, npart) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... x (time, npart) float64 180.0 180.0 180.1 180.1 180.1 180.2 180.2 ... y (time, npart) float64 -79.98 -79.98 -79.98 -79.98 -79.98 -79.98 ... ```

It takes over 2 minutes to open the dataset. Specifying concat_dim='time' does not improve performance.

Here is %prun of the open_mfdataset command.

``` 748994 function calls (724222 primitive calls) in 142.160 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function) 49 62.455 1.275 62.458 1.275 {method 'get_indexer' of 'pandas.index.IndexEngine' objects} 49 47.207 0.963 47.209 0.963 base.py:1067(is_unique) 196 7.198 0.037 7.267 0.037 {operator.getitem} 49 4.632 0.095 4.687 0.096 netCDF4_.py:182(_open_netcdf4_group) 240 3.189 0.013 3.426 0.014 numeric.py:2476(array_equal) 98 1.937 0.020 1.937 0.020 {numpy.core.multiarray.arange} 4175/3146 1.867 0.000 9.296 0.003 {numpy.core.multiarray.array} 49 1.525 0.031 119.144 2.432 alignment.py:251(reindex_variables) 24 1.065 0.044 1.065 0.044 {method 'cumsum' of 'numpy.ndarray' objects} 12 1.010 0.084 1.010 0.084 {method 'sort' of 'numpy.ndarray' objects} 5227/4035 0.660 0.000 1.688 0.000 collections.py:50(init) 12 0.600 0.050 3.238 0.270 core.py:2761(insert) 12691/7497 0.473 0.000 0.875 0.000 indexing.py:363(shape) 110728 0.425 0.000 0.663 0.000 {isinstance} 12 0.413 0.034 0.413 0.034 {method 'flatten' of 'numpy.ndarray' objects} 12 0.341 0.028 0.341 0.028 {numpy.core.multiarray.where} 2 0.333 0.166 0.333 0.166 {pandas._join.outer_join_indexer_int64} 1 0.331 0.331 142.164 142.164 <string>:1(<module>) ```

It looks like most of the time is being spent on reindex_variables. I understand why this happens...xarray needs to make sure the dimensions are the same in order to concatenate them together.

Is there any obvious way I could improve the load time? For example, can I give a hint to xarray that this reindex_variables step is not necessary, since I know that all the npart dimensions are the same in each file?

Possibly related to #1301 and #1340.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1385/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 3 rows from issues_id in issues_labels
  • 39 rows from issue in issue_comments
Powered by Datasette · Queries took 159.227ms · About: xarray-datasette