issues: 224553135

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
224553135	MDU6SXNzdWUyMjQ1NTMxMzU=	1385	slow performance with open_mfdataset	1197350	open	0			52	2017-04-26T18:06:32Z	2024-03-14T01:31:21Z		MEMBER				We have a dataset stored across multiple netCDF files. We are getting very slow performance with `open_mfdataset`, and I would like to improve this. Each individual netCDF file looks like this: `python %time ds_single = xr.open_dataset('float_trajectories.0000000000.nc') ds_single` ``` CPU times: user 14.9 ms, sys: 48.4 ms, total: 63.4 ms Wall time: 60.8 ms <xarray.Dataset> Dimensions: (npart: 8192000, time: 1) Coordinates: * time (time) datetime64[ns] 1993-01-01 * npart (npart) int32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... Data variables: z (time, npart) float32 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ... vort (time, npart) float32 -9.71733e-10 -9.72858e-10 -9.73001e-10 ... u (time, npart) float32 0.000545563 0.000544884 0.000544204 ... v (time, npart) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... x (time, npart) float32 180.016 180.047 180.078 180.109 180.141 ... y (time, npart) float32 -79.9844 -79.9844 -79.9844 -79.9844 ... ``` As shown above, a single data file opens in ~60 ms. When I call `open_mdsdataset` on 49 files (each with a different `time` dimension but the same `npart`), here is what happens: `python %time ds = xr.open_mfdataset('.nc', ) ds` ``` CPU times: user 1min 31s, sys: 25.4 s, total: 1min 57s Wall time: 2min 4s <xarray.Dataset> Dimensions: (npart: 8192000, time: 49) Coordinates: npart (npart) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * time (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ... Data variables: z (time, npart) float64 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ... vort (time, npart) float64 -9.717e-10 -9.729e-10 -9.73e-10 -9.73e-10 ... u (time, npart) float64 0.0005456 0.0005449 0.0005442 0.0005437 ... v (time, npart) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... x (time, npart) float64 180.0 180.0 180.1 180.1 180.1 180.2 180.2 ... y (time, npart) float64 -79.98 -79.98 -79.98 -79.98 -79.98 -79.98 ... ``` It takes over 2 minutes to open the dataset. Specifying `concat_dim='time'` does not improve performance. Here is `%prun` of the `open_mfdataset` command. ``` 748994 function calls (724222 primitive calls) in 142.160 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 49 62.455 1.275 62.458 1.275 {method 'get_indexer' of 'pandas.index.IndexEngine' objects} 49 47.207 0.963 47.209 0.963 base.py:1067(is_unique) 196 7.198 0.037 7.267 0.037 {operator.getitem} 49 4.632 0.095 4.687 0.096 netCDF4_.py:182(_open_netcdf4_group) 240 3.189 0.013 3.426 0.014 numeric.py:2476(array_equal) 98 1.937 0.020 1.937 0.020 {numpy.core.multiarray.arange} 4175/3146 1.867 0.000 9.296 0.003 {numpy.core.multiarray.array} 49 1.525 0.031 119.144 2.432 alignment.py:251(reindex_variables) 24 1.065 0.044 1.065 0.044 {method 'cumsum' of 'numpy.ndarray' objects} 12 1.010 0.084 1.010 0.084 {method 'sort' of 'numpy.ndarray' objects} 5227/4035 0.660 0.000 1.688 0.000 collections.py:50(init) 12 0.600 0.050 3.238 0.270 core.py:2761(insert) 12691/7497 0.473 0.000 0.875 0.000 indexing.py:363(shape) 110728 0.425 0.000 0.663 0.000 {isinstance} 12 0.413 0.034 0.413 0.034 {method 'flatten' of 'numpy.ndarray' objects} 12 0.341 0.028 0.341 0.028 {numpy.core.multiarray.where} 2 0.333 0.166 0.333 0.166 {pandas._join.outer_join_indexer_int64} 1 0.331 0.331 142.164 142.164 <string>:1(<module>) ``` It looks like most of the time is being spent on `reindex_variables`. I understand why this happens...xarray needs to make sure the dimensions are the same in order to concatenate them together. Is there any obvious way I could improve the load time? For example, can I give a hint to xarray that this `reindex_variables` step is not necessary, since I know that all the `npart` dimensions are the same in each file? Possibly related to #1301 and #1340.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1385/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			13221727	issue

Links from other tables

3 rows from issues_id in issues_labels
39 rows from issue in issue_comments