issues: 311573817

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
311573817	MDU6SXNzdWUzMTE1NzM4MTc=	2039	open_mfdataset: skip loading for indexes and coordinates from all but the first file	6213168	open	0			1	2018-04-05T11:32:02Z	2021-01-27T17:49:21Z		MEMBER				This is a follow-up from #1521. When invoking open_mfdataset, very frequently the user knows in advance that all of his coords that aren't on the concat_dim are already aligned, and may be willing to blindly trust such assumption in exchange of a huge performance boost. My production data: 200x NetCDF files on a not very performant NFS file system, concatenated on the "scenario" dimension: ``` xarray.open_mfdataset('cube..nc', engine='h5netcdf', concat_dim='scenario') <xarray.Dataset> Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: attribute (attribute) object 'THEO/Value' currency (instr_id) object 'ZAR' 'EUR' 'EUR' 'EUR' 'EUR' 'EUR' 'GBP' ... * fx_id (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ... * instr_id (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ... * timestep (timestep) datetime64[ns] 2016-12-31 type (instr_id) object 'American' 'Bond Future' 'Bond Future' ... * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... Data variables: FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)> instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)> CPU times: user 19.6 s, sys: 981 ms, total: 20.6 s Wall time: 24.4 s ``` If I skip loading and comparing the non-index coords from all 200 files: ``` xarray.open_mfdataset('cube..nc'), engine='h5netcdf', concat_dim='scenario', coords='all') <xarray.Dataset> Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: attribute (attribute) object 'THEO/Value' * fx_id (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ... * instr_id (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ... * timestep (timestep) datetime64[ns] 2016-12-31 currency (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)> * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... type (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)> Data variables: FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)> instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)> CPU times: user 12.7 s, sys: 305 ms, total: 13 s Wall time: 14.8 s ``` If I skip loading and comparing also the index coords from all 200 files: ``` cube = xarray.open_mfdataset(sh.resolve_env(f'{dynamic}/mtf/{cubename}/nc/cube..nc'), engine='h5netcdf', concat_dim='scenario', drop_variables=['attribute', 'fx_id', 'instr_id', 'timestep', 'currency', 'type']) <xarray.Dataset> Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... Dimensions without coordinates: attribute, fx_id, instr_id, timestep Data variables: FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)> instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)> CPU times: user 7.31 s, sys: 61 ms, total: 7.37 s Wall time: 9.05 s ``` Proposed design Add a new optional parameter to open_mfdataset, `assume_aligned=None`. It can be valued to a list of variable names or "all", and requires `concat_dim` to be explicitly set. It causes open_mfdataset to use the first occurrence of every variable and blindly skip loading the subsequent ones. Algorithm Perform the first invocation to the underlying open_dataset like it happens now if assume_aligned is not None: for each new NetCDF file, figure out which variables need to be aligned & compared (as opposed to concatenated), and add them to a drop_variables list. if assume_aligned != "all": drop_variables &= assume_aligned Pass the increasingly long drop_variables list to the underlying open_dataset	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2039/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			13221727	issue

Links from other tables

1 row from issues_id in issues_labels
1 row from issue in issue_comments