home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 252541496

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
252541496 MDU6SXNzdWUyNTI1NDE0OTY= 1521 open_mfdataset reads coords from disk multiple times 6213168 closed 0     14 2017-08-24T09:29:57Z 2017-10-09T21:15:31Z 2017-10-09T21:15:31Z MEMBER      

I have 200x of the below dataset, split on the 'scenario' axis:

<xarray.Dataset> Dimensions: (fx_id: 39, instr_id: 16095, scenario: 2501) Coordinates: currency (instr_id) object 'GBP' 'USD' 'GBP' 'GBP' 'GBP' 'EUR' 'CHF' ... * fx_id (fx_id) object 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' 'CAD' ... * instr_id (instr_id) object 'property_standard_gbp' ... * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... type (instr_id) object 'Common Stock' 'Fixed Amortizing Bond' ... Data variables: fx_rates (fx_id, scenario) float64 1.236 1.191 1.481 1.12 1.264 ... instruments (instr_id, scenario) float64 1.0 1.143 0.9443 1.013 1.176 ... Attributes: base_currency: GBP

I individually dump them to disk with Dataset.to_netcdf(fname, engine='h5netcdf'). Then I try loading them back up with open_mfdataset, but it's mortally slow:

``` %%time xarray.open_mfdataset('*.nc', engine='h5netcdf')

Wall time: 30.3 s ```

The problem is caused by the coords being read from disk multiple times. Workaround:

%%time def load_coords(ds): for coord in ds.coords.values(): coord.load() return ds xarray.open_mfdataset('*.nc', engine='h5netcdf', preprocess=load_coords) Wall time: 12.3 s

Proposed solutions: 1. Implement the above workaround directly inside open_mfdataset() 2. change open_dataset() to always eagerly load the coords to memory, regardless of the chunks parameter. Is there any valid use case where lazy coords are actually desirable?

An additional, more radical observation is that, very frequently, a user knows in advance that all coords are aligned. In this use case, the user could explicitly request xarray to blindly trust this assumption, and thus skip loading the coords not based on concat_dim in all datasets beyond the first.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1521/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 14 rows from issue in issue_comments
Powered by Datasette · Queries took 242.176ms · About: xarray-datasette