home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 2149485914

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2149485914 I_kwDOAMm_X86AHo1a 8778 Stricter defaults for concat, combine, open_mfdataset 2448579 open 0     2 2024-02-22T16:43:38Z 2024-02-23T04:17:40Z   MEMBER      

Is your feature request related to a problem?

The defaults for concat are excessively permissive: data_vars="all", coords="different", compat="no_conflicts", join="outer". This comment illustrates why this can be hard to predict or understand: a seemingly unrelated option decode_cf controls whether a variable is in data_vars or coords, and can result in wildly different concatenation behaviour.

  1. This always concatenates data_vars along concat_dim even if they did not have that dimension to begin with.
  2. If the same coordinate var exists in different datasets/files, they will be sequentially compared for equality to decide whether they get concatenated.
  3. The outer join (applied along all dimensions that are not concat_dim) can result in very large datasets due to small floating points differences in the indexes, and also questionable behaviour with staggered grid datasets.
  4. "no_conflicts" basically picks the first not-NaN value after aligning all datasets, but is quite slow (we should be using duck_array_ops.nanfirst here I think).

While "convenient" this really just makes the default experience quite bad with hard-to-understand slowdowns.

Describe the solution you'd like

I propose we migrate to data_vars="minimal", coords="minimal", join="exact", compat="override". This should 1. only concatenate data_vars and coords variables when they already have concat_dim. 2. For any variables that do not have concat_dim, it will blindly pick them from the first file. 3. join="exact" will prevent ballooning of dimension sizes due to floating point inequalities. 4. These options will totally avoid any data reads unless explicitly requested by the user.

Unfortunately, this has a pretty big blast radius so we'd need a long deprecation cycle.

Describe alternatives you've considered

No response

Additional context

xref https://github.com/pydata/xarray/issues/4824 xref https://github.com/pydata/xarray/issues/1385 xref https://github.com/pydata/xarray/issues/8231 xref https://github.com/pydata/xarray/issues/5381 xref https://github.com/pydata/xarray/issues/2064 xref https://github.com/pydata/xarray/issues/2217

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8778/reactions",
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 80.338ms · About: xarray-datasette