home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 249118534

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/pull/964#issuecomment-249118534 https://api.github.com/repos/pydata/xarray/issues/964 249118534 MDEyOklzc3VlQ29tbWVudDI0OTExODUzNA== 1217238 2016-09-23T07:07:09Z 2016-09-23T07:07:09Z MEMBER

One of the tricky things with apply is that there are a lot of similar but distinct use cases to disambiguate. I'll outline a few of these below.

I'd appreciate feedback on which cases are most essential and which can wait until later (this PR is already getting pretty big).

Also, I'd appreciate ideas for how to make the API more easily understood. We will have extensive docs either way, but xarray.apply is probably already in the realm of "too many arguments for one function". The last thing I want to do is to make a swiss army knife so flexible (like (numpy.nditer)[http://docs.scipy.org/doc/numpy/reference/generated/numpy.nditer.html]) that nobody uses it because they don't understand how it works.

How func vectorizes

There are two main cases here: 1. Functions already written to vectorize their arguments: 1. Scalar functions built out of NumPy primitives (e.g., a + b + c). These work by default. 2. Functions that use core dimension referred to by axis (e.g., np.mean). These work if you set axis=-1 and put the dimension in the signature, but the API is kind of awkward. You're rather that the wrapper just converts argument like dim='time' automatically into axis=2. Transposing these core dimensions to the end also feels unnecessary, though maybe not a serious concern given that transposing NumPy arrays involves no memory copies. 3. Functions that work mostly like gufuncs, but aren't actually (e.g., np.svd). This is pretty common, because NumPy ufuncs have some serious limitations (e.g.., they can't handle non-vectorized arguments). These work about as well as we could hope, modulo possible improvements to the signature spec. 4. True gufuncs, most likely written with numba.guvectorize. For these functions, we'd like a way to extract/use the signature automatically. 2. Functions for which you only have the inner loop (e.g., np.polyfit or scipy.stats.pearsonr). Running these is going to entail large Python overhead, but often that's acceptable.

One option for these is to wrap them into something that broadcasts like a gufunc, e.g., via a new function numpy.guvectorize (https://github.com/numpy/numpy/pull/8054). But as a user, this is a lot of wrappers to write. You'd rather just add something like vectorize=True and let xarray handle all the automatic broadcasting, e.g.,

python def poly_fit(x, y, dim='time', deg=1): return xr.apply(np.polyfit, x, y, signature=([(dim,), (dim,)], [('poly_order',)]), new_coords={'poly_order': range(deg + 1)}, kwargs={'deg': deg}, vectorize=True)

Whether func applies to "data only" or "everything"

Most "computation" functions/methods in xarray (e.g., arithmetic and reduce methods) follow the rule of merging coordinates, and only applying the core function to data variables. Coordinates that are no longer valid with new dimensions are dropped. This is currently what we do in apply.

On the other hand, there are also function/methods that we might refer to as "organizing" (e.g., indexing methods, concat, stack/unstack, transpose), which generally apply to every variable, including coordinates. It seems like there are definitely use cases for applying these sorts of functions, too, e.g., to wrap Cartopy's add_cyclic_point utility (#1005). So, I think we might need another option to toggle what happens to coordinates (e.g., variables='data' vs variables='all').

How to handle mismatched core dimensions

Xarray methods often have fallbacks to handle data with different dimensions. For example, if you write ds.mean(['x', 'y']), it matches on core dimensions to apply four different possible functions to each data variables: - mean over ('x', 'y'): for variables with both dimensions - mean over 'x': for variables with only 'x' - mean over 'y': for variables with only 'y' - identity: for variables with neither 'x' nor 'y'

Indexing is another example -- it applies to both data and coordinates, but only to matching dimensions for each variable. If you don't have the dimensions, we ignore the variable.

Writing something like mean with a single call to apply would entail the need for something like a dispatching system to pick which function to use, e.g., instead of a singular func/signature pair you pass a dispatcher function that chooses func/signature based on the core dimensions of passed variable. This feels like serious over engineering.

Instead, we might support a few pre-canned options for how to deal with mismatched dimensions. For example: - missing_core_dims='drop': silently drop these variables in the output(s) - missing_core_dims='error': raise an error. This is the current default behavior, which is probably only be useful with variables='data' -- otherwise some coordinate variables would always error. -missing_core_dims='keep': keep these variables unchanged in the output(s) (usemerge_variablesto check for conflicts) -missing_core_dims='broadcast'`: broadcast all inputs to have the necessary core dimensions if they don't have them already

Another option would be to consolidate this with the variables option to allow only two modes of operation: - variables='data': For "computation" functions. Apply only to data variables, and error if any data variables are missing a core dimension. Merge coordinates on the output(s), silently dropping conflicts for variables that don't label a dimension. - variables='matching': For "organizing" functions. Apply to every variable with matching core dimensions. Merge everything else on the outputs(s), silently dropping conflicts for variables that don't label a dimension.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  170779798
Powered by Datasette · Queries took 0.908ms · About: xarray-datasette