issue_comments: 249118534

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/pull/964#issuecomment-249118534	https://api.github.com/repos/pydata/xarray/issues/964	249118534	MDEyOklzc3VlQ29tbWVudDI0OTExODUzNA==	1217238	2016-09-23T07:07:09Z	2016-09-23T07:07:09Z	MEMBER	One of the tricky things with `apply` is that there are a lot of similar but distinct use cases to disambiguate. I'll outline a few of these below. I'd appreciate feedback on which cases are most essential and which can wait until later (this PR is already getting pretty big). Also, I'd appreciate ideas for how to make the API more easily understood. We will have extensive docs either way, but `xarray.apply` is probably already in the realm of "too many arguments for one function". The last thing I want to do is to make a swiss army knife so flexible (like (`numpy.nditer`)[http://docs.scipy.org/doc/numpy/reference/generated/numpy.nditer.html]) that nobody uses it because they don't understand how it works. How `func` vectorizes There are two main cases here: 1. Functions already written to vectorize their arguments: 1. Scalar functions built out of NumPy primitives (e.g., `a + b + c`). These work by default. 2. Functions that use core dimension referred to by `axis` (e.g., `np.mean`). These work if you set `axis=-1` and put the dimension in the signature, but the API is kind of awkward. You're rather that the wrapper just converts argument like `dim='time'` automatically into `axis=2`. Transposing these core dimensions to the end also feels unnecessary, though maybe not a serious concern given that transposing NumPy arrays involves no memory copies. 3. Functions that work mostly like gufuncs, but aren't actually (e.g., `np.svd`). This is pretty common, because NumPy ufuncs have some serious limitations (e.g.., they can't handle non-vectorized arguments). These work about as well as we could hope, modulo possible improvements to the `signature` spec. 4. True gufuncs, most likely written with `numba.guvectorize`. For these functions, we'd like a way to extract/use the signature automatically. 2. Functions for which you only have the inner loop (e.g., `np.polyfit` or `scipy.stats.pearsonr`). Running these is going to entail large Python overhead, but often that's acceptable. One option for these is to wrap them into something that broadcasts like a gufunc, e.g., via a new function `numpy.guvectorize` (https://github.com/numpy/numpy/pull/8054). But as a user, this is a lot of wrappers to write. You'd rather just add something like `vectorize=True` and let xarray handle all the automatic broadcasting, e.g., `python def poly_fit(x, y, dim='time', deg=1): return xr.apply(np.polyfit, x, y, signature=([(dim,), (dim,)], [('poly_order',)]), new_coords={'poly_order': range(deg + 1)}, kwargs={'deg': deg}, vectorize=True)` Whether `func` applies to "data only" or "everything" Most "computation" functions/methods in xarray (e.g., arithmetic and reduce methods) follow the rule of merging coordinates, and only applying the core function to data variables. Coordinates that are no longer valid with new dimensions are dropped. This is currently what we do in `apply`. On the other hand, there are also function/methods that we might refer to as "organizing" (e.g., indexing methods, `concat`, `stack`/`unstack`, `transpose`), which generally apply to every variable, including coordinates. It seems like there are definitely use cases for applying these sorts of functions, too, e.g., to wrap Cartopy's `add_cyclic_point` utility (#1005). So, I think we might need another option to toggle what happens to coordinates (e.g., `variables='data'` vs `variables='all'`). How to handle mismatched core dimensions Xarray methods often have fallbacks to handle data with different dimensions. For example, if you write `ds.mean(['x', 'y'])`, it matches on core dimensions to apply four different possible functions to each data variables: - mean over `('x', 'y')`: for variables with both dimensions - mean over `'x'`: for variables with only `'x'` - mean over `'y'`: for variables with only `'y'` - identity: for variables with neither `'x'` nor `'y'` Indexing is another example -- it applies to both data and coordinates, but only to matching dimensions for each variable. If you don't have the dimensions, we ignore the variable. Writing something like `mean` with a single call to `apply` would entail the need for something like a dispatching system to pick which function to use, e.g., instead of a singular `func`/`signature` pair you pass a `dispatcher` function that chooses `func`/`signature` based on the core dimensions of passed variable. This feels like serious over engineering. Instead, we might support a few pre-canned options for how to deal with mismatched dimensions. For example: - `missing_core_dims='drop'`: silently drop these variables in the output(s) - `missing_core_dims='error'`: raise an error. This is the current default behavior, which is probably only be useful with `variables='data' -- otherwise some coordinate variables would always error. -`missing_core_dims='keep'`: keep these variables unchanged in the output(s) (use`merge_variables`to check for conflicts) -`missing_core_dims='broadcast'`: broadcast all inputs to have the necessary core dimensions if they don't have them already Another option would be to consolidate this with the `variables` option to allow only two modes of operation: - `variables='data'`: For "computation" functions. Apply only to data variables, and error if any data variables are missing a core dimension. Merge coordinates on the output(s), silently dropping conflicts for variables that don't label a dimension. - `variables='matching'`: For "organizing" functions. Apply to every variable with matching core dimensions. Merge everything else on the outputs(s), silently dropping conflicts for variables that don't label a dimension.	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		170779798