pull_requests: 18385180
This data as json
id | node_id | number | state | locked | title | user | body | created_at | updated_at | closed_at | merged_at | merge_commit_sha | assignee | milestone | draft | head | base | author_association | auto_merge | repo | url | merged_by |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18385180 | MDExOlB1bGxSZXF1ZXN0MTgzODUxODA= | 184 | closed | 0 | WIP: Automatic label alignment for mathematical operations | 1217238 | This still need a bit of cleanup (note the failing test), but there is an interesting design decision that came up: How should we handle alignment for in-place operations when the operation would result in missing values that cannot be represented by the existing data type? For example, what should `x` be after the following? ``` python x = DataArray([1, 2], coordinates=[['a', 'b']], dimensions=['foo']) y = DataArray([3], coordinates=[['b']], dimensions=['foo']) x += y ``` If we do automatic alignment like pandas, in-place operations should not change the coordinates of the object to which the operation is being applied. Thus, `y` should be equivalent to: ``` python y_prime = DataArray([np.nan, 3], coordinates=[['a', 'b']], dimensions=['foo']) ``` Here arises the problem: `x` has `dtype=int`, so it cannot represent `NaN`. If I run this example using the current version of this patch, I end up with: ``` In [5]: x Out[5]: <xray.DataArray (foo: 2)> array([-9223372036854775808, 5]) Coordinates: foo: Index([u'a', u'b'], dtype='object') Attributes: Empty ``` There are several options here: 1. Don't actually do in-place operations on the underlying ndarrays: `x += y` should translate under the hood to `x = x + y`, which sidesteps the issue, because `x + y` results in a new floating point array. This is what pandas does. 2. Do the operation in-place on the ndarray like numpy -- it's the user's problem if they try to add `np.nan` in-place to an integer. 3. Do the operation in-place, but raise a warning or error if the right hand side expression ends up including any missing values. Interestingly, this is what numpy does, but only for 0-dimensional arrays: ``` In [3]: x = np.array(0) In [4]: x += np.nan /Users/shoyer/miniconda/envs/tcc-climatology/bin/ipython:1: RuntimeWarning: invalid value encountered in add #!/Users/shoyer/miniconda/envs/tcc-climatology/python.app/Contents/MacOS/python ``` Option 1 has negative performance implications for all in-place array operations (they would be no faster than the non-in-place versions), and might also complicate the hypothetical future feature of datasets linked on disk (but we might also just disallow in-place operations for such arrays). Option 2 is one principled choice, but the outcome with missing values would be pretty surprising (note that in this scenario, both `x` and `y` were integer arrays). I like option 3 (with the warning), but unfortunately it has most of the negative performance implications of option 1, because we could need to make a copy of `y` to check for missing values. This could be partially alleviated by using something like `bottleneck.anynan` instead, and by the fact that we would only need to do this check if the in-place operation is adding a float to an int. Any thoughts? | 2014-07-15T01:57:18Z | 2014-08-21T05:44:30Z | 2014-08-21T05:44:30Z | 3c7636c0a42add63080db091293596ed0b1cba1a | 0 | 8724e86031d07ffb2f8d743f5a83b12b645aadc6 | 6c394b14ecc04a53d804893060ed33cadfde688e | MEMBER | 13221727 | https://github.com/pydata/xarray/pull/184 |
Links from other tables
- 2 rows from pull_requests_id in labels_pull_requests