home / github / pull_requests

Menu
  • Search all tables
  • GraphQL API

pull_requests: 18385180

This data as json

id node_id number state locked title user body created_at updated_at closed_at merged_at merge_commit_sha assignee milestone draft head base author_association auto_merge repo url merged_by
18385180 MDExOlB1bGxSZXF1ZXN0MTgzODUxODA= 184 closed 0 WIP: Automatic label alignment for mathematical operations 1217238 This still need a bit of cleanup (note the failing test), but there is an interesting design decision that came up: How should we handle alignment for in-place operations when the operation would result in missing values that cannot be represented by the existing data type? For example, what should `x` be after the following? ``` python x = DataArray([1, 2], coordinates=[['a', 'b']], dimensions=['foo']) y = DataArray([3], coordinates=[['b']], dimensions=['foo']) x += y ``` If we do automatic alignment like pandas, in-place operations should not change the coordinates of the object to which the operation is being applied. Thus, `y` should be equivalent to: ``` python y_prime = DataArray([np.nan, 3], coordinates=[['a', 'b']], dimensions=['foo']) ``` Here arises the problem: `x` has `dtype=int`, so it cannot represent `NaN`. If I run this example using the current version of this patch, I end up with: ``` In [5]: x Out[5]: <xray.DataArray (foo: 2)> array([-9223372036854775808, 5]) Coordinates: foo: Index([u'a', u'b'], dtype='object') Attributes: Empty ``` There are several options here: 1. Don't actually do in-place operations on the underlying ndarrays: `x += y` should translate under the hood to `x = x + y`, which sidesteps the issue, because `x + y` results in a new floating point array. This is what pandas does. 2. Do the operation in-place on the ndarray like numpy -- it's the user's problem if they try to add `np.nan` in-place to an integer. 3. Do the operation in-place, but raise a warning or error if the right hand side expression ends up including any missing values. Interestingly, this is what numpy does, but only for 0-dimensional arrays: ``` In [3]: x = np.array(0) In [4]: x += np.nan /Users/shoyer/miniconda/envs/tcc-climatology/bin/ipython:1: RuntimeWarning: invalid value encountered in add #!/Users/shoyer/miniconda/envs/tcc-climatology/python.app/Contents/MacOS/python ``` Option 1 has negative performance implications for all in-place array operations (they would be no faster than the non-in-place versions), and might also complicate the hypothetical future feature of datasets linked on disk (but we might also just disallow in-place operations for such arrays). Option 2 is one principled choice, but the outcome with missing values would be pretty surprising (note that in this scenario, both `x` and `y` were integer arrays). I like option 3 (with the warning), but unfortunately it has most of the negative performance implications of option 1, because we could need to make a copy of `y` to check for missing values. This could be partially alleviated by using something like `bottleneck.anynan` instead, and by the fact that we would only need to do this check if the in-place operation is adding a float to an int. Any thoughts? 2014-07-15T01:57:18Z 2014-08-21T05:44:30Z 2014-08-21T05:44:30Z   3c7636c0a42add63080db091293596ed0b1cba1a     0 8724e86031d07ffb2f8d743f5a83b12b645aadc6 6c394b14ecc04a53d804893060ed33cadfde688e MEMBER   13221727 https://github.com/pydata/xarray/pull/184  

Links from other tables

  • 2 rows from pull_requests_id in labels_pull_requests
Powered by Datasette · Queries took 1.98ms