home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 485708282

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
485708282 MDU6SXNzdWU0ODU3MDgyODI= 3268 Stateful user-defined accessors 6213168 open 0     15 2019-08-27T09:54:28Z 2019-10-08T11:13:25Z   MEMBER      

If anybody decorates a stateful class with @register_dataarray_accessor or @register_dataset_accessor, the instance will lose its state on any method that invokes _to_temp_dataset, as well as on a shallow copy.

```python

In [1]: @xarray.register_dataarray_accessor('foo') ...: class Foo: ...: def init(self, obj): ...: self.obj = obj ...: self.x = 1 ...:
...:

In [2]: a = xarray.DataArray()

In [3]: a.foo.x
Out[3]: 1

In [4]: a.foo.x = 2

In [5]: a.foo.x
Out[5]: 2

In [6]: a.roll().foo.x
Out[6]: 1

In [7]: a.copy(deep=False).foo.x
Out[7]: 1 ```

While in the case of _to_temp_dataset it could be possible to spend (substantial) effort to retain the state, on the case of copy() it's impossible without modifying the accessor duck API, as one would need to tamper with the accessor instance in place and modify the pointer back to the DataArray/Dataset.

This issue is so glaring that it makes me strongly suspect that nobody saves any state in accessor classes. This kind of use would also be problematic in practical terms, as the accessor object would have a hard time realising when its own state is no longer coherent with the referenced DataArray/Dataset.

This design also carries the problem that it introduces a circular reference in the DataArray/Dataset. This means that, after someone invokes an accessor method on his DataArray/Dataset, then the whole object - including the numpy buffers! - won't be instantly collected when it's dereferenced by the user, and it will have to instead wait for the next gc pass. This could cause huge increases in RAM usage overnight in a user application, which would be very hard to logically link to a change that just added a custom method.

Finally, with https://github.com/pydata/xarray/pull/3250/, this statefulness forces us to increase the RAM usage of all datasets and dataarrays by an extra slot, for all users, even if this feature is quite niche.

Proposed solution

Get rid of accessor caching altogether, and just recreate the accessor object from scratch every time it is invoked. In the documentation, clarify that the __init__ method should not perform anything computationally intensive.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3268/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 15 rows from issue in issue_comments
Powered by Datasette · Queries took 0.785ms · About: xarray-datasette