home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 184722754

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
184722754 MDU6SXNzdWUxODQ3MjI3NTQ= 1058 shallow copies become deep copies when pickling 6213168 closed 0     10 2016-10-23T23:12:03Z 2017-02-05T21:13:41Z 2017-01-17T01:53:18Z MEMBER      

Whenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays. This design fails when the object is pickled.

Whenever a numpy view is pickled, it becomes a regular array:

```

a = numpy.arange(226) print(len(pickle.dumps(a)) / 220) 256.00015354156494 b = a.view() print(len(pickle.dumps((a, b))) / 2**20) 512.0001964569092 b.base is a True a2, b2 = pickle.loads(pickle.dumps((a, b))) b2.base is a2 False ```

This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer. Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k * 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they're pickled expand to several GBs as they become 3000 independent copies.

I see a few possible solutions to this: 1. Implement pandas range indexes in xarray. This would be nice as a general thing and would solve my specific problem, but anybody who does not fall in my very specific use case won't benefit from it. 2. Do not auto-generate a coord with numpy.arange() if the user doesn't explicitly ask for it; just leave a None and maybe generate it on the fly when requested. Again, this would solve my specific problem but not other people's. 3. Force the coord to be a dask.array.arange. Actually supporting unconverted dask arrays as coordinates would take a considerable amount of work; they would get converted to numpy several times, and other issues. Again it wouldn't solve the general problem. 4. Fix the issue upstream in numpy. I didn't look into it yet and it's definitely worth investigating, but I found about it as early as 2012, so I suspect there might be some pretty good reason why it works like that... 5. Whenever xarray performs a shallow copy, take the numpy array instead of creating a view.

I implemented (5) as a workaround in my getstate method. Before:

%%time print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30) 2.535497265867889 Wall time: 33.3 s

Workaround:

``` def get_base(array): if not isinstance(array, numpy.ndarray): return array
elif array.base is None: return array elif array.base.dtype != array.dtype: return array elif array.base.shape != array.shape: return array else: return array.base

for v in cache.values(): if isinstance(v, xarray.DataArray): v.data = get_base(v.data) for coord in v.coords.values(): coord.data = get_base(coord.data) elif isinstance(v, xarray.Dataset): for var in v.variables(): var.data = get_base(var.data) ```

After:

%%time print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30) 0.9733252348378301 Wall time: 21.1 s

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1058/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 10 rows from issue in issue_comments
Powered by Datasette · Queries took 0.536ms · About: xarray-datasette