issues: 184722754
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
184722754 | MDU6SXNzdWUxODQ3MjI3NTQ= | 1058 | shallow copies become deep copies when pickling | 6213168 | closed | 0 | 10 | 2016-10-23T23:12:03Z | 2017-02-05T21:13:41Z | 2017-01-17T01:53:18Z | MEMBER | Whenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays. This design fails when the object is pickled. Whenever a numpy view is pickled, it becomes a regular array: ```
This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer. Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k * 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they're pickled expand to several GBs as they become 3000 independent copies. I see a few possible solutions to this: 1. Implement pandas range indexes in xarray. This would be nice as a general thing and would solve my specific problem, but anybody who does not fall in my very specific use case won't benefit from it. 2. Do not auto-generate a coord with numpy.arange() if the user doesn't explicitly ask for it; just leave a None and maybe generate it on the fly when requested. Again, this would solve my specific problem but not other people's. 3. Force the coord to be a dask.array.arange. Actually supporting unconverted dask arrays as coordinates would take a considerable amount of work; they would get converted to numpy several times, and other issues. Again it wouldn't solve the general problem. 4. Fix the issue upstream in numpy. I didn't look into it yet and it's definitely worth investigating, but I found about it as early as 2012, so I suspect there might be some pretty good reason why it works like that... 5. Whenever xarray performs a shallow copy, take the numpy array instead of creating a view. I implemented (5) as a workaround in my getstate method. Before:
Workaround: ```
def get_base(array):
if not isinstance(array, numpy.ndarray):
return array for v in cache.values(): if isinstance(v, xarray.DataArray): v.data = get_base(v.data) for coord in v.coords.values(): coord.data = get_base(coord.data) elif isinstance(v, xarray.Dataset): for var in v.variables(): var.data = get_base(var.data) ``` After:
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/1058/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |