issues: 184722754

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
184722754	MDU6SXNzdWUxODQ3MjI3NTQ=	1058	shallow copies become deep copies when pickling	6213168	closed	0			10	2016-10-23T23:12:03Z	2017-02-05T21:13:41Z	2017-01-17T01:53:18Z	MEMBER				Whenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays. This design fails when the object is pickled. Whenever a numpy view is pickled, it becomes a regular array: ``` a = numpy.arange(226) print(len(pickle.dumps(a)) / 220) 256.00015354156494 b = a.view() print(len(pickle.dumps((a, b))) / 2*20) 512.0001964569092 b.base is a True a2, b2 = pickle.loads(pickle.dumps((a, b))) b2.base is a2 False ``` This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer. Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they're pickled expand to several GBs as they become 3000 independent copies. I see a few possible solutions to this: 1. Implement pandas range indexes in xarray. This would be nice as a general thing and would solve my specific problem, but anybody who does not fall in my very specific use case won't benefit from it. 2. Do not auto-generate a coord with numpy.arange() if the user doesn't explicitly ask for it; just leave a None and maybe generate it on the fly when requested. Again, this would solve my specific problem but not other people's. 3. Force the coord to be a dask.array.arange. Actually supporting unconverted dask arrays as coordinates would take a considerable amount of work; they would get converted to numpy several times, and other issues. Again it wouldn't solve the general problem. 4. Fix the issue upstream in numpy. I didn't look into it yet and it's definitely worth investigating, but I found about it as early as 2012, so I suspect there might be some pretty good reason why it works like that... 5. Whenever xarray performs a shallow copy, take the numpy array instead of creating a view. I implemented (5) as a workaround in my getstate method. Before: `%%time print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 230) 2.535497265867889 Wall time: 33.3 s` Workaround: ``` def get_base(array): if not isinstance(array, numpy.ndarray): return array elif array.base is None: return array elif array.base.dtype != array.dtype: return array elif array.base.shape != array.shape: return array else: return array.base for v in cache.values(): if isinstance(v, xarray.DataArray): v.data = get_base(v.data) for coord in v.coords.values(): coord.data = get_base(coord.data) elif isinstance(v, xarray.Dataset): for var in v.variables(): var.data = get_base(var.data) ``` After: `%%time print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 230) 0.9733252348378301 Wall time: 21.1 s`	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1058/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

0 rows from issues_id in issues_labels
10 rows from issue in issue_comments