issues: 1561508426

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1561508426	I_kwDOAMm_X85dErpK	7484	Opening datasets with large object dtype arrays is very slow	5179430	closed	0			3	2023-01-29T23:31:51Z	2023-03-07T16:10:31Z	2023-03-07T16:10:31Z	CONTRIBUTOR				What is your issue? Opening a dataset with a very large array with object dtype is much slower than it should be. I initially noticed this when working with a dataset spanned by around 24000 netcdf files. I have been using kerchunk references to load them with a consolidated metadata key so I was expecting it to be fairly quick to open, but it actually took several minutes. I realized that all this time was spent on one variable consisting of strings, so when dropping it the whole dataset opens up in seconds. Sharing this would be a bit difficult so instead I will illustrate this with a simple easy to reproduce example with the latest released versions of xarray and zarr installed: ```python str_array = np.arange(100000000).astype(str) ds = xr.DataArray(dims=('x',), data=str_array).to_dataset(name='str_array') ds['str_array'] = ds.str_array.astype('O') # Needs to actually be object dtype to show the problem ds.to_zarr('str_array.zarr') %time xr.open_zarr('str_array.zarr/') CPU times: user 8.24 s, sys: 5.23 s, total: 13.5 s Wall time: 12.9 s ``` I did some digging and found that pretty much all the time was spent on the check being done by `contains_cftime_datetimes` in https://github.com/pydata/xarray/blob/d385e2063a6b5919e1fe9dd3e27a24bc7117137e/xarray/core/common.py#L1793 This operation is not lazy and ends up requiring every single chunk for this variable to be opened, all for the sake of checking the very first element in the entire array. A quick fix I tried is updating `contains_cftime_datetimes` to do the following: `python def contains_cftime_datetimes(var) -> bool: """Check if an xarray.Variable contains cftime.datetime objects""" if var.dtype == np.dtype("O") and var.size > 0: ndims = len(var.shape) first_idx = np.zeros(ndims, dtype='int32') array = var[*first_idx].data return _contains_cftime_datetimes(array) else: return False` This drastically reduced the time to open the dataset as expected: `python %time xr.open_zarr('str_array.zarr/') CPU times: user 384 ms, sys: 502 ms, total: 887 ms Wall time: 985 ms` I would like to make a PR with this change but I realize that this change could effect every backend, and although I have been using xarray for many years this would be my first contribution and so I would like to briefly discuss it in case there are better ways to address the issue. Thanks!	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7484/reactions", "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

3 rows from issues_id in issues_labels
3 rows from issue in issue_comments