issues: 1064837571

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1064837571	I_kwDOAMm_X84_eCHD	6033	Threadlocking in DataArray calculations for zarr data depending on where it's loaded from (S3 vs local)	5509356	open	0			7	2021-11-26T22:23:14Z	2022-01-24T15:00:53Z		NONE				What happened: I am consistently seeing an issue where if I download the same dataset from a particular zarr archive on S3, calculations are slowed down by 3 seconds or more compared to when the same data was loaded locally. (I make sure to load all the data into the datasets before trying the calculations.) These same calculations are subsecond on the locally-loaded data, and it's literally just the same data copied from S3 using the aws cli. Profiling shows that this is a threadlocking issue. I have been able to reproduce it: - Using DataArray.mean or min as the calculation - On different machines, OS and Linux, on different networks - In different versions of Python, xarray, dask, and zarr - Loading the full Dataset with coordinates or just the data variable in question (they're in different directories for this archive) - On different zarr arrays in the same archive What you expected to happen: No major threadlocking issues, also calculations on the same data should perform the same regardless of where it was loaded from. Minimal Complete Verifiable Example: See the attached Jupyter Notebook (thread_locking.ipynb.zip) which has the magic for timing and profiling the operations. ```python import s3fs import xarray as xr s3 = s3fs.S3FileSystem(anon=True) def lookup(path): return s3fs.S3Map(path, s3=s3) path_forecast = "hrrrzarr/sfc/20211124/20211124_00z_fcst.zarr/surface/PRES" ds_from_s3 = xr.open_zarr(lookup(f"{path_forecast}/surface")) _ = ds_from_s3.PRES.values ``` ``` %%time %%prun -l 2 _ = ds_from_s3.PRES.mean(dim="time").values `` This takes over 3 seconds, most of it in{method 'acquire' of '_thread.lock' objects}`. The same `mean` with the data in question downloaded via `aws s3 cp --recursive` and then opened locally is 10x faster. The threadlock is still the main time spent, but it's much less. Environment: Tested with these and more: `Python: 3.10.0 xarray: 0.20.1 dask: 2021.11.2 zarr: 2.10.3 s3fs: 2021.11.0` ``` Python: 3.9.6 xarray: 0.19.0 dask: 2021.07.1 zarr: 2.8.3 s3fs: 2021.07.0 ``` Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.10.0 \| packaged by conda-forge \| (default, Nov 20 2021, 02:25:38) [Clang 11.1.0 ] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: None libnetcdf: None xarray: 0.20.1 pandas: 1.3.4 numpy: 1.21.4 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.10.3 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.11.2 distributed: 2021.11.2 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2021.11.0 cupy: None pint: None sparse: None setuptools: 59.2.0 pip: 21.3.1 conda: None pytest: None IPython: 7.29.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6033/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			13221727	issue

Links from other tables

1 row from issues_id in issues_labels
7 rows from issue in issue_comments