issues: 1064837571
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1064837571 | I_kwDOAMm_X84_eCHD | 6033 | Threadlocking in DataArray calculations for zarr data depending on where it's loaded from (S3 vs local) | 5509356 | open | 0 | 7 | 2021-11-26T22:23:14Z | 2022-01-24T15:00:53Z | NONE | What happened: I am consistently seeing an issue where if I download the same dataset from a particular zarr archive on S3, calculations are slowed down by 3 seconds or more compared to when the same data was loaded locally. (I make sure to load all the data into the datasets before trying the calculations.) These same calculations are subsecond on the locally-loaded data, and it's literally just the same data copied from S3 using the aws cli. Profiling shows that this is a threadlocking issue. I have been able to reproduce it: - Using DataArray.mean or min as the calculation - On different machines, OS and Linux, on different networks - In different versions of Python, xarray, dask, and zarr - Loading the full Dataset with coordinates or just the data variable in question (they're in different directories for this archive) - On different zarr arrays in the same archive What you expected to happen: No major threadlocking issues, also calculations on the same data should perform the same regardless of where it was loaded from. Minimal Complete Verifiable Example: See the attached Jupyter Notebook (thread_locking.ipynb.zip) which has the magic for timing and profiling the operations. ```python import s3fs import xarray as xr s3 = s3fs.S3FileSystem(anon=True) def lookup(path): return s3fs.S3Map(path, s3=s3) path_forecast = "hrrrzarr/sfc/20211124/20211124_00z_fcst.zarr/surface/PRES" ds_from_s3 = xr.open_zarr(lookup(f"{path_forecast}/surface")) _ = ds_from_s3.PRES.values ``` ``` %%time %%prun -l 2 _ = ds_from_s3.PRES.mean(dim="time").values
The same Environment: Tested with these and more:
``` Python: 3.9.6 xarray: 0.19.0 dask: 2021.07.1 zarr: 2.8.3 s3fs: 2021.07.0 ``` Output of <tt>xr.show_versions()</tt>INSTALLED VERSIONS ------------------ commit: None python: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:25:38) [Clang 11.1.0 ] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: None libnetcdf: None xarray: 0.20.1 pandas: 1.3.4 numpy: 1.21.4 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.10.3 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.11.2 distributed: 2021.11.2 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2021.11.0 cupy: None pint: None sparse: None setuptools: 59.2.0 pip: 21.3.1 conda: None pytest: None IPython: 7.29.0 sphinx: None |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/6033/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
13221727 | issue |