home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 1064837571

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1064837571 I_kwDOAMm_X84_eCHD 6033 Threadlocking in DataArray calculations for zarr data depending on where it's loaded from (S3 vs local) 5509356 open 0     7 2021-11-26T22:23:14Z 2022-01-24T15:00:53Z   NONE      

What happened:

I am consistently seeing an issue where if I download the same dataset from a particular zarr archive on S3, calculations are slowed down by 3 seconds or more compared to when the same data was loaded locally. (I make sure to load all the data into the datasets before trying the calculations.) These same calculations are subsecond on the locally-loaded data, and it's literally just the same data copied from S3 using the aws cli.

Profiling shows that this is a threadlocking issue.

I have been able to reproduce it: - Using DataArray.mean or min as the calculation - On different machines, OS and Linux, on different networks - In different versions of Python, xarray, dask, and zarr - Loading the full Dataset with coordinates or just the data variable in question (they're in different directories for this archive) - On different zarr arrays in the same archive

What you expected to happen:

No major threadlocking issues, also calculations on the same data should perform the same regardless of where it was loaded from.

Minimal Complete Verifiable Example:

See the attached Jupyter Notebook (thread_locking.ipynb.zip) which has the magic for timing and profiling the operations.

```python import s3fs import xarray as xr

s3 = s3fs.S3FileSystem(anon=True) def lookup(path): return s3fs.S3Map(path, s3=s3)

path_forecast = "hrrrzarr/sfc/20211124/20211124_00z_fcst.zarr/surface/PRES" ds_from_s3 = xr.open_zarr(lookup(f"{path_forecast}/surface")) _ = ds_from_s3.PRES.values ```

``` %%time %%prun -l 2

_ = ds_from_s3.PRES.mean(dim="time").values `` This takes over 3 seconds, most of it in{method 'acquire' of '_thread.lock' objects}`.

The same mean with the data in question downloaded via aws s3 cp --recursive and then opened locally is 10x faster. The threadlock is still the main time spent, but it's much less.

Environment:

Tested with these and more: Python: 3.10.0 xarray: 0.20.1 dask: 2021.11.2 zarr: 2.10.3 s3fs: 2021.11.0

``` Python: 3.9.6 xarray: 0.19.0 dask: 2021.07.1 zarr: 2.8.3 s3fs: 2021.07.0

```

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:25:38) [Clang 11.1.0 ] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: None libnetcdf: None xarray: 0.20.1 pandas: 1.3.4 numpy: 1.21.4 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.10.3 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.11.2 distributed: 2021.11.2 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2021.11.0 cupy: None pint: None sparse: None setuptools: 59.2.0 pip: 21.3.1 conda: None pytest: None IPython: 7.29.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6033/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 7 rows from issue in issue_comments
Powered by Datasette · Queries took 81.048ms · About: xarray-datasette