id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
1064837571,I_kwDOAMm_X84_eCHD,6033,Threadlocking in DataArray calculations for zarr data depending on where it's loaded from (S3 vs local),5509356,open,0,,,7,2021-11-26T22:23:14Z,2022-01-24T15:00:53Z,,NONE,,,,"**What happened**:
I am consistently seeing an issue where if I download the same dataset from a particular zarr archive on S3, calculations are slowed down by 3 seconds or more compared to when the same data was loaded locally. (I make sure to load all the data into the datasets before trying the calculations.) These same calculations are subsecond on the locally-loaded data, and it's literally just the same data copied from S3 using the aws cli.
Profiling shows that this is a threadlocking issue.
I have been able to reproduce it:
- Using DataArray.mean or min as the calculation
- On different machines, OS and Linux, on different networks
- In different versions of Python, xarray, dask, and zarr
- Loading the full Dataset with coordinates or just the data variable in question (they're in different directories for this archive)
- On different zarr arrays in the same archive
**What you expected to happen**:
No major threadlocking issues, also calculations on the same data should perform the same regardless of where it was loaded from.
**Minimal Complete Verifiable Example**:
See the attached Jupyter Notebook ([thread_locking.ipynb.zip](https://github.com/pydata/xarray/files/7610791/thread_locking.ipynb.zip)) which has the magic for timing and profiling the operations.
```python
import s3fs
import xarray as xr
s3 = s3fs.S3FileSystem(anon=True)
def lookup(path):
return s3fs.S3Map(path, s3=s3)
path_forecast = ""hrrrzarr/sfc/20211124/20211124_00z_fcst.zarr/surface/PRES""
ds_from_s3 = xr.open_zarr(lookup(f""{path_forecast}/surface""))
_ = ds_from_s3.PRES.values
```
```
%%time
%%prun -l 2
_ = ds_from_s3.PRES.mean(dim=""time"").values
```
This takes over 3 seconds, most of it in `{method 'acquire' of '_thread.lock' objects}`.
The same `mean` with the data in question downloaded via `aws s3 cp --recursive` and then opened locally is 10x faster. The threadlock is still the main time spent, but it's much less.
**Environment**:
Tested with these and more:
```
Python: 3.10.0
xarray: 0.20.1
dask: 2021.11.2
zarr: 2.10.3
s3fs: 2021.11.0
```
```
Python: 3.9.6
xarray: 0.19.0
dask: 2021.07.1
zarr: 2.8.3
s3fs: 2021.07.0
```
Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:25:38) [Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 18.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: (None, 'UTF-8')
libhdf5: None
libnetcdf: None
xarray: 0.20.1
pandas: 1.3.4
numpy: 1.21.4
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.10.3
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.11.2
distributed: 2021.11.2
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2021.11.0
cupy: None
pint: None
sparse: None
setuptools: 59.2.0
pip: 21.3.1
conda: None
pytest: None
IPython: 7.29.0
sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6033/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1039844354,I_kwDOAMm_X849-sQC,5918,Reading zarr gives unspecific PermissionError: Access Denied when public data has been consolidated after being written to S3,5509356,open,0,,,2,2021-10-29T18:54:23Z,2021-11-04T16:30:33Z,,NONE,,,,"This is a minor issue with an invalid data setup that's easy to get into, I'm reporting it here more for documentation than expecting a fix.
**Quick summary**: If you consolidate metadata on a public zarr dataset in S3, the .zmetadata files end up permission-restricted. So if you try reading with xarray 0.19, it gives an unclear error message and fails to open the dataset, while reading with xarray <=0.18.x goes fine. Contrast this with the nice warnings you get in 0.19 if the .zmetadata just doesn't exist.
-----
**How this happens**
People who wrote data without consolidated=True in past versions of xarray might run into this issue, but in our case we actually did have that parameter originally.
I've previously reported a few issues with the fact that xarray will write arbitrary zarr hierarchies if the variable names contain slashes, and then can't read them properly. One consequence of this is that data written with consolidated=True still doesn't have .zmetadata files where they're needed for xarray.open_mfdataset to read them.
If you try to add the .zmetadata by running
```
fs = s3fs.S3FileSystem()
zarr.consolidate_metadata(s3fs.S3Map(s3_path, s3=fs))
```
directly on the cloud bucket in S3, it writes the .zmetadata files... but permissions are restricted to the user who uploaded them even if you're writing to a public bucket. (It's an [AWS thing](https://docs.aws.amazon.com/config/latest/developerguide/s3-bucket-policy.html).)
-----
**Why it's a problem**
It would be *nice* if when xarray goes to read this data, it would see that it has access to the data but not to any usable .zmetadata and spit out the warning like it does if .zmetadata doesn't exist. Instead it fails on an uncaught PermissionError: Access Denied, and it's not clear from the output that this is *just* a .zmetadata issue and the user can still get the data by passing consolidated=False.
Another problem with this situation is that it causes data that reads just fine in xarray 0.18.x without even a warning message to suddenly give Access Denied from the same code when you update to xarray 0.19.
-----
**Work around**
If you're trying to read a dataset that has this issue, you can get the same behavior as in previous versions of xarray like so:
```
xr.open_mfdataset(s3_lookups, engine=""zarr"", consolidated=False)
```
The data loads fine, just more slowly than if the .zmetadata were accessible.
-----
**Stacktrace**
```
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
245 try:
--> 246 out = await method(**additional_kwargs)
247 return out
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/aiobotocore/client.py in _make_api_call(self, operation_name, api_params)
154 error_class = self.exceptions.from_code(error_code)
--> 155 raise error_class(parsed_response, operation_name)
156 else:
ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied
The above exception was the direct cause of the following exception:
PermissionError Traceback (most recent call last)
/var/folders/xf/xwjm3rj52ls9780rvrbbb9tm0000gn/T/ipykernel_84970/645651303.py in
16
17 # Look up the data
---> 18 test_data = xr.open_mfdataset(s3_lookups, engine=""zarr"")
19 test_data
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
911 getattr_ = getattr
912
--> 913 datasets = [open_(p, **open_kwargs) for p in paths]
914 closers = [getattr_(ds, ""_close"") for ds in datasets]
915 if preprocess is not None:
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/api.py in (.0)
911 getattr_ = getattr
912
--> 913 datasets = [open_(p, **open_kwargs) for p in paths]
914 closers = [getattr_(ds, ""_close"") for ds in datasets]
915 if preprocess is not None:
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
495
496 overwrite_encoded_chunks = kwargs.pop(""overwrite_encoded_chunks"", None)
--> 497 backend_ds = backend.open_dataset(
498 filename_or_obj,
499 drop_variables=drop_variables,
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, lock)
824
825 filename_or_obj = _normalize_path(filename_or_obj)
--> 826 store = ZarrStore.open_group(
827 filename_or_obj,
828 group=group,
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, stacklevel)
367 if consolidated is None:
368 try:
--> 369 zarr_group = zarr.open_consolidated(store, **open_kwargs)
370 except KeyError:
371 warnings.warn(
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/zarr/convenience.py in open_consolidated(store, metadata_key, mode, **kwargs)
1176
1177 # setup metadata store
-> 1178 meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
1179
1180 # pass through
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/zarr/storage.py in __init__(self, store, metadata_key)
2767
2768 # retrieve consolidated metadata
-> 2769 meta = json_loads(store[metadata_key])
2770
2771 # check format of consolidated metadata
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/mapping.py in __getitem__(self, key, default)
131 k = self._key_to_str(key)
132 try:
--> 133 result = self.fs.cat(k)
134 except self.missing_exceptions:
135 if default is not None:
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
86 def wrapper(*args, **kwargs):
87 self = obj or args[0]
---> 88 return sync(self.loop, func, *args, **kwargs)
89
90 return wrapper
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
67 raise FSTimeoutError
68 if isinstance(result[0], BaseException):
---> 69 raise result[0]
70 return result[0]
71
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
23 coro = asyncio.wait_for(coro, timeout=timeout)
24 try:
---> 25 result[0] = await coro
26 except Exception as ex:
27 result[0] = ex
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in _cat(self, path, recursive, on_error, **kwargs)
342 ex = next(filter(is_exception, out), False)
343 if ex:
--> 344 raise ex
345 if (
346 len(paths) > 1
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/s3fs/core.py in _cat_file(self, path, version_id, start, end)
849 else:
850 head = {}
--> 851 resp = await self._call_s3(
852 ""get_object"",
853 Bucket=bucket,
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
263 except Exception as e:
264 err = e
--> 265 raise translate_boto_error(err)
266
267 call_s3 = sync_wrapper(_call_s3)
PermissionError: Access Denied
```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5918/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
955029073,MDU6SXNzdWU5NTUwMjkwNzM=,5643,"open_mfdataset from zarr store, consolidated=None warns, consolidated=False is slow, consolidated=True fails",5509356,open,0,,,0,2021-07-28T16:23:53Z,2021-08-14T17:41:40Z,,NONE,,,,"**What happened**:
With xarray 0.19.0, using open_mfdataset to read from a zarr store written with a previous version of xarray (with consolidated=True), I get the following results depending on the consolidated parameter:
1. None or not provided: The call takes 3 seconds and spits out a warning saying to set the consolidated flag.
2. True: The call fails
3. False: The call takes 15 seconds (5x longer than None) but doesn't warn
Hopefully it's okay if I include the actual code rather than trying to create a test zarr store that reproduces the situation:
```python
import s3fs
import xarray as xr
top_group_url = 's3://hrrrzarr/sfc/20200801/20200801_00z_anl.zarr'
group_url = f'{top_group_url}/surface/GUST'
subgroup_url = f""{group_url}/surface""
fs = s3fs.S3FileSystem(anon=True)
ds = xr.open_mfdataset([s3fs.S3Map(url, s3=fs) for url in [top_group_url, group_url, subgroup_url]],
engine='zarr', consolidated=None)
```
Note that if I remove top_group_url, consolidated=False returns as quickly as consolidated=None and gives the same results (so it's just the top_group_url which is slowing things down, even though xarray doesn't load any data from it). However, if I only pass top_group_url, xarray loads the dataset in like 300ms, it's just empty.
**What I expected to happen**:
1. I would expect that providing top_group_url (where the .zmetadata file is located) would speed up the call and prevent the warn from happening when consolidated=None, since obviously the metadata is actually consolidated.
2. I wouldn't expect to be able to construct a call where with any possible value of consolidated, it either warns, fails, or is really slow––and the warning directs me to either change to one of the calls that fails/is slow or to update the data store, which I don't have write permissions to.
3. I would expect the data store to be efficiently and easily read using xarray since it was written by xarray.
**Anything else we need to know?**:
This zarr store cannot be (usefully) opened by xarray without using open_mfdataset due to an issue I brought up in discussion #5584 which no one has replied to so far. Basically, the person creating it assumed that if they used xarray to write it, xarray would have no problem reading it, but since there's a slash in the variable names, xarray created it as a deeply-nested zarr store instead of a store with each variable as a single-level (sub)group that xarray would have been able to handle.
Each variable was written like this:
```python
ds.to_zarr(store=store, group=i, mode='w', encoding=encoding, consolidated=True)
```
In the example, ""surface/GUST"" is the variable name.
**Environment**:
Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:36:15)
[Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 18.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: (None, 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.8.0
xarray: 0.19.0
pandas: 1.3.1
numpy: 1.21.1
scipy: 1.7.0
netCDF4: 1.5.7
pydap: None
h5netcdf: 0.11.0
h5py: 3.3.0
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.07.1
distributed: 2021.07.1
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
pint: 0.17
setuptools: 49.6.0.post20210108
pip: 21.2.1
conda: None
pytest: None
IPython: 7.25.0
sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5643/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue