home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

3 rows where user = 5509356 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

type 1

  • issue 3

state 1

  • open 3

repo 1

  • xarray 3
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1064837571 I_kwDOAMm_X84_eCHD 6033 Threadlocking in DataArray calculations for zarr data depending on where it's loaded from (S3 vs local) adair-kovac 5509356 open 0     7 2021-11-26T22:23:14Z 2022-01-24T15:00:53Z   NONE      

What happened:

I am consistently seeing an issue where if I download the same dataset from a particular zarr archive on S3, calculations are slowed down by 3 seconds or more compared to when the same data was loaded locally. (I make sure to load all the data into the datasets before trying the calculations.) These same calculations are subsecond on the locally-loaded data, and it's literally just the same data copied from S3 using the aws cli.

Profiling shows that this is a threadlocking issue.

I have been able to reproduce it: - Using DataArray.mean or min as the calculation - On different machines, OS and Linux, on different networks - In different versions of Python, xarray, dask, and zarr - Loading the full Dataset with coordinates or just the data variable in question (they're in different directories for this archive) - On different zarr arrays in the same archive

What you expected to happen:

No major threadlocking issues, also calculations on the same data should perform the same regardless of where it was loaded from.

Minimal Complete Verifiable Example:

See the attached Jupyter Notebook (thread_locking.ipynb.zip) which has the magic for timing and profiling the operations.

```python import s3fs import xarray as xr

s3 = s3fs.S3FileSystem(anon=True) def lookup(path): return s3fs.S3Map(path, s3=s3)

path_forecast = "hrrrzarr/sfc/20211124/20211124_00z_fcst.zarr/surface/PRES" ds_from_s3 = xr.open_zarr(lookup(f"{path_forecast}/surface")) _ = ds_from_s3.PRES.values ```

``` %%time %%prun -l 2

_ = ds_from_s3.PRES.mean(dim="time").values `` This takes over 3 seconds, most of it in{method 'acquire' of '_thread.lock' objects}`.

The same mean with the data in question downloaded via aws s3 cp --recursive and then opened locally is 10x faster. The threadlock is still the main time spent, but it's much less.

Environment:

Tested with these and more: Python: 3.10.0 xarray: 0.20.1 dask: 2021.11.2 zarr: 2.10.3 s3fs: 2021.11.0

``` Python: 3.9.6 xarray: 0.19.0 dask: 2021.07.1 zarr: 2.8.3 s3fs: 2021.07.0

```

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:25:38) [Clang 11.1.0 ] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: None libnetcdf: None xarray: 0.20.1 pandas: 1.3.4 numpy: 1.21.4 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.10.3 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.11.2 distributed: 2021.11.2 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2021.11.0 cupy: None pint: None sparse: None setuptools: 59.2.0 pip: 21.3.1 conda: None pytest: None IPython: 7.29.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6033/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1039844354 I_kwDOAMm_X849-sQC 5918 Reading zarr gives unspecific PermissionError: Access Denied when public data has been consolidated after being written to S3 adair-kovac 5509356 open 0     2 2021-10-29T18:54:23Z 2021-11-04T16:30:33Z   NONE      

This is a minor issue with an invalid data setup that's easy to get into, I'm reporting it here more for documentation than expecting a fix.

Quick summary: If you consolidate metadata on a public zarr dataset in S3, the .zmetadata files end up permission-restricted. So if you try reading with xarray 0.19, it gives an unclear error message and fails to open the dataset, while reading with xarray <=0.18.x goes fine. Contrast this with the nice warnings you get in 0.19 if the .zmetadata just doesn't exist.


How this happens

People who wrote data without consolidated=True in past versions of xarray might run into this issue, but in our case we actually did have that parameter originally.

I've previously reported a few issues with the fact that xarray will write arbitrary zarr hierarchies if the variable names contain slashes, and then can't read them properly. One consequence of this is that data written with consolidated=True still doesn't have .zmetadata files where they're needed for xarray.open_mfdataset to read them.

If you try to add the .zmetadata by running

fs = s3fs.S3FileSystem() zarr.consolidate_metadata(s3fs.S3Map(s3_path, s3=fs)) directly on the cloud bucket in S3, it writes the .zmetadata files... but permissions are restricted to the user who uploaded them even if you're writing to a public bucket. (It's an AWS thing.)


Why it's a problem

It would be nice if when xarray goes to read this data, it would see that it has access to the data but not to any usable .zmetadata and spit out the warning like it does if .zmetadata doesn't exist. Instead it fails on an uncaught PermissionError: Access Denied, and it's not clear from the output that this is just a .zmetadata issue and the user can still get the data by passing consolidated=False.

Another problem with this situation is that it causes data that reads just fine in xarray 0.18.x without even a warning message to suddenly give Access Denied from the same code when you update to xarray 0.19.


Work around

If you're trying to read a dataset that has this issue, you can get the same behavior as in previous versions of xarray like so:

xr.open_mfdataset(s3_lookups, engine="zarr", consolidated=False) The data loads fine, just more slowly than if the .zmetadata were accessible.


Stacktrace

```

ClientError Traceback (most recent call last) /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/s3fs/core.py in _call_s3(self, method, akwarglist, kwargs) 245 try: --> 246 out = await method(*additional_kwargs) 247 return out

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/aiobotocore/client.py in _make_api_call(self, operation_name, api_params) 154 error_class = self.exceptions.from_code(error_code) --> 155 raise error_class(parsed_response, operation_name) 156 else:

ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

The above exception was the direct cause of the following exception:

PermissionError Traceback (most recent call last) /var/folders/xf/xwjm3rj52ls9780rvrbbb9tm0000gn/T/ipykernel_84970/645651303.py in <module> 16 17 # Look up the data ---> 18 test_data = xr.open_mfdataset(s3_lookups, engine="zarr") 19 test_data

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, kwargs) 911 getattr_ = getattr 912 --> 913 datasets = [open_(p, open_kwargs) for p in paths] 914 closers = [getattr_(ds, "_close") for ds in datasets] 915 if preprocess is not None:

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/api.py in <listcomp>(.0) 911 getattr_ = getattr 912 --> 913 datasets = [open_(p, **open_kwargs) for p in paths] 914 closers = [getattr_(ds, "_close") for ds in datasets] 915 if preprocess is not None:

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, args, *kwargs) 495 496 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None) --> 497 backend_ds = backend.open_dataset( 498 filename_or_obj, 499 drop_variables=drop_variables,

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, lock) 824 825 filename_or_obj = _normalize_path(filename_or_obj) --> 826 store = ZarrStore.open_group( 827 filename_or_obj, 828 group=group,

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, stacklevel) 367 if consolidated is None: 368 try: --> 369 zarr_group = zarr.open_consolidated(store, **open_kwargs) 370 except KeyError: 371 warnings.warn(

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/zarr/convenience.py in open_consolidated(store, metadata_key, mode, **kwargs) 1176 1177 # setup metadata store -> 1178 meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key) 1179 1180 # pass through

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/zarr/storage.py in init(self, store, metadata_key) 2767 2768 # retrieve consolidated metadata -> 2769 meta = json_loads(store[metadata_key]) 2770 2771 # check format of consolidated metadata

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/mapping.py in getitem(self, key, default) 131 k = self._key_to_str(key) 132 try: --> 133 result = self.fs.cat(k) 134 except self.missing_exceptions: 135 if default is not None:

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in wrapper(args, kwargs) 86 def wrapper(args, kwargs): 87 self = obj or args[0] ---> 88 return sync(self.loop, func, *args, kwargs) 89 90 return wrapper

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in sync(loop, func, timeout, args, *kwargs) 67 raise FSTimeoutError 68 if isinstance(result[0], BaseException): ---> 69 raise result[0] 70 return result[0] 71

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout) 23 coro = asyncio.wait_for(coro, timeout=timeout) 24 try: ---> 25 result[0] = await coro 26 except Exception as ex: 27 result[0] = ex

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in _cat(self, path, recursive, on_error, **kwargs) 342 ex = next(filter(is_exception, out), False) 343 if ex: --> 344 raise ex 345 if ( 346 len(paths) > 1

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/s3fs/core.py in _cat_file(self, path, version_id, start, end) 849 else: 850 head = {} --> 851 resp = await self._call_s3( 852 "get_object", 853 Bucket=bucket,

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/s3fs/core.py in _call_s3(self, method, akwarglist, *kwargs) 263 except Exception as e: 264 err = e --> 265 raise translate_boto_error(err) 266 267 call_s3 = sync_wrapper(_call_s3)

PermissionError: Access Denied ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5918/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
955029073 MDU6SXNzdWU5NTUwMjkwNzM= 5643 open_mfdataset from zarr store, consolidated=None warns, consolidated=False is slow, consolidated=True fails adair-kovac 5509356 open 0     0 2021-07-28T16:23:53Z 2021-08-14T17:41:40Z   NONE      

What happened:

With xarray 0.19.0, using open_mfdataset to read from a zarr store written with a previous version of xarray (with consolidated=True), I get the following results depending on the consolidated parameter:

  1. None or not provided: The call takes 3 seconds and spits out a warning saying to set the consolidated flag.
  2. True: The call fails
  3. False: The call takes 15 seconds (5x longer than None) but doesn't warn

Hopefully it's okay if I include the actual code rather than trying to create a test zarr store that reproduces the situation:

```python import s3fs import xarray as xr

top_group_url = 's3://hrrrzarr/sfc/20200801/20200801_00z_anl.zarr' group_url = f'{top_group_url}/surface/GUST' subgroup_url = f"{group_url}/surface"

fs = s3fs.S3FileSystem(anon=True)
ds = xr.open_mfdataset([s3fs.S3Map(url, s3=fs) for url in [top_group_url, group_url, subgroup_url]], engine='zarr', consolidated=None) ``` Note that if I remove top_group_url, consolidated=False returns as quickly as consolidated=None and gives the same results (so it's just the top_group_url which is slowing things down, even though xarray doesn't load any data from it). However, if I only pass top_group_url, xarray loads the dataset in like 300ms, it's just empty.

What I expected to happen:

  1. I would expect that providing top_group_url (where the .zmetadata file is located) would speed up the call and prevent the warn from happening when consolidated=None, since obviously the metadata is actually consolidated.
  2. I wouldn't expect to be able to construct a call where with any possible value of consolidated, it either warns, fails, or is really slow––and the warning directs me to either change to one of the calls that fails/is slow or to update the data store, which I don't have write permissions to.
  3. I would expect the data store to be efficiently and easily read using xarray since it was written by xarray.

Anything else we need to know?:

This zarr store cannot be (usefully) opened by xarray without using open_mfdataset due to an issue I brought up in discussion #5584 which no one has replied to so far. Basically, the person creating it assumed that if they used xarray to write it, xarray would have no problem reading it, but since there's a slash in the variable names, xarray created it as a deeply-nested zarr store instead of a store with each variable as a single-level (sub)group that xarray would have been able to handle.

Each variable was written like this:

python ds.to_zarr(store=store, group=i, mode='w', encoding=encoding, consolidated=True) In the example, "surface/GUST" is the variable name.

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:36:15) [Clang 11.1.0 ] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.8.0 xarray: 0.19.0 pandas: 1.3.1 numpy: 1.21.1 scipy: 1.7.0 netCDF4: 1.5.7 pydap: None h5netcdf: 0.11.0 h5py: 3.3.0 Nio: None zarr: 2.8.3 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.07.1 distributed: 2021.07.1 matplotlib: 3.4.2 cartopy: 0.19.0.post1 seaborn: None numbagg: None pint: 0.17 setuptools: 49.6.0.post20210108 pip: 21.2.1 conda: None pytest: None IPython: 7.25.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5643/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 1127.915ms · About: xarray-datasette