issues
3 rows where state = "open" and user = 5509356 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: created_at (date), updated_at (date)
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at ▲ | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1064837571 | I_kwDOAMm_X84_eCHD | 6033 | Threadlocking in DataArray calculations for zarr data depending on where it's loaded from (S3 vs local) | adair-kovac 5509356 | open | 0 | 7 | 2021-11-26T22:23:14Z | 2022-01-24T15:00:53Z | NONE | What happened: I am consistently seeing an issue where if I download the same dataset from a particular zarr archive on S3, calculations are slowed down by 3 seconds or more compared to when the same data was loaded locally. (I make sure to load all the data into the datasets before trying the calculations.) These same calculations are subsecond on the locally-loaded data, and it's literally just the same data copied from S3 using the aws cli. Profiling shows that this is a threadlocking issue. I have been able to reproduce it: - Using DataArray.mean or min as the calculation - On different machines, OS and Linux, on different networks - In different versions of Python, xarray, dask, and zarr - Loading the full Dataset with coordinates or just the data variable in question (they're in different directories for this archive) - On different zarr arrays in the same archive What you expected to happen: No major threadlocking issues, also calculations on the same data should perform the same regardless of where it was loaded from. Minimal Complete Verifiable Example: See the attached Jupyter Notebook (thread_locking.ipynb.zip) which has the magic for timing and profiling the operations. ```python import s3fs import xarray as xr s3 = s3fs.S3FileSystem(anon=True) def lookup(path): return s3fs.S3Map(path, s3=s3) path_forecast = "hrrrzarr/sfc/20211124/20211124_00z_fcst.zarr/surface/PRES" ds_from_s3 = xr.open_zarr(lookup(f"{path_forecast}/surface")) _ = ds_from_s3.PRES.values ``` ``` %%time %%prun -l 2 _ = ds_from_s3.PRES.mean(dim="time").values
The same Environment: Tested with these and more:
``` Python: 3.9.6 xarray: 0.19.0 dask: 2021.07.1 zarr: 2.8.3 s3fs: 2021.07.0 ``` Output of <tt>xr.show_versions()</tt>INSTALLED VERSIONS ------------------ commit: None python: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:25:38) [Clang 11.1.0 ] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: None libnetcdf: None xarray: 0.20.1 pandas: 1.3.4 numpy: 1.21.4 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.10.3 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.11.2 distributed: 2021.11.2 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2021.11.0 cupy: None pint: None sparse: None setuptools: 59.2.0 pip: 21.3.1 conda: None pytest: None IPython: 7.29.0 sphinx: None |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/6033/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
xarray 13221727 | issue | ||||||||
1039844354 | I_kwDOAMm_X849-sQC | 5918 | Reading zarr gives unspecific PermissionError: Access Denied when public data has been consolidated after being written to S3 | adair-kovac 5509356 | open | 0 | 2 | 2021-10-29T18:54:23Z | 2021-11-04T16:30:33Z | NONE | This is a minor issue with an invalid data setup that's easy to get into, I'm reporting it here more for documentation than expecting a fix. Quick summary: If you consolidate metadata on a public zarr dataset in S3, the .zmetadata files end up permission-restricted. So if you try reading with xarray 0.19, it gives an unclear error message and fails to open the dataset, while reading with xarray <=0.18.x goes fine. Contrast this with the nice warnings you get in 0.19 if the .zmetadata just doesn't exist. How this happens People who wrote data without consolidated=True in past versions of xarray might run into this issue, but in our case we actually did have that parameter originally. I've previously reported a few issues with the fact that xarray will write arbitrary zarr hierarchies if the variable names contain slashes, and then can't read them properly. One consequence of this is that data written with consolidated=True still doesn't have .zmetadata files where they're needed for xarray.open_mfdataset to read them. If you try to add the .zmetadata by running
Why it's a problem It would be nice if when xarray goes to read this data, it would see that it has access to the data but not to any usable .zmetadata and spit out the warning like it does if .zmetadata doesn't exist. Instead it fails on an uncaught PermissionError: Access Denied, and it's not clear from the output that this is just a .zmetadata issue and the user can still get the data by passing consolidated=False. Another problem with this situation is that it causes data that reads just fine in xarray 0.18.x without even a warning message to suddenly give Access Denied from the same code when you update to xarray 0.19. Work around If you're trying to read a dataset that has this issue, you can get the same behavior as in previous versions of xarray like so:
Stacktrace ```ClientError Traceback (most recent call last) /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/s3fs/core.py in _call_s3(self, method, akwarglist, kwargs) 245 try: --> 246 out = await method(*additional_kwargs) 247 return out /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/aiobotocore/client.py in _make_api_call(self, operation_name, api_params) 154 error_class = self.exceptions.from_code(error_code) --> 155 raise error_class(parsed_response, operation_name) 156 else: ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied The above exception was the direct cause of the following exception: PermissionError Traceback (most recent call last) /var/folders/xf/xwjm3rj52ls9780rvrbbb9tm0000gn/T/ipykernel_84970/645651303.py in <module> 16 17 # Look up the data ---> 18 test_data = xr.open_mfdataset(s3_lookups, engine="zarr") 19 test_data /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, kwargs) 911 getattr_ = getattr 912 --> 913 datasets = [open_(p, open_kwargs) for p in paths] 914 closers = [getattr_(ds, "_close") for ds in datasets] 915 if preprocess is not None: /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/api.py in <listcomp>(.0) 911 getattr_ = getattr 912 --> 913 datasets = [open_(p, **open_kwargs) for p in paths] 914 closers = [getattr_(ds, "_close") for ds in datasets] 915 if preprocess is not None: /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, args, *kwargs) 495 496 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None) --> 497 backend_ds = backend.open_dataset( 498 filename_or_obj, 499 drop_variables=drop_variables, /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, lock) 824 825 filename_or_obj = _normalize_path(filename_or_obj) --> 826 store = ZarrStore.open_group( 827 filename_or_obj, 828 group=group, /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, stacklevel) 367 if consolidated is None: 368 try: --> 369 zarr_group = zarr.open_consolidated(store, **open_kwargs) 370 except KeyError: 371 warnings.warn( /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/zarr/convenience.py in open_consolidated(store, metadata_key, mode, **kwargs) 1176 1177 # setup metadata store -> 1178 meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key) 1179 1180 # pass through /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/zarr/storage.py in init(self, store, metadata_key) 2767 2768 # retrieve consolidated metadata -> 2769 meta = json_loads(store[metadata_key]) 2770 2771 # check format of consolidated metadata /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/mapping.py in getitem(self, key, default) 131 k = self._key_to_str(key) 132 try: --> 133 result = self.fs.cat(k) 134 except self.missing_exceptions: 135 if default is not None: /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in wrapper(args, kwargs) 86 def wrapper(args, kwargs): 87 self = obj or args[0] ---> 88 return sync(self.loop, func, *args, kwargs) 89 90 return wrapper /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in sync(loop, func, timeout, args, *kwargs) 67 raise FSTimeoutError 68 if isinstance(result[0], BaseException): ---> 69 raise result[0] 70 return result[0] 71 /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout) 23 coro = asyncio.wait_for(coro, timeout=timeout) 24 try: ---> 25 result[0] = await coro 26 except Exception as ex: 27 result[0] = ex /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in _cat(self, path, recursive, on_error, **kwargs) 342 ex = next(filter(is_exception, out), False) 343 if ex: --> 344 raise ex 345 if ( 346 len(paths) > 1 /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/s3fs/core.py in _cat_file(self, path, version_id, start, end) 849 else: 850 head = {} --> 851 resp = await self._call_s3( 852 "get_object", 853 Bucket=bucket, /opt/anaconda3/envs/gefs/lib/python3.9/site-packages/s3fs/core.py in _call_s3(self, method, akwarglist, *kwargs) 263 except Exception as e: 264 err = e --> 265 raise translate_boto_error(err) 266 267 call_s3 = sync_wrapper(_call_s3) PermissionError: Access Denied ``` |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/5918/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
xarray 13221727 | issue | ||||||||
955029073 | MDU6SXNzdWU5NTUwMjkwNzM= | 5643 | open_mfdataset from zarr store, consolidated=None warns, consolidated=False is slow, consolidated=True fails | adair-kovac 5509356 | open | 0 | 0 | 2021-07-28T16:23:53Z | 2021-08-14T17:41:40Z | NONE | What happened: With xarray 0.19.0, using open_mfdataset to read from a zarr store written with a previous version of xarray (with consolidated=True), I get the following results depending on the consolidated parameter:
Hopefully it's okay if I include the actual code rather than trying to create a test zarr store that reproduces the situation: ```python import s3fs import xarray as xr top_group_url = 's3://hrrrzarr/sfc/20200801/20200801_00z_anl.zarr' group_url = f'{top_group_url}/surface/GUST' subgroup_url = f"{group_url}/surface" fs = s3fs.S3FileSystem(anon=True) What I expected to happen:
Anything else we need to know?: This zarr store cannot be (usefully) opened by xarray without using open_mfdataset due to an issue I brought up in discussion #5584 which no one has replied to so far. Basically, the person creating it assumed that if they used xarray to write it, xarray would have no problem reading it, but since there's a slash in the variable names, xarray created it as a deeply-nested zarr store instead of a store with each variable as a single-level (sub)group that xarray would have been able to handle. Each variable was written like this:
Environment: Output of <tt>xr.show_versions()</tt>INSTALLED VERSIONS ------------------ commit: None python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:36:15) [Clang 11.1.0 ] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.8.0 xarray: 0.19.0 pandas: 1.3.1 numpy: 1.21.1 scipy: 1.7.0 netCDF4: 1.5.7 pydap: None h5netcdf: 0.11.0 h5py: 3.3.0 Nio: None zarr: 2.8.3 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.07.1 distributed: 2021.07.1 matplotlib: 3.4.2 cartopy: 0.19.0.post1 seaborn: None numbagg: None pint: 0.17 setuptools: 49.6.0.post20210108 pip: 21.2.1 conda: None pytest: None IPython: 7.25.0 sphinx: None |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/5643/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
xarray 13221727 | issue |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issues] ( [id] INTEGER PRIMARY KEY, [node_id] TEXT, [number] INTEGER, [title] TEXT, [user] INTEGER REFERENCES [users]([id]), [state] TEXT, [locked] INTEGER, [assignee] INTEGER REFERENCES [users]([id]), [milestone] INTEGER REFERENCES [milestones]([id]), [comments] INTEGER, [created_at] TEXT, [updated_at] TEXT, [closed_at] TEXT, [author_association] TEXT, [active_lock_reason] TEXT, [draft] INTEGER, [pull_request] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [state_reason] TEXT, [repo] INTEGER REFERENCES [repos]([id]), [type] TEXT ); CREATE INDEX [idx_issues_repo] ON [issues] ([repo]); CREATE INDEX [idx_issues_milestone] ON [issues] ([milestone]); CREATE INDEX [idx_issues_assignee] ON [issues] ([assignee]); CREATE INDEX [idx_issues_user] ON [issues] ([user]);