html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/6033#issuecomment-1020190813,https://api.github.com/repos/pydata/xarray/issues/6033,1020190813,IC_kwDOAMm_X848zuBd,6042212,2022-01-24T15:00:53Z,2022-01-24T15:00:53Z,CONTRIBUTOR,"It would be interesting to turn on s3fs logging to see the access pattern, if you are interested. ```python fsspec.utils.setup_logging(logger_name=""s3fs"") ``` Particularly, I am interested in whether xarray is loading chunk-by chunk serially versus concurrently. It would be good to know your chunksize versus total array size. The dask version is interesting: ``` xr.open_zarr(lookup(f""{path_forecast}/surface""), chunks={}) # uses dask ``` where the dask partition size will be the same as the underlying chunk size. If you find a lot of latency (small chunks), you can sometimes get an order of magnitude download performance increase by specifying the chunksize along some dimension(s) to be a multiple of the on-disk size. I wouldn't normally recommend Dask just for loading the data into memory, but feel free to experiment.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1064837571 https://github.com/pydata/xarray/issues/6033#issuecomment-1017189009,https://api.github.com/repos/pydata/xarray/issues/6033,1017189009,IC_kwDOAMm_X848oRKR,2443309,2022-01-20T07:25:28Z,2022-01-20T19:59:22Z,MEMBER,"It is worth mentioning that, specifically when using Zarr with fsspec, you have multiple layers of caching available. 1. You can ask fsspec to cache locally: ```python path = 's3://hrrrzarr/sfc/20211124/20211124_00z_fcst.zarr/surface/PRES' ds = xr.open_zarr('simplecache::'+path) ``` (more details on configuration: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally) 2. You can ask Zarr to cache chunks as they are read: ```python mapper = fsspec.get_mapper(path) store = LRUStoreCache(mapper, max_size=1e9) ds = xr.open_zarr(store) ``` (more details on configuration here: https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.LRUStoreCache) 3. Configure a more complex mapper/cache using 3rd party mappers (i.e. [Zict](https://zict.readthedocs.io/en/latest/)) perhaps @martindurant has more to add here?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1064837571 https://github.com/pydata/xarray/issues/6033#issuecomment-981064185,https://api.github.com/repos/pydata/xarray/issues/6033,981064185,IC_kwDOAMm_X846edn5,14371165,2021-11-28T10:59:05Z,2021-11-28T10:59:05Z,MEMBER,If you think the data would fit in memory maybe #5704 would be enough?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1064837571 https://github.com/pydata/xarray/issues/6033#issuecomment-980840643,https://api.github.com/repos/pydata/xarray/issues/6033,980840643,IC_kwDOAMm_X846dnDD,5509356,2021-11-28T04:57:48Z,2021-11-28T04:57:48Z,NONE,"@max-sixty Okay, yeah, that's the problem, it's re-downloading the data every time the values are accessed. Apparently this is the default behavior given that zarr is a chunked format. Adding `cache=True`: - Fixes the problem in open_dataset - Throws an error in open_zarr - Doesn't have any noticeable effect in open_mfdataset My data archive can't normally be usefully read without open_mfdataset and it's small enough to easily fit in memory so this behavior isn't ideal. I guess I had assumed that the data would get stored on disk temporarily even if it wasn't in memory, too, so it's an unexpected limitation that the choices are to either cache it in memory or re-read from S3 every time you access the data. It also seems odd that the default caching logic just takes into account whether the data is chunked, not how big (small) it is, how slow accessing the store is, or whether the data's being repeatedly accessed.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1064837571 https://github.com/pydata/xarray/issues/6033#issuecomment-980501470,https://api.github.com/repos/pydata/xarray/issues/6033,980501470,IC_kwDOAMm_X846cUPe,5635139,2021-11-27T04:43:31Z,2021-11-27T04:43:31Z,MEMBER,"> Is there a way to check what is and isn't downloaded? What is the time difference between the approach you've tried vs. before anything is downloaded? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1064837571 https://github.com/pydata/xarray/issues/6033#issuecomment-980477705,https://api.github.com/repos/pydata/xarray/issues/6033,980477705,IC_kwDOAMm_X846cOcJ,5509356,2021-11-27T00:49:25Z,2021-11-27T00:50:28Z,NONE,"@max-sixty There shouldn't be any download happening by the time I'm seeing this issue. If you check the notebook ([also here](https://github.com/adair-kovac/examples/blob/master/thread_locking.ipynb) if it's easier to read), ~~I check that the data is downloaded (via looking at the dataset nbytes) before attempting the computation and verify it hasn't changed afterward.~~ wait nevermind that doesn't actually work, I just verified that nbytes returns the same size of the data even when I've just opened the dataset. Is there a way to check what is and isn't downloaded? But in any case, I call .values on the data beforehand and it has the same issue if I run the method a second (third, fourth, fifth) time. Unless it's repeatedly re-downloading the same data for some reason download doesn't seem to be the problem. The dataset is about 350 MB and has 48 x 150 x 150 chunks. I haven't tried creating smaller or larger datasets and posting them to S3 to see if it happens with them, too.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1064837571 https://github.com/pydata/xarray/issues/6033#issuecomment-980459050,https://api.github.com/repos/pydata/xarray/issues/6033,980459050,IC_kwDOAMm_X846cJ4q,5635139,2021-11-26T22:41:32Z,2021-11-26T22:41:32Z,MEMBER,"Thanks @adair-kovac . To what extent is this the time to download the data? How big is the dataset? What's the absolute difference for a very small dataset? Or for a large dataset including the time to download the data first? The threading issue may be threading contention, or it could be the main thread waiting for another thread to complete the download (others will know more here).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1064837571