id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 2188557281,I_kwDOAMm_X86Ccrvh,8842,Opening zarr dataset with poor connection leads to NaN chunks,38732257,open,0,,,21,2024-03-15T13:47:18Z,2024-04-28T20:05:15Z,,NONE,,,,"## Problem I am using xarray to open zarr datasets located in an s3 bucket. However, it can happen that the results doesn't retrieve all the chunks and we have NaNs instead. It is usually linked with low bandwith internet connection and asking for a lot of chunks. ## More details In our case (see code below), we started tracking the http calls to understand a bit better the problem (with http tracking software). 3 cases are possible as for the response when getting a chunk: - 200: we get the chunk with the data - 403: missing data, this is normal as I am dealing with ocean data so the chunks associated with the continent don't exists - no response: there isn't even a response so the get request ""fails"" and we don't have the data. ![Screenshot_http_tracking](https://github.com/pydata/xarray/assets/38732257/4a8c9a54-19ac-4990-a955-a72893b47cb0) The latter is a big problem as we have randomly empty chunks! as a user it is also very annoying to detect. We also noticed that when using `xarray.open_dataset` the calls seems to be done all at the same time! Which increases the probability of NaN chunks. That's why we tried using `xarray.open_mfdataset` since each worker calls the get request ie the chunks, one by one. ## Questions - Why the `xarray.open_dataset` sends all the requests concurrently? Is it possible to control the number of requests and do some kind of rolling batch gather? - Is there a way to raise an exception when there are no response from the server? So that at least, as users, we don't have to manually check the data. - Any idea to solve this problem? - Maybe this is linked to zarr library? ## To reproduce This bug is difficult to reproduce. The only way I managed to reproduce it is with a computer connected to a phone that is connected to the 3G. With this setup it happens all the time though. With a good connection and on my computer it never happens. We have had several reports of this problem otherwise. See the two scripts: one with `open_dataset` ``` import xarray as xr import matplotlib.pyplot as plt import time import sys import logging logging.basicConfig( stream=sys.stdout, format=""%(asctime)s | %(name)14s | %(levelname)7s | %(message)s"", datefmt=""%Y-%m-%dT%H:%M:%S"", encoding=""utf-8"", level=logging.ERROR, ) logging.getLogger(""timeloop"").setLevel(logging.DEBUG) logging.getLogger(""urllib3"").setLevel(logging.DEBUG) logging.getLogger(""botocore"").setLevel(logging.DEBUG) logging.getLogger(""s3fs"").setLevel(logging.DEBUG) logging.getLogger(""fsspec"").setLevel(logging.DEBUG) logging.getLogger(""asyncio"").setLevel(logging.DEBUG) # logging.getLogger(""numba"").setLevel(logging.ERROR) logging.getLogger(""s3transfer"").setLevel(logging.DEBUG) start_time = time.time() print(""Starting..."") data = xr.open_dataset(""https://s3.waw3-1.cloudferro.com/mdl-arco-geo-012/arco/GLOBAL_ANALYSISFORECAST_PHY_001_024/cmems_mod_glo_phy-thetao_anfc_0.083deg_P1D-m_202211/geoChunked.zarr"", engine = ""zarr"") print(""Dataset opened..."") bla = data.thetao.sel(longitude = slice(-170.037309004901026,-70.037309004901026), latitude=slice(-80.27257431850789,-40.27257431850789), time=slice(""2023-03-20T00:00:00"",""2023-03-20T00:00:00"")).sel(elevation =0, method=""nearest"") print(""Plotting... "") map = bla.isel(time=0).plot() #map = data.isel(time=0).plot() print(""Saving image..."") plt.savefig(""./bla_fast.png"") print(""Total processing time:"", (time.time() - start_time)) ``` and the other one with `open_mfdataset`: ``` import xarray as xr import matplotlib.pyplot as plt import time import sys import dask import logging logging.basicConfig( stream=sys.stdout, format=""%(asctime)s | %(name)14s | %(levelname)7s | %(message)s"", datefmt=""%Y-%m-%dT%H:%M:%S"", encoding=""utf-8"", level=logging.ERROR, ) logging.getLogger(""timeloop"").setLevel(logging.DEBUG) logging.getLogger(""urllib3"").setLevel(logging.DEBUG) logging.getLogger(""botocore"").setLevel(logging.DEBUG) logging.getLogger(""s3fs"").setLevel(logging.DEBUG) logging.getLogger(""fsspec"").setLevel(logging.DEBUG) logging.getLogger(""asyncio"").setLevel(logging.DEBUG) # logging.getLogger(""numba"").setLevel(logging.ERROR) logging.getLogger(""s3transfer"").setLevel(logging.DEBUG) start_time = time.time() with dask.config.set(num_workers=2): print(""Starting..."") data = xr.open_mfdataset([""https://s3.waw3-1.cloudferro.com/mdl-arco-geo-012/arco/GLOBAL_ANALYSISFORECAST_PHY_001_024/cmems_mod_glo_phy-thetao_anfc_0.083deg_P1D-m_202211/geoChunked.zarr""], engine = ""zarr"")#.thetao.sel(longitude = slice(-170.037309004901026,-70.037309004901026), #latitude=slice(-80.27257431850789,-40.27257431850789), #time=slice(""2023-03-20T00:00:00"",""2023-03-20T00:00:00"")).sel(elevation =0, #method=""nearest"") print(""Dataset opened..."") bla = data.thetao.sel(longitude = slice(-170.037309004901026,-70.037309004901026), latitude=slice(-80.27257431850789,-40.27257431850789), time=slice(""2023-03-20T00:00:00"",""2023-03-20T00:00:00"")).sel(elevation =0, method=""nearest"") print(""Plotting... "") map = bla.isel(time=0).plot() #map = data.isel(time=0).plot() print(""Saving image..."") plt.savefig(""./bla_long.png"") print(""Total processing time:"", (time.time() - start_time)) ``` ## Expected result ![bla_long](https://github.com/pydata/xarray/assets/38732257/c08d5586-f2fe-4826-99aa-5acb2bd2de81) or failed run ## Obtained result ![bla_fast](https://github.com/pydata/xarray/assets/38732257/391a543b-de28-44e6-82ec-221f43890a71) ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8842/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue