github: issues: 1 row where user = 38732257 sorted by updated

1 row where user = 38732257 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at ▲	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
2188557281	I_kwDOAMm_X86Ccrvh	8842	Opening zarr dataset with poor connection leads to NaN chunks	renaudjester 38732257	open	0			21	2024-03-15T13:47:18Z	2024-04-28T20:05:15Z		NONE				Problem I am using xarray to open zarr datasets located in an s3 bucket. However, it can happen that the results doesn't retrieve all the chunks and we have NaNs instead. It is usually linked with low bandwith internet connection and asking for a lot of chunks. More details In our case (see code below), we started tracking the http calls to understand a bit better the problem (with http tracking software). 3 cases are possible as for the response when getting a chunk: - 200: we get the chunk with the data - 403: missing data, this is normal as I am dealing with ocean data so the chunks associated with the continent don't exists - no response: there isn't even a response so the get request "fails" and we don't have the data. The latter is a big problem as we have randomly empty chunks! as a user it is also very annoying to detect. We also noticed that when using `xarray.open_dataset` the calls seems to be done all at the same time! Which increases the probability of NaN chunks. That's why we tried using `xarray.open_mfdataset` since each worker calls the get request ie the chunks, one by one. Questions Why the `xarray.open_dataset` sends all the requests concurrently? Is it possible to control the number of requests and do some kind of rolling batch gather? Is there a way to raise an exception when there are no response from the server? So that at least, as users, we don't have to manually check the data. Any idea to solve this problem? Maybe this is linked to zarr library? To reproduce This bug is difficult to reproduce. The only way I managed to reproduce it is with a computer connected to a phone that is connected to the 3G. With this setup it happens all the time though. With a good connection and on my computer it never happens. We have had several reports of this problem otherwise. See the two scripts: one with `open_dataset` ``` import xarray as xr import matplotlib.pyplot as plt import time import sys import logging logging.basicConfig( stream=sys.stdout, format="%(asctime)s \| %(name)14s \| %(levelname)7s \| %(message)s", datefmt="%Y-%m-%dT%H:%M:%S", encoding="utf-8", level=logging.ERROR, ) logging.getLogger("timeloop").setLevel(logging.DEBUG) logging.getLogger("urllib3").setLevel(logging.DEBUG) logging.getLogger("botocore").setLevel(logging.DEBUG) logging.getLogger("s3fs").setLevel(logging.DEBUG) logging.getLogger("fsspec").setLevel(logging.DEBUG) logging.getLogger("asyncio").setLevel(logging.DEBUG) logging.getLogger("numba").setLevel(logging.ERROR) logging.getLogger("s3transfer").setLevel(logging.DEBUG) start_time = time.time() print("Starting...") data = xr.open_dataset("https://s3.waw3-1.cloudferro.com/mdl-arco-geo-012/arco/GLOBAL_ANALYSISFORECAST_PHY_001_024/cmems_mod_glo_phy-thetao_anfc_0.083deg_P1D-m_202211/geoChunked.zarr", engine = "zarr") print("Dataset opened...") bla = data.thetao.sel(longitude = slice(-170.037309004901026,-70.037309004901026), latitude=slice(-80.27257431850789,-40.27257431850789), time=slice("2023-03-20T00:00:00","2023-03-20T00:00:00")).sel(elevation =0, method="nearest") print("Plotting... ") map = bla.isel(time=0).plot() map = data.isel(time=0).plot() print("Saving image...") plt.savefig("./bla_fast.png") print("Total processing time:", (time.time() - start_time)) and the other one with `open_mfdataset`: import xarray as xr import matplotlib.pyplot as plt import time import sys import dask import logging logging.basicConfig( stream=sys.stdout, format="%(asctime)s \| %(name)14s \| %(levelname)7s \| %(message)s", datefmt="%Y-%m-%dT%H:%M:%S", encoding="utf-8", level=logging.ERROR, ) logging.getLogger("timeloop").setLevel(logging.DEBUG) logging.getLogger("urllib3").setLevel(logging.DEBUG) logging.getLogger("botocore").setLevel(logging.DEBUG) logging.getLogger("s3fs").setLevel(logging.DEBUG) logging.getLogger("fsspec").setLevel(logging.DEBUG) logging.getLogger("asyncio").setLevel(logging.DEBUG) logging.getLogger("numba").setLevel(logging.ERROR) logging.getLogger("s3transfer").setLevel(logging.DEBUG) start_time = time.time() with dask.config.set(num_workers=2): print("Starting...") data = xr.open_mfdataset(["https://s3.waw3-1.cloudferro.com/mdl-arco-geo-012/arco/GLOBAL_ANALYSISFORECAST_PHY_001_024/cmems_mod_glo_phy-thetao_anfc_0.083deg_P1D-m_202211/geoChunked.zarr"], engine = "zarr")#.thetao.sel(longitude = slice(-170.037309004901026,-70.037309004901026), #latitude=slice(-80.27257431850789,-40.27257431850789), #time=slice("2023-03-20T00:00:00","2023-03-20T00:00:00")).sel(elevation =0, #method="nearest") print("Dataset opened...") bla = data.thetao.sel(longitude = slice(-170.037309004901026,-70.037309004901026), latitude=slice(-80.27257431850789,-40.27257431850789), time=slice("2023-03-20T00:00:00","2023-03-20T00:00:00")).sel(elevation =0, method="nearest") print("Plotting... ") map = bla.isel(time=0).plot() #map = data.isel(time=0).plot() print("Saving image...") plt.savefig("./bla_long.png") print("Total processing time:", (time.time() - start_time)) ``` Expected result or failed run Obtained result	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8842/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

1 row where user = 38732257 sorted by updated_at descending

Problem

More details

Questions

To reproduce

logging.getLogger("numba").setLevel(logging.ERROR)

map = data.isel(time=0).plot()

logging.getLogger("numba").setLevel(logging.ERROR)

Expected result

Obtained result

Advanced export