home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 2188557281

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2188557281 I_kwDOAMm_X86Ccrvh 8842 Opening zarr dataset with poor connection leads to NaN chunks 38732257 open 0     21 2024-03-15T13:47:18Z 2024-04-28T20:05:15Z   NONE      

Problem

I am using xarray to open zarr datasets located in an s3 bucket. However, it can happen that the results doesn't retrieve all the chunks and we have NaNs instead. It is usually linked with low bandwith internet connection and asking for a lot of chunks.

More details

In our case (see code below), we started tracking the http calls to understand a bit better the problem (with http tracking software). 3 cases are possible as for the response when getting a chunk: - 200: we get the chunk with the data - 403: missing data, this is normal as I am dealing with ocean data so the chunks associated with the continent don't exists - no response: there isn't even a response so the get request "fails" and we don't have the data.

The latter is a big problem as we have randomly empty chunks! as a user it is also very annoying to detect.

We also noticed that when using xarray.open_dataset the calls seems to be done all at the same time! Which increases the probability of NaN chunks. That's why we tried using xarray.open_mfdataset since each worker calls the get request ie the chunks, one by one.

Questions

  • Why the xarray.open_dataset sends all the requests concurrently? Is it possible to control the number of requests and do some kind of rolling batch gather?
  • Is there a way to raise an exception when there are no response from the server? So that at least, as users, we don't have to manually check the data.
  • Any idea to solve this problem?
  • Maybe this is linked to zarr library?

To reproduce

This bug is difficult to reproduce. The only way I managed to reproduce it is with a computer connected to a phone that is connected to the 3G. With this setup it happens all the time though. With a good connection and on my computer it never happens. We have had several reports of this problem otherwise.

See the two scripts: one with open_dataset

``` import xarray as xr import matplotlib.pyplot as plt import time import sys

import logging logging.basicConfig( stream=sys.stdout, format="%(asctime)s | %(name)14s | %(levelname)7s | %(message)s", datefmt="%Y-%m-%dT%H:%M:%S", encoding="utf-8", level=logging.ERROR, ) logging.getLogger("timeloop").setLevel(logging.DEBUG) logging.getLogger("urllib3").setLevel(logging.DEBUG) logging.getLogger("botocore").setLevel(logging.DEBUG) logging.getLogger("s3fs").setLevel(logging.DEBUG) logging.getLogger("fsspec").setLevel(logging.DEBUG) logging.getLogger("asyncio").setLevel(logging.DEBUG)

logging.getLogger("numba").setLevel(logging.ERROR)

logging.getLogger("s3transfer").setLevel(logging.DEBUG)

start_time = time.time()

print("Starting...") data = xr.open_dataset("https://s3.waw3-1.cloudferro.com/mdl-arco-geo-012/arco/GLOBAL_ANALYSISFORECAST_PHY_001_024/cmems_mod_glo_phy-thetao_anfc_0.083deg_P1D-m_202211/geoChunked.zarr", engine = "zarr")

print("Dataset opened...") bla = data.thetao.sel(longitude = slice(-170.037309004901026,-70.037309004901026), latitude=slice(-80.27257431850789,-40.27257431850789), time=slice("2023-03-20T00:00:00","2023-03-20T00:00:00")).sel(elevation =0, method="nearest")

print("Plotting... ") map = bla.isel(time=0).plot()

map = data.isel(time=0).plot()

print("Saving image...") plt.savefig("./bla_fast.png")

print("Total processing time:", (time.time() - start_time)) and the other one with `open_mfdataset`: import xarray as xr import matplotlib.pyplot as plt import time import sys import dask

import logging logging.basicConfig( stream=sys.stdout, format="%(asctime)s | %(name)14s | %(levelname)7s | %(message)s", datefmt="%Y-%m-%dT%H:%M:%S", encoding="utf-8", level=logging.ERROR, ) logging.getLogger("timeloop").setLevel(logging.DEBUG) logging.getLogger("urllib3").setLevel(logging.DEBUG) logging.getLogger("botocore").setLevel(logging.DEBUG) logging.getLogger("s3fs").setLevel(logging.DEBUG) logging.getLogger("fsspec").setLevel(logging.DEBUG) logging.getLogger("asyncio").setLevel(logging.DEBUG)

logging.getLogger("numba").setLevel(logging.ERROR)

logging.getLogger("s3transfer").setLevel(logging.DEBUG)

start_time = time.time()

with dask.config.set(num_workers=2):

print("Starting...")
data = xr.open_mfdataset(["https://s3.waw3-1.cloudferro.com/mdl-arco-geo-012/arco/GLOBAL_ANALYSISFORECAST_PHY_001_024/cmems_mod_glo_phy-thetao_anfc_0.083deg_P1D-m_202211/geoChunked.zarr"], 
engine = "zarr")#.thetao.sel(longitude = slice(-170.037309004901026,-70.037309004901026),
            #latitude=slice(-80.27257431850789,-40.27257431850789),
            #time=slice("2023-03-20T00:00:00","2023-03-20T00:00:00")).sel(elevation =0,     #method="nearest")

print("Dataset opened...")
bla = data.thetao.sel(longitude = slice(-170.037309004901026,-70.037309004901026),
            latitude=slice(-80.27257431850789,-40.27257431850789),
            time=slice("2023-03-20T00:00:00","2023-03-20T00:00:00")).sel(elevation =0,
method="nearest")


print("Plotting... ")
map = bla.isel(time=0).plot()

#map = data.isel(time=0).plot()

print("Saving image...")
plt.savefig("./bla_long.png")

print("Total processing time:", (time.time() - start_time)) ```

Expected result

or failed run

Obtained result

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8842/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 4 rows from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 234.607ms · About: xarray-datasette