home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 225774140

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
225774140 MDU6SXNzdWUyMjU3NzQxNDA= 1396 selecting a point from an mfdataset 1197350 closed 0     12 2017-05-02T18:02:50Z 2019-01-13T06:32:45Z 2019-01-13T06:32:45Z MEMBER      

Sorry to be opening so many vague performance issues. I am really having a hard time with my current dataset, which is exposing certain limitations of xarray and dask in a way none of my previous work has done.

I have a directory full of netCDF4 files. There are 1754 files, each 8.1GB in size, each representing a single model timestep. So there is ~14 TB of data total. (In addition to the time-dependent output, there is a single file with information about the grid.)

Imagine I want to extract a timeseries from a single point (indexed by k, j, i) in this simulation. Without xarray, I would do something like this: python import netCDF4 ts = np.zeros(len(all_files)) for n, fname in enumerate(tqdm(all_files)): nc = netCDF4.Dataset(fname) ts[n] = nc.variables['Salt'][k, j, i] nc.close() Which goes reasonably quick: tqdm gives [02:38<00:00, 11.56it/s].

I could do the same sort of loop using xarray: python import xarray as xr ts = np.zeros(len(all_files)) for n, fname in enumerate(tqdm(all_files)): ds = xr.open_dataset(fname) ts[n] = ds['Salt'][k, j, i] ds.close() Which has a <50% performance overhead: [03:29<00:00, 8.74it/s]. Totally acceptable.

Of course, what I really want is to avoid a loop and deal with the whole dataset as a single self-contained object. python ds = xr.open_mfdataset(all_files, decode_cf=False, autoclose=True) This alone takes between 4-5 minutes to run (see #1385). If I want to print the repr, it takes another 3 minutes or so to print(ds). The full dataset looks like this: python <xarray.Dataset> Dimensions: (i: 2160, i_g: 2160, j: 2160, j_g: 2160, k: 90, k_l: 90, k_p1: 91, k_u: 90, time: 1752) Coordinates: * j (j) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ... * k (k) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... * j_g (j_g) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ... * i (i) int64 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 ... * k_p1 (k_p1) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * k_u (k_u) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * i_g (i_g) int64 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 ... * k_l (k_l) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * time (time) float64 2.592e+05 2.628e+05 2.664e+05 2.7e+05 2.736e+05 ... Data variables: face (time) int64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... PhiBot (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceQnet (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... SIvice (time, j_g, i) float32 0.0516454 0.0523205 0.0308559 ... SIhsalt (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceFWflx (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... V (time, k, j_g, i) float32 0.0491903 0.0496442 0.0276739 ... iter (time) int64 10368 10512 10656 10800 10944 11088 11232 11376 ... oceQsw (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceTAUY (time, j_g, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Theta (time, k, j, i) float32 -1.31868 -1.27825 -1.21401 -1.17964 ... SIhsnow (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... U (time, k, j, i_g) float32 0.0281392 0.0203967 0.0075199 ... SIheff (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... SIuice (time, j, i_g) float32 -0.041163 -0.0487612 -0.0614498 ... SIarea (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Salt (time, k, j, i) float32 33.7534 33.7652 33.7755 33.7723 ... oceSflux (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... W (time, k_l, j, i) float32 -2.27453e-05 -2.28018e-05 ... oceTAUX (time, j, i_g) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Eta (time, j, i) float32 -1.28886 -1.28811 -1.2871 -1.28567 ... YC (j, i) float32 -57.001 -57.001 -57.001 -57.001 -57.001 -57.001 ... YG (j_g, i_g) float32 -57.0066 -57.0066 -57.0066 -57.0066 ... XC (j, i) float32 -15.4896 -15.4688 -15.4479 -15.4271 -15.4062 ... XG (j_g, i_g) float32 -15.5 -15.4792 -15.4583 -15.4375 -15.4167 ... Zp1 (k_p1) float32 0.0 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 ... Z (k) float32 -0.5 -1.57 -2.79 -4.185 -5.78 -7.595 -9.66 -12.01 ... Zl (k_l) float32 0.0 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 ... Zu (k_u) float32 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 -13.26 ... rA (j, i) float32 1.5528e+06 1.5528e+06 1.5528e+06 1.5528e+06 ... rAw (j, i_g) float32 1.5528e+06 1.5528e+06 1.5528e+06 1.5528e+06 ... rAs (j_g, i) float32 9.96921e+36 9.96921e+36 9.96921e+36 ... rAz (j_g, i_g) float32 1.55245e+06 1.55245e+06 1.55245e+06 ... dxG (j_g, i) float32 1261.27 1261.27 1261.27 1261.27 1261.27 ... dyG (j, i_g) float32 1230.96 1230.96 1230.96 1230.96 1230.96 ... dxC (j, i_g) float32 1261.46 1261.46 1261.46 1261.46 1261.46 ... Depth (j, i) float32 4578.67 4611.09 4647.6 4674.88 4766.75 4782.64 ... dyC (j_g, i) float32 1230.86 1230.86 1230.86 1230.86 1230.86 ... PHrefF (k_p1) float32 0.0 9.81 20.9934 33.7464 48.3633 65.0403 ... drF (k) float32 1.0 1.14 1.3 1.49 1.7 1.93 2.2 2.5 2.84 3.21 3.63 ... PHrefC (k) float32 4.905 15.4017 27.3699 41.0549 56.7018 74.507 ... drC (k_p1) float32 0.5 1.07 1.22 1.395 1.595 1.815 2.065 2.35 2.67 ... hFacW (k, j, i_g) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... hFacS (k, j_g, i) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... hFacC (k, j, i) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... Attributes: coordinates: face

Now, to extract the same timeseries, I would like to say python ts = ds.Salt[:, k, j, i].load()

I monitor what is happening under the hood using when I call this by using netdata and the dask.distributed dashboard, using only a single process and thread. First, all the files are opened (see #1394). Then they start getting read. Each read takes between 10 and 30 seconds, and the memory usage starts increasing steadily. My impression is that the entire dataset is being read into memory for concatenation. (I have dumped out the dask graph in case anyone can make sense of it.) I have never let this calculation complete, as it looks like it would eat up all the memory on my system...plus it's extremely slow.

To me, this seems like a failure of lazy indexing. I naively expected that the underlying file access would work similar to my loop, perhaps even in parallel.

Can anyone shed some light on what might be going wrong?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1396/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 12 rows from issue in issue_comments
Powered by Datasette · Queries took 81.357ms · About: xarray-datasette