issues: 225774140

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
225774140	MDU6SXNzdWUyMjU3NzQxNDA=	1396	selecting a point from an mfdataset	1197350	closed	0			12	2017-05-02T18:02:50Z	2019-01-13T06:32:45Z	2019-01-13T06:32:45Z	MEMBER				Sorry to be opening so many vague performance issues. I am really having a hard time with my current dataset, which is exposing certain limitations of xarray and dask in a way none of my previous work has done. I have a directory full of netCDF4 files. There are 1754 files, each 8.1GB in size, each representing a single model timestep. So there is ~14 TB of data total. (In addition to the time-dependent output, there is a single file with information about the grid.) Imagine I want to extract a timeseries from a single point (indexed by `k, j, i`) in this simulation. Without xarray, I would do something like this: `python import netCDF4 ts = np.zeros(len(all_files)) for n, fname in enumerate(tqdm(all_files)): nc = netCDF4.Dataset(fname) ts[n] = nc.variables['Salt'][k, j, i] nc.close()` Which goes reasonably quick: tqdm gives `[02:38<00:00, 11.56it/s]`. I could do the same sort of loop using xarray: `python import xarray as xr ts = np.zeros(len(all_files)) for n, fname in enumerate(tqdm(all_files)): ds = xr.open_dataset(fname) ts[n] = ds['Salt'][k, j, i] ds.close()` Which has a <50% performance overhead: `[03:29<00:00, 8.74it/s]`. Totally acceptable. Of course, what I really want is to avoid a loop and deal with the whole dataset as a single self-contained object. `python ds = xr.open_mfdataset(all_files, decode_cf=False, autoclose=True)` This alone takes between 4-5 minutes to run (see #1385). If I want to print the repr, it takes another 3 minutes or so to `print(ds)`. The full dataset looks like this: python <xarray.Dataset> Dimensions: (i: 2160, i_g: 2160, j: 2160, j_g: 2160, k: 90, k_l: 90, k_p1: 91, k_u: 90, time: 1752) Coordinates: * j (j) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ... * k (k) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... * j_g (j_g) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ... * i (i) int64 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 ... * k_p1 (k_p1) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * k_u (k_u) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * i_g (i_g) int64 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 ... * k_l (k_l) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * time (time) float64 2.592e+05 2.628e+05 2.664e+05 2.7e+05 2.736e+05 ... Data variables: face (time) int64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... PhiBot (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceQnet (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... SIvice (time, j_g, i) float32 0.0516454 0.0523205 0.0308559 ... SIhsalt (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceFWflx (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... V (time, k, j_g, i) float32 0.0491903 0.0496442 0.0276739 ... iter (time) int64 10368 10512 10656 10800 10944 11088 11232 11376 ... oceQsw (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceTAUY (time, j_g, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Theta (time, k, j, i) float32 -1.31868 -1.27825 -1.21401 -1.17964 ... SIhsnow (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... U (time, k, j, i_g) float32 0.0281392 0.0203967 0.0075199 ... SIheff (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... SIuice (time, j, i_g) float32 -0.041163 -0.0487612 -0.0614498 ... SIarea (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Salt (time, k, j, i) float32 33.7534 33.7652 33.7755 33.7723 ... oceSflux (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... W (time, k_l, j, i) float32 -2.27453e-05 -2.28018e-05 ... oceTAUX (time, j, i_g) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Eta (time, j, i) float32 -1.28886 -1.28811 -1.2871 -1.28567 ... YC (j, i) float32 -57.001 -57.001 -57.001 -57.001 -57.001 -57.001 ... YG (j_g, i_g) float32 -57.0066 -57.0066 -57.0066 -57.0066 ... XC (j, i) float32 -15.4896 -15.4688 -15.4479 -15.4271 -15.4062 ... XG (j_g, i_g) float32 -15.5 -15.4792 -15.4583 -15.4375 -15.4167 ... Zp1 (k_p1) float32 0.0 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 ... Z (k) float32 -0.5 -1.57 -2.79 -4.185 -5.78 -7.595 -9.66 -12.01 ... Zl (k_l) float32 0.0 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 ... Zu (k_u) float32 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 -13.26 ... rA (j, i) float32 1.5528e+06 1.5528e+06 1.5528e+06 1.5528e+06 ... rAw (j, i_g) float32 1.5528e+06 1.5528e+06 1.5528e+06 1.5528e+06 ... rAs (j_g, i) float32 9.96921e+36 9.96921e+36 9.96921e+36 ... rAz (j_g, i_g) float32 1.55245e+06 1.55245e+06 1.55245e+06 ... dxG (j_g, i) float32 1261.27 1261.27 1261.27 1261.27 1261.27 ... dyG (j, i_g) float32 1230.96 1230.96 1230.96 1230.96 1230.96 ... dxC (j, i_g) float32 1261.46 1261.46 1261.46 1261.46 1261.46 ... Depth (j, i) float32 4578.67 4611.09 4647.6 4674.88 4766.75 4782.64 ... dyC (j_g, i) float32 1230.86 1230.86 1230.86 1230.86 1230.86 ... PHrefF (k_p1) float32 0.0 9.81 20.9934 33.7464 48.3633 65.0403 ... drF (k) float32 1.0 1.14 1.3 1.49 1.7 1.93 2.2 2.5 2.84 3.21 3.63 ... PHrefC (k) float32 4.905 15.4017 27.3699 41.0549 56.7018 74.507 ... drC (k_p1) float32 0.5 1.07 1.22 1.395 1.595 1.815 2.065 2.35 2.67 ... hFacW (k, j, i_g) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... hFacS (k, j_g, i) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... hFacC (k, j, i) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... Attributes: coordinates: face Now, to extract the same timeseries, I would like to say `python ts = ds.Salt[:, k, j, i].load()` I monitor what is happening under the hood using when I call this by using netdata and the dask.distributed dashboard, using only a single process and thread. First, all the files are opened (see #1394). Then they start getting read. Each read takes between 10 and 30 seconds, and the memory usage starts increasing steadily. My impression is that the entire dataset is being read into memory for concatenation. (I have dumped out the dask graph in case anyone can make sense of it.) I have never let this calculation complete, as it looks like it would eat up all the memory on my system...plus it's extremely slow. To me, this seems like a failure of lazy indexing. I naively expected that the underlying file access would work similar to my loop, perhaps even in parallel. Can anyone shed some light on what might be going wrong?	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1396/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

1 row from issues_id in issues_labels
12 rows from issue in issue_comments