home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

21 rows where author_association = "MEMBER" and issue = 224553135 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • rabernat 10
  • shoyer 7
  • dcherian 3
  • TomNicholas 1

issue 1

  • slow performance with open_mfdataset · 21 ✖

author_association 1

  • MEMBER · 21 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1043038150 https://github.com/pydata/xarray/issues/1385#issuecomment-1043038150 https://api.github.com/repos/pydata/xarray/issues/1385 IC_kwDOAMm_X84-K3_G rabernat 1197350 2022-02-17T14:57:03Z 2022-02-17T14:57:03Z MEMBER

See deeper dive in https://github.com/pydata/xarray/discussions/6284

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
1043016100 https://github.com/pydata/xarray/issues/1385#issuecomment-1043016100 https://api.github.com/repos/pydata/xarray/issues/1385 IC_kwDOAMm_X84-Kymk rabernat 1197350 2022-02-17T14:36:23Z 2022-02-17T14:36:23Z MEMBER

Ah ok so if that is your goal, decode_times=False should be enough to solve it.

There is a problem with the time encoding in this file. The units (days since 1950-01-01T00:00:00Z) are not compatible with the values (738457.04166667, etc.). That would place your measurements sometime in the year 3971. This is part of the problem, but not the whole story.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
1043001146 https://github.com/pydata/xarray/issues/1385#issuecomment-1043001146 https://api.github.com/repos/pydata/xarray/issues/1385 IC_kwDOAMm_X84-Ku86 rabernat 1197350 2022-02-17T14:21:45Z 2022-02-17T14:22:23Z MEMBER

(I could post to a web server if there's any reason to prefer that.)

In general that would be a little more convenient than google drive, because then we could download the file from python (rather than having a manual step). This would allow us to share a fully copy-pasteable code snippet to reproduce the issue. But don't worry about that for now.

First, I'd note that your issue is not really related to open_mfdataset at all, since it is reproduced just using open_dataset. The core problem is that you have ~15M timesteps, and it is taking forever to decode the times out of them. It's fast when you do decode_times=False because the data aren't actually being read. I'm going to make a post over in discussions to dig a bit deeper into this. StackOverflow isn't monitored too regularly by this community.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
1042937825 https://github.com/pydata/xarray/issues/1385#issuecomment-1042937825 https://api.github.com/repos/pydata/xarray/issues/1385 IC_kwDOAMm_X84-Kffh rabernat 1197350 2022-02-17T13:14:50Z 2022-02-17T13:14:50Z MEMBER

Hi Tom! 👋

So much has evolved about xarray since this original issue was posted. However, we continue to use it as a catchall for people looking to speed up open_mfdataset. I saw your stackoverflow post. Any chance you could post a link to the actual file in question?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
756948150 https://github.com/pydata/xarray/issues/1385#issuecomment-756948150 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDc1Njk0ODE1MA== dcherian 2448579 2021-01-08T19:20:51Z 2021-01-08T19:20:51Z MEMBER

setting parallel=True seg faults... I'm betting that is some quirk of my python environment, though.

This is important! Otherwise that timing scales with number of files. If you get that to work, then you can convert to a dask dataframe and keep things lazy.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
756830228 https://github.com/pydata/xarray/issues/1385#issuecomment-756830228 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDc1NjgzMDIyOA== dcherian 2448579 2021-01-08T15:52:49Z 2021-01-08T15:53:22Z MEMBER

@jameshalgren A lot of these issues have been fixed. Have you tried the advice here: https://xarray.pydata.org/en/stable/io.html#reading-multi-file-datasets?

If not, a reproducible example would help (I have access to Cheyenne). Let's also move this conversation to the "Discussions" forum: https://github.com/pydata/xarray/discussions

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
685164626 https://github.com/pydata/xarray/issues/1385#issuecomment-685164626 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDY4NTE2NDYyNg== dcherian 2448579 2020-09-01T22:21:58Z 2020-09-01T22:21:58Z MEMBER

This is the most up-to-date documentation on this issue: https://xarray.pydata.org/en/stable/io.html#reading-multi-file-datasets

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
561920115 https://github.com/pydata/xarray/issues/1385#issuecomment-561920115 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDU2MTkyMDExNQ== rabernat 1197350 2019-12-05T01:09:25Z 2019-12-05T01:09:25Z MEMBER

In your twitter thread you said

Do any of my xarray/dask folks know why open_mfdataset takes such a significant amount of time compared to looping over a list of files? Each file corresponds to a new time, just wanting to open multiple times at once...

The general reason for this is usually that open_mfdataset performs coordinate compatibility checks when it concatenates the files. It's useful to actually read the code of open_mfdataset to see how it works.

First, all the files are opened individually https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L900-L903

You can recreate this step outside of xarray yourself by doing something like python from glob import glob datasets = [xr.open_dataset(fname, chunks={}) for fname in glob('*.nc')]

Once each dataset is open, xarray calls out to one of its combine functions. This logic has gotten more complex over the years as different options have been introduced, but the gist is this: https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L947-L952

You can reproduce this step outside of xarray, e.g. ds = xr.concat(datasets, dim='time') At that point, various checks will kick in to be sure that the coordinates in the different datasets are compatible. Performing these checks requires the data to be read eagerly, which can be a source of slow performance.

Without seeing more details about your files, it's hard to know exactly where the issue lies. A good place to start is to simply drop all coordinates from your data as a preprocessing step.

``` def drop_all_coords(ds): return ds.reset_coords(drop=True)

xr.open_mfdataset('*.nc', combine='by_coords', preprocess=drop_all_coords) ```

If you observe a big speedup, this points at coordinate compatibility checks as the culprit. From there you can experiment with the various options for open_mfdataset, such as coords='minimal', compat='override', etc.

Once you post your file details, we can provide more concrete suggestions.

{
    "total_count": 6,
    "+1": 6,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
561915767 https://github.com/pydata/xarray/issues/1385#issuecomment-561915767 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDU2MTkxNTc2Nw== rabernat 1197350 2019-12-05T00:52:06Z 2019-12-05T00:52:06Z MEMBER

@keltonhalbert - I'm sorry you're frustrated by this issue. It's hard to provide a general answer to "why is open_mfdataset slow?" without seeing the data in question. I'll try to provide some best practices and recommendations here. In the meantime, could you please post the xarray repr of two of your files? To be explicit. python ds1 = xr.open_dataset('file1.nc') print(ds1) ds2 = xr.open_dataset('file2.nc') print(ds2) This will help us debug.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
463369751 https://github.com/pydata/xarray/issues/1385#issuecomment-463369751 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDQ2MzM2OTc1MQ== rabernat 1197350 2019-02-13T21:04:03Z 2019-02-13T21:04:03Z MEMBER

What if you do xr.open_mfdataset(fname, decode_times=False)?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
461554066 https://github.com/pydata/xarray/issues/1385#issuecomment-461554066 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDQ2MTU1NDA2Ng== TomNicholas 35968931 2019-02-07T19:00:57Z 2019-02-07T19:00:57Z MEMBER

Looks like you're using xarray v0.11.0, but the most recent one is v0.11.3. There have been several changes since then which might affect this, try that first.

On Thu, 7 Feb 2019, 18:53 sbiner, notifications@github.com wrote:

I have the same problem. open_mfdatasset is 10X slower than nc.MFDataset. I used the following code to get some timing on opening 456 local netcdf files located in a nc_local directory (of total size of 532MB)

clef = 'nc_local/*.nc' t00 = time.time() l_fichiers_nc = sorted(glob.glob(clef)) print ('timing glob: {:6.2f}s'.format(time.time()-t00))

netcdf4

t00 = time.time() ds1 = nc.MFDataset(l_fichiers_nc)

dates1 = ouralib.netcdf.calcule_dates(ds1)

print ('timing netcdf4: {:6.2f}s'.format(time.time()-t00))

xarray

t00 = time.time() ds2 = xr.open_mfdataset(l_fichiers_nc) print ('timing xarray: {:6.2f}s'.format(time.time()-t00))

xarray tune

t00 = time.time() ds3 = xr.open_mfdataset(l_fichiers_nc, decode_cf=False, concat_dim='time') ds3 = xr.decode_cf(ds3) print ('timing xarray tune: {:6.2f}s'.format(time.time()-t00))

The output I get is :

timing glob: 0.00s timing netcdf4: 3.80s timing xarray: 44.60s timing xarray tune: 15.61s

I made tests on a centOS server using python2.7 and 3.6, and on mac OS as well with python3.6. The timing changes but the ratios are similar between netCDF4 and xarray.

Is there any way of making open_mfdataset go faster?

In case it helps, here are output from xr.show_versions and %prun xr.open_mfdataset(l_fichiers_nc). I do not know anything about the output of %prun but I have noticed that the first two lines of the ouput are different wether I'm using python 2.7 or python 3.6. I made those tests on centOS and macOS with anaconda environments.

for python 2.7:

    13996351 function calls (13773659 primitive calls) in 42.133 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function) 2664 16.290 0.006 16.290 0.006 {time.sleep} 912 6.330 0.007 6.623 0.007 netCDF4_.py:244(_open_netcdf4_group)

for python 3.6:

   9663408 function calls (9499759 primitive calls) in 31.934 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function) 5472 15.140 0.003 15.140 0.003 {method 'acquire' of 'thread.lock' objects} 912 5.661 0.006 5.718 0.006 netCDF4.py:244(_open_netcdf4_group)

longer output of %prun with python3.6:

     9663408 function calls (9499759 primitive calls) in 31.934 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function) 5472 15.140 0.003 15.140 0.003 {method 'acquire' of 'thread.lock' objects} 912 5.661 0.006 5.718 0.006 netCDF4.py:244(open_netcdf4_group) 4104 0.564 0.000 0.757 0.000 {built-in method _operator.getitem} 133152/129960 0.477 0.000 0.660 0.000 indexing.py:496(shape) 1554550/1554153 0.414 0.000 0.711 0.000 {built-in method builtins.isinstance} 912 0.260 0.000 0.260 0.000 {method 'close' of 'netCDF4._netCDF4.Dataset' objects} 6384 0.244 0.000 0.953 0.000 netCDF4.py:361(open_store_variable) 910 0.241 0.000 0.595 0.001 duck_array_ops.py:141(array_equiv) 20990 0.235 0.000 0.343 0.000 {pandas.libs.lib.is_scalar} 37483/36567 0.228 0.000 0.230 0.000 {built-in method builtins.iter} 93986 0.219 0.000 1.607 0.000 variable.py:239(__init__) 93982 0.194 0.000 0.194 0.000 variable.py:706(attrs) 33744 0.189 0.000 0.189 0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects} 15511 0.175 0.000 0.638 0.000 core.py:1776(normalize_chunks) 5930 0.162 0.000 0.350 0.000 missing.py:183(_isna_ndarraylike) 297391/296926 0.159 0.000 0.380 0.000 {built-in method builtins.getattr} 134230 0.155 0.000 0.269 0.000 abc.py:180(__instancecheck__) 6384 0.142 0.000 0.199 0.000 netCDF4.py:34(init) 93986 0.126 0.000 0.671 0.000 variable.py:414(_parse_dimensions) 156545 0.119 0.000 0.811 0.000 utils.py:450(ndim) 12768 0.119 0.000 0.203 0.000 core.py:747(blockdims_from_blockshape) 6384 0.117 0.000 2.526 0.000 conventions.py:245(decode_cf_variable) 741183/696380 0.116 0.000 0.134 0.000 {built-in method builtins.len} 41957/23717 0.110 0.000 4.395 0.000 {built-in method numpy.core.multiarray.array} 93978 0.110 0.000 0.110 0.000 variable.py:718(encoding) 219940 0.109 0.000 0.109 0.000 _weakrefset.py:70(contains) 99458 0.100 0.000 0.440 0.000 variable.py:137(as_compatible_data) 53882 0.085 0.000 0.095 0.000 core.py:891(shape) 140604 0.084 0.000 0.628 0.000 variable.py:272(shape) 3192 0.084 0.000 0.170 0.000 utils.py:88(_StartCountStride) 10494 0.081 0.000 0.081 0.000 {method 'reduce' of 'numpy.ufunc' objects} 44688 0.077 0.000 0.157 0.000 variables.py:102(unpack_for_decoding)

output of xr.show_versions()

xr.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.8.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-514.2.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8

xarray: 0.11.0 pandas: 0.24.1 numpy: 1.15.4 scipy: None netCDF4: 1.4.2 h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 PseudonetCDF: None rasterio: None iris: None bottleneck: None cyordereddict: None dask: 1.1.1 distributed: 1.25.3 matplotlib: 3.0.2 cartopy: None seaborn: None setuptools: 40.7.3 pip: 19.0.1 conda: None pytest: None IPython: 7.2.0 sphinx: None

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1385#issuecomment-461551320, or mute the thread https://github.com/notifications/unsubscribe-auth/AiTXo7EQRagOh5iu3cYenLDTFDk68MGIks5vLHYMgaJpZM4NJOcQ .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
439454213 https://github.com/pydata/xarray/issues/1385#issuecomment-439454213 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDQzOTQ1NDIxMw== shoyer 1217238 2018-11-16T16:46:55Z 2018-11-16T16:46:55Z MEMBER

Does it take 10 seconds even to open a single file? The big mystery is what that top line ("_operator.getitem") is but my guess is it's netCDF4-python. h5netcdf might also give different results... On Fri, Nov 16, 2018 at 8:20 AM chuaxr notifications@github.com wrote:

Sorry, I think the speedup had to do with accessing a file that had previously been loaded rather than due to decode_cf. Here's the output of prun using two different files of approximately the same size (~75 GB), run from a notebook without using distributed (which doesn't lead to any speedup):

Output of %prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/ atmos_level.1999010100-2000123123.sphum.nc ',chunks={'lat':20,'time':50,'lon':12,'pfull':11})

      780980 function calls (780741 primitive calls) in 55.374 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7   54.448    7.778   54.448    7.778 {built-in method _operator.getitem}
764838    0.473    0.000    0.473    0.000 core.py:169(<genexpr>)
     3    0.285    0.095    0.758    0.253 core.py:169(<listcomp>)
     2    0.041    0.020    0.041    0.020 {cftime._cftime.num2date}
     3    0.040    0.013    0.821    0.274 core.py:173(getem)
     1    0.027    0.027   55.374   55.374 <string>:1(<module>)

Output of %prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/ atmos_level.2001010100-2002123123.temp.nc ',chunks={'lat':20,'time':50,'lon':12,'pfull':11}, decode_cf=False)

      772212 function calls (772026 primitive calls) in 56.000 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5   55.213   11.043   55.214   11.043 {built-in method _operator.getitem}
764838    0.486    0.000    0.486    0.000 core.py:169(<genexpr>)
     3    0.185    0.062    0.671    0.224 core.py:169(<listcomp>)
     3    0.041    0.014    0.735    0.245 core.py:173(getem)
     1    0.027    0.027   56.001   56.001 <string>:1(<module>)

/work isn't a remote archive, so it surprises me that this should happen.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1385#issuecomment-439445695, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1jmFqfe9_dIgHAMYlVOh7WKhzO8Kks5uvuXKgaJpZM4NJOcQ .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
439263419 https://github.com/pydata/xarray/issues/1385#issuecomment-439263419 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDQzOTI2MzQxOQ== shoyer 1217238 2018-11-16T02:45:05Z 2018-11-16T02:45:05Z MEMBER

@chuaxr What do you see when you use %prun when opening the dataset? This might point to the bottleneck.

One way to fix this would be to move our call to decode_cf() in open_dataset() to after applying chunking, i.e., to switch up the order of operations on these lines: https://github.com/pydata/xarray/blob/f547ed0b379ef70a3bda5e77f66de95ec2332ddf/xarray/backends/api.py#L270-L296

In practice, is the difference between using xarray's internal lazy array classes for decoding and dask for decoding. I would expect to see small differences in performance between these approaches (especially when actually computing data), but for constructing the computation graph I would expect them to have similar performance. It is puzzling that dask is orders of magnitude faster -- that suggests that something else is going wrong in the normal code path for decode_cf(). It would certainly be good to understand this before trying to apply any fixes.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
438873285 https://github.com/pydata/xarray/issues/1385#issuecomment-438873285 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDQzODg3MzI4NQ== shoyer 1217238 2018-11-15T00:45:53Z 2018-11-15T00:45:53Z MEMBER

@chuaxr I assume you're testing this with xarray 0.11?

It would be good to do some profiling to figure out what is going wrong here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
437630511 https://github.com/pydata/xarray/issues/1385#issuecomment-437630511 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDQzNzYzMDUxMQ== shoyer 1217238 2018-11-10T23:38:10Z 2018-11-10T23:38:10Z MEMBER

Was this fixed by https://github.com/pydata/xarray/pull/2047?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
371933603 https://github.com/pydata/xarray/issues/1385#issuecomment-371933603 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDM3MTkzMzYwMw== shoyer 1217238 2018-03-09T20:17:19Z 2018-03-09T20:17:19Z MEMBER

OK, so it seems that we need a change to disable wrapping dask arrays with LazilyIndexedArray. Dask arrays are already lazy!

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
371891466 https://github.com/pydata/xarray/issues/1385#issuecomment-371891466 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDM3MTg5MTQ2Ng== rabernat 1197350 2018-03-09T17:53:15Z 2018-03-09T17:53:15Z MEMBER

Calling ds = xr.decode_cf(ds, decode_times=False) on the dataset returns instantly. However, the variable data is wrapped in the adaptors, effectively destroying the chunks ```python

ds.SST.variable._data LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array<_apply_mask, shape=(16401, 2400, 3600), dtype=float32, chunksize=(1, 2400, 3600)>), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))) ```

Calling getitem on this array triggers the whole dask array to be computed, which would takes forever and would completely blow out the notebook memory. This is because of #1372, which would be fixed by #1725.

This has actually become a major showstopper for me. I need to work with this dataset in decoded form.

Versions

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 3.12.62-60.64.8-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.1 pandas: 0.22.0 numpy: 1.13.3 scipy: 1.0.0 netCDF4: 1.3.1 h5netcdf: 0.5.0 h5py: 2.7.1 Nio: None zarr: 2.2.0a2.dev176 bottleneck: 1.2.1 cyordereddict: None dask: 0.17.1 distributed: 1.21.3 matplotlib: 2.1.2 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 38.4.0 pip: 9.0.1 conda: None pytest: 3.3.2 IPython: 6.2.1
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
370092011 https://github.com/pydata/xarray/issues/1385#issuecomment-370092011 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDM3MDA5MjAxMQ== shoyer 1217238 2018-03-02T23:58:26Z 2018-03-02T23:58:26Z MEMBER

@rabernat How does performance compare if you call xarray.decode_cf() on the opened dataset? The adjustments I recently did to lazy decoding should only help once the data is already loaded into dask.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
370064483 https://github.com/pydata/xarray/issues/1385#issuecomment-370064483 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDM3MDA2NDQ4Mw== rabernat 1197350 2018-03-02T21:57:26Z 2018-03-02T21:57:26Z MEMBER

An update on this long-standing issue.

I have learned that open_mfdataset can be blazingly fast if decode_cf=False but extremely slow with decode_cf=True.

As an example, I am loading a POP datataset on cheyenne. Anyone with access can try this example. ```python base_dir = '/glade/scratch/rpa/' prefix = 'BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001' code = 'pop.h.nday1.SST' glob_pattern = os.path.join(base_dir, prefix, '%s.%s.*.nc' % (prefix, code))

def non_time_coords(ds): return [v for v in ds.data_vars if 'time' not in ds[v].dims]

def drop_non_essential_vars_pop(ds): return ds.drop(non_time_coords(ds))

this runs almost instantly

ds = xr.open_mfdataset(glob_pattern, decode_times=False, chunks={'time': 1}, preprocess=drop_non_essential_vars_pop, decode_cf=False) And returns this <xarray.Dataset> Dimensions: (d2: 2, nlat: 2400, nlon: 3600, time: 16401, z_t: 62, z_t_150m: 15, z_w: 62, z_w_bot: 62, z_w_top: 62) Coordinates: * z_w_top (z_w_top) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 ... * z_t (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ... * z_w (z_w) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ... * z_t_150m (z_t_150m) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ... * z_w_bot (z_w_bot) float32 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ... * time (time) float64 7.322e+05 7.322e+05 7.322e+05 7.322e+05 ... Dimensions without coordinates: d2, nlat, nlon Data variables: time_bound (time, d2) float64 dask.array<shape=(16401, 2), chunksize=(1, 2)> SST (time, nlat, nlon) float32 dask.array<shape=(16401, 2400, 3600), chunksize=(1, 2400, 3600)> Attributes: nsteps_total: 480 tavg_sum: 64800.0 title: BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001 start_time: This dataset was created on 2016-03-14 at 05:32:30.3 Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netcdf/CF-curren... source: CCSM POP2, the CCSM Ocean Component cell_methods: cell_methods = time: mean ==> the variable values are aver... calendar: All years have exactly 365 days. history: none contents: Diagnostic and Prognostic Variables revision: $Id: tavg.F90 56176 2013-12-20 18:35:46Z mlevy@ucar.edu $ ```

This is roughly 45 years of daily data, one file per year.

Instead, if I just change decode_cf=True (the default), it takes forever. I can monitor what is happening via the distributed dashboard. It looks like this:

There are more of these open_dataset tasks then there are number of files (45), so I can only presume there are 16401 individual tasks (one for each timestep), which each takes about 1 s in serial.

This is a real failure of lazy decoding. Maybe it can be fixed by #1725, possibly related to #1372.

cc Pangeo folks: @jhamman, @mrocklin

{
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 2,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
297539517 https://github.com/pydata/xarray/issues/1385#issuecomment-297539517 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDI5NzUzOTUxNw== shoyer 1217238 2017-04-26T20:59:23Z 2017-04-26T20:59:23Z MEMBER

For example, can I give a hint to xarray that this reindex_variables step is not necessary

Yes, adding an boolean argument prealigned which defaults to False to concat seems like a very reasonable optimization here.

But more generally, I am a little surprised by how slow pandas.Index.get_indexer and pandas.Index.is_unique are. This suggests we should add a fast-path optimization to skip these steps in reindex_variables: https://github.com/pydata/xarray/blob/ab4ffee919d4abe9f6c0cf6399a5827c38b9eb5d/xarray/core/alignment.py#L302-L306

Basically, if index.equals(target), we should just set indexer = np.arange(target.size). Although, if we have duplicate values in the index, the operation should arguably fail for correctness.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
297494539 https://github.com/pydata/xarray/issues/1385#issuecomment-297494539 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDI5NzQ5NDUzOQ== rabernat 1197350 2017-04-26T18:07:03Z 2017-04-26T18:07:03Z MEMBER

cc: @geosciz, who is helping with this project.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.782ms · About: xarray-datasette