home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

3 rows where issue = 224553135 and user = 53343824 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • jameshalgren · 3 ✖

issue 1

  • slow performance with open_mfdataset · 3 ✖

author_association 1

  • NONE 3
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
781407863 https://github.com/pydata/xarray/issues/1385#issuecomment-781407863 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDc4MTQwNzg2Mw== jameshalgren 53343824 2021-02-18T15:06:13Z 2021-02-18T15:06:13Z NONE

setting parallel=True seg faults... I'm betting that is some quirk of my python environment, though.

This is important! Otherwise that timing scales with number of files. If you get that to work, then you can convert to a dask dataframe and keep things lazy.

Indeed @dcherian -- it took some experimentation to get the right engine to support parallel execution and even then, results are still mixed, which, to me, means further work is needed to isolate the issue.

Along the lines of suggestions here (thanks @jmccreight for pointing this out), we've introduced a very practical pre-processing step to rewrite the datasets so that the read is not striped across the file system, effectively isolating the performance bottleneck to a position where it can be dealt with independently. Of course, such an asynchronous workflow is not possible in all situations, so we're still looking at improving the direct performance.

Two notes as we keep working: - The preprocessor. Reading and re-manipulating an individual dataset is lightning fast. We saw that a small change or adjustment in the individual files, made with a preprocessor, made the multi-file read massively faster. - The "more sophisticated example" referenced here has proven to be very useful.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
756922963 https://github.com/pydata/xarray/issues/1385#issuecomment-756922963 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDc1NjkyMjk2Mw== jameshalgren 53343824 2021-01-08T18:26:44Z 2021-01-08T18:34:49Z NONE

@dcherian We had looked at a number of options. In the end, the best performance I could achieve was with the work-around pre-processor script, rather than any of the built-in options. It's worth noting that a major part of the slowdown we were experiencing was from the dataframe transform option we were doing after reading the files. Once that was fixed, performance was much better, but not necessarily with any of the expected options. This script reading one-day's worth of NWM q_laterals runs in about 8 seconds (on Cheyenne). If you change the globbing pattern to include a full month, it takes about 380 seconds.

setting parallel=True seg faults... I'm betting that is some quirk of my python environment, though.

We are reading everything into memory, which negates the lazy-access benefits of using a dataset and our next steps include looking into that.

300 seconds to read a month isn't totally unacceptable, but we'd like it be faster for the operational runs we'll eventually be doing -- for longer simulations, we may be able to achieve some improvement with asynchronous data access. We'll keep looking into it. (We'll start by trying to adapt the "slightly more sophisticated example" under the docs you referenced here...)

Thanks (for the great package and for getting back on this question!)

```

python /glade/scratch/halgren/qlat_mfopen_test.py

import time import xarray as xr import pandas as pd

def get_ql_from_wrf_hydro_mf( qlat_files, index_col="feature_id", value_col="q_lateral" ): """ qlat_files: globbed list of CHRTOUT files containing desired lateral inflows index_col: column/field in the CHRTOUT files with the segment/link id value_col: column/field in the CHRTOUT files with the lateral inflow value

In general the CHRTOUT files contain one value per time step. At present, there is
no capability for handling non-uniform timesteps in the qlaterals.

The qlateral may also be input using comma delimited file -- see
`get_ql_from_csv`


Note/Todo:
For later needs, filtering for specific features or times may
be accomplished with one of:
    ds.loc[{selectors}]
    ds.sel({selectors})
    ds.isel({selectors})

Returns from these selection functions are sub-datasets.

For example:
```
(Pdb) ds.sel({"feature_id":[4186117, 4186169],"time":ds.time.values[:2]})['q_lateral'].to_dataframe()
                                 latitude  longitude  q_lateral
time                feature_id
2018-01-01 13:00:00 4186117     41.233807 -75.413895   0.006496
2018-01-02 00:00:00 4186117     41.233807 -75.413895   0.006460
```

or...
```
(Pdb) ds.sel({"feature_id":[4186117, 4186169],"time":[np.datetime64('2018-01-01T13:00:00')]})['q_lateral'].to_dataframe()
                                 latitude  longitude  q_lateral
time                feature_id
2018-01-01 13:00:00 4186117     41.233807 -75.413895   0.006496
```
"""
filter_list = None

with xr.open_mfdataset(
    qlat_files,
    combine="by_coords",
    # combine="nested",
    # concat_dim="time",
    # data_vars="minimal",
    # coords="minimal",
    # compat="override",
    preprocess=drop_all_coords,
    # parallel=True,
) as ds:
    ql = pd.DataFrame(
        ds[value_col].values.T,
        index=ds[index_col].values,
        columns=ds.time.values,
        # dtype=float,
    )

return ql

def drop_all_coords(ds): return ds.reset_coords(drop=True)

def main():

input_folder = "/glade/p/cisl/nwc/nwmv21_finals/CONUS/retro/Retro8yr/FullRouting/OUTPUT_chrtout_comp_20181001_20191231"
file_pattern_filter = "/20181101*.CHRTOUT*"
file_index_col = "feature_id"
file_value_col = "q_lateral"
# file_value_col = "streamflow"

start_time = time.time()

qlat_files = (input_folder + file_pattern_filter)
print(f"reading {qlat_files}")
qlat_df = get_ql_from_wrf_hydro_mf(
    qlat_files=qlat_files,
    index_col=file_index_col,
    value_col=file_value_col,
)
print(qlat_df)
print("read qlaterals in %s seconds." % (time.time() - start_time))

if name == "main": main() ``` @groutr, @jmccreight

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135
756364564 https://github.com/pydata/xarray/issues/1385#issuecomment-756364564 https://api.github.com/repos/pydata/xarray/issues/1385 MDEyOklzc3VlQ29tbWVudDc1NjM2NDU2NA== jameshalgren 53343824 2021-01-07T20:28:32Z 2021-01-07T20:28:32Z NONE

@rabernat Is test dataset you mention still somewhere on Cheyenne -- we're seeing a general slowness processing multifile netcdf output from the National Water Model (our project here: NOAA-OWP/t-route) and we would like to see how things compare to your mini-benchmark test.

cc @groutr

An update on this long-standing issue.

I have learned that open_mfdataset can be blazingly fast if decode_cf=False but extremely slow with decode_cf=True.

As an example, I am loading a POP datataset on cheyenne. Anyone with access can try this example.

```python base_dir = '/glade/scratch/rpa/' prefix = 'BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001' code = 'pop.h.nday1.SST' glob_pattern = os.path.join(base_dir, prefix, '%s.%s.*.nc' % (prefix, code))

def non_time_coords(ds): return [v for v in ds.data_vars if 'time' not in ds[v].dims]

def drop_non_essential_vars_pop(ds): return ds.drop(non_time_coords(ds))

this runs almost instantly

ds = xr.open_mfdataset(glob_pattern, decode_times=False, chunks={'time': 1}, preprocess=drop_non_essential_vars_pop, decode_cf=False) ```

And returns this

<xarray.Dataset> Dimensions: (d2: 2, nlat: 2400, nlon: 3600, time: 16401, z_t: 62, z_t_150m: 15, z_w: 62, z_w_bot: 62, z_w_top: 62) Coordinates: * z_w_top (z_w_top) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 ... * z_t (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ... * z_w (z_w) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ... * z_t_150m (z_t_150m) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ... * z_w_bot (z_w_bot) float32 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ... * time (time) float64 7.322e+05 7.322e+05 7.322e+05 7.322e+05 ... Dimensions without coordinates: d2, nlat, nlon Data variables: time_bound (time, d2) float64 dask.array<shape=(16401, 2), chunksize=(1, 2)> SST (time, nlat, nlon) float32 dask.array<shape=(16401, 2400, 3600), chunksize=(1, 2400, 3600)> Attributes: nsteps_total: 480 tavg_sum: 64800.0 title: BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001 start_time: This dataset was created on 2016-03-14 at 05:32:30.3 Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netcdf/CF-curren... source: CCSM POP2, the CCSM Ocean Component cell_methods: cell_methods = time: mean ==> the variable values are aver... calendar: All years have exactly 365 days. history: none contents: Diagnostic and Prognostic Variables revision: $Id: tavg.F90 56176 2013-12-20 18:35:46Z mlevy@ucar.edu $

This is roughly 45 years of daily data, one file per year.

Instead, if I just change decode_cf=True (the default), it takes forever. I can monitor what is happening via the distributed dashboard. It looks like this:

There are more of these open_dataset tasks then there are number of files (45), so I can only presume there are 16401 individual tasks (one for each timestep), which each takes about 1 s in serial.

This is a real failure of lazy decoding. Maybe it can be fixed by #1725, possibly related to #1372.

cc Pangeo folks: @jhamman, @mrocklin

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  slow performance with open_mfdataset 224553135

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 563.603ms · About: xarray-datasette