home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 756922963

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/1385#issuecomment-756922963 https://api.github.com/repos/pydata/xarray/issues/1385 756922963 MDEyOklzc3VlQ29tbWVudDc1NjkyMjk2Mw== 53343824 2021-01-08T18:26:44Z 2021-01-08T18:34:49Z NONE

@dcherian We had looked at a number of options. In the end, the best performance I could achieve was with the work-around pre-processor script, rather than any of the built-in options. It's worth noting that a major part of the slowdown we were experiencing was from the dataframe transform option we were doing after reading the files. Once that was fixed, performance was much better, but not necessarily with any of the expected options. This script reading one-day's worth of NWM q_laterals runs in about 8 seconds (on Cheyenne). If you change the globbing pattern to include a full month, it takes about 380 seconds.

setting parallel=True seg faults... I'm betting that is some quirk of my python environment, though.

We are reading everything into memory, which negates the lazy-access benefits of using a dataset and our next steps include looking into that.

300 seconds to read a month isn't totally unacceptable, but we'd like it be faster for the operational runs we'll eventually be doing -- for longer simulations, we may be able to achieve some improvement with asynchronous data access. We'll keep looking into it. (We'll start by trying to adapt the "slightly more sophisticated example" under the docs you referenced here...)

Thanks (for the great package and for getting back on this question!)

```

python /glade/scratch/halgren/qlat_mfopen_test.py

import time import xarray as xr import pandas as pd

def get_ql_from_wrf_hydro_mf( qlat_files, index_col="feature_id", value_col="q_lateral" ): """ qlat_files: globbed list of CHRTOUT files containing desired lateral inflows index_col: column/field in the CHRTOUT files with the segment/link id value_col: column/field in the CHRTOUT files with the lateral inflow value

In general the CHRTOUT files contain one value per time step. At present, there is
no capability for handling non-uniform timesteps in the qlaterals.

The qlateral may also be input using comma delimited file -- see
`get_ql_from_csv`


Note/Todo:
For later needs, filtering for specific features or times may
be accomplished with one of:
    ds.loc[{selectors}]
    ds.sel({selectors})
    ds.isel({selectors})

Returns from these selection functions are sub-datasets.

For example:
```
(Pdb) ds.sel({"feature_id":[4186117, 4186169],"time":ds.time.values[:2]})['q_lateral'].to_dataframe()
                                 latitude  longitude  q_lateral
time                feature_id
2018-01-01 13:00:00 4186117     41.233807 -75.413895   0.006496
2018-01-02 00:00:00 4186117     41.233807 -75.413895   0.006460
```

or...
```
(Pdb) ds.sel({"feature_id":[4186117, 4186169],"time":[np.datetime64('2018-01-01T13:00:00')]})['q_lateral'].to_dataframe()
                                 latitude  longitude  q_lateral
time                feature_id
2018-01-01 13:00:00 4186117     41.233807 -75.413895   0.006496
```
"""
filter_list = None

with xr.open_mfdataset(
    qlat_files,
    combine="by_coords",
    # combine="nested",
    # concat_dim="time",
    # data_vars="minimal",
    # coords="minimal",
    # compat="override",
    preprocess=drop_all_coords,
    # parallel=True,
) as ds:
    ql = pd.DataFrame(
        ds[value_col].values.T,
        index=ds[index_col].values,
        columns=ds.time.values,
        # dtype=float,
    )

return ql

def drop_all_coords(ds): return ds.reset_coords(drop=True)

def main():

input_folder = "/glade/p/cisl/nwc/nwmv21_finals/CONUS/retro/Retro8yr/FullRouting/OUTPUT_chrtout_comp_20181001_20191231"
file_pattern_filter = "/20181101*.CHRTOUT*"
file_index_col = "feature_id"
file_value_col = "q_lateral"
# file_value_col = "streamflow"

start_time = time.time()

qlat_files = (input_folder + file_pattern_filter)
print(f"reading {qlat_files}")
qlat_df = get_ql_from_wrf_hydro_mf(
    qlat_files=qlat_files,
    index_col=file_index_col,
    value_col=file_value_col,
)
print(qlat_df)
print("read qlaterals in %s seconds." % (time.time() - start_time))

if name == "main": main() ``` @groutr, @jmccreight

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  224553135
Powered by Datasette · Queries took 82.88ms · About: xarray-datasette