home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 781407863

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/1385#issuecomment-781407863 https://api.github.com/repos/pydata/xarray/issues/1385 781407863 MDEyOklzc3VlQ29tbWVudDc4MTQwNzg2Mw== 53343824 2021-02-18T15:06:13Z 2021-02-18T15:06:13Z NONE

setting parallel=True seg faults... I'm betting that is some quirk of my python environment, though.

This is important! Otherwise that timing scales with number of files. If you get that to work, then you can convert to a dask dataframe and keep things lazy.

Indeed @dcherian -- it took some experimentation to get the right engine to support parallel execution and even then, results are still mixed, which, to me, means further work is needed to isolate the issue.

Along the lines of suggestions here (thanks @jmccreight for pointing this out), we've introduced a very practical pre-processing step to rewrite the datasets so that the read is not striped across the file system, effectively isolating the performance bottleneck to a position where it can be dealt with independently. Of course, such an asynchronous workflow is not possible in all situations, so we're still looking at improving the direct performance.

Two notes as we keep working: - The preprocessor. Reading and re-manipulating an individual dataset is lightning fast. We saw that a small change or adjustment in the individual files, made with a preprocessor, made the multi-file read massively faster. - The "more sophisticated example" referenced here has proven to be very useful.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  224553135
Powered by Datasette · Queries took 0.448ms · About: xarray-datasette