home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where author_association = "NONE" and issue = 435535284 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • fsteinmetz 2
  • msaharia 1
  • pinshuai 1
  • bhanu-magotra 1

issue 1

  • Writing a netCDF file is unexpectedly slow · 5 ✖

author_association 1

  • NONE · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
832864415 https://github.com/pydata/xarray/issues/2912#issuecomment-832864415 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDgzMjg2NDQxNQ== pinshuai 34693887 2021-05-05T17:12:19Z 2021-05-05T17:12:19Z NONE

I had a similar issue. I am trying to save a big xarray (~2 GB) dataset using to_netcdf().

Dataset:

I tried the following three approaches:

  1. Directly save using dset.to_netcdf()
  2. Load before save using dset.load().to_netcdf()
  3. Chunk data and save using dset.chunk({'time': 19968}).to_netcdf()

All three approaches failed to write to file which cause the python kernel to hang indefinitely or die.

Any suggestion?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
773820054 https://github.com/pydata/xarray/issues/2912#issuecomment-773820054 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDc3MzgyMDA1NA== bhanu-magotra 60338532 2021-02-05T06:20:40Z 2021-02-05T06:56:05Z NONE

I am trying to perform a fairly simplistic operation on a dataset involving editing of variable and global attributes on individual netcdf files of 3.5GB each. The files load instantly using xr.open_dataset but dataset.to_netcdf() is too slow to export after the modifications. I have tried : 1. Without rechunking and dask invocations. 2. Varying chunk sizes followed by : 3. Usingload()before to_netcdf 4. Using persist() or compute () before to_netcdf I am working on a HPC with 10 distributed workers . In all cases, the time taken is more than 15 minutes per file. Is it expected? What else can I try to speed up this process apart from further parallelizing the single file operations using dask delayed?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
542369777 https://github.com/pydata/xarray/issues/2912#issuecomment-542369777 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDU0MjM2OTc3Nw== fsteinmetz 668201 2019-10-15T19:32:50Z 2019-10-15T19:32:50Z NONE

Thanks for the explanations @jhamman and @shoyer :) Actually it turns out that I was not using particularly small chunks, but the filesystem for /tmp was faulty... After trying on a reliable filesystem, the results are much more reasonable.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
533801682 https://github.com/pydata/xarray/issues/2912#issuecomment-533801682 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDUzMzgwMTY4Mg== fsteinmetz 668201 2019-09-21T14:21:17Z 2019-09-21T14:21:17Z NONE

There are ways to side step some of these challenges (save_mfdataset and the distributed dask scheduler)

@jhamman Could you elaborate on these ways ?

I am having severe slow-downs when writing Datasets by blocks (backed by dask). I have also noticed that the slowdowns do not occur when writing to ramdisk. Here are the timings of to_netcdf, which uses default engine and encoding (the nc file is 4.3 GB) :

  • When writing to ramdisk (/dev/shm/) : 2min 1s
  • When writing to /tmp/ : 27min 28s
  • When writing to /tmp/ after .load(), as suggested here : 34s (.load takes 1min 43s)

The workaround suggested here works, but the datasets may not always fit in memory, and it fails the essential purpose of dask...

Note: I am using dask 2.3.0 and xarray 0.12.3

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
485505651 https://github.com/pydata/xarray/issues/2912#issuecomment-485505651 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDQ4NTUwNTY1MQ== msaharia 2014301 2019-04-22T18:32:30Z 2019-04-22T18:36:38Z NONE

Diagnosis

Thank you very much! I found this. For now, I will use the load() option.

Loading netCDFs

In [8]: time ncdat=reformat_LIS_outputs(outlist) CPU times: user 7.78 s, sys: 220 ms, total: 8 s Wall time: 8.02 s

Slower export

In [6]: time ncdat.to_netcdf('test_slow') CPU times: user 12min, sys: 8.19 s, total: 12min 9s Wall time: 12min 14s

Faster export

In [9]: time ncdat.load().to_netcdf('test_faster.nc') CPU times: user 42.6 s, sys: 2.82 s, total: 45.4 s Wall time: 54.6 s

{
    "total_count": 9,
    "+1": 5,
    "-1": 0,
    "laugh": 1,
    "hooray": 1,
    "confused": 0,
    "heart": 1,
    "rocket": 1,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 24.525ms · About: xarray-datasette