home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

11 rows where issue = 435535284 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 7

  • shoyer 3
  • fsteinmetz 2
  • jhamman 2
  • msaharia 1
  • dcherian 1
  • pinshuai 1
  • bhanu-magotra 1

author_association 2

  • MEMBER 6
  • NONE 5

issue 1

  • Writing a netCDF file is unexpectedly slow · 11 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
832864415 https://github.com/pydata/xarray/issues/2912#issuecomment-832864415 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDgzMjg2NDQxNQ== pinshuai 34693887 2021-05-05T17:12:19Z 2021-05-05T17:12:19Z NONE

I had a similar issue. I am trying to save a big xarray (~2 GB) dataset using to_netcdf().

Dataset:

I tried the following three approaches:

  1. Directly save using dset.to_netcdf()
  2. Load before save using dset.load().to_netcdf()
  3. Chunk data and save using dset.chunk({'time': 19968}).to_netcdf()

All three approaches failed to write to file which cause the python kernel to hang indefinitely or die.

Any suggestion?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
773820054 https://github.com/pydata/xarray/issues/2912#issuecomment-773820054 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDc3MzgyMDA1NA== bhanu-magotra 60338532 2021-02-05T06:20:40Z 2021-02-05T06:56:05Z NONE

I am trying to perform a fairly simplistic operation on a dataset involving editing of variable and global attributes on individual netcdf files of 3.5GB each. The files load instantly using xr.open_dataset but dataset.to_netcdf() is too slow to export after the modifications. I have tried : 1. Without rechunking and dask invocations. 2. Varying chunk sizes followed by : 3. Usingload()before to_netcdf 4. Using persist() or compute () before to_netcdf I am working on a HPC with 10 distributed workers . In all cases, the time taken is more than 15 minutes per file. Is it expected? What else can I try to speed up this process apart from further parallelizing the single file operations using dask delayed?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
542369777 https://github.com/pydata/xarray/issues/2912#issuecomment-542369777 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDU0MjM2OTc3Nw== fsteinmetz 668201 2019-10-15T19:32:50Z 2019-10-15T19:32:50Z NONE

Thanks for the explanations @jhamman and @shoyer :) Actually it turns out that I was not using particularly small chunks, but the filesystem for /tmp was faulty... After trying on a reliable filesystem, the results are much more reasonable.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
534869060 https://github.com/pydata/xarray/issues/2912#issuecomment-534869060 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDUzNDg2OTA2MA== shoyer 1217238 2019-09-25T06:08:43Z 2019-09-25T06:08:43Z MEMBER

I suspect it could work pretty well to explicitly rechunk your dataset into larger chunks (e.g., with the Dataset.chunk() method). This way you could continue to use dask for lazy writes, but reduce the overhead of writing individual chunks.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
534855337 https://github.com/pydata/xarray/issues/2912#issuecomment-534855337 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDUzNDg1NTMzNw== jhamman 2443309 2019-09-25T05:12:32Z 2019-09-25T05:12:32Z MEMBER

@fsteinmetz - in my experience, the main thing to consider here is how and when xarray's backends lock/block for certain operations. The hdf5 library is not thread safe and so we implement a global lock around all hdf5 read/write operations. In most cases, this means we can only do one read or one write at a time per process. We have found that using Dask's distributed (or mulitprocessing) scheduler allows us to bypass the thread locks required by hdf5 by using multiple processes. We also need a per file lock when writing, so using multiple output datasets theoretically allows for concurrent writes (provided your filesystem and OS support this).

Finally, its best not to jump to the complicated explanations first. If you have many small dask chunks in your dataset, both reading and writing will be quite inefficient. This is simply because there is some non-trivial overhead when accessing partial datasets. This is even worse when the dataset is chunked/compressed.

Hope that helps.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
533801682 https://github.com/pydata/xarray/issues/2912#issuecomment-533801682 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDUzMzgwMTY4Mg== fsteinmetz 668201 2019-09-21T14:21:17Z 2019-09-21T14:21:17Z NONE

There are ways to side step some of these challenges (save_mfdataset and the distributed dask scheduler)

@jhamman Could you elaborate on these ways ?

I am having severe slow-downs when writing Datasets by blocks (backed by dask). I have also noticed that the slowdowns do not occur when writing to ramdisk. Here are the timings of to_netcdf, which uses default engine and encoding (the nc file is 4.3 GB) :

  • When writing to ramdisk (/dev/shm/) : 2min 1s
  • When writing to /tmp/ : 27min 28s
  • When writing to /tmp/ after .load(), as suggested here : 34s (.load takes 1min 43s)

The workaround suggested here works, but the datasets may not always fit in memory, and it fails the essential purpose of dask...

Note: I am using dask 2.3.0 and xarray 0.12.3

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
485505651 https://github.com/pydata/xarray/issues/2912#issuecomment-485505651 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDQ4NTUwNTY1MQ== msaharia 2014301 2019-04-22T18:32:30Z 2019-04-22T18:36:38Z NONE

Diagnosis

Thank you very much! I found this. For now, I will use the load() option.

Loading netCDFs

In [8]: time ncdat=reformat_LIS_outputs(outlist) CPU times: user 7.78 s, sys: 220 ms, total: 8 s Wall time: 8.02 s

Slower export

In [6]: time ncdat.to_netcdf('test_slow') CPU times: user 12min, sys: 8.19 s, total: 12min 9s Wall time: 12min 14s

Faster export

In [9]: time ncdat.load().to_netcdf('test_faster.nc') CPU times: user 42.6 s, sys: 2.82 s, total: 45.4 s Wall time: 54.6 s

{
    "total_count": 9,
    "+1": 5,
    "-1": 0,
    "laugh": 1,
    "hooray": 1,
    "confused": 0,
    "heart": 1,
    "rocket": 1,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
485497398 https://github.com/pydata/xarray/issues/2912#issuecomment-485497398 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDQ4NTQ5NzM5OA== jhamman 2443309 2019-04-22T18:06:56Z 2019-04-22T18:06:56Z MEMBER

Since the final dataset size is quite manageable, I would start by forcing computation before the write step:

python ncdat.load().to_netcdf(...)

While writing of xarray datasets backed by dask is possible, its a poorly optimized operation. Most of this comes from constraints in netCDF4/HDF5. There are ways to side step some of these challenges (save_mfdataset and the distributed dask scheduler) but they are probably overkill for this use case.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
485465687 https://github.com/pydata/xarray/issues/2912#issuecomment-485465687 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDQ4NTQ2NTY4Nw== shoyer 1217238 2019-04-22T16:23:44Z 2019-04-22T16:23:44Z MEMBER

It really depends on the underlying cause. In most cases, writing a file to disk is not the slow part, only the place where the slow-down is manifested.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
485464872 https://github.com/pydata/xarray/issues/2912#issuecomment-485464872 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDQ4NTQ2NDg3Mg== dcherian 2448579 2019-04-22T16:21:00Z 2019-04-22T16:21:20Z MEMBER

Are there "best practices" for a situation like this? Parallel writes? save_mfdataset?

ping @jhamman @rabernat

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284
485460901 https://github.com/pydata/xarray/issues/2912#issuecomment-485460901 https://api.github.com/repos/pydata/xarray/issues/2912 MDEyOklzc3VlQ29tbWVudDQ4NTQ2MDkwMQ== shoyer 1217238 2019-04-22T16:06:50Z 2019-04-22T16:06:50Z MEMBER

You're using dask, so the Dataset is being lazily computed. If one part of your pipeline is very expensive (perhaps reading the original data from disk?) then the process of saving can be very slow.

I would suggest doing some profiling, e.g., as shown in this example: http://docs.dask.org/en/latest/diagnostics-local.html#example

Once we know what the slow part is, that will hopefully make opportunities for improvement more obvious.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a netCDF file is unexpectedly slow 435535284

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.991ms · About: xarray-datasette