home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 355264812

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
355264812 MDU6SXNzdWUzNTUyNjQ4MTI= 2389 Large pickle overhead in ds.to_netcdf() involving dask.delayed functions 1882397 closed 0     11 2018-08-29T17:43:28Z 2019-01-13T21:17:12Z 2019-01-13T21:17:12Z NONE      

If we write a dask array that doesn't involve dask.delayed functions using ds.to_netcdf, there is only little overhead from pickle:

python vals = da.random.random(500, chunks=(1,)) ds = xr.Dataset({'vals': (['a'], vals)}) write = ds.to_netcdf('file2.nc', compute=False) %prun -stime -l10 write.compute()

``` 123410 function calls (104395 primitive calls) in 13.720 seconds

Ordered by: internal time List reduced from 203 to 10 due to restriction <10>

ncalls tottime percall cumtime percall filename:lineno(function) 8 10.032 1.254 10.032 1.254 {method 'acquire' of '_thread.lock' objects} 1001 2.939 0.003 2.950 0.003 {built-in method _pickle.dumps} 1001 0.614 0.001 3.569 0.004 pickle.py:30(dumps) 6504/1002 0.012 0.000 0.021 0.000 utils.py:803(convert) 11507/1002 0.010 0.000 0.019 0.000 utils_comm.py:144(unpack_remotedata) 6013 0.009 0.000 0.009 0.000 utils.py:767(tokey) 3002/1002 0.008 0.000 0.017 0.000 utils_comm.py:181(<listcomp>) 11512 0.007 0.000 0.008 0.000 core.py:26(istask) 1002 0.006 0.000 3.589 0.004 worker.py:788(dumps_task) 1 0.005 0.005 0.007 0.007 core.py:273(<dictcomp>) ```

But if we use results from dask.delayed, pickle takes up most of the time: ```python @dask.delayed def make_data(): return np.array(np.random.randn())

vals = da.stack([da.from_delayed(make_data(), (), np.float64) for _ in range(500)]) ds = xr.Dataset({'vals': (['a'], vals)}) write = ds.to_netcdf('file5.nc', compute=False) %prun -stime -l10 write.compute() ```

``` 115045243 function calls (104115443 primitive calls) in 67.240 seconds

Ordered by: internal time List reduced from 292 to 10 due to restriction <10>

ncalls tottime percall cumtime percall filename:lineno(function) 8120705/501 17.597 0.000 59.036 0.118 pickle.py:457(save) 2519027/501 7.581 0.000 59.032 0.118 pickle.py:723(save_tuple) 4 6.978 1.745 6.978 1.745 {method 'acquire' of '_thread.lock' objects} 3082150 5.362 0.000 8.748 0.000 pickle.py:413(memoize) 11474396 4.516 0.000 5.970 0.000 pickle.py:213(write) 8121206 4.186 0.000 5.202 0.000 pickle.py:200(commit_frame) 13747943 2.703 0.000 2.703 0.000 {method 'get' of 'dict' objects} 17057538 1.887 0.000 1.887 0.000 {built-in method builtins.id} 4568116 1.772 0.000 1.782 0.000 {built-in method _struct.pack} 2762513 1.613 0.000 2.826 0.000 pickle.py:448(get) ```

This additional pickle overhead does not happen if we compute the dataset without writing it to a file.

Output of `%prun -stime -l10 ds.compute()` without `dask.delayed`: ``` 83856 function calls (73348 primitive calls) in 0.566 seconds Ordered by: internal time List reduced from 259 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 4 0.441 0.110 0.441 0.110 {method 'acquire' of '_thread.lock' objects} 502 0.013 0.000 0.013 0.000 {method 'send' of '_socket.socket' objects} 500 0.011 0.000 0.011 0.000 {built-in method _pickle.dumps} 1000 0.007 0.000 0.008 0.000 core.py:159(get_dependencies) 3500 0.007 0.000 0.007 0.000 utils.py:767(tokey) 3000/500 0.006 0.000 0.010 0.000 utils.py:803(convert) 500 0.005 0.000 0.019 0.000 pickle.py:30(dumps) 1 0.004 0.004 0.008 0.008 core.py:3826(concatenate3) 4500/500 0.004 0.000 0.008 0.000 utils_comm.py:144(unpack_remotedata) 1 0.004 0.004 0.017 0.017 order.py:83(order) ``` With `dask.delayed`: ``` 149376 function calls (139868 primitive calls) in 1.738 seconds Ordered by: internal time List reduced from 264 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 4 1.568 0.392 1.568 0.392 {method 'acquire' of '_thread.lock' objects} 1 0.015 0.015 0.038 0.038 optimization.py:455(fuse) 502 0.012 0.000 0.012 0.000 {method 'send' of '_socket.socket' objects} 6500 0.010 0.000 0.010 0.000 utils.py:767(tokey) 5500/1000 0.009 0.000 0.012 0.000 utils_comm.py:144(unpack_remotedata) 2500 0.008 0.000 0.009 0.000 core.py:159(get_dependencies) 500 0.007 0.000 0.009 0.000 client.py:142(__init__) 1000 0.005 0.000 0.008 0.000 core.py:280(subs) 2000/1000 0.005 0.000 0.008 0.000 utils.py:803(convert) 1 0.004 0.004 0.022 0.022 order.py:83(order) ```

I am using dask.distributed. I haven't tested it with anything else.

Software versions

``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_GB.UTF-8 LANG: None LOCALE: en_GB.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.15.1 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: 0.6.2 h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.18.2 distributed: 1.22.1 matplotlib: 2.2.2 cartopy: None seaborn: 0.9.0 setuptools: 40.2.0 pip: 18.0 conda: 4.5.11 pytest: 3.7.3 IPython: 6.5.0 sphinx: 1.7.7 ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2389/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 11 rows from issue in issue_comments
Powered by Datasette · Queries took 0.807ms · About: xarray-datasette