issues: 355264812
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
355264812 | MDU6SXNzdWUzNTUyNjQ4MTI= | 2389 | Large pickle overhead in ds.to_netcdf() involving dask.delayed functions | 1882397 | closed | 0 | 11 | 2018-08-29T17:43:28Z | 2019-01-13T21:17:12Z | 2019-01-13T21:17:12Z | NONE | If we write a dask array that doesn't involve
``` 123410 function calls (104395 primitive calls) in 13.720 seconds Ordered by: internal time List reduced from 203 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 8 10.032 1.254 10.032 1.254 {method 'acquire' of '_thread.lock' objects} 1001 2.939 0.003 2.950 0.003 {built-in method _pickle.dumps} 1001 0.614 0.001 3.569 0.004 pickle.py:30(dumps) 6504/1002 0.012 0.000 0.021 0.000 utils.py:803(convert) 11507/1002 0.010 0.000 0.019 0.000 utils_comm.py:144(unpack_remotedata) 6013 0.009 0.000 0.009 0.000 utils.py:767(tokey) 3002/1002 0.008 0.000 0.017 0.000 utils_comm.py:181(<listcomp>) 11512 0.007 0.000 0.008 0.000 core.py:26(istask) 1002 0.006 0.000 3.589 0.004 worker.py:788(dumps_task) 1 0.005 0.005 0.007 0.007 core.py:273(<dictcomp>) ``` But if we use results from vals = da.stack([da.from_delayed(make_data(), (), np.float64) for _ in range(500)]) ds = xr.Dataset({'vals': (['a'], vals)}) write = ds.to_netcdf('file5.nc', compute=False) %prun -stime -l10 write.compute() ``` ``` 115045243 function calls (104115443 primitive calls) in 67.240 seconds Ordered by: internal time List reduced from 292 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 8120705/501 17.597 0.000 59.036 0.118 pickle.py:457(save) 2519027/501 7.581 0.000 59.032 0.118 pickle.py:723(save_tuple) 4 6.978 1.745 6.978 1.745 {method 'acquire' of '_thread.lock' objects} 3082150 5.362 0.000 8.748 0.000 pickle.py:413(memoize) 11474396 4.516 0.000 5.970 0.000 pickle.py:213(write) 8121206 4.186 0.000 5.202 0.000 pickle.py:200(commit_frame) 13747943 2.703 0.000 2.703 0.000 {method 'get' of 'dict' objects} 17057538 1.887 0.000 1.887 0.000 {built-in method builtins.id} 4568116 1.772 0.000 1.782 0.000 {built-in method _struct.pack} 2762513 1.613 0.000 2.826 0.000 pickle.py:448(get) ``` This additional pickle overhead does not happen if we compute the dataset without writing it to a file.
Output of `%prun -stime -l10 ds.compute()` without `dask.delayed`:
```
83856 function calls (73348 primitive calls) in 0.566 seconds
Ordered by: internal time
List reduced from 259 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
4 0.441 0.110 0.441 0.110 {method 'acquire' of '_thread.lock' objects}
502 0.013 0.000 0.013 0.000 {method 'send' of '_socket.socket' objects}
500 0.011 0.000 0.011 0.000 {built-in method _pickle.dumps}
1000 0.007 0.000 0.008 0.000 core.py:159(get_dependencies)
3500 0.007 0.000 0.007 0.000 utils.py:767(tokey)
3000/500 0.006 0.000 0.010 0.000 utils.py:803(convert)
500 0.005 0.000 0.019 0.000 pickle.py:30(dumps)
1 0.004 0.004 0.008 0.008 core.py:3826(concatenate3)
4500/500 0.004 0.000 0.008 0.000 utils_comm.py:144(unpack_remotedata)
1 0.004 0.004 0.017 0.017 order.py:83(order)
```
With `dask.delayed`:
```
149376 function calls (139868 primitive calls) in 1.738 seconds
Ordered by: internal time
List reduced from 264 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
4 1.568 0.392 1.568 0.392 {method 'acquire' of '_thread.lock' objects}
1 0.015 0.015 0.038 0.038 optimization.py:455(fuse)
502 0.012 0.000 0.012 0.000 {method 'send' of '_socket.socket' objects}
6500 0.010 0.000 0.010 0.000 utils.py:767(tokey)
5500/1000 0.009 0.000 0.012 0.000 utils_comm.py:144(unpack_remotedata)
2500 0.008 0.000 0.009 0.000 core.py:159(get_dependencies)
500 0.007 0.000 0.009 0.000 client.py:142(__init__)
1000 0.005 0.000 0.008 0.000 core.py:280(subs)
2000/1000 0.005 0.000 0.008 0.000 utils.py:803(convert)
1 0.004 0.004 0.022 0.022 order.py:83(order)
```
I am using Software versions
```
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_GB.UTF-8
LANG: None
LOCALE: en_GB.UTF-8
xarray: 0.10.8
pandas: 0.23.4
numpy: 1.15.1
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: 0.6.2
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.18.2
distributed: 1.22.1
matplotlib: 2.2.2
cartopy: None
seaborn: 0.9.0
setuptools: 40.2.0
pip: 18.0
conda: 4.5.11
pytest: 3.7.3
IPython: 6.5.0
sphinx: 1.7.7
```
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/2389/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |