issue_comments
10 rows where author_association = "NONE", issue = 427410885 and user = 2418513 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: reactions, created_at (date), updated_at (date)
issue 1
- Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) · 10 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
807126680 | https://github.com/pydata/xarray/issues/2857#issuecomment-807126680 | https://api.github.com/repos/pydata/xarray/issues/2857 | MDEyOklzc3VlQ29tbWVudDgwNzEyNjY4MA== | aldanor 2418513 | 2021-03-25T17:17:48Z | 2021-03-25T17:18:21Z | NONE |
It scales with data size it seems, but: even if you reduce data size to 1 element, after 50 iterations a single write goes up to 150ms already (whereas it's a few milliseconds in an empty file). These 150ms is the pure 'file traversal' etc part; the rest (of the 2 seconds) is the part where it seemingly reads stuff - which scales with data. Ideally it should just stay at <10ms all the time. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885 | |
806863336 | https://github.com/pydata/xarray/issues/2857#issuecomment-806863336 | https://api.github.com/repos/pydata/xarray/issues/2857 | MDEyOklzc3VlQ29tbWVudDgwNjg2MzMzNg== | aldanor 2418513 | 2021-03-25T14:35:28Z | 2021-03-25T17:15:06Z | NONE |
I don't think it's about what's happening in the current Python's process, which instances are being cached or not, it's about the general logic. For instance, in the example above, if you run it once (e.g. set the range to 50); and then run it but comment out the block that clears the file, and set the range to 50-100. The very first dataset written the second time will be already very slow, slower than the last dataset written the first time - which means it's not about reusing the same |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885 | |
806776909 | https://github.com/pydata/xarray/issues/2857#issuecomment-806776909 | https://api.github.com/repos/pydata/xarray/issues/2857 | MDEyOklzc3VlQ29tbWVudDgwNjc3NjkwOQ== | aldanor 2418513 | 2021-03-25T13:48:04Z | 2021-03-25T13:48:29Z | NONE | Without digging into implementational details, my logic as a library user would be this:
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885 | |
806767981 | https://github.com/pydata/xarray/issues/2857#issuecomment-806767981 | https://api.github.com/repos/pydata/xarray/issues/2857 | MDEyOklzc3VlQ29tbWVudDgwNjc2Nzk4MQ== | aldanor 2418513 | 2021-03-25T13:44:22Z | 2021-03-25T13:45:04Z | NONE | Just checked it out. | Number of datasets in file | netCDF4 (ms/write) | h5netcdf (ms/write) | | --- | --- | --- | | 1 | 4 | 11 | | 250 | 142| 1933 | |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885 | |
806740965 | https://github.com/pydata/xarray/issues/2857#issuecomment-806740965 | https://api.github.com/repos/pydata/xarray/issues/2857 | MDEyOklzc3VlQ29tbWVudDgwNjc0MDk2NQ== | aldanor 2418513 | 2021-03-25T13:27:17Z | 2021-03-25T13:27:17Z | NONE | Here's the minimal example, try running this: ```python import time import xarray as xr import numpy as np import h5py arr = xr.DataArray(np.random.RandomState(0).randint(-100, 100, (50_000, 3)), dims=['x', 'y']) ds = xr.Dataset({'arr': arr}) filename = 'test.h5' save = lambda group: ds.to_netcdf(filename, engine='h5netcdf', mode='a', group=str(group)) with h5py.File(filename, 'w') as _: pass for i in range(250): t0 = time.time() save(i) print(time.time() - t0) ``` |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885 | |
806713825 | https://github.com/pydata/xarray/issues/2857#issuecomment-806713825 | https://api.github.com/repos/pydata/xarray/issues/2857 | MDEyOklzc3VlQ29tbWVudDgwNjcxMzgyNQ== | aldanor 2418513 | 2021-03-25T13:10:13Z | 2021-03-25T13:10:13Z | NONE | Is it possible to use Or at least, don't traverse anything above the current root group that the dataset is being written into. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885 | |
806711702 | https://github.com/pydata/xarray/issues/2857#issuecomment-806711702 | https://api.github.com/repos/pydata/xarray/issues/2857 | MDEyOklzc3VlQ29tbWVudDgwNjcxMTcwMg== | aldanor 2418513 | 2021-03-25T13:08:46Z | 2021-03-25T13:08:46Z | NONE | @kmuehlbauer Just installed h5netcdf=0.10.0, here's the timings when there's 200 groups in file -
And here's ``` Timer unit: 1e-06 s Total time: 2.44857 s File: .../python3.8/site-packages/h5netcdf/core.py Function: _lookup_dimensions at line 92 Line # Hits Time Per Hit % Time Line Contents
100 400 44938.0 112.3 1.8 child_name = self.name.split("/")[-1]
101 400 5006.0 12.5 0.2 if child_name in self._parent.dimensions:
102 return (child_name,)
103 |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885 | |
806680140 | https://github.com/pydata/xarray/issues/2857#issuecomment-806680140 | https://api.github.com/repos/pydata/xarray/issues/2857 | MDEyOklzc3VlQ29tbWVudDgwNjY4MDE0MA== | aldanor 2418513 | 2021-03-25T12:48:23Z | 2021-03-25T12:49:19Z | NONE | There's some absolutely obscure things here, e.g. For 0 datasets: ``` Timer unit: 1e-06 s Total time: 0.005034 s File: .../python3.8/site-packages/h5netcdf/core.py Function: _lookup_dimensions at line 86 Line # Hits Time Per Hit % Time Line Contents
100 # scale? it's not netCDF4 spec, but it is unambiguous... 101 # Also: the netCDF lib can read HDF5 datasets with unlabeled 102 # dimensions. 103 5 355.0 71.0 7.1 if len(dim) == 0: 104 raise ValueError('variable %r has no dimension scale ' 105 'associated with axis %s' 106 % (self.name, axis)) 107 5 1772.0 354.4 35.2 name = _name_from_dimension(dim) 108 5 18.0 3.6 0.4 dims.append(name) 109 2 6.0 3.0 0.1 return tuple(dims) ``` For 200 datasets: ``` Timer unit: 1e-06 s Total time: 2.34179 s File: .../python3.8/site-packages/h5netcdf/core.py Function: _lookup_dimensions at line 86 Line # Hits Time Per Hit % Time Line Contents
100 # scale? it's not netCDF4 spec, but it is unambiguous... 101 # Also: the netCDF lib can read HDF5 datasets with unlabeled 102 # dimensions. 103 1000 35653.0 35.7 1.5 if len(dim) == 0: 104 raise ValueError('variable %r has no dimension scale ' 105 'associated with axis %s' 106 % (self.name, axis)) 107 1000 2012597.0 2012.6 85.9 name = _name_from_dimension(dim) 108 1000 1640.0 1.6 0.1 dims.append(name) 109 400 400.0 1.0 0.0 return tuple(dims) ``` |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885 | |
806667029 | https://github.com/pydata/xarray/issues/2857#issuecomment-806667029 | https://api.github.com/repos/pydata/xarray/issues/2857 | MDEyOklzc3VlQ29tbWVudDgwNjY2NzAyOQ== | aldanor 2418513 | 2021-03-25T12:40:18Z | 2021-03-25T12:49:00Z | NONE |
``` 0.21619391441345215 Timer unit: 1e-06 s Total time: 0.006862 s File: .../python3.8/site-packages/h5netcdf/core.py Function: flush at line 689 Line # Hits Time Per Hit % Time Line Contents689 def flush(self): 690 1 4.0 4.0 0.1 if 'r' not in self._mode: 691 1 111.0 111.0 1.6 self._set_unassigned_dimension_ids() 692 1 3521.0 3521.0 51.3 self._create_dim_scales() 693 1 3224.0 3224.0 47.0 self._attach_dim_scales() 694 1 2.0 2.0 0.0 if not self._preexisting_file and self._write_ncproperties: 695 self.attrs._h5attrs['_NCProperties'] = _NC_PROPERTIES ```
``` Timer unit: 1e-06 s Total time: 4.55295 s File: .../python3.8/site-packages/h5netcdf/core.py Function: flush at line 689 Line # Hits Time Per Hit % Time Line Contents689 def flush(self): 690 1 3.0 3.0 0.0 if 'r' not in self._mode: 691 1 1148237.0 1148237.0 25.2 self._set_unassigned_dimension_ids() 692 1 462926.0 462926.0 10.2 self._create_dim_scales() 693 1 2941779.0 2941779.0 64.6 self._attach_dim_scales() 694 1 2.0 2.0 0.0 if not self._preexisting_file and self._write_ncproperties: 695 self.attrs._h5attrs['_NCProperties'] = _NC_PROPERTIES ``` |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885 | |
806651823 | https://github.com/pydata/xarray/issues/2857#issuecomment-806651823 | https://api.github.com/repos/pydata/xarray/issues/2857 | MDEyOklzc3VlQ29tbWVudDgwNjY1MTgyMw== | aldanor 2418513 | 2021-03-25T12:30:39Z | 2021-03-25T12:46:26Z | NONE | @shoyer This problem persisted all of this time, but since I faced it again, I did a bit of digging. (it's strange noone else noticed it so far as it's pretty bad) I've line-profiled this snippet for various number of datasets already written to file ( | Number of datasets in file | The above can be measured simply in a notebook via |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 1