home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

24 rows where issue = 427410885 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • aldanor 10
  • kmuehlbauer 10
  • shoyer 4

author_association 2

  • MEMBER 14
  • NONE 10

issue 1

  • Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) · 24 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1010716076 https://github.com/pydata/xarray/issues/2857#issuecomment-1010716076 https://api.github.com/repos/pydata/xarray/issues/2857 IC_kwDOAMm_X848Pk2s shoyer 1217238 2022-01-12T07:18:57Z 2022-01-12T07:18:57Z MEMBER

Well done, Kai!

{
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 2,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
1010713645 https://github.com/pydata/xarray/issues/2857#issuecomment-1010713645 https://api.github.com/repos/pydata/xarray/issues/2857 IC_kwDOAMm_X848PkQt kmuehlbauer 5821660 2022-01-12T07:15:39Z 2022-01-12T07:15:39Z MEMBER

This issue is fixed to some extent since h5netcdf 0.12.0.

h5netcdf does not reach the timings of netCDF4 engine, but the improvement is quite significant.

| Number of datasets in file | netCDF4 write (ms) | h5netcdf <= 0.11.0 write(ms) | h5netcdf >= 0.12.0 write (ms) | |-----|------|-----|-----| | 1 | 2 | 7 | 7 | | 250 | 104 | 1710 | 164 |

The issue can be closed.

Ping @aldanor.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
999410802 https://github.com/pydata/xarray/issues/2857#issuecomment-999410802 https://api.github.com/repos/pydata/xarray/issues/2857 IC_kwDOAMm_X847kcxy kmuehlbauer 5821660 2021-12-22T09:11:05Z 2021-12-22T09:11:05Z MEMBER

FYI: h5netcdf has just merged a refactor of the dimension scale handling, which greatly improves the performance here. It will be released in the next version (0.13.0).

See https://github.com/h5netcdf/h5netcdf/pull/112

I'll come back if the release is out, so we can close this issue.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
825579825 https://github.com/pydata/xarray/issues/2857#issuecomment-825579825 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgyNTU3OTgyNQ== kmuehlbauer 5821660 2021-04-23T11:01:04Z 2021-04-23T11:01:04Z MEMBER

@aldanor Could you please have a look into https://github.com/h5netcdf/h5netcdf/pull/101 for a fix. Any comments are very much appreciated.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
807472405 https://github.com/pydata/xarray/issues/2857#issuecomment-807472405 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNzQ3MjQwNQ== shoyer 1217238 2021-03-25T20:56:36Z 2021-03-25T20:56:36Z MEMBER

It appears that issues can only be moved within a GitHub organization. So I guess we'll need to start a new one.

On Thu, Mar 25, 2021 at 12:35 PM Kai Mühlbauer @.***> wrote:

@shoyer https://github.com/shoyer Could we move the entire issue? Or just open another one over at 'h5netcdf' and reference this one?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2857#issuecomment-807344131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJJFVRZOCDSW7GMLVHBYPDTFOF7DANCNFSM4HCRQYLA .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
807344131 https://github.com/pydata/xarray/issues/2857#issuecomment-807344131 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNzM0NDEzMQ== kmuehlbauer 5821660 2021-03-25T19:34:55Z 2021-03-25T19:34:55Z MEMBER

@shoyer Could we move the entire issue? Or just open another one over at 'h5netcdf' and reference this one?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
807319582 https://github.com/pydata/xarray/issues/2857#issuecomment-807319582 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNzMxOTU4Mg== shoyer 1217238 2021-03-25T19:19:08Z 2021-03-25T19:19:08Z MEMBER

I suspect this could be solved by adding an optimization into h5netcdf to only call _attach_dim_scales() (and maybe some other methods) on variables/groups that have been modified (as opposed to the entire file).

It's probably worth moving the discussion over into the h5netcdf tracker, anyways :)

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
807126680 https://github.com/pydata/xarray/issues/2857#issuecomment-807126680 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNzEyNjY4MA== aldanor 2418513 2021-03-25T17:17:48Z 2021-03-25T17:18:21Z NONE

OK, we might check if that depends on the data size or on the number of groups, or both.

It scales with data size it seems, but: even if you reduce data size to 1 element, after 50 iterations a single write goes up to 150ms already (whereas it's a few milliseconds in an empty file). These 150ms is the pure 'file traversal' etc part; the rest (of the 2 seconds) is the part where it seemingly reads stuff - which scales with data. Ideally it should just stay at <10ms all the time.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806863336 https://github.com/pydata/xarray/issues/2857#issuecomment-806863336 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjg2MzMzNg== aldanor 2418513 2021-03-25T14:35:28Z 2021-03-25T17:15:06Z NONE

I wonder if it would help to use the same underlying h5py.File or h5netcdf.File when appending.

I don't think it's about what's happening in the current Python's process, which instances are being cached or not, it's about the general logic.

For instance, in the example above, if you run it once (e.g. set the range to 50); and then run it but comment out the block that clears the file, and set the range to 50-100. The very first dataset written the second time will be already very slow, slower than the last dataset written the first time - which means it's not about reusing the same File instance.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806982015 https://github.com/pydata/xarray/issues/2857#issuecomment-806982015 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjk4MjAxNQ== kmuehlbauer 5821660 2021-03-25T15:48:35Z 2021-03-25T15:48:35Z MEMBER

OK, we might check if that depends on the data size or on the number of groups, or both.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806853536 https://github.com/pydata/xarray/issues/2857#issuecomment-806853536 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjg1MzUzNg== kmuehlbauer 5821660 2021-03-25T14:29:24Z 2021-03-25T14:29:24Z MEMBER

I wonder if it would help to use the same underlying h5py.File or h5netcdf.File when appending.

This should somehow be possible. I'll try to create some proof of concept script bypassing to_netcdf, when I find the time. If there are other ideas or solutions, please comment here. Thanks @aldanor for intensive testing and minimal example.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806825379 https://github.com/pydata/xarray/issues/2857#issuecomment-806825379 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjgyNTM3OQ== kmuehlbauer 5821660 2021-03-25T14:11:43Z 2021-03-25T14:11:43Z MEMBER

From my understanding, part of the the problem is with the use of CachingFileManager. Every call to to_netcdf(filename....) reopens this particular file (with all the downsides) and wraps it in CachingFileManager again. I wonder if it would help to use the same underlying h5py.File or h5netcdf.File when appending.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806776909 https://github.com/pydata/xarray/issues/2857#issuecomment-806776909 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjc3NjkwOQ== aldanor 2418513 2021-03-25T13:48:04Z 2021-03-25T13:48:29Z NONE

Without digging into implementational details, my logic as a library user would be this:

  • If I write one dataset to file1 and another dataset to file2 using to_netcdf(), to different groups
  • And then I simply combine the two hdf5 files using some external tools (again, datasets stored in different groups)
  • I will be able to read them both perfectly well using open_dataset() or load_dataset()
  • This implies that the datasets can be written just fine independently without knowing about each other
  • Why then those writing functions (flush in particular) traverse and read the entire file every time?
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806767981 https://github.com/pydata/xarray/issues/2857#issuecomment-806767981 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjc2Nzk4MQ== aldanor 2418513 2021-03-25T13:44:22Z 2021-03-25T13:45:04Z NONE

Just checked it out.

| Number of datasets in file | netCDF4 (ms/write) | h5netcdf (ms/write) | | --- | --- | --- | | 1 | 4 | 11 | | 250 | 142| 1933 |

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806759522 https://github.com/pydata/xarray/issues/2857#issuecomment-806759522 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjc1OTUyMg== kmuehlbauer 5821660 2021-03-25T13:39:02Z 2021-03-25T13:39:02Z MEMBER

@aldanor If I change your example to using engine=netcdf4, the times increase too, but not to the extend of the h5netcdf case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806741704 https://github.com/pydata/xarray/issues/2857#issuecomment-806741704 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjc0MTcwNA== kmuehlbauer 5821660 2021-03-25T13:27:43Z 2021-03-25T13:27:43Z MEMBER

@aldanor Thanks, that's what I expected (that the new version doesn't change the behaviour you are showing).

I think your assessment of the situation is correct. It looks like, to_netcdf is re-reading the whole file when in append-mode. Or better said, the underlying machinery re-reads the complete file. Would it be possible to use engine=netcdf4 just to see if this isn't affected?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806740965 https://github.com/pydata/xarray/issues/2857#issuecomment-806740965 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjc0MDk2NQ== aldanor 2418513 2021-03-25T13:27:17Z 2021-03-25T13:27:17Z NONE

Here's the minimal example, try running this:

```python import time import xarray as xr import numpy as np import h5py

arr = xr.DataArray(np.random.RandomState(0).randint(-100, 100, (50_000, 3)), dims=['x', 'y']) ds = xr.Dataset({'arr': arr})

filename = 'test.h5' save = lambda group: ds.to_netcdf(filename, engine='h5netcdf', mode='a', group=str(group))

with h5py.File(filename, 'w') as _: pass

for i in range(250): t0 = time.time() save(i) print(time.time() - t0) ```

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806713825 https://github.com/pydata/xarray/issues/2857#issuecomment-806713825 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjcxMzgyNQ== aldanor 2418513 2021-03-25T13:10:13Z 2021-03-25T13:10:13Z NONE

Is it possible to use .to_netcdf() without h5netcdf.File touching any of the pre-existing data or attempting to read it or traverse it? This will inevitably cause quadratic slowdowns as you write multiple datasets to the file - and that's what seems to be happening.

Or at least, don't traverse anything above the current root group that the dataset is being written into.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806711702 https://github.com/pydata/xarray/issues/2857#issuecomment-806711702 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjcxMTcwMg== aldanor 2418513 2021-03-25T13:08:46Z 2021-03-25T13:08:46Z NONE

@kmuehlbauer Just installed h5netcdf=0.10.0, here's the timings when there's 200 groups in file - store.close() takes 92.4% of time again:

1078 1 1.0 1.0 0.0 try: 1079 # TODO: allow this work (setting up the file for writing array data) 1080 # to be parallelized with dask 1081 2 221642.0 110821.0 4.2 dump_to_store( 1082 1 2.0 2.0 0.0 dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims 1083 ) 1084 1 3.0 3.0 0.0 if autoclose: 1085 store.close() 1086 1087 1 1.0 1.0 0.0 if multifile: 1088 return writer, store 1089 1090 1 6.0 6.0 0.0 writes = writer.sync(compute=compute) 1091 1092 1 1.0 1.0 0.0 if path_or_file is None: 1093 store.sync() 1094 return target.getvalue() 1095 finally: 1096 1 2.0 2.0 0.0 if not multifile and compute: 1097 1 4857912.0 4857912.0 92.6 store.close()

And here's _lookup_dimensions(): (note that it only takes half of the time, there's tons of other time spent in File.flush() which I don't understand):

``` Timer unit: 1e-06 s

Total time: 2.44857 s File: .../python3.8/site-packages/h5netcdf/core.py Function: _lookup_dimensions at line 92

Line # Hits Time Per Hit % Time Line Contents

92                                               def _lookup_dimensions(self):
93       400      65513.0    163.8      2.7          attrs = self._h5ds.attrs
94       400       6175.0     15.4      0.3          if "_Netcdf4Coordinates" in attrs:
95                                                       order_dim = _reverse_dict(self._parent._dim_order)
96                                                       return tuple(
97                                                           order_dim[coord_id] for coord_id in attrs["_Netcdf4Coordinates"]
98                                                       )
99

100 400 44938.0 112.3 1.8 child_name = self.name.split("/")[-1] 101 400 5006.0 12.5 0.2 if child_name in self._parent.dimensions: 102 return (child_name,) 103
104 400 350.0 0.9 0.0 dims = [] 105 400 781.0 2.0 0.0 phony_dims = defaultdict(int) 106 1400 166093.0 118.6 6.8 for axis, dim in enumerate(self._h5ds.dims): 107 # get current dimension 108 1000 119507.0 119.5 4.9 dimsize = self.shape[axis] 109 1000 2459.0 2.5 0.1 phony_dims[dimsize] += 1 110 1000 34345.0 34.3 1.4 if len(dim): 111 1000 2001071.0 2001.1 81.7 name = _name_from_dimension(dim) 112 else: 113 # if unlabeled dimensions are found 114 if self._root._phony_dims_mode is None: 115 raise ValueError( 116 "variable %r has no dimension scale " 117 "associated with axis %s. \n" 118 "Use phony_dims=%r for sorted naming or " 119 "phony_dims=%r for per access naming." 120 % (self.name, axis, "sort", "access") 121 ) 122 else: 123 # get dimension name 124 name = self._parent._phony_dims[(dimsize, phony_dims[dimsize] - 1)] 125 1000 1820.0 1.8 0.1 dims.append(name) 126 400 512.0 1.3 0.0 return tuple(dims) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806697600 https://github.com/pydata/xarray/issues/2857#issuecomment-806697600 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjY5NzYwMA== kmuehlbauer 5821660 2021-03-25T12:59:11Z 2021-03-25T12:59:11Z MEMBER

@aldanor Which h5netcdf-version are you using? There have been changes to the _lookup_dimensions-function (which should not change behaviour). I'd try to check this out, could you help with a minimal script to reproduce?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806680140 https://github.com/pydata/xarray/issues/2857#issuecomment-806680140 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjY4MDE0MA== aldanor 2418513 2021-03-25T12:48:23Z 2021-03-25T12:49:19Z NONE

There's some absolutely obscure things here, e.g. h5netcdf.core.BaseVariable._lookup_dimensions:

For 0 datasets:

``` Timer unit: 1e-06 s

Total time: 0.005034 s File: .../python3.8/site-packages/h5netcdf/core.py Function: _lookup_dimensions at line 86

Line # Hits Time Per Hit % Time Line Contents

86                                               def _lookup_dimensions(self):
87         2        633.0    316.5     12.6          attrs = self._h5ds.attrs
88         2         53.0     26.5      1.1          if '_Netcdf4Coordinates' in attrs:
89                                                       order_dim = _reverse_dict(self._parent._dim_order)
90                                                       return tuple(order_dim[coord_id]
91                                                                    for coord_id in attrs['_Netcdf4Coordinates'])
92                                           
93         2        471.0    235.5      9.4          child_name = self.name.split('/')[-1]
94         2         51.0     25.5      1.0          if child_name in self._parent.dimensions:
95                                                       return (child_name,)
96                                           
97         2          4.0      2.0      0.1          dims = []
98         7       1671.0    238.7     33.2          for axis, dim in enumerate(self._h5ds.dims):
99                                                       # TODO: read dimension labels even if there is no associated

100 # scale? it's not netCDF4 spec, but it is unambiguous... 101 # Also: the netCDF lib can read HDF5 datasets with unlabeled 102 # dimensions. 103 5 355.0 71.0 7.1 if len(dim) == 0: 104 raise ValueError('variable %r has no dimension scale ' 105 'associated with axis %s' 106 % (self.name, axis)) 107 5 1772.0 354.4 35.2 name = _name_from_dimension(dim) 108 5 18.0 3.6 0.4 dims.append(name) 109 2 6.0 3.0 0.1 return tuple(dims) ```

For 200 datasets:

``` Timer unit: 1e-06 s

Total time: 2.34179 s File: .../python3.8/site-packages/h5netcdf/core.py Function: _lookup_dimensions at line 86

Line # Hits Time Per Hit % Time Line Contents

86                                               def _lookup_dimensions(self):
87       400      66185.0    165.5      2.8          attrs = self._h5ds.attrs
88       400       6106.0     15.3      0.3          if '_Netcdf4Coordinates' in attrs:
89                                                       order_dim = _reverse_dict(self._parent._dim_order)
90                                                       return tuple(order_dim[coord_id]
91                                                                    for coord_id in attrs['_Netcdf4Coordinates'])
92                                           
93       400      45176.0    112.9      1.9          child_name = self.name.split('/')[-1]
94       400       5006.0     12.5      0.2          if child_name in self._parent.dimensions:
95                                                       return (child_name,)
96                                           
97       400        317.0      0.8      0.0          dims = []
98      1400     168708.0    120.5      7.2          for axis, dim in enumerate(self._h5ds.dims):
99                                                       # TODO: read dimension labels even if there is no associated

100 # scale? it's not netCDF4 spec, but it is unambiguous... 101 # Also: the netCDF lib can read HDF5 datasets with unlabeled 102 # dimensions. 103 1000 35653.0 35.7 1.5 if len(dim) == 0: 104 raise ValueError('variable %r has no dimension scale ' 105 'associated with axis %s' 106 % (self.name, axis)) 107 1000 2012597.0 2012.6 85.9 name = _name_from_dimension(dim) 108 1000 1640.0 1.6 0.1 dims.append(name) 109 400 400.0 1.0 0.0 return tuple(dims) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806667029 https://github.com/pydata/xarray/issues/2857#issuecomment-806667029 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjY2NzAyOQ== aldanor 2418513 2021-03-25T12:40:18Z 2021-03-25T12:49:00Z NONE
  • All of the time in store.close() is, in its turn spent in CachingFileManager.close()
  • That time is spent in h5netcdf.File.close()
  • All of which is spent in h5netcdf.File.flush()

h5netcdf.File.flush() when there's 0 datasets in file:

``` 0.21619391441345215

Timer unit: 1e-06 s

Total time: 0.006862 s File: .../python3.8/site-packages/h5netcdf/core.py Function: flush at line 689

Line # Hits Time Per Hit % Time Line Contents

689 def flush(self): 690 1 4.0 4.0 0.1 if 'r' not in self._mode: 691 1 111.0 111.0 1.6 self._set_unassigned_dimension_ids() 692 1 3521.0 3521.0 51.3 self._create_dim_scales() 693 1 3224.0 3224.0 47.0 self._attach_dim_scales() 694 1 2.0 2.0 0.0 if not self._preexisting_file and self._write_ncproperties: 695 self.attrs._h5attrs['_NCProperties'] = _NC_PROPERTIES ```

h5netcdf.File.flush() when there's 200 datasets in file (758 times slower):

``` Timer unit: 1e-06 s

Total time: 4.55295 s File: .../python3.8/site-packages/h5netcdf/core.py Function: flush at line 689

Line # Hits Time Per Hit % Time Line Contents

689 def flush(self): 690 1 3.0 3.0 0.0 if 'r' not in self._mode: 691 1 1148237.0 1148237.0 25.2 self._set_unassigned_dimension_ids() 692 1 462926.0 462926.0 10.2 self._create_dim_scales() 693 1 2941779.0 2941779.0 64.6 self._attach_dim_scales() 694 1 2.0 2.0 0.0 if not self._preexisting_file and self._write_ncproperties: 695 self.attrs._h5attrs['_NCProperties'] = _NC_PROPERTIES ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
806651823 https://github.com/pydata/xarray/issues/2857#issuecomment-806651823 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDgwNjY1MTgyMw== aldanor 2418513 2021-03-25T12:30:39Z 2021-03-25T12:46:26Z NONE

@shoyer This problem persisted all of this time, but since I faced it again, I did a bit of digging. (it's strange noone else noticed it so far as it's pretty bad)

I've line-profiled this snippet for various number of datasets already written to file (xarray.backends.api.to_netcdf):

https://github.com/pydata/xarray/blob/8452120e52862df564a6e629d1ab5a7d392853b0/xarray/backends/api.py#L1075-L1094

| Number of datasets in file | dump_to_store() | store_open() | store.close() | | --- | --- | --- | --- | | 0 | 88% | 1% | 10% | | 50 | 18% | 2% | 80% | | 200 | 4% | 2% | 94% |

The above can be measured simply in a notebook via %lprun -f xarray.backends.api.to_netcdf test_func(). The writing was done in mode='a', with blosc:zstd compression. All datasets are written into different groups (i.e. by passing group=...).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885
478359263 https://github.com/pydata/xarray/issues/2857#issuecomment-478359263 https://api.github.com/repos/pydata/xarray/issues/2857 MDEyOklzc3VlQ29tbWVudDQ3ODM1OTI2Mw== shoyer 1217238 2019-03-31T17:03:21Z 2019-03-31T17:03:21Z MEMBER

I don't think this is expected. Can you try profiling to_netcdf() (e.g., %prun in IPython) to see what the source of slow down is?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf) 427410885

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 82.052ms · About: xarray-datasette