html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/2857#issuecomment-1010716076,https://api.github.com/repos/pydata/xarray/issues/2857,1010716076,IC_kwDOAMm_X848Pk2s,1217238,2022-01-12T07:18:57Z,2022-01-12T07:18:57Z,MEMBER,"Well done, Kai!","{""total_count"": 2, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 2, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-1010713645,https://api.github.com/repos/pydata/xarray/issues/2857,1010713645,IC_kwDOAMm_X848PkQt,5821660,2022-01-12T07:15:39Z,2022-01-12T07:15:39Z,MEMBER,"This issue is fixed to some extent since `h5netcdf 0.12.0`.
`h5netcdf` does not reach the timings of netCDF4 engine, but the improvement is quite significant.
| Number of datasets in file | netCDF4 write (ms) | h5netcdf <= 0.11.0 write(ms) | h5netcdf >= 0.12.0 write (ms) |
|-----|------|-----|-----|
| 1 | 2 | 7 | 7 |
| 250 | 104 | 1710 | 164 |
The issue can be closed.
Ping @aldanor.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-999410802,https://api.github.com/repos/pydata/xarray/issues/2857,999410802,IC_kwDOAMm_X847kcxy,5821660,2021-12-22T09:11:05Z,2021-12-22T09:11:05Z,MEMBER,"FYI: `h5netcdf` has just merged a refactor of the dimension scale handling, which greatly improves the performance here. It will be released in the next version (0.13.0).
See https://github.com/h5netcdf/h5netcdf/pull/112
I'll come back if the release is out, so we can close this issue.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-825579825,https://api.github.com/repos/pydata/xarray/issues/2857,825579825,MDEyOklzc3VlQ29tbWVudDgyNTU3OTgyNQ==,5821660,2021-04-23T11:01:04Z,2021-04-23T11:01:04Z,MEMBER,@aldanor Could you please have a look into https://github.com/h5netcdf/h5netcdf/pull/101 for a fix. Any comments are very much appreciated.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-807472405,https://api.github.com/repos/pydata/xarray/issues/2857,807472405,MDEyOklzc3VlQ29tbWVudDgwNzQ3MjQwNQ==,1217238,2021-03-25T20:56:36Z,2021-03-25T20:56:36Z,MEMBER,"It appears that issues can only be moved within a GitHub organization. So I
guess we'll need to start a new one.
On Thu, Mar 25, 2021 at 12:35 PM Kai Mühlbauer ***@***.***>
wrote:
> @shoyer Could we move the entire issue? Or
> just open another one over at 'h5netcdf' and reference this one?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or
> unsubscribe
>
> .
>
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-807344131,https://api.github.com/repos/pydata/xarray/issues/2857,807344131,MDEyOklzc3VlQ29tbWVudDgwNzM0NDEzMQ==,5821660,2021-03-25T19:34:55Z,2021-03-25T19:34:55Z,MEMBER,@shoyer Could we move the entire issue? Or just open another one over at 'h5netcdf' and reference this one? ,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-807319582,https://api.github.com/repos/pydata/xarray/issues/2857,807319582,MDEyOklzc3VlQ29tbWVudDgwNzMxOTU4Mg==,1217238,2021-03-25T19:19:08Z,2021-03-25T19:19:08Z,MEMBER,"I suspect this could be solved by adding an optimization into h5netcdf to only call `_attach_dim_scales()` (and maybe some other methods) on variables/groups that have been modified (as opposed to the entire file).
It's probably worth moving the discussion over into the h5netcdf tracker, anyways :)","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-807126680,https://api.github.com/repos/pydata/xarray/issues/2857,807126680,MDEyOklzc3VlQ29tbWVudDgwNzEyNjY4MA==,2418513,2021-03-25T17:17:48Z,2021-03-25T17:18:21Z,NONE,"> OK, we might check if that depends on the data size or on the number of groups, or both.
It scales with data size it seems, but: even if you reduce data size to 1 element, after 50 iterations a single write goes up to 150ms already (whereas it's a few milliseconds in an empty file). These 150ms is the pure 'file traversal' etc part; the rest (of the 2 seconds) is the part where it seemingly reads stuff - which scales with data. Ideally it should just stay at <10ms all the time.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806863336,https://api.github.com/repos/pydata/xarray/issues/2857,806863336,MDEyOklzc3VlQ29tbWVudDgwNjg2MzMzNg==,2418513,2021-03-25T14:35:28Z,2021-03-25T17:15:06Z,NONE,"> I wonder if it would help to use the same underlying `h5py.File` or `h5netcdf.File` when appending.
I don't think it's about what's happening in the current Python's process, which instances are being cached or not, it's about the general logic.
For instance, in the example above, if you run it once (e.g. set the range to 50); and then run it but comment out the block that clears the file, and set the range to 50-100. The very first dataset written the second time will be already very slow, slower than the last dataset written the first time - which means it's not about reusing the same `File` instance.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806982015,https://api.github.com/repos/pydata/xarray/issues/2857,806982015,MDEyOklzc3VlQ29tbWVudDgwNjk4MjAxNQ==,5821660,2021-03-25T15:48:35Z,2021-03-25T15:48:35Z,MEMBER,"OK, we might check if that depends on the data size or on the number of groups, or both.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806853536,https://api.github.com/repos/pydata/xarray/issues/2857,806853536,MDEyOklzc3VlQ29tbWVudDgwNjg1MzUzNg==,5821660,2021-03-25T14:29:24Z,2021-03-25T14:29:24Z,MEMBER,"> I wonder if it would help to use the same underlying `h5py.File` or `h5netcdf.File` when appending.
This should somehow be possible. I'll try to create some proof of concept script bypassing `to_netcdf`, when I find the time. If there are other ideas or solutions, please comment here. Thanks @aldanor for intensive testing and minimal example.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806825379,https://api.github.com/repos/pydata/xarray/issues/2857,806825379,MDEyOklzc3VlQ29tbWVudDgwNjgyNTM3OQ==,5821660,2021-03-25T14:11:43Z,2021-03-25T14:11:43Z,MEMBER,"From my understanding, part of the the problem is with the use of `CachingFileManager`. Every call to `to_netcdf(filename....)` reopens this particular file (with all the downsides) and wraps it in `CachingFileManager` again. I wonder if it would help to use the same underlying `h5py.File` or `h5netcdf.File` when appending. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806776909,https://api.github.com/repos/pydata/xarray/issues/2857,806776909,MDEyOklzc3VlQ29tbWVudDgwNjc3NjkwOQ==,2418513,2021-03-25T13:48:04Z,2021-03-25T13:48:29Z,NONE,"Without digging into implementational details, my logic as a library user would be this:
- If I write one dataset to file1 and another dataset to file2 using to_netcdf(), to different groups
- And then I simply combine the two hdf5 files using some external tools (again, datasets stored in different groups)
- I will be able to read them both perfectly well using `open_dataset()` or `load_dataset()`
- This implies that the datasets can be written just fine independently without knowing about each other
- Why then those writing functions (flush in particular) traverse and read the entire file every time?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806767981,https://api.github.com/repos/pydata/xarray/issues/2857,806767981,MDEyOklzc3VlQ29tbWVudDgwNjc2Nzk4MQ==,2418513,2021-03-25T13:44:22Z,2021-03-25T13:45:04Z,NONE,"Just checked it out.
| Number of datasets in file | netCDF4 (ms/write) | h5netcdf (ms/write) |
| --- | --- | --- |
| 1 | 4 | 11 |
| 250 | 142| 1933 |","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806759522,https://api.github.com/repos/pydata/xarray/issues/2857,806759522,MDEyOklzc3VlQ29tbWVudDgwNjc1OTUyMg==,5821660,2021-03-25T13:39:02Z,2021-03-25T13:39:02Z,MEMBER,"@aldanor If I change your example to using `engine=netcdf4`, the times increase too, but not to the extend of the `h5netcdf` case. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806741704,https://api.github.com/repos/pydata/xarray/issues/2857,806741704,MDEyOklzc3VlQ29tbWVudDgwNjc0MTcwNA==,5821660,2021-03-25T13:27:43Z,2021-03-25T13:27:43Z,MEMBER,"@aldanor Thanks, that's what I expected (that the new version doesn't change the behaviour you are showing).
I think your assessment of the situation is correct. It looks like, `to_netcdf` is re-reading the whole file when in append-mode. Or better said, the underlying machinery re-reads the complete file. Would it be possible to use engine=`netcdf4` just to see if this isn't affected?
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806740965,https://api.github.com/repos/pydata/xarray/issues/2857,806740965,MDEyOklzc3VlQ29tbWVudDgwNjc0MDk2NQ==,2418513,2021-03-25T13:27:17Z,2021-03-25T13:27:17Z,NONE,"Here's the minimal example, try running this:
```python
import time
import xarray as xr
import numpy as np
import h5py
arr = xr.DataArray(np.random.RandomState(0).randint(-100, 100, (50_000, 3)), dims=['x', 'y'])
ds = xr.Dataset({'arr': arr})
filename = 'test.h5'
save = lambda group: ds.to_netcdf(filename, engine='h5netcdf', mode='a', group=str(group))
with h5py.File(filename, 'w') as _:
pass
for i in range(250):
t0 = time.time()
save(i)
print(time.time() - t0)
```","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806713825,https://api.github.com/repos/pydata/xarray/issues/2857,806713825,MDEyOklzc3VlQ29tbWVudDgwNjcxMzgyNQ==,2418513,2021-03-25T13:10:13Z,2021-03-25T13:10:13Z,NONE,"Is it possible to use `.to_netcdf()` without `h5netcdf.File` touching **any** of the pre-existing data or attempting to read it or traverse it? This will inevitably cause quadratic slowdowns as you write multiple datasets to the file - and that's what seems to be happening.
Or at least, don't traverse anything above the current root group that the dataset is being written into.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806711702,https://api.github.com/repos/pydata/xarray/issues/2857,806711702,MDEyOklzc3VlQ29tbWVudDgwNjcxMTcwMg==,2418513,2021-03-25T13:08:46Z,2021-03-25T13:08:46Z,NONE,"@kmuehlbauer Just installed h5netcdf=0.10.0, here's the timings when there's 200 groups in file - `store.close()` takes 92.4% of time again:
```
1078 1 1.0 1.0 0.0 try:
1079 # TODO: allow this work (setting up the file for writing array data)
1080 # to be parallelized with dask
1081 2 221642.0 110821.0 4.2 dump_to_store(
1082 1 2.0 2.0 0.0 dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
1083 )
1084 1 3.0 3.0 0.0 if autoclose:
1085 store.close()
1086
1087 1 1.0 1.0 0.0 if multifile:
1088 return writer, store
1089
1090 1 6.0 6.0 0.0 writes = writer.sync(compute=compute)
1091
1092 1 1.0 1.0 0.0 if path_or_file is None:
1093 store.sync()
1094 return target.getvalue()
1095 finally:
1096 1 2.0 2.0 0.0 if not multifile and compute:
1097 1 4857912.0 4857912.0 92.6 store.close()
```
And here's `_lookup_dimensions()`: (note that it only takes **half** of the time, there's tons of other time spent in `File.flush()` which I don't understand):
```
Timer unit: 1e-06 s
Total time: 2.44857 s
File: .../python3.8/site-packages/h5netcdf/core.py
Function: _lookup_dimensions at line 92
Line # Hits Time Per Hit % Time Line Contents
==============================================================
92 def _lookup_dimensions(self):
93 400 65513.0 163.8 2.7 attrs = self._h5ds.attrs
94 400 6175.0 15.4 0.3 if ""_Netcdf4Coordinates"" in attrs:
95 order_dim = _reverse_dict(self._parent._dim_order)
96 return tuple(
97 order_dim[coord_id] for coord_id in attrs[""_Netcdf4Coordinates""]
98 )
99
100 400 44938.0 112.3 1.8 child_name = self.name.split(""/"")[-1]
101 400 5006.0 12.5 0.2 if child_name in self._parent.dimensions:
102 return (child_name,)
103
104 400 350.0 0.9 0.0 dims = []
105 400 781.0 2.0 0.0 phony_dims = defaultdict(int)
106 1400 166093.0 118.6 6.8 for axis, dim in enumerate(self._h5ds.dims):
107 # get current dimension
108 1000 119507.0 119.5 4.9 dimsize = self.shape[axis]
109 1000 2459.0 2.5 0.1 phony_dims[dimsize] += 1
110 1000 34345.0 34.3 1.4 if len(dim):
111 1000 2001071.0 2001.1 81.7 name = _name_from_dimension(dim)
112 else:
113 # if unlabeled dimensions are found
114 if self._root._phony_dims_mode is None:
115 raise ValueError(
116 ""variable %r has no dimension scale ""
117 ""associated with axis %s. \n""
118 ""Use phony_dims=%r for sorted naming or ""
119 ""phony_dims=%r for per access naming.""
120 % (self.name, axis, ""sort"", ""access"")
121 )
122 else:
123 # get dimension name
124 name = self._parent._phony_dims[(dimsize, phony_dims[dimsize] - 1)]
125 1000 1820.0 1.8 0.1 dims.append(name)
126 400 512.0 1.3 0.0 return tuple(dims)
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806697600,https://api.github.com/repos/pydata/xarray/issues/2857,806697600,MDEyOklzc3VlQ29tbWVudDgwNjY5NzYwMA==,5821660,2021-03-25T12:59:11Z,2021-03-25T12:59:11Z,MEMBER,"@aldanor Which `h5netcdf`-version are you using? There have been changes to the `_lookup_dimensions`-function (which should not change behaviour). I'd try to check this out, could you help with a minimal script to reproduce?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806680140,https://api.github.com/repos/pydata/xarray/issues/2857,806680140,MDEyOklzc3VlQ29tbWVudDgwNjY4MDE0MA==,2418513,2021-03-25T12:48:23Z,2021-03-25T12:49:19Z,NONE,"There's some absolutely obscure things here, e.g. `h5netcdf.core.BaseVariable._lookup_dimensions`:
For 0 datasets:
```
Timer unit: 1e-06 s
Total time: 0.005034 s
File: .../python3.8/site-packages/h5netcdf/core.py
Function: _lookup_dimensions at line 86
Line # Hits Time Per Hit % Time Line Contents
==============================================================
86 def _lookup_dimensions(self):
87 2 633.0 316.5 12.6 attrs = self._h5ds.attrs
88 2 53.0 26.5 1.1 if '_Netcdf4Coordinates' in attrs:
89 order_dim = _reverse_dict(self._parent._dim_order)
90 return tuple(order_dim[coord_id]
91 for coord_id in attrs['_Netcdf4Coordinates'])
92
93 2 471.0 235.5 9.4 child_name = self.name.split('/')[-1]
94 2 51.0 25.5 1.0 if child_name in self._parent.dimensions:
95 return (child_name,)
96
97 2 4.0 2.0 0.1 dims = []
98 7 1671.0 238.7 33.2 for axis, dim in enumerate(self._h5ds.dims):
99 # TODO: read dimension labels even if there is no associated
100 # scale? it's not netCDF4 spec, but it is unambiguous...
101 # Also: the netCDF lib can read HDF5 datasets with unlabeled
102 # dimensions.
103 5 355.0 71.0 7.1 if len(dim) == 0:
104 raise ValueError('variable %r has no dimension scale '
105 'associated with axis %s'
106 % (self.name, axis))
107 5 1772.0 354.4 35.2 name = _name_from_dimension(dim)
108 5 18.0 3.6 0.4 dims.append(name)
109 2 6.0 3.0 0.1 return tuple(dims)
```
For 200 datasets:
```
Timer unit: 1e-06 s
Total time: 2.34179 s
File: .../python3.8/site-packages/h5netcdf/core.py
Function: _lookup_dimensions at line 86
Line # Hits Time Per Hit % Time Line Contents
==============================================================
86 def _lookup_dimensions(self):
87 400 66185.0 165.5 2.8 attrs = self._h5ds.attrs
88 400 6106.0 15.3 0.3 if '_Netcdf4Coordinates' in attrs:
89 order_dim = _reverse_dict(self._parent._dim_order)
90 return tuple(order_dim[coord_id]
91 for coord_id in attrs['_Netcdf4Coordinates'])
92
93 400 45176.0 112.9 1.9 child_name = self.name.split('/')[-1]
94 400 5006.0 12.5 0.2 if child_name in self._parent.dimensions:
95 return (child_name,)
96
97 400 317.0 0.8 0.0 dims = []
98 1400 168708.0 120.5 7.2 for axis, dim in enumerate(self._h5ds.dims):
99 # TODO: read dimension labels even if there is no associated
100 # scale? it's not netCDF4 spec, but it is unambiguous...
101 # Also: the netCDF lib can read HDF5 datasets with unlabeled
102 # dimensions.
103 1000 35653.0 35.7 1.5 if len(dim) == 0:
104 raise ValueError('variable %r has no dimension scale '
105 'associated with axis %s'
106 % (self.name, axis))
107 1000 2012597.0 2012.6 85.9 name = _name_from_dimension(dim)
108 1000 1640.0 1.6 0.1 dims.append(name)
109 400 400.0 1.0 0.0 return tuple(dims)
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806667029,https://api.github.com/repos/pydata/xarray/issues/2857,806667029,MDEyOklzc3VlQ29tbWVudDgwNjY2NzAyOQ==,2418513,2021-03-25T12:40:18Z,2021-03-25T12:49:00Z,NONE,"- All of the time in `store.close()` is, in its turn spent in `CachingFileManager.close()`
- That time is spent in `h5netcdf.File.close()`
- All of which is spent in `h5netcdf.File.flush()`
`h5netcdf.File.flush()` when there's 0 datasets in file:
```
0.21619391441345215
Timer unit: 1e-06 s
Total time: 0.006862 s
File: .../python3.8/site-packages/h5netcdf/core.py
Function: flush at line 689
Line # Hits Time Per Hit % Time Line Contents
==============================================================
689 def flush(self):
690 1 4.0 4.0 0.1 if 'r' not in self._mode:
691 1 111.0 111.0 1.6 self._set_unassigned_dimension_ids()
692 1 3521.0 3521.0 51.3 self._create_dim_scales()
693 1 3224.0 3224.0 47.0 self._attach_dim_scales()
694 1 2.0 2.0 0.0 if not self._preexisting_file and self._write_ncproperties:
695 self.attrs._h5attrs['_NCProperties'] = _NC_PROPERTIES
```
`h5netcdf.File.flush()` when there's 200 datasets in file (**758 times slower**):
```
Timer unit: 1e-06 s
Total time: 4.55295 s
File: .../python3.8/site-packages/h5netcdf/core.py
Function: flush at line 689
Line # Hits Time Per Hit % Time Line Contents
==============================================================
689 def flush(self):
690 1 3.0 3.0 0.0 if 'r' not in self._mode:
691 1 1148237.0 1148237.0 25.2 self._set_unassigned_dimension_ids()
692 1 462926.0 462926.0 10.2 self._create_dim_scales()
693 1 2941779.0 2941779.0 64.6 self._attach_dim_scales()
694 1 2.0 2.0 0.0 if not self._preexisting_file and self._write_ncproperties:
695 self.attrs._h5attrs['_NCProperties'] = _NC_PROPERTIES
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-806651823,https://api.github.com/repos/pydata/xarray/issues/2857,806651823,MDEyOklzc3VlQ29tbWVudDgwNjY1MTgyMw==,2418513,2021-03-25T12:30:39Z,2021-03-25T12:46:26Z,NONE,"@shoyer This problem persisted all of this time, but since I faced it again, I did a bit of digging. (it's strange noone else noticed it so far as it's pretty bad)
I've line-profiled this snippet for various number of datasets already written to file (`xarray.backends.api.to_netcdf`):
https://github.com/pydata/xarray/blob/8452120e52862df564a6e629d1ab5a7d392853b0/xarray/backends/api.py#L1075-L1094
| Number of datasets in file | `dump_to_store()` | `store_open()` | `store.close()` |
| --- | --- | --- | --- |
| 0 | 88% | 1% | 10% |
| 50 | 18% | 2% | 80% |
| 200 | 4% | 2% | 94% |
The above can be measured simply in a notebook via `%lprun -f xarray.backends.api.to_netcdf test_func()`. The writing was done in `mode='a'`, with blosc:zstd compression. All datasets are written into *different groups* (i.e. by passing `group=...`).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885
https://github.com/pydata/xarray/issues/2857#issuecomment-478359263,https://api.github.com/repos/pydata/xarray/issues/2857,478359263,MDEyOklzc3VlQ29tbWVudDQ3ODM1OTI2Mw==,1217238,2019-03-31T17:03:21Z,2019-03-31T17:03:21Z,MEMBER,"I don't think this is expected. Can you try profiling `to_netcdf()` (e.g., `%prun` in IPython) to see what the source of slow down is?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,427410885