issue_comments
18 rows where author_association = "NONE" and issue = 224553135 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: reactions, created_at (date), updated_at (date)
issue 1
- slow performance with open_mfdataset · 18 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
1043022273 | https://github.com/pydata/xarray/issues/1385#issuecomment-1043022273 | https://api.github.com/repos/pydata/xarray/issues/1385 | IC_kwDOAMm_X84-K0HB | jtomfarrar 44488331 | 2022-02-17T14:42:41Z | 2022-02-17T14:42:41Z | NONE | Thank you. A member of my research group made the netcdf file, so we will make a second file with the time encoding fixed. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
1043009735 | https://github.com/pydata/xarray/issues/1385#issuecomment-1043009735 | https://api.github.com/repos/pydata/xarray/issues/1385 | IC_kwDOAMm_X84-KxDH | jtomfarrar 44488331 | 2022-02-17T14:30:03Z | 2022-02-17T14:30:03Z | NONE | Thank you, Ryan. I will post the file to a server with a stable URL and replace the google drive link in the other post. My original issue was that I wanted to not read the data (yet), only to have a look at the metadata. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
1042962960 | https://github.com/pydata/xarray/issues/1385#issuecomment-1042962960 | https://api.github.com/repos/pydata/xarray/issues/1385 | IC_kwDOAMm_X84-KloQ | jtomfarrar 44488331 | 2022-02-17T13:43:21Z | 2022-02-17T13:43:21Z | NONE | Thanks, Ryan! Sure-- here's a link to the file: https://drive.google.com/file/d/1-05bG2kF8wbvldYtDpZ3LYLyqXnvZyw1/view?usp=sharing (I could post to a web server if there's any reason to prefer that.) |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
1042930077 | https://github.com/pydata/xarray/issues/1385#issuecomment-1042930077 | https://api.github.com/repos/pydata/xarray/issues/1385 | IC_kwDOAMm_X84-Kdmd | jtomfarrar 44488331 | 2022-02-17T13:06:18Z | 2022-02-17T13:06:18Z | NONE | @rabernat wrote:
I seem to be experiencing a similar (same?) issue with open_dataset: https://stackoverflow.com/questions/71147712/can-i-force-xarray-open-dataset-to-do-a-lazy-load?stw=2 |
{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
781407863 | https://github.com/pydata/xarray/issues/1385#issuecomment-781407863 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDc4MTQwNzg2Mw== | jameshalgren 53343824 | 2021-02-18T15:06:13Z | 2021-02-18T15:06:13Z | NONE |
Indeed @dcherian -- it took some experimentation to get the right engine to support parallel execution and even then, results are still mixed, which, to me, means further work is needed to isolate the issue. Along the lines of suggestions here (thanks @jmccreight for pointing this out), we've introduced a very practical pre-processing step to rewrite the datasets so that the read is not striped across the file system, effectively isolating the performance bottleneck to a position where it can be dealt with independently. Of course, such an asynchronous workflow is not possible in all situations, so we're still looking at improving the direct performance. Two notes as we keep working: - The preprocessor. Reading and re-manipulating an individual dataset is lightning fast. We saw that a small change or adjustment in the individual files, made with a preprocessor, made the multi-file read massively faster. - The "more sophisticated example" referenced here has proven to be very useful. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
756922963 | https://github.com/pydata/xarray/issues/1385#issuecomment-756922963 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDc1NjkyMjk2Mw== | jameshalgren 53343824 | 2021-01-08T18:26:44Z | 2021-01-08T18:34:49Z | NONE | @dcherian We had looked at a number of options. In the end, the best performance I could achieve was with the work-around pre-processor script, rather than any of the built-in options. It's worth noting that a major part of the slowdown we were experiencing was from the dataframe transform option we were doing after reading the files. Once that was fixed, performance was much better, but not necessarily with any of the expected options. This script reading one-day's worth of NWM q_laterals runs in about 8 seconds (on Cheyenne). If you change the globbing pattern to include a full month, it takes about 380 seconds. setting We are reading everything into memory, which negates the lazy-access benefits of using a dataset and our next steps include looking into that. 300 seconds to read a month isn't totally unacceptable, but we'd like it be faster for the operational runs we'll eventually be doing -- for longer simulations, we may be able to achieve some improvement with asynchronous data access. We'll keep looking into it. (We'll start by trying to adapt the "slightly more sophisticated example" under the docs you referenced here...) Thanks (for the great package and for getting back on this question!) ``` python /glade/scratch/halgren/qlat_mfopen_test.pyimport time import xarray as xr import pandas as pd def get_ql_from_wrf_hydro_mf( qlat_files, index_col="feature_id", value_col="q_lateral" ): """ qlat_files: globbed list of CHRTOUT files containing desired lateral inflows index_col: column/field in the CHRTOUT files with the segment/link id value_col: column/field in the CHRTOUT files with the lateral inflow value
def drop_all_coords(ds): return ds.reset_coords(drop=True) def main():
if name == "main": main() ``` @groutr, @jmccreight |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
756364564 | https://github.com/pydata/xarray/issues/1385#issuecomment-756364564 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDc1NjM2NDU2NA== | jameshalgren 53343824 | 2021-01-07T20:28:32Z | 2021-01-07T20:28:32Z | NONE | @rabernat Is test dataset you mention still somewhere on Cheyenne -- we're seeing a general slowness processing multifile netcdf output from the National Water Model (our project here: NOAA-OWP/t-route) and we would like to see how things compare to your mini-benchmark test. cc @groutr
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
685141540 | https://github.com/pydata/xarray/issues/1385#issuecomment-685141540 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDY4NTE0MTU0MA== | dksasaki 17645581 | 2020-09-01T21:25:24Z | 2020-09-01T21:25:24Z | NONE | Hi, I have used xarray for a few years now and always had this slow performance associated to |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
561900194 | https://github.com/pydata/xarray/issues/1385#issuecomment-561900194 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDU2MTkwMDE5NA== | keltonhalbert 1411265 | 2019-12-04T23:57:07Z | 2019-12-04T23:57:07Z | NONE | So is there any word on a best practice, fix, or workaround with the MFDataset performance? Still getting abysmal reading perfomance with a list of NetCDF files that represent sequential times. I want to use MFDataset to chunk multiple time steps into memory at once but its taking 5-10 minutes to construct MFDataset objects and even longer to run .values on it. |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
464100720 | https://github.com/pydata/xarray/issues/1385#issuecomment-464100720 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDQ2NDEwMDcyMA== | chuaxr 30007270 | 2019-02-15T15:57:01Z | 2019-02-15T18:33:31Z | NONE | In that case, the speedup disappears. It seems that the slowdown arises from the entire time array being loaded into memory at once. EDIT: I subsequently realized that using drop_variables = 'time' caused all the data values to become nan, which makes that an invalid option. ``` %prun ds = xr.open_mfdataset(fname,decode_times=False) 8025 function calls (7856 primitive calls) in 29.662 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 4 29.608 7.402 29.608 7.402 {built-in method operator.getitem} 1 0.032 0.032 0.032 0.032 netCDF4.py:244(_open_netcdf4_group) 1 0.015 0.015 0.015 0.015 {built-in method posix.lstat} 126/114 0.000 0.000 0.001 0.000 indexing.py:504(shape) 1196 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 81 0.000 0.000 0.001 0.000 variable.py:239(init) ``` See the rest of the prun output under the Details for more information:
30 0.000 0.000 0.000 0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects}
81 0.000 0.000 0.000 0.000 variable.py:709(attrs)
736/672 0.000 0.000 0.000 0.000 {built-in method builtins.len}
157 0.000 0.000 0.001 0.000 utils.py:450(ndim)
81 0.000 0.000 0.001 0.000 variable.py:417(_parse_dimensions)
7 0.000 0.000 0.001 0.000 netCDF4_.py:361(open_store_variable)
4 0.000 0.000 0.000 0.000 base.py:253(__new__)
1 0.000 0.000 29.662 29.662 <string>:1(<module>)
7 0.000 0.000 0.001 0.000 conventions.py:245(decode_cf_variable)
39/19 0.000 0.000 29.609 1.558 {built-in method numpy.core.multiarray.array}
9 0.000 0.000 0.000 0.000 core.py:1776(normalize_chunks)
104 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
143 0.000 0.000 0.001 0.000 variable.py:272(shape)
4 0.000 0.000 0.000 0.000 utils.py:88(_StartCountStride)
8 0.000 0.000 0.000 0.000 core.py:747(blockdims_from_blockshape)
23 0.000 0.000 0.032 0.001 file_manager.py:150(acquire)
8 0.000 0.000 0.000 0.000 base.py:590(tokenize)
84 0.000 0.000 0.000 0.000 variable.py:137(as_compatible_data)
268 0.000 0.000 0.000 0.000 {method 'indices' of 'slice' objects}
14 0.000 0.000 29.610 2.115 variable.py:41(as_variable)
35 0.000 0.000 0.000 0.000 variables.py:102(unpack_for_decoding)
81 0.000 0.000 0.000 0.000 variable.py:721(encoding)
192 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
2 0.000 0.000 0.000 0.000 merge.py:109(merge_variables)
2 0.000 0.000 29.610 14.805 merge.py:392(merge_core)
7 0.000 0.000 0.000 0.000 variables.py:161(<setcomp>)
103 0.000 0.000 0.000 0.000 {built-in method _abc._abc_instancecheck}
1 0.000 0.000 0.001 0.001 conventions.py:351(decode_cf_variables)
3 0.000 0.000 0.000 0.000 dataset.py:90(calculate_dimensions)
1 0.000 0.000 0.000 0.000 {built-in method posix.stat}
361 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
20 0.000 0.000 0.000 0.000 variable.py:728(copy)
23 0.000 0.000 0.000 0.000 lru_cache.py:40(__getitem__)
12 0.000 0.000 0.000 0.000 base.py:504(_simple_new)
2 0.000 0.000 0.000 0.000 variable.py:1985(assert_unique_multiindex_level_names)
2 0.000 0.000 0.000 0.000 alignment.py:172(deep_align)
14 0.000 0.000 0.000 0.000 indexing.py:469(__init__)
16 0.000 0.000 29.609 1.851 variable.py:1710(__init__)
1 0.000 0.000 29.662 29.662 {built-in method builtins.exec}
25 0.000 0.000 0.000 0.000 contextlib.py:81(__init__)
7 0.000 0.000 0.000 0.000 {method 'getncattr' of 'netCDF4._netCDF4.Dataset' objects}
24 0.000 0.000 0.000 0.000 indexing.py:331(as_integer_slice)
50/46 0.000 0.000 0.000 0.000 common.py:181(__setattr__)
7 0.000 0.000 0.000 0.000 variables.py:155(decode)
4 0.000 0.000 29.609 7.402 indexing.py:760(explicit_indexing_adapter)
48 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:416(parent)
103 0.000 0.000 0.000 0.000 abc.py:137(__instancecheck__)
48 0.000 0.000 0.000 0.000 _collections_abc.py:742(__iter__)
180 0.000 0.000 0.000 0.000 variable.py:411(dims)
4 0.000 0.000 0.000 0.000 locks.py:158(__exit__)
3 0.000 0.000 0.001 0.000 core.py:2048(from_array)
1 0.000 0.000 29.612 29.612 conventions.py:412(decode_cf)
4 0.000 0.000 0.000 0.000 utils.py:50(_maybe_cast_to_cftimeindex)
77/59 0.000 0.000 0.000 0.000 utils.py:473(dtype)
84 0.000 0.000 0.000 0.000 generic.py:7(_check)
146 0.000 0.000 0.000 0.000 indexing.py:319(tuple)
7 0.000 0.000 0.000 0.000 netCDF4_.py:34(__init__)
1 0.000 0.000 29.614 29.614 api.py:270(maybe_decode_store)
1 0.000 0.000 29.662 29.662 api.py:487(open_mfdataset)
20 0.000 0.000 0.000 0.000 common.py:1845(_is_dtype_type)
33 0.000 0.000 0.000 0.000 core.py:1911(<genexpr>)
84 0.000 0.000 0.000 0.000 variable.py:117(_maybe_wrap_data)
3 0.000 0.000 0.001 0.000 variable.py:830(chunk)
25 0.000 0.000 0.000 0.000 contextlib.py:237(helper)
36/25 0.000 0.000 0.000 0.000 utils.py:477(shape)
8 0.000 0.000 0.000 0.000 base.py:566(_shallow_copy)
8 0.000 0.000 0.000 0.000 indexing.py:346(__init__)
26/25 0.000 0.000 0.000 0.000 utils.py:408(__call__)
4 0.000 0.000 0.000 0.000 indexing.py:886(_decompose_outer_indexer)
2 0.000 0.000 29.610 14.805 merge.py:172(expand_variable_dicts)
4 0.000 0.000 29.608 7.402 netCDF4_.py:67(_getitem)
2 0.000 0.000 0.000 0.000 dataset.py:722(copy)
7 0.000 0.000 0.001 0.000 dataset.py:1383(maybe_chunk)
16 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.empty}
14 0.000 0.000 0.000 0.000 fromnumeric.py:1471(ravel)
60 0.000 0.000 0.000 0.000 base.py:652(__len__)
3 0.000 0.000 0.000 0.000 core.py:141(getem)
25 0.000 0.000 0.000 0.000 contextlib.py:116(__exit__)
4 0.000 0.000 29.609 7.402 utils.py:62(safe_cast_to_index)
18 0.000 0.000 0.000 0.000 core.py:891(shape)
25 0.000 0.000 0.000 0.000 contextlib.py:107(__enter__)
4 0.000 0.000 0.001 0.000 utils.py:332(FrozenOrderedDict)
8 0.000 0.000 0.000 0.000 base.py:1271(set_names)
4 0.000 0.000 0.000 0.000 numeric.py:34(__new__)
24 0.000 0.000 0.000 0.000 inference.py:253(is_list_like)
3 0.000 0.000 0.000 0.000 core.py:820(__new__)
12 0.000 0.000 0.000 0.000 variable.py:1785(copy)
36 0.000 0.000 0.000 0.000 {method 'copy' of 'collections.OrderedDict' objects}
8/7 0.000 0.000 0.000 0.000 {built-in method builtins.sorted}
2 0.000 0.000 0.000 0.000 merge.py:220(determine_coords)
46 0.000 0.000 0.000 0.000 file_manager.py:141(_optional_lock)
60 0.000 0.000 0.000 0.000 indexing.py:1252(shape)
50 0.000 0.000 0.000 0.000 {built-in method builtins.next}
59 0.000 0.000 0.000 0.000 {built-in method builtins.iter}
54 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
1 0.000 0.000 0.000 0.000 api.py:146(_protect_dataset_variables_inplace)
1 0.000 0.000 29.646 29.646 api.py:162(open_dataset)
4 0.000 0.000 0.000 0.000 utils.py:424(_out_array_shape)
4 0.000 0.000 29.609 7.402 indexing.py:1224(__init__)
24 0.000 0.000 0.000 0.000 function_base.py:241(iterable)
4 0.000 0.000 0.000 0.000 dtypes.py:968(is_dtype)
2 0.000 0.000 0.000 0.000 merge.py:257(coerce_pandas_values)
14 0.000 0.000 0.000 0.000 missing.py:105(_isna_new)
8 0.000 0.000 0.000 0.000 variable.py:1840(to_index)
7 0.000 0.000 0.000 0.000 {method 'search' of 're.Pattern' objects}
48 0.000 0.000 0.000 0.000 {method 'rpartition' of 'str' objects}
7 0.000 0.000 0.000 0.000 strings.py:66(decode)
7 0.000 0.000 0.000 0.000 netCDF4_.py:257(_disable_auto_decode_variable)
14 0.000 0.000 0.000 0.000 numerictypes.py:619(issubclass_)
24/4 0.000 0.000 29.609 7.402 numeric.py:433(asarray)
7 0.000 0.000 0.000 0.000 {method 'ncattrs' of 'netCDF4._netCDF4.Variable' objects}
8 0.000 0.000 0.000 0.000 numeric.py:67(_shallow_copy)
8 0.000 0.000 0.000 0.000 indexing.py:373(__init__)
3 0.000 0.000 0.000 0.000 core.py:134(<listcomp>)
14 0.000 0.000 0.000 0.000 merge.py:154(<listcomp>)
16 0.000 0.000 0.000 0.000 dataset.py:816(<genexpr>)
11 0.000 0.000 0.000 0.000 netCDF4_.py:56(get_array)
40 0.000 0.000 0.000 0.000 utils.py:40(_find_dim)
22 0.000 0.000 0.000 0.000 core.py:1893(<genexpr>)
27 0.000 0.000 0.000 0.000 {built-in method builtins.all}
26/10 0.000 0.000 0.000 0.000 {built-in method builtins.sum}
2 0.000 0.000 0.000 0.000 dataset.py:424(attrs)
7 0.000 0.000 0.000 0.000 variables.py:231(decode)
1 0.000 0.000 0.000 0.000 file_manager.py:66(__init__)
67 0.000 0.000 0.000 0.000 utils.py:316(__getitem__)
22 0.000 0.000 0.000 0.000 {method 'move_to_end' of 'collections.OrderedDict' objects}
53 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
1 0.000 0.000 0.000 0.000 combine.py:374(_infer_concat_order_from_positions)
7 0.000 0.000 0.000 0.000 dataset.py:1378(selkeys)
1 0.000 0.000 0.001 0.001 dataset.py:1333(chunk)
4 0.000 0.000 29.609 7.402 netCDF4_.py:62(__getitem__)
37 0.000 0.000 0.000 0.000 netCDF4_.py:365(<genexpr>)
18 0.000 0.000 0.000 0.000 {method 'ravel' of 'numpy.ndarray' objects}
2 0.000 0.000 0.000 0.000 alignment.py:37(align)
14 0.000 0.000 0.000 0.000 {pandas._libs.lib.is_scalar}
8 0.000 0.000 0.000 0.000 base.py:1239(_set_names)
16 0.000 0.000 0.000 0.000 indexing.py:314(__init__)
3 0.000 0.000 0.000 0.000 config.py:414(get)
7 0.000 0.000 0.000 0.000 dtypes.py:68(maybe_promote)
8 0.000 0.000 0.000 0.000 variable.py:1856(level_names)
37 0.000 0.000 0.000 0.000 {method 'copy' of 'dict' objects}
6 0.000 0.000 0.000 0.000 re.py:180(search)
6 0.000 0.000 0.000 0.000 re.py:271(_compile)
8 0.000 0.000 0.000 0.000 {built-in method _hashlib.openssl_md5}
1 0.000 0.000 0.000 0.000 merge.py:463(merge)
7 0.000 0.000 0.000 0.000 variables.py:158(<listcomp>)
7 0.000 0.000 0.000 0.000 numerictypes.py:687(issubdtype)
6 0.000 0.000 0.000 0.000 utils.py:510(is_remote_uri)
8 0.000 0.000 0.000 0.000 common.py:1702(is_extension_array_dtype)
25 0.000 0.000 0.000 0.000 indexing.py:645(as_indexable)
21 0.000 0.000 0.000 0.000 {method 'pop' of 'collections.OrderedDict' objects}
19 0.000 0.000 0.000 0.000 {built-in method __new__ of type object at 0x2b324a13e3c0}
1 0.000 0.000 0.001 0.001 dataset.py:1394(<listcomp>)
21 0.000 0.000 0.000 0.000 variables.py:117(pop_to)
1 0.000 0.000 0.032 0.032 netCDF4_.py:320(open)
8 0.000 0.000 0.000 0.000 netCDF4_.py:399(<genexpr>)
12 0.000 0.000 0.000 0.000 __init__.py:221(iteritems)
4 0.000 0.000 0.000 0.000 common.py:403(is_datetime64_dtype)
8 0.000 0.000 0.000 0.000 common.py:1809(_get_dtype)
8 0.000 0.000 0.000 0.000 dtypes.py:68(find)
8 0.000 0.000 0.000 0.000 base.py:3607(values)
22 0.000 0.000 0.000 0.000 pycompat.py:32(move_to_end)
8 0.000 0.000 0.000 0.000 utils.py:792(__exit__)
3 0.000 0.000 0.000 0.000 highlevelgraph.py:84(from_collections)
22 0.000 0.000 0.000 0.000 core.py:1906(<genexpr>)
16 0.000 0.000 0.000 0.000 abc.py:141(__subclasscheck__)
1 0.000 0.000 0.000 0.000 posixpath.py:104(split)
1 0.000 0.000 0.001 0.001 combine.py:479(_auto_combine_all_along_first_dim)
1 0.000 0.000 29.610 29.610 dataset.py:321(__init__)
4 0.000 0.000 0.000 0.000 dataset.py:643(_construct_direct)
7 0.000 0.000 0.000 0.000 variables.py:266(decode)
1 0.000 0.000 0.032 0.032 netCDF4_.py:306(__init__)
14 0.000 0.000 0.000 0.000 numeric.py:504(asanyarray)
4 0.000 0.000 0.000 0.000 common.py:503(is_period_dtype)
8 0.000 0.000 0.000 0.000 common.py:1981(pandas_dtype)
12 0.000 0.000 0.000 0.000 base.py:633(_reset_identity)
11 0.000 0.000 0.000 0.000 pycompat.py:18(iteritems)
16 0.000 0.000 0.000 0.000 utils.py:279(is_integer)
14 0.000 0.000 0.000 0.000 variable.py:268(dtype)
4 0.000 0.000 0.000 0.000 indexing.py:698(_outer_to_numpy_indexer)
42 0.000 0.000 0.000 0.000 variable.py:701(attrs)
9 0.000 0.000 0.000 0.000 {built-in method builtins.any}
1 0.000 0.000 0.000 0.000 posixpath.py:338(normpath)
6 0.000 0.000 0.000 0.000 _collections_abc.py:676(items)
24 0.000 0.000 0.000 0.000 {built-in method math.isnan}
1 0.000 0.000 29.610 29.610 merge.py:360(merge_data_and_coords)
1 0.000 0.000 0.000 0.000 dataset.py:1084(set_coords)
1 0.000 0.000 0.001 0.001 common.py:99(load)
1 0.000 0.000 0.000 0.000 file_manager.py:250(decrement)
4 0.000 0.000 0.000 0.000 locks.py:154(__enter__)
7 0.000 0.000 0.000 0.000 netCDF4_.py:160(_ensure_fill_value_valid)
8 0.000 0.000 0.001 0.000 netCDF4_.py:393(<genexpr>)
8 0.000 0.000 0.000 0.000 common.py:572(is_categorical_dtype)
16 0.000 0.000 0.000 0.000 base.py:75(is_dtype)
72 0.000 0.000 0.000 0.000 indexing.py:327(as_integer_or_none)
26 0.000 0.000 0.000 0.000 utils.py:382(dispatch)
3 0.000 0.000 0.000 0.000 core.py:123(slices_from_chunks)
16 0.000 0.000 0.000 0.000 core.py:768(<genexpr>)
4 0.000 0.000 29.609 7.402 indexing.py:514(__array__)
4 0.000 0.000 0.000 0.000 indexing.py:1146(__init__)
4 0.000 0.000 0.000 0.000 indexing.py:1153(_indexing_array_and_key)
4 0.000 0.000 29.609 7.402 variable.py:400(to_index_variable)
30 0.000 0.000 0.000 0.000 {method 'items' of 'collections.OrderedDict' objects}
16 0.000 0.000 0.000 0.000 {built-in method _abc._abc_subclasscheck}
19 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}
1 0.000 0.000 0.000 0.000 combine.py:423(_check_shape_tile_ids)
4 0.000 0.000 0.000 0.000 merge.py:91(_assert_compat_valid)
12 0.000 0.000 0.000 0.000 dataset.py:263(<genexpr>)
1 0.000 0.000 29.610 29.610 dataset.py:372(_set_init_vars_and_dims)
3 0.000 0.000 0.000 0.000 dataset.py:413(_attrs_copy)
8 0.000 0.000 0.000 0.000 common.py:120(<genexpr>)
14 0.000 0.000 0.000 0.000 {built-in method pandas._libs.missing.checknull}
4 0.000 0.000 0.000 0.000 common.py:746(is_dtype_equal)
4 0.000 0.000 0.000 0.000 common.py:923(is_signed_integer_dtype)
4 0.000 0.000 0.000 0.000 common.py:1545(is_float_dtype)
14 0.000 0.000 0.000 0.000 missing.py:25(isna)
3 0.000 0.000 0.000 0.000 highlevelgraph.py:71(__init__)
3 0.000 0.000 0.000 0.000 core.py:137(<listcomp>)
33 0.000 0.000 0.000 0.000 core.py:1883(<genexpr>)
35 0.000 0.000 0.000 0.000 variable.py:713(encoding)
2 0.000 0.000 0.000 0.000 {built-in method builtins.min}
16 0.000 0.000 0.000 0.000 _collections_abc.py:719(__iter__)
8 0.000 0.000 0.000 0.000 _collections_abc.py:760(__iter__)
1 0.000 0.000 0.015 0.015 glob.py:9(glob)
2 0.000 0.000 0.015 0.008 glob.py:39(_iglob)
8 0.000 0.000 0.000 0.000 {method 'hexdigest' of '_hashlib.HASH' objects}
1 0.000 0.000 0.000 0.000 combine.py:500(_auto_combine_1d)
14 0.000 0.000 0.000 0.000 merge.py:104(__missing__)
1 0.000 0.000 0.000 0.000 coordinates.py:167(variables)
3 0.000 0.000 0.000 0.000 dataset.py:98(<genexpr>)
4 0.000 0.000 0.000 0.000 dataset.py:402(variables)
1 0.000 0.000 0.000 0.000 netCDF4_.py:269(_disable_auto_decode_group)
12 0.000 0.000 0.032 0.003 netCDF4_.py:357(ds)
1 0.000 0.000 29.646 29.646 api.py:637(<listcomp>)
9 0.000 0.000 0.000 0.000 utils.py:313(__init__)
7 0.000 0.000 0.000 0.000 {method 'filters' of 'netCDF4._netCDF4.Variable' objects}
12 0.000 0.000 0.000 0.000 common.py:117(classes)
8 0.000 0.000 0.000 0.000 common.py:536(is_interval_dtype)
4 0.000 0.000 0.000 0.000 common.py:1078(is_datetime64_any_dtype)
4 0.000 0.000 0.000 0.000 dtypes.py:827(is_dtype)
8 0.000 0.000 0.000 0.000 base.py:551(<dictcomp>)
8 0.000 0.000 0.000 0.000 base.py:547(_get_attributes_dict)
8 0.000 0.000 0.000 0.000 utils.py:789(__enter__)
18 0.000 0.000 0.000 0.000 core.py:903(_get_chunks)
33 0.000 0.000 0.000 0.000 core.py:1885(<genexpr>)
22 0.000 0.000 0.000 0.000 core.py:1889(<genexpr>)
4 0.000 0.000 0.000 0.000 indexing.py:799(_decompose_slice)
4 0.000 0.000 0.000 0.000 indexing.py:1174(__getitem__)
3 0.000 0.000 0.000 0.000 variable.py:294(data)
8 0.000 0.000 0.000 0.000 {method '__enter__' of '_thread.lock' objects}
9 0.000 0.000 0.000 0.000 {built-in method builtins.hash}
4 0.000 0.000 0.000 0.000 {built-in method builtins.max}
4 0.000 0.000 0.000 0.000 {method 'update' of 'set' objects}
7 0.000 0.000 0.000 0.000 {method 'values' of 'dict' objects}
8 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
1 0.000 0.000 0.000 0.000 posixpath.py:376(abspath)
1 0.000 0.000 0.000 0.000 genericpath.py:53(getmtime)
4 0.000 0.000 0.000 0.000 _collections_abc.py:657(get)
1 0.000 0.000 0.000 0.000 __init__.py:548(__init__)
1 0.000 0.000 0.000 0.000 __init__.py:617(update)
4/2 0.000 0.000 0.000 0.000 combine.py:392(_infer_tile_ids_from_nested_list)
1 0.000 0.000 0.001 0.001 combine.py:522(_auto_combine)
2 0.000 0.000 0.000 0.000 merge.py:100(__init__)
5 0.000 0.000 0.000 0.000 coordinates.py:38(__iter__)
5 0.000 0.000 0.000 0.000 coordinates.py:169(<genexpr>)
1 0.000 0.000 0.000 0.000 dataset.py:666(_replace_vars_and_dims)
5 0.000 0.000 0.000 0.000 dataset.py:1078(data_vars)
1 0.000 0.000 0.000 0.000 file_manager.py:133(_make_key)
1 0.000 0.000 0.000 0.000 file_manager.py:245(increment)
1 0.000 0.000 0.000 0.000 lru_cache.py:54(__setitem__)
1 0.000 0.000 0.000 0.000 netCDF4_.py:398(get_attrs)
1 0.000 0.000 0.000 0.000 api.py:80(_get_default_engine)
1 0.000 0.000 0.000 0.000 api.py:92(_normalize_path)
8 0.000 0.000 0.000 0.000 {method 'view' of 'numpy.ndarray' objects}
8 0.000 0.000 0.000 0.000 utils.py:187(is_dict_like)
4 0.000 0.000 0.000 0.000 utils.py:219(is_valid_numpy_dtype)
10 0.000 0.000 0.000 0.000 utils.py:319(__iter__)
1 0.000 0.000 0.000 0.000 {method 'filepath' of 'netCDF4._netCDF4.Dataset' objects}
4 0.000 0.000 0.000 0.000 common.py:434(is_datetime64tz_dtype)
3 0.000 0.000 0.000 0.000 config.py:107(normalize_key)
3 0.000 0.000 0.000 0.000 core.py:160(<listcomp>)
6 0.000 0.000 0.000 0.000 core.py:966(ndim)
4 0.000 0.000 0.000 0.000 indexing.py:791(decompose_indexer)
8 0.000 0.000 0.000 0.000 {method '__exit__' of '_thread.lock' objects}
3 0.000 0.000 0.000 0.000 {method 'replace' of 'str' objects}
4 0.000 0.000 0.000 0.000 {method 'split' of 'str' objects}
1 0.000 0.000 0.000 0.000 posixpath.py:121(splitext)
1 0.000 0.000 0.000 0.000 genericpath.py:117(_splitext)
1 0.000 0.000 0.001 0.001 combine.py:443(_combine_nd)
1 0.000 0.000 0.000 0.000 combine.py:508(<listcomp>)
14 0.000 0.000 0.000 0.000 merge.py:41(unique_variable)
11 0.000 0.000 0.000 0.000 coordinates.py:163(_names)
1 0.000 0.000 0.000 0.000 dataset.py:2593(_assert_all_in_dataset)
1 0.000 0.000 0.000 0.000 variables.py:55(__init__)
1 0.000 0.000 0.000 0.000 file_manager.py:269(__init__)
29 0.000 0.000 0.000 0.000 file_manager.py:273(__hash__)
1 0.000 0.000 0.001 0.001 netCDF4_.py:392(get_variables)
1 0.000 0.000 0.000 0.000 netCDF4_.py:410(<setcomp>)
7 0.000 0.000 0.000 0.000 {method 'set_auto_chartostring' of 'netCDF4._netCDF4.Variable' objects}
1 0.000 0.000 0.000 0.000 {method 'ncattrs' of 'netCDF4._netCDF4.Dataset' objects}
4 0.000 0.000 0.000 0.000 common.py:472(is_timedelta64_dtype)
4 0.000 0.000 0.000 0.000 common.py:980(is_unsigned_integer_dtype)
4 0.000 0.000 0.000 0.000 base.py:3805(_coerce_to_ndarray)
3 0.000 0.000 0.000 0.000 itertoolz.py:241(unique)
11 0.000 0.000 0.000 0.000 core.py:137(<genexpr>)
3 0.000 0.000 0.000 0.000 indexing.py:600(__init__)
2 0.000 0.000 0.000 0.000 {method 'keys' of 'collections.OrderedDict' objects}
2 0.000 0.000 0.000 0.000 {built-in method _thread.allocate_lock}
1 0.000 0.000 0.000 0.000 {built-in method _collections._count_elements}
8 0.000 0.000 0.000 0.000 {method 'encode' of 'str' objects}
3 0.000 0.000 0.000 0.000 {method 'rfind' of 'str' objects}
8 0.000 0.000 0.000 0.000 {method 'add' of 'set' objects}
3 0.000 0.000 0.000 0.000 {method 'intersection' of 'set' objects}
7 0.000 0.000 0.000 0.000 {method 'setdefault' of 'dict' objects}
13 0.000 0.000 0.000 0.000 {method 'pop' of 'dict' objects}
1 0.000 0.000 0.000 0.000 posixpath.py:64(isabs)
1 0.000 0.000 0.015 0.015 posixpath.py:178(lexists)
1 0.000 0.000 0.000 0.000 posixpath.py:232(expanduser)
2 0.000 0.000 0.000 0.000 _collections_abc.py:672(keys)
7 0.000 0.000 0.000 0.000 contextlib.py:352(__init__)
7 0.000 0.000 0.000 0.000 contextlib.py:355(__enter__)
2 0.000 0.000 0.000 0.000 combine.py:496(vars_as_keys)
2 0.000 0.000 0.000 0.000 combine.py:517(_new_tile_id)
7 0.000 0.000 0.000 0.000 common.py:29(_decode_variable_name)
1 0.000 0.000 0.000 0.000 coordinates.py:160(__init__)
3 0.000 0.000 0.000 0.000 dataset.py:262(__iter__)
2 0.000 0.000 0.000 0.000 dataset.py:266(__len__)
2 0.000 0.000 0.000 0.000 dataset.py:940(__iter__)
1 0.000 0.000 0.000 0.000 dataset.py:1071(coords)
7 0.000 0.000 0.000 0.000 dataset.py:1381(<genexpr>)
4 0.000 0.000 0.000 0.000 variables.py:61(dtype)
1 0.000 0.000 0.000 0.000 file_manager.py:189(__del__)
1 0.000 0.000 0.000 0.000 lru_cache.py:47(_enforce_size_limit)
1 0.000 0.000 0.000 0.000 netCDF4_.py:138(_nc4_require_group)
1 0.000 0.000 0.000 0.000 netCDF4_.py:408(get_encoding)
1 0.000 0.000 0.000 0.000 api.py:66(_get_default_engine_netcdf)
4 0.000 0.000 0.000 0.000 utils.py:197(<genexpr>)
1 0.000 0.000 0.000 0.000 alignment.py:17(_get_joiner)
10 0.000 0.000 0.000 0.000 alignment.py:184(is_alignable)
5 0.000 0.000 0.000 0.000 alignment.py:226(<genexpr>)
5 0.000 0.000 0.000 0.000 utils.py:325(__contains__)
5 0.000 0.000 0.000 0.000 {method 'isunlimited' of 'netCDF4._netCDF4.Dimension' objects}
8 0.000 0.000 0.000 0.000 inference.py:435(is_hashable)
12 0.000 0.000 0.000 0.000 common.py:119(<lambda>)
8 0.000 0.000 0.000 0.000 common.py:127(<lambda>)
8 0.000 0.000 0.000 0.000 common.py:122(classes_and_not_datetimelike)
4 0.000 0.000 0.000 0.000 base.py:675(dtype)
8 0.000 0.000 0.000 0.000 base.py:1395(nlevels)
24 0.000 0.000 0.000 0.000 functoolz.py:15(identity)
1 0.000 0.000 0.000 0.000 base.py:610(normalize_dict)
1 0.000 0.000 0.000 0.000 base.py:625(normalize_seq)
3 0.000 0.000 0.000 0.000 indexing.py:453(__init__)
4 0.000 0.000 0.000 0.000 indexing.py:713(<listcomp>)
3 0.000 0.000 0.000 0.000 variable.py:821(chunks)
4 0.000 0.000 0.000 0.000 variable.py:1731(chunk)
8 0.000 0.000 0.000 0.000 variable.py:1874(name)
3 0.000 0.000 0.000 0.000 {method 'values' of 'collections.OrderedDict' objects}
6 0.000 0.000 0.000 0.000 {built-in method posix.fspath}
1 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
4 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}
3 0.000 0.000 0.000 0.000 {method 'copy' of 'set' objects}
1 0.000 0.000 0.000 0.000 {method 'union' of 'set' objects}
1 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
2 0.000 0.000 0.000 0.000 posixpath.py:41(_get_sep)
1 0.000 0.000 0.000 0.000 _collections_abc.py:680(values)
9 0.000 0.000 0.000 0.000 _collections_abc.py:698(__init__)
7 0.000 0.000 0.000 0.000 contextlib.py:358(__exit__)
1 0.000 0.000 0.000 0.000 glob.py:145(has_magic)
1 0.000 0.000 0.000 0.000 combine.py:428(<listcomp>)
2 0.000 0.000 0.000 0.000 merge.py:301(_get_priority_vars)
1 0.000 0.000 0.000 0.000 merge.py:370(extract_indexes)
1 0.000 0.000 0.000 0.000 merge.py:378(assert_valid_explicit_coords)
5 0.000 0.000 0.000 0.000 dataset.py:259(__init__)
1 0.000 0.000 0.000 0.000 dataset.py:375(<listcomp>)
2 0.000 0.000 0.000 0.000 dataset.py:416(attrs)
5 0.000 0.000 0.000 0.000 dataset.py:428(encoding)
1 0.000 0.000 0.000 0.000 dataset.py:436(encoding)
1 0.000 0.000 0.000 0.000 dataset.py:1373(<listcomp>)
1 0.000 0.000 0.000 0.000 variables.py:76(lazy_elemwise_func)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
7 0.000 0.000 0.000 0.000 strings.py:39(__init__)
1 0.000 0.000 0.000 0.000 file_manager.py:241(__init__)
1 0.000 0.000 0.000 0.000 locks.py:206(ensure_lock)
1 0.000 0.000 0.000 0.000 netCDF4_.py:236(__init__)
1 0.000 0.000 0.000 0.000 api.py:638(<listcomp>)
1 0.000 0.000 0.000 0.000 utils.py:452(_tostr)
7 0.000 0.000 0.000 0.000 {method 'set_auto_maskandscale' of 'netCDF4._netCDF4.Variable' objects}
1 0.000 0.000 0.000 0.000 utils.py:514(is_grib_path)
3 0.000 0.000 0.000 0.000 core.py:989(name)
8 0.000 0.000 0.000 0.000 variable.py:1834(to_index_variable)
1 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'keys' of 'dict' objects}
1 0.000 0.000 0.000 0.000 glob.py:22(iglob)
2 0.000 0.000 0.000 0.000 variable.py:2007(<listcomp>)
1 0.000 0.000 0.000 0.000 combine.py:345(_auto_concat)
1 0.000 0.000 0.000 0.000 combine.py:435(<listcomp>)
1 0.000 0.000 0.000 0.000 merge.py:519(<listcomp>)
2 0.000 0.000 0.000 0.000 dataset.py:934(__len__)
2 0.000 0.000 0.000 0.000 variables.py:106(safe_setitem)
1 0.000 0.000 0.000 0.000 api.py:479(__init__)
1 0.000 0.000 0.000 0.000 utils.py:20(_check_inplace)
7 0.000 0.000 0.000 0.000 {method 'chunking' of 'netCDF4._netCDF4.Variable' objects}
4 0.000 0.000 0.000 0.000 utils.py:498(close_on_error)
1 0.000 0.000 0.000 0.000 numeric.py:101(_assert_safe_casting)
3 0.000 0.000 0.000 0.000 core.py:167(<listcomp>)
Output of ds:
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
464113917 | https://github.com/pydata/xarray/issues/1385#issuecomment-464113917 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDQ2NDExMzkxNw== | chuaxr 30007270 | 2019-02-15T16:34:02Z | 2019-02-15T16:34:35Z | NONE | On a related note, is it possible to clear out the memory used by the xarray dataset after it is no longer needed? Here's an example:
```python fname2 = '/work/xrc/AM4_xrc/c192L33_am4p0_cmip6Diag/daily/5yr/atmos.20100101-20141231.ucomp.nc' ``` ```python with xr.set_options(file_cache_maxsize=1): %time ds = xr.open_mfdataset(fname2) # would like this to free up memory used by fname ```
```python with xr.set_options(file_cache_maxsize=1): # expected to take same time as first call %time ds = xr.open_mfdataset(fname) ```
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
463367754 | https://github.com/pydata/xarray/issues/1385#issuecomment-463367754 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDQ2MzM2Nzc1NA== | chuaxr 30007270 | 2019-02-13T20:58:52Z | 2019-02-13T20:59:06Z | NONE | It seems my issue has to do with the time coordinate: fname = '/work/xrc/AM4_xrc/c192L33_am4p0_cmip6Diag/daily/5yr/atmos.20100101-20141231.sphum.nc' %prun ds = xr.open_mfdataset(fname,drop_variables='time') 7510 function calls (7366 primitive calls) in 0.068 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.039 0.039 0.039 0.039 netCDF4_.py:244(_open_netcdf4_group) 3 0.022 0.007 0.022 0.007 {built-in method _operator.getitem} 1 0.001 0.001 0.001 0.001 {built-in method posix.lstat} 125/113 0.000 0.000 0.001 0.000 indexing.py:504(shape) 11 0.000 0.000 0.000 0.000 core.py:137(<genexpr>) fname = '/work/xrc/AM4_xrc/c192L33_am4p0_cmip6Diag/daily/5yr/atmos.20000101-20041231.sphum.nc' %prun ds = xr.open_mfdataset(fname)
Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 6 23.791 3.965 23.791 3.965 {built-in method operator.getitem} 1 0.029 0.029 0.029 0.029 netCDF4.py:244(_open_netcdf4_group) 2 0.023 0.012 0.023 0.012 {cftime._cftime.num2date} 1 0.001 0.001 0.001 0.001 {built-in method posix.lstat} 158/139 0.000 0.000 0.001 0.000 indexing.py:504(shape) ``` Both files are 33 GB. This is using xarray 0.11.3. I also confirm that nc.MFDataset is much faster (<1s). Is there any speed-up for the time coordinates possible, given that my data follows a standard calendar? (Short of using drop_variables='time' and then manually adding the time coordinate...) |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
461561653 | https://github.com/pydata/xarray/issues/1385#issuecomment-461561653 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDQ2MTU2MTY1Mw== | sbiner 16655388 | 2019-02-07T19:22:58Z | 2019-02-07T19:22:58Z | NONE | I just tried and it did not help ... ``` In [5]: run test_ouverture_fichier_nc_vs_xr.py timing glob: 0.00s timing netcdf4: 3.36s timing xarray: 44.82s timing xarray tune: 14.47s In [6]: xr.show_versions() INSTALLED VERSIONScommit: None python: 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 19:04:19) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-514.2.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: None.None libhdf5: 1.10.4 libnetcdf: 4.6.1 xarray: 0.11.3 pandas: 0.24.0 numpy: 1.13.3 scipy: 1.2.0 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.2.1 cyordereddict: None dask: 1.0.0 distributed: 1.25.2 matplotlib: 2.2.3 cartopy: None seaborn: None setuptools: 40.5.0 pip: 19.0.1 conda: None pytest: None IPython: 5.8.0 sphinx: 1.8.2 ``` |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
461551320 | https://github.com/pydata/xarray/issues/1385#issuecomment-461551320 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDQ2MTU1MTMyMA== | sbiner 16655388 | 2019-02-07T18:52:53Z | 2019-02-07T18:52:53Z | NONE | I have the same problem. open_mfdatasset is 10X slower than nc.MFDataset. I used the following code to get some timing on opening 456 local netcdf files located in a netcdf4t00 = time.time() ds1 = nc.MFDataset(l_fichiers_nc) dates1 = ouralib.netcdf.calcule_dates(ds1)print ('timing netcdf4: {:6.2f}s'.format(time.time()-t00)) xarrayt00 = time.time() ds2 = xr.open_mfdataset(l_fichiers_nc) print ('timing xarray: {:6.2f}s'.format(time.time()-t00)) xarray tunet00 = time.time() ds3 = xr.open_mfdataset(l_fichiers_nc, decode_cf=False, concat_dim='time') ds3 = xr.decode_cf(ds3) print ('timing xarray tune: {:6.2f}s'.format(time.time()-t00)) ``` The output I get is :
I made tests on a centOS server using python2.7 and 3.6, and on mac OS as well with python3.6. The timing changes but the ratios are similar between netCDF4 and xarray. Is there any way of making open_mfdataset go faster? In case it helps, here are output from for python 2.7: ``` 13996351 function calls (13773659 primitive calls) in 42.133 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 2664 16.290 0.006 16.290 0.006 {time.sleep} 912 6.330 0.007 6.623 0.007 netCDF4_.py:244(_open_netcdf4_group) ``` for python 3.6: ``` 9663408 function calls (9499759 primitive calls) in 31.934 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function)
5472 15.140 0.003 15.140 0.003 {method 'acquire' of 'thread.lock' objects}
912 5.661 0.006 5.718 0.006 netCDF4.py:244(_open_netcdf4_group)
Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 5472 15.140 0.003 15.140 0.003 {method 'acquire' of 'thread.lock' objects} 912 5.661 0.006 5.718 0.006 netCDF4.py:244(open_netcdf4_group) 4104 0.564 0.000 0.757 0.000 {built-in method _operator.getitem} 133152/129960 0.477 0.000 0.660 0.000 indexing.py:496(shape) 1554550/1554153 0.414 0.000 0.711 0.000 {built-in method builtins.isinstance} 912 0.260 0.000 0.260 0.000 {method 'close' of 'netCDF4._netCDF4.Dataset' objects} 6384 0.244 0.000 0.953 0.000 netCDF4.py:361(open_store_variable) 910 0.241 0.000 0.595 0.001 duck_array_ops.py:141(array_equiv) 20990 0.235 0.000 0.343 0.000 {pandas.libs.lib.is_scalar} 37483/36567 0.228 0.000 0.230 0.000 {built-in method builtins.iter} 93986 0.219 0.000 1.607 0.000 variable.py:239(__init__) 93982 0.194 0.000 0.194 0.000 variable.py:706(attrs) 33744 0.189 0.000 0.189 0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects} 15511 0.175 0.000 0.638 0.000 core.py:1776(normalize_chunks) 5930 0.162 0.000 0.350 0.000 missing.py:183(_isna_ndarraylike) 297391/296926 0.159 0.000 0.380 0.000 {built-in method builtins.getattr} 134230 0.155 0.000 0.269 0.000 abc.py:180(__instancecheck__) 6384 0.142 0.000 0.199 0.000 netCDF4.py:34(init) 93986 0.126 0.000 0.671 0.000 variable.py:414(_parse_dimensions) 156545 0.119 0.000 0.811 0.000 utils.py:450(ndim) 12768 0.119 0.000 0.203 0.000 core.py:747(blockdims_from_blockshape) 6384 0.117 0.000 2.526 0.000 conventions.py:245(decode_cf_variable) 741183/696380 0.116 0.000 0.134 0.000 {built-in method builtins.len} 41957/23717 0.110 0.000 4.395 0.000 {built-in method numpy.core.multiarray.array} 93978 0.110 0.000 0.110 0.000 variable.py:718(encoding) 219940 0.109 0.000 0.109 0.000 _weakrefset.py:70(contains) 99458 0.100 0.000 0.440 0.000 variable.py:137(as_compatible_data) 53882 0.085 0.000 0.095 0.000 core.py:891(shape) 140604 0.084 0.000 0.628 0.000 variable.py:272(shape) 3192 0.084 0.000 0.170 0.000 utils.py:88(_StartCountStride) 10494 0.081 0.000 0.081 0.000 {method 'reduce' of 'numpy.ufunc' objects} 44688 0.077 0.000 0.157 0.000 variables.py:102(unpack_for_decoding) ``` output of xr.show_versions() ``` xr.show_versions() INSTALLED VERSIONScommit: None python: 3.6.8.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-514.2.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8 xarray: 0.11.0 pandas: 0.24.1 numpy: 1.15.4 scipy: None netCDF4: 1.4.2 h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 PseudonetCDF: None rasterio: None iris: None bottleneck: None cyordereddict: None dask: 1.1.1 distributed: 1.25.3 matplotlib: 3.0.2 cartopy: None seaborn: None setuptools: 40.7.3 pip: 19.0.1 conda: None pytest: None IPython: 7.2.0 sphinx: None ``` |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
439478904 | https://github.com/pydata/xarray/issues/1385#issuecomment-439478904 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDQzOTQ3ODkwNA== | chuaxr 30007270 | 2018-11-16T18:10:53Z | 2018-11-16T18:10:53Z | NONE | h5netcdf fails with the following error (presumably the file is not compatible): ``` /nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr) 97 if swmr and swmr_support: 98 flags |= h5f.ACC_SWMR_READ ---> 99 fid = h5f.open(name, flags, fapl=fapl) 100 elif mode == 'r+': 101 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl) h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/h5f.pyx in h5py.h5f.open() OSError: Unable to open file (file signature not found)
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
439445695 | https://github.com/pydata/xarray/issues/1385#issuecomment-439445695 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDQzOTQ0NTY5NQ== | chuaxr 30007270 | 2018-11-16T16:20:25Z | 2018-11-16T16:20:25Z | NONE | Sorry, I think the speedup had to do with accessing a file that had previously been loaded rather than due to Output of %prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/atmos_level.1999010100-2000123123.sphum.nc',chunks={'lat':20,'time':50,'lon':12,'pfull':11}) ```
``` /work isn't a remote archive, so it surprises me that this should happen. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
439042364 | https://github.com/pydata/xarray/issues/1385#issuecomment-439042364 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDQzOTA0MjM2NA== | chuaxr 30007270 | 2018-11-15T13:37:16Z | 2018-11-15T14:06:04Z | NONE | Yes, I'm on 0.11. Nothing displays on the task stream/ progress bar when using The output from and for decode_cf = True:
Using If I repeat the open_mfdataset for another 5 files (after opening the first 5), I occasionally get this warning:
I only began using the dashboard recently; please let me know if there's something basic I'm missing. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 | |
438870575 | https://github.com/pydata/xarray/issues/1385#issuecomment-438870575 | https://api.github.com/repos/pydata/xarray/issues/1385 | MDEyOklzc3VlQ29tbWVudDQzODg3MDU3NQ== | chuaxr 30007270 | 2018-11-15T00:32:42Z | 2018-11-15T00:32:42Z | NONE | I can confirm that
``` ds = xr.open_mfdataset(data_fnames,chunks={'lat':20,'time':50,'lon':24,'pfull':11}) ``` . For reference, data_fnames is a list of 5 files, each of which is ~75 GB. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
slow performance with open_mfdataset 224553135 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 6