html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/2313#issuecomment-1062761948,https://api.github.com/repos/pydata/xarray/issues/2313,1062761948,IC_kwDOAMm_X84_WHXc,30007270,2022-03-09T10:13:09Z,2022-03-09T10:13:09Z,NONE,"Seconding @dcherian's comment in #4901 on an example for `.encoding['source']`. Working off @raybellwaves' example, something like this would have been useful to me:
```
>>> import xarray as xr
>>> import numpy as np
>>> model1 = xr.DataArray(np.arange(2), coords=[np.arange(2)], name=""f"")
>>> model1.to_dataset().to_netcdf(""model1.nc"")
>>> model2 = xr.DataArray(np.arange(2), coords=[np.arange(2)], name=""f"")
>>> model2.to_dataset().to_netcdf(""model2.nc"")
>>> ds = xr.open_mfdataset(
... [""model1.nc"", ""model2.nc""],
... preprocess=lambda ds: ds.expand_dims(
... {""model_name"": [ds.encoding[""source""].split(""/"")[-1].split(""."")[0]]}
... ),
... )
>>> ds
Dimensions: (dim_0: 2, model_name: 2)
Coordinates:
* dim_0 (dim_0) int64 0 1
* model_name (model_name) object 'model1' 'model2'
Data variables:
f (model_name, dim_0) int64 dask.array
```
On that note, the example above seems to work with some slight changes:
```
>>> import numpy as np
>>> import xarray as xr
>>>
>>> f1 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=[""a""], name=""f1"")
>>> f1 = f1.assign_coords(t='t0')
>>> f1.to_dataset().to_netcdf(""f1.nc"")
>>>
>>> f2 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=[""a""], name=""f2"")
>>> f2 = f2.assign_coords(t='t1')
>>> f2.to_dataset().to_netcdf(""f2.nc"")
>>>
>>> # Concat along t
>>> def preprocess(ds):
... return ds.expand_dims(""t"")
...
>>>
>>> ds = xr.open_mfdataset([""f1.nc"", ""f2.nc""], concat_dim=""t"", preprocess=preprocess)
>>> ds
Dimensions: (a: 2, t: 2)
Coordinates:
* t (t) object 't0' 't1'
* a (a) int64 0 1
Data variables:
f1 (t, a) float64 dask.array
f2 (t, a) float64 dask.array
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,344614881
https://github.com/pydata/xarray/issues/1385#issuecomment-464100720,https://api.github.com/repos/pydata/xarray/issues/1385,464100720,MDEyOklzc3VlQ29tbWVudDQ2NDEwMDcyMA==,30007270,2019-02-15T15:57:01Z,2019-02-15T18:33:31Z,NONE,"In that case, the speedup disappears. It seems that the slowdown arises from the entire time array being loaded into memory at once.
EDIT: I subsequently realized that using drop_variables = 'time' caused all the data values to become nan, which makes that an invalid option.
```
%prun ds = xr.open_mfdataset(fname,decode_times=False)
8025 function calls (7856 primitive calls) in 29.662 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
4 29.608 7.402 29.608 7.402 {built-in method _operator.getitem}
1 0.032 0.032 0.032 0.032 netCDF4_.py:244(_open_netcdf4_group)
1 0.015 0.015 0.015 0.015 {built-in method posix.lstat}
126/114 0.000 0.000 0.001 0.000 indexing.py:504(shape)
1196 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
81 0.000 0.000 0.001 0.000 variable.py:239(__init__)
```
See the rest of the prun output under the Details for more information:
30 0.000 0.000 0.000 0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects}
81 0.000 0.000 0.000 0.000 variable.py:709(attrs)
736/672 0.000 0.000 0.000 0.000 {built-in method builtins.len}
157 0.000 0.000 0.001 0.000 utils.py:450(ndim)
81 0.000 0.000 0.001 0.000 variable.py:417(_parse_dimensions)
7 0.000 0.000 0.001 0.000 netCDF4_.py:361(open_store_variable)
4 0.000 0.000 0.000 0.000 base.py:253(__new__)
1 0.000 0.000 29.662 29.662 :1()
7 0.000 0.000 0.001 0.000 conventions.py:245(decode_cf_variable)
39/19 0.000 0.000 29.609 1.558 {built-in method numpy.core.multiarray.array}
9 0.000 0.000 0.000 0.000 core.py:1776(normalize_chunks)
104 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
143 0.000 0.000 0.001 0.000 variable.py:272(shape)
4 0.000 0.000 0.000 0.000 utils.py:88(_StartCountStride)
8 0.000 0.000 0.000 0.000 core.py:747(blockdims_from_blockshape)
23 0.000 0.000 0.032 0.001 file_manager.py:150(acquire)
8 0.000 0.000 0.000 0.000 base.py:590(tokenize)
84 0.000 0.000 0.000 0.000 variable.py:137(as_compatible_data)
268 0.000 0.000 0.000 0.000 {method 'indices' of 'slice' objects}
14 0.000 0.000 29.610 2.115 variable.py:41(as_variable)
35 0.000 0.000 0.000 0.000 variables.py:102(unpack_for_decoding)
81 0.000 0.000 0.000 0.000 variable.py:721(encoding)
192 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
2 0.000 0.000 0.000 0.000 merge.py:109(merge_variables)
2 0.000 0.000 29.610 14.805 merge.py:392(merge_core)
7 0.000 0.000 0.000 0.000 variables.py:161()
103 0.000 0.000 0.000 0.000 {built-in method _abc._abc_instancecheck}
1 0.000 0.000 0.001 0.001 conventions.py:351(decode_cf_variables)
3 0.000 0.000 0.000 0.000 dataset.py:90(calculate_dimensions)
1 0.000 0.000 0.000 0.000 {built-in method posix.stat}
361 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
20 0.000 0.000 0.000 0.000 variable.py:728(copy)
23 0.000 0.000 0.000 0.000 lru_cache.py:40(__getitem__)
12 0.000 0.000 0.000 0.000 base.py:504(_simple_new)
2 0.000 0.000 0.000 0.000 variable.py:1985(assert_unique_multiindex_level_names)
2 0.000 0.000 0.000 0.000 alignment.py:172(deep_align)
14 0.000 0.000 0.000 0.000 indexing.py:469(__init__)
16 0.000 0.000 29.609 1.851 variable.py:1710(__init__)
1 0.000 0.000 29.662 29.662 {built-in method builtins.exec}
25 0.000 0.000 0.000 0.000 contextlib.py:81(__init__)
7 0.000 0.000 0.000 0.000 {method 'getncattr' of 'netCDF4._netCDF4.Dataset' objects}
24 0.000 0.000 0.000 0.000 indexing.py:331(as_integer_slice)
50/46 0.000 0.000 0.000 0.000 common.py:181(__setattr__)
7 0.000 0.000 0.000 0.000 variables.py:155(decode)
4 0.000 0.000 29.609 7.402 indexing.py:760(explicit_indexing_adapter)
48 0.000 0.000 0.000 0.000 :416(parent)
103 0.000 0.000 0.000 0.000 abc.py:137(__instancecheck__)
48 0.000 0.000 0.000 0.000 _collections_abc.py:742(__iter__)
180 0.000 0.000 0.000 0.000 variable.py:411(dims)
4 0.000 0.000 0.000 0.000 locks.py:158(__exit__)
3 0.000 0.000 0.001 0.000 core.py:2048(from_array)
1 0.000 0.000 29.612 29.612 conventions.py:412(decode_cf)
4 0.000 0.000 0.000 0.000 utils.py:50(_maybe_cast_to_cftimeindex)
77/59 0.000 0.000 0.000 0.000 utils.py:473(dtype)
84 0.000 0.000 0.000 0.000 generic.py:7(_check)
146 0.000 0.000 0.000 0.000 indexing.py:319(tuple)
7 0.000 0.000 0.000 0.000 netCDF4_.py:34(__init__)
1 0.000 0.000 29.614 29.614 api.py:270(maybe_decode_store)
1 0.000 0.000 29.662 29.662 api.py:487(open_mfdataset)
20 0.000 0.000 0.000 0.000 common.py:1845(_is_dtype_type)
33 0.000 0.000 0.000 0.000 core.py:1911()
84 0.000 0.000 0.000 0.000 variable.py:117(_maybe_wrap_data)
3 0.000 0.000 0.001 0.000 variable.py:830(chunk)
25 0.000 0.000 0.000 0.000 contextlib.py:237(helper)
36/25 0.000 0.000 0.000 0.000 utils.py:477(shape)
8 0.000 0.000 0.000 0.000 base.py:566(_shallow_copy)
8 0.000 0.000 0.000 0.000 indexing.py:346(__init__)
26/25 0.000 0.000 0.000 0.000 utils.py:408(__call__)
4 0.000 0.000 0.000 0.000 indexing.py:886(_decompose_outer_indexer)
2 0.000 0.000 29.610 14.805 merge.py:172(expand_variable_dicts)
4 0.000 0.000 29.608 7.402 netCDF4_.py:67(_getitem)
2 0.000 0.000 0.000 0.000 dataset.py:722(copy)
7 0.000 0.000 0.001 0.000 dataset.py:1383(maybe_chunk)
16 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.empty}
14 0.000 0.000 0.000 0.000 fromnumeric.py:1471(ravel)
60 0.000 0.000 0.000 0.000 base.py:652(__len__)
3 0.000 0.000 0.000 0.000 core.py:141(getem)
25 0.000 0.000 0.000 0.000 contextlib.py:116(__exit__)
4 0.000 0.000 29.609 7.402 utils.py:62(safe_cast_to_index)
18 0.000 0.000 0.000 0.000 core.py:891(shape)
25 0.000 0.000 0.000 0.000 contextlib.py:107(__enter__)
4 0.000 0.000 0.001 0.000 utils.py:332(FrozenOrderedDict)
8 0.000 0.000 0.000 0.000 base.py:1271(set_names)
4 0.000 0.000 0.000 0.000 numeric.py:34(__new__)
24 0.000 0.000 0.000 0.000 inference.py:253(is_list_like)
3 0.000 0.000 0.000 0.000 core.py:820(__new__)
12 0.000 0.000 0.000 0.000 variable.py:1785(copy)
36 0.000 0.000 0.000 0.000 {method 'copy' of 'collections.OrderedDict' objects}
8/7 0.000 0.000 0.000 0.000 {built-in method builtins.sorted}
2 0.000 0.000 0.000 0.000 merge.py:220(determine_coords)
46 0.000 0.000 0.000 0.000 file_manager.py:141(_optional_lock)
60 0.000 0.000 0.000 0.000 indexing.py:1252(shape)
50 0.000 0.000 0.000 0.000 {built-in method builtins.next}
59 0.000 0.000 0.000 0.000 {built-in method builtins.iter}
54 0.000 0.000 0.000 0.000 :1009(_handle_fromlist)
1 0.000 0.000 0.000 0.000 api.py:146(_protect_dataset_variables_inplace)
1 0.000 0.000 29.646 29.646 api.py:162(open_dataset)
4 0.000 0.000 0.000 0.000 utils.py:424(_out_array_shape)
4 0.000 0.000 29.609 7.402 indexing.py:1224(__init__)
24 0.000 0.000 0.000 0.000 function_base.py:241(iterable)
4 0.000 0.000 0.000 0.000 dtypes.py:968(is_dtype)
2 0.000 0.000 0.000 0.000 merge.py:257(coerce_pandas_values)
14 0.000 0.000 0.000 0.000 missing.py:105(_isna_new)
8 0.000 0.000 0.000 0.000 variable.py:1840(to_index)
7 0.000 0.000 0.000 0.000 {method 'search' of 're.Pattern' objects}
48 0.000 0.000 0.000 0.000 {method 'rpartition' of 'str' objects}
7 0.000 0.000 0.000 0.000 strings.py:66(decode)
7 0.000 0.000 0.000 0.000 netCDF4_.py:257(_disable_auto_decode_variable)
14 0.000 0.000 0.000 0.000 numerictypes.py:619(issubclass_)
24/4 0.000 0.000 29.609 7.402 numeric.py:433(asarray)
7 0.000 0.000 0.000 0.000 {method 'ncattrs' of 'netCDF4._netCDF4.Variable' objects}
8 0.000 0.000 0.000 0.000 numeric.py:67(_shallow_copy)
8 0.000 0.000 0.000 0.000 indexing.py:373(__init__)
3 0.000 0.000 0.000 0.000 core.py:134()
14 0.000 0.000 0.000 0.000 merge.py:154()
16 0.000 0.000 0.000 0.000 dataset.py:816()
11 0.000 0.000 0.000 0.000 netCDF4_.py:56(get_array)
40 0.000 0.000 0.000 0.000 utils.py:40(_find_dim)
22 0.000 0.000 0.000 0.000 core.py:1893()
27 0.000 0.000 0.000 0.000 {built-in method builtins.all}
26/10 0.000 0.000 0.000 0.000 {built-in method builtins.sum}
2 0.000 0.000 0.000 0.000 dataset.py:424(attrs)
7 0.000 0.000 0.000 0.000 variables.py:231(decode)
1 0.000 0.000 0.000 0.000 file_manager.py:66(__init__)
67 0.000 0.000 0.000 0.000 utils.py:316(__getitem__)
22 0.000 0.000 0.000 0.000 {method 'move_to_end' of 'collections.OrderedDict' objects}
53 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
1 0.000 0.000 0.000 0.000 combine.py:374(_infer_concat_order_from_positions)
7 0.000 0.000 0.000 0.000 dataset.py:1378(selkeys)
1 0.000 0.000 0.001 0.001 dataset.py:1333(chunk)
4 0.000 0.000 29.609 7.402 netCDF4_.py:62(__getitem__)
37 0.000 0.000 0.000 0.000 netCDF4_.py:365()
18 0.000 0.000 0.000 0.000 {method 'ravel' of 'numpy.ndarray' objects}
2 0.000 0.000 0.000 0.000 alignment.py:37(align)
14 0.000 0.000 0.000 0.000 {pandas._libs.lib.is_scalar}
8 0.000 0.000 0.000 0.000 base.py:1239(_set_names)
16 0.000 0.000 0.000 0.000 indexing.py:314(__init__)
3 0.000 0.000 0.000 0.000 config.py:414(get)
7 0.000 0.000 0.000 0.000 dtypes.py:68(maybe_promote)
8 0.000 0.000 0.000 0.000 variable.py:1856(level_names)
37 0.000 0.000 0.000 0.000 {method 'copy' of 'dict' objects}
6 0.000 0.000 0.000 0.000 re.py:180(search)
6 0.000 0.000 0.000 0.000 re.py:271(_compile)
8 0.000 0.000 0.000 0.000 {built-in method _hashlib.openssl_md5}
1 0.000 0.000 0.000 0.000 merge.py:463(merge)
7 0.000 0.000 0.000 0.000 variables.py:158()
7 0.000 0.000 0.000 0.000 numerictypes.py:687(issubdtype)
6 0.000 0.000 0.000 0.000 utils.py:510(is_remote_uri)
8 0.000 0.000 0.000 0.000 common.py:1702(is_extension_array_dtype)
25 0.000 0.000 0.000 0.000 indexing.py:645(as_indexable)
21 0.000 0.000 0.000 0.000 {method 'pop' of 'collections.OrderedDict' objects}
19 0.000 0.000 0.000 0.000 {built-in method __new__ of type object at 0x2b324a13e3c0}
1 0.000 0.000 0.001 0.001 dataset.py:1394()
21 0.000 0.000 0.000 0.000 variables.py:117(pop_to)
1 0.000 0.000 0.032 0.032 netCDF4_.py:320(open)
8 0.000 0.000 0.000 0.000 netCDF4_.py:399()
12 0.000 0.000 0.000 0.000 __init__.py:221(iteritems)
4 0.000 0.000 0.000 0.000 common.py:403(is_datetime64_dtype)
8 0.000 0.000 0.000 0.000 common.py:1809(_get_dtype)
8 0.000 0.000 0.000 0.000 dtypes.py:68(find)
8 0.000 0.000 0.000 0.000 base.py:3607(values)
22 0.000 0.000 0.000 0.000 pycompat.py:32(move_to_end)
8 0.000 0.000 0.000 0.000 utils.py:792(__exit__)
3 0.000 0.000 0.000 0.000 highlevelgraph.py:84(from_collections)
22 0.000 0.000 0.000 0.000 core.py:1906()
16 0.000 0.000 0.000 0.000 abc.py:141(__subclasscheck__)
1 0.000 0.000 0.000 0.000 posixpath.py:104(split)
1 0.000 0.000 0.001 0.001 combine.py:479(_auto_combine_all_along_first_dim)
1 0.000 0.000 29.610 29.610 dataset.py:321(__init__)
4 0.000 0.000 0.000 0.000 dataset.py:643(_construct_direct)
7 0.000 0.000 0.000 0.000 variables.py:266(decode)
1 0.000 0.000 0.032 0.032 netCDF4_.py:306(__init__)
14 0.000 0.000 0.000 0.000 numeric.py:504(asanyarray)
4 0.000 0.000 0.000 0.000 common.py:503(is_period_dtype)
8 0.000 0.000 0.000 0.000 common.py:1981(pandas_dtype)
12 0.000 0.000 0.000 0.000 base.py:633(_reset_identity)
11 0.000 0.000 0.000 0.000 pycompat.py:18(iteritems)
16 0.000 0.000 0.000 0.000 utils.py:279(is_integer)
14 0.000 0.000 0.000 0.000 variable.py:268(dtype)
4 0.000 0.000 0.000 0.000 indexing.py:698(_outer_to_numpy_indexer)
42 0.000 0.000 0.000 0.000 variable.py:701(attrs)
9 0.000 0.000 0.000 0.000 {built-in method builtins.any}
1 0.000 0.000 0.000 0.000 posixpath.py:338(normpath)
6 0.000 0.000 0.000 0.000 _collections_abc.py:676(items)
24 0.000 0.000 0.000 0.000 {built-in method math.isnan}
1 0.000 0.000 29.610 29.610 merge.py:360(merge_data_and_coords)
1 0.000 0.000 0.000 0.000 dataset.py:1084(set_coords)
1 0.000 0.000 0.001 0.001 common.py:99(load)
1 0.000 0.000 0.000 0.000 file_manager.py:250(decrement)
4 0.000 0.000 0.000 0.000 locks.py:154(__enter__)
7 0.000 0.000 0.000 0.000 netCDF4_.py:160(_ensure_fill_value_valid)
8 0.000 0.000 0.001 0.000 netCDF4_.py:393()
8 0.000 0.000 0.000 0.000 common.py:572(is_categorical_dtype)
16 0.000 0.000 0.000 0.000 base.py:75(is_dtype)
72 0.000 0.000 0.000 0.000 indexing.py:327(as_integer_or_none)
26 0.000 0.000 0.000 0.000 utils.py:382(dispatch)
3 0.000 0.000 0.000 0.000 core.py:123(slices_from_chunks)
16 0.000 0.000 0.000 0.000 core.py:768()
4 0.000 0.000 29.609 7.402 indexing.py:514(__array__)
4 0.000 0.000 0.000 0.000 indexing.py:1146(__init__)
4 0.000 0.000 0.000 0.000 indexing.py:1153(_indexing_array_and_key)
4 0.000 0.000 29.609 7.402 variable.py:400(to_index_variable)
30 0.000 0.000 0.000 0.000 {method 'items' of 'collections.OrderedDict' objects}
16 0.000 0.000 0.000 0.000 {built-in method _abc._abc_subclasscheck}
19 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}
1 0.000 0.000 0.000 0.000 combine.py:423(_check_shape_tile_ids)
4 0.000 0.000 0.000 0.000 merge.py:91(_assert_compat_valid)
12 0.000 0.000 0.000 0.000 dataset.py:263()
1 0.000 0.000 29.610 29.610 dataset.py:372(_set_init_vars_and_dims)
3 0.000 0.000 0.000 0.000 dataset.py:413(_attrs_copy)
8 0.000 0.000 0.000 0.000 common.py:120()
14 0.000 0.000 0.000 0.000 {built-in method pandas._libs.missing.checknull}
4 0.000 0.000 0.000 0.000 common.py:746(is_dtype_equal)
4 0.000 0.000 0.000 0.000 common.py:923(is_signed_integer_dtype)
4 0.000 0.000 0.000 0.000 common.py:1545(is_float_dtype)
14 0.000 0.000 0.000 0.000 missing.py:25(isna)
3 0.000 0.000 0.000 0.000 highlevelgraph.py:71(__init__)
3 0.000 0.000 0.000 0.000 core.py:137()
33 0.000 0.000 0.000 0.000 core.py:1883()
35 0.000 0.000 0.000 0.000 variable.py:713(encoding)
2 0.000 0.000 0.000 0.000 {built-in method builtins.min}
16 0.000 0.000 0.000 0.000 _collections_abc.py:719(__iter__)
8 0.000 0.000 0.000 0.000 _collections_abc.py:760(__iter__)
1 0.000 0.000 0.015 0.015 glob.py:9(glob)
2 0.000 0.000 0.015 0.008 glob.py:39(_iglob)
8 0.000 0.000 0.000 0.000 {method 'hexdigest' of '_hashlib.HASH' objects}
1 0.000 0.000 0.000 0.000 combine.py:500(_auto_combine_1d)
14 0.000 0.000 0.000 0.000 merge.py:104(__missing__)
1 0.000 0.000 0.000 0.000 coordinates.py:167(variables)
3 0.000 0.000 0.000 0.000 dataset.py:98()
4 0.000 0.000 0.000 0.000 dataset.py:402(variables)
1 0.000 0.000 0.000 0.000 netCDF4_.py:269(_disable_auto_decode_group)
12 0.000 0.000 0.032 0.003 netCDF4_.py:357(ds)
1 0.000 0.000 29.646 29.646 api.py:637()
9 0.000 0.000 0.000 0.000 utils.py:313(__init__)
7 0.000 0.000 0.000 0.000 {method 'filters' of 'netCDF4._netCDF4.Variable' objects}
12 0.000 0.000 0.000 0.000 common.py:117(classes)
8 0.000 0.000 0.000 0.000 common.py:536(is_interval_dtype)
4 0.000 0.000 0.000 0.000 common.py:1078(is_datetime64_any_dtype)
4 0.000 0.000 0.000 0.000 dtypes.py:827(is_dtype)
8 0.000 0.000 0.000 0.000 base.py:551()
8 0.000 0.000 0.000 0.000 base.py:547(_get_attributes_dict)
8 0.000 0.000 0.000 0.000 utils.py:789(__enter__)
18 0.000 0.000 0.000 0.000 core.py:903(_get_chunks)
33 0.000 0.000 0.000 0.000 core.py:1885()
22 0.000 0.000 0.000 0.000 core.py:1889()
4 0.000 0.000 0.000 0.000 indexing.py:799(_decompose_slice)
4 0.000 0.000 0.000 0.000 indexing.py:1174(__getitem__)
3 0.000 0.000 0.000 0.000 variable.py:294(data)
8 0.000 0.000 0.000 0.000 {method '__enter__' of '_thread.lock' objects}
9 0.000 0.000 0.000 0.000 {built-in method builtins.hash}
4 0.000 0.000 0.000 0.000 {built-in method builtins.max}
4 0.000 0.000 0.000 0.000 {method 'update' of 'set' objects}
7 0.000 0.000 0.000 0.000 {method 'values' of 'dict' objects}
8 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
1 0.000 0.000 0.000 0.000 posixpath.py:376(abspath)
1 0.000 0.000 0.000 0.000 genericpath.py:53(getmtime)
4 0.000 0.000 0.000 0.000 _collections_abc.py:657(get)
1 0.000 0.000 0.000 0.000 __init__.py:548(__init__)
1 0.000 0.000 0.000 0.000 __init__.py:617(update)
4/2 0.000 0.000 0.000 0.000 combine.py:392(_infer_tile_ids_from_nested_list)
1 0.000 0.000 0.001 0.001 combine.py:522(_auto_combine)
2 0.000 0.000 0.000 0.000 merge.py:100(__init__)
5 0.000 0.000 0.000 0.000 coordinates.py:38(__iter__)
5 0.000 0.000 0.000 0.000 coordinates.py:169()
1 0.000 0.000 0.000 0.000 dataset.py:666(_replace_vars_and_dims)
5 0.000 0.000 0.000 0.000 dataset.py:1078(data_vars)
1 0.000 0.000 0.000 0.000 file_manager.py:133(_make_key)
1 0.000 0.000 0.000 0.000 file_manager.py:245(increment)
1 0.000 0.000 0.000 0.000 lru_cache.py:54(__setitem__)
1 0.000 0.000 0.000 0.000 netCDF4_.py:398(get_attrs)
1 0.000 0.000 0.000 0.000 api.py:80(_get_default_engine)
1 0.000 0.000 0.000 0.000 api.py:92(_normalize_path)
8 0.000 0.000 0.000 0.000 {method 'view' of 'numpy.ndarray' objects}
8 0.000 0.000 0.000 0.000 utils.py:187(is_dict_like)
4 0.000 0.000 0.000 0.000 utils.py:219(is_valid_numpy_dtype)
10 0.000 0.000 0.000 0.000 utils.py:319(__iter__)
1 0.000 0.000 0.000 0.000 {method 'filepath' of 'netCDF4._netCDF4.Dataset' objects}
4 0.000 0.000 0.000 0.000 common.py:434(is_datetime64tz_dtype)
3 0.000 0.000 0.000 0.000 config.py:107(normalize_key)
3 0.000 0.000 0.000 0.000 core.py:160()
6 0.000 0.000 0.000 0.000 core.py:966(ndim)
4 0.000 0.000 0.000 0.000 indexing.py:791(decompose_indexer)
8 0.000 0.000 0.000 0.000 {method '__exit__' of '_thread.lock' objects}
3 0.000 0.000 0.000 0.000 {method 'replace' of 'str' objects}
4 0.000 0.000 0.000 0.000 {method 'split' of 'str' objects}
1 0.000 0.000 0.000 0.000 posixpath.py:121(splitext)
1 0.000 0.000 0.000 0.000 genericpath.py:117(_splitext)
1 0.000 0.000 0.001 0.001 combine.py:443(_combine_nd)
1 0.000 0.000 0.000 0.000 combine.py:508()
14 0.000 0.000 0.000 0.000 merge.py:41(unique_variable)
11 0.000 0.000 0.000 0.000 coordinates.py:163(_names)
1 0.000 0.000 0.000 0.000 dataset.py:2593(_assert_all_in_dataset)
1 0.000 0.000 0.000 0.000 variables.py:55(__init__)
1 0.000 0.000 0.000 0.000 file_manager.py:269(__init__)
29 0.000 0.000 0.000 0.000 file_manager.py:273(__hash__)
1 0.000 0.000 0.001 0.001 netCDF4_.py:392(get_variables)
1 0.000 0.000 0.000 0.000 netCDF4_.py:410()
7 0.000 0.000 0.000 0.000 {method 'set_auto_chartostring' of 'netCDF4._netCDF4.Variable' objects}
1 0.000 0.000 0.000 0.000 {method 'ncattrs' of 'netCDF4._netCDF4.Dataset' objects}
4 0.000 0.000 0.000 0.000 common.py:472(is_timedelta64_dtype)
4 0.000 0.000 0.000 0.000 common.py:980(is_unsigned_integer_dtype)
4 0.000 0.000 0.000 0.000 base.py:3805(_coerce_to_ndarray)
3 0.000 0.000 0.000 0.000 itertoolz.py:241(unique)
11 0.000 0.000 0.000 0.000 core.py:137()
3 0.000 0.000 0.000 0.000 indexing.py:600(__init__)
2 0.000 0.000 0.000 0.000 {method 'keys' of 'collections.OrderedDict' objects}
2 0.000 0.000 0.000 0.000 {built-in method _thread.allocate_lock}
1 0.000 0.000 0.000 0.000 {built-in method _collections._count_elements}
8 0.000 0.000 0.000 0.000 {method 'encode' of 'str' objects}
3 0.000 0.000 0.000 0.000 {method 'rfind' of 'str' objects}
8 0.000 0.000 0.000 0.000 {method 'add' of 'set' objects}
3 0.000 0.000 0.000 0.000 {method 'intersection' of 'set' objects}
7 0.000 0.000 0.000 0.000 {method 'setdefault' of 'dict' objects}
13 0.000 0.000 0.000 0.000 {method 'pop' of 'dict' objects}
1 0.000 0.000 0.000 0.000 posixpath.py:64(isabs)
1 0.000 0.000 0.015 0.015 posixpath.py:178(lexists)
1 0.000 0.000 0.000 0.000 posixpath.py:232(expanduser)
2 0.000 0.000 0.000 0.000 _collections_abc.py:672(keys)
7 0.000 0.000 0.000 0.000 contextlib.py:352(__init__)
7 0.000 0.000 0.000 0.000 contextlib.py:355(__enter__)
2 0.000 0.000 0.000 0.000 combine.py:496(vars_as_keys)
2 0.000 0.000 0.000 0.000 combine.py:517(_new_tile_id)
7 0.000 0.000 0.000 0.000 common.py:29(_decode_variable_name)
1 0.000 0.000 0.000 0.000 coordinates.py:160(__init__)
3 0.000 0.000 0.000 0.000 dataset.py:262(__iter__)
2 0.000 0.000 0.000 0.000 dataset.py:266(__len__)
2 0.000 0.000 0.000 0.000 dataset.py:940(__iter__)
1 0.000 0.000 0.000 0.000 dataset.py:1071(coords)
7 0.000 0.000 0.000 0.000 dataset.py:1381()
4 0.000 0.000 0.000 0.000 variables.py:61(dtype)
1 0.000 0.000 0.000 0.000 file_manager.py:189(__del__)
1 0.000 0.000 0.000 0.000 lru_cache.py:47(_enforce_size_limit)
1 0.000 0.000 0.000 0.000 netCDF4_.py:138(_nc4_require_group)
1 0.000 0.000 0.000 0.000 netCDF4_.py:408(get_encoding)
1 0.000 0.000 0.000 0.000 api.py:66(_get_default_engine_netcdf)
4 0.000 0.000 0.000 0.000 utils.py:197()
1 0.000 0.000 0.000 0.000 alignment.py:17(_get_joiner)
10 0.000 0.000 0.000 0.000 alignment.py:184(is_alignable)
5 0.000 0.000 0.000 0.000 alignment.py:226()
5 0.000 0.000 0.000 0.000 utils.py:325(__contains__)
5 0.000 0.000 0.000 0.000 {method 'isunlimited' of 'netCDF4._netCDF4.Dimension' objects}
8 0.000 0.000 0.000 0.000 inference.py:435(is_hashable)
12 0.000 0.000 0.000 0.000 common.py:119()
8 0.000 0.000 0.000 0.000 common.py:127()
8 0.000 0.000 0.000 0.000 common.py:122(classes_and_not_datetimelike)
4 0.000 0.000 0.000 0.000 base.py:675(dtype)
8 0.000 0.000 0.000 0.000 base.py:1395(nlevels)
24 0.000 0.000 0.000 0.000 functoolz.py:15(identity)
1 0.000 0.000 0.000 0.000 base.py:610(normalize_dict)
1 0.000 0.000 0.000 0.000 base.py:625(normalize_seq)
3 0.000 0.000 0.000 0.000 indexing.py:453(__init__)
4 0.000 0.000 0.000 0.000 indexing.py:713()
3 0.000 0.000 0.000 0.000 variable.py:821(chunks)
4 0.000 0.000 0.000 0.000 variable.py:1731(chunk)
8 0.000 0.000 0.000 0.000 variable.py:1874(name)
3 0.000 0.000 0.000 0.000 {method 'values' of 'collections.OrderedDict' objects}
6 0.000 0.000 0.000 0.000 {built-in method posix.fspath}
1 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
4 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}
3 0.000 0.000 0.000 0.000 {method 'copy' of 'set' objects}
1 0.000 0.000 0.000 0.000 {method 'union' of 'set' objects}
1 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
2 0.000 0.000 0.000 0.000 posixpath.py:41(_get_sep)
1 0.000 0.000 0.000 0.000 _collections_abc.py:680(values)
9 0.000 0.000 0.000 0.000 _collections_abc.py:698(__init__)
7 0.000 0.000 0.000 0.000 contextlib.py:358(__exit__)
1 0.000 0.000 0.000 0.000 glob.py:145(has_magic)
1 0.000 0.000 0.000 0.000 combine.py:428()
2 0.000 0.000 0.000 0.000 merge.py:301(_get_priority_vars)
1 0.000 0.000 0.000 0.000 merge.py:370(extract_indexes)
1 0.000 0.000 0.000 0.000 merge.py:378(assert_valid_explicit_coords)
5 0.000 0.000 0.000 0.000 dataset.py:259(__init__)
1 0.000 0.000 0.000 0.000 dataset.py:375()
2 0.000 0.000 0.000 0.000 dataset.py:416(attrs)
5 0.000 0.000 0.000 0.000 dataset.py:428(encoding)
1 0.000 0.000 0.000 0.000 dataset.py:436(encoding)
1 0.000 0.000 0.000 0.000 dataset.py:1373()
1 0.000 0.000 0.000 0.000 variables.py:76(lazy_elemwise_func)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
7 0.000 0.000 0.000 0.000 strings.py:39(__init__)
1 0.000 0.000 0.000 0.000 file_manager.py:241(__init__)
1 0.000 0.000 0.000 0.000 locks.py:206(ensure_lock)
1 0.000 0.000 0.000 0.000 netCDF4_.py:236(__init__)
1 0.000 0.000 0.000 0.000 api.py:638()
1 0.000 0.000 0.000 0.000 utils.py:452(_tostr)
7 0.000 0.000 0.000 0.000 {method 'set_auto_maskandscale' of 'netCDF4._netCDF4.Variable' objects}
1 0.000 0.000 0.000 0.000 utils.py:514(is_grib_path)
3 0.000 0.000 0.000 0.000 core.py:989(name)
8 0.000 0.000 0.000 0.000 variable.py:1834(to_index_variable)
1 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'keys' of 'dict' objects}
1 0.000 0.000 0.000 0.000 glob.py:22(iglob)
2 0.000 0.000 0.000 0.000 variable.py:2007()
1 0.000 0.000 0.000 0.000 combine.py:345(_auto_concat)
1 0.000 0.000 0.000 0.000 combine.py:435()
1 0.000 0.000 0.000 0.000 merge.py:519()
2 0.000 0.000 0.000 0.000 dataset.py:934(__len__)
2 0.000 0.000 0.000 0.000 variables.py:106(safe_setitem)
1 0.000 0.000 0.000 0.000 api.py:479(__init__)
1 0.000 0.000 0.000 0.000 utils.py:20(_check_inplace)
7 0.000 0.000 0.000 0.000 {method 'chunking' of 'netCDF4._netCDF4.Variable' objects}
4 0.000 0.000 0.000 0.000 utils.py:498(close_on_error)
1 0.000 0.000 0.000 0.000 numeric.py:101(_assert_safe_casting)
3 0.000 0.000 0.000 0.000 core.py:167()
Output of ds:
```
Dimensions: (bnds: 2, lat: 360, level: 23, lon: 576, time: 1827)
Coordinates:
* lat (lat) float64 -89.75 -89.25 -88.75 -88.25 ... 88.75 89.25 89.75
* level (level) float32 1000.0 925.0 850.0 775.0 700.0 ... 5.0 3.0 2.0 1.0
* lon (lon) float64 0.3125 0.9375 1.562 2.188 ... 358.4 359.1 359.7
* time (time) float64 7.671e+03 7.672e+03 ... 9.496e+03 9.497e+03
Dimensions without coordinates: bnds
Data variables:
lat_bnds (lat, bnds) float64 dask.array
lon_bnds (lon, bnds) float64 dask.array
sphum (time, level, lat, lon) float32 dask.array
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-464113917,https://api.github.com/repos/pydata/xarray/issues/1385,464113917,MDEyOklzc3VlQ29tbWVudDQ2NDExMzkxNw==,30007270,2019-02-15T16:34:02Z,2019-02-15T16:34:35Z,NONE,"On a related note, is it possible to clear out the memory used by the xarray dataset after it is no longer needed?
Here's an example:
```python
fname = '/work/xrc/AM4_xrc/c192L33_am4p0_cmip6Diag/daily/5yr/atmos.19800101-19841231.ucomp.nc'
```
```python
import xarray as xr
```
```python
with xr.set_options(file_cache_maxsize=1):
%time ds = xr.open_mfdataset(fname)
```
CPU times: user 48 ms, sys: 124 ms, total: 172 ms
Wall time: 29.7 s
```python
fname2 = '/work/xrc/AM4_xrc/c192L33_am4p0_cmip6Diag/daily/5yr/atmos.20100101-20141231.ucomp.nc'
```
```python
with xr.set_options(file_cache_maxsize=1):
%time ds = xr.open_mfdataset(fname2) # would like this to free up memory used by fname
```
CPU times: user 39 ms, sys: 124 ms, total: 163 ms
Wall time: 28.8 s
```python
import gc
gc.collect()
```
```python
with xr.set_options(file_cache_maxsize=1): # expected to take same time as first call
%time ds = xr.open_mfdataset(fname)
```
CPU times: user 28 ms, sys: 10 ms, total: 38 ms
Wall time: 37.9 ms
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-463367754,https://api.github.com/repos/pydata/xarray/issues/1385,463367754,MDEyOklzc3VlQ29tbWVudDQ2MzM2Nzc1NA==,30007270,2019-02-13T20:58:52Z,2019-02-13T20:59:06Z,NONE,"It seems my issue has to do with the time coordinate:
```
fname = '/work/xrc/AM4_xrc/c192L33_am4p0_cmip6Diag/daily/5yr/atmos.20100101-20141231.sphum.nc'
%prun ds = xr.open_mfdataset(fname,drop_variables='time')
7510 function calls (7366 primitive calls) in 0.068 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.039 0.039 0.039 0.039 netCDF4_.py:244(_open_netcdf4_group)
3 0.022 0.007 0.022 0.007 {built-in method _operator.getitem}
1 0.001 0.001 0.001 0.001 {built-in method posix.lstat}
125/113 0.000 0.000 0.001 0.000 indexing.py:504(shape)
11 0.000 0.000 0.000 0.000 core.py:137()
fname = '/work/xrc/AM4_xrc/c192L33_am4p0_cmip6Diag/daily/5yr/atmos.20000101-20041231.sphum.nc'
%prun ds = xr.open_mfdataset(fname)
13143 function calls (12936 primitive calls) in 23.853 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
6 23.791 3.965 23.791 3.965 {built-in method _operator.getitem}
1 0.029 0.029 0.029 0.029 netCDF4_.py:244(_open_netcdf4_group)
2 0.023 0.012 0.023 0.012 {cftime._cftime.num2date}
1 0.001 0.001 0.001 0.001 {built-in method posix.lstat}
158/139 0.000 0.000 0.001 0.000 indexing.py:504(shape)
```
Both files are 33 GB. This is using xarray 0.11.3.
I also confirm that nc.MFDataset is much faster (<1s).
Is there any speed-up for the time coordinates possible, given that my data follows a standard calendar? (Short of using drop_variables='time' and then manually adding the time coordinate...)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-439478904,https://api.github.com/repos/pydata/xarray/issues/1385,439478904,MDEyOklzc3VlQ29tbWVudDQzOTQ3ODkwNA==,30007270,2018-11-16T18:10:53Z,2018-11-16T18:10:53Z,NONE,"h5netcdf fails with the following error (presumably the file is not compatible):
```
/nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
97 if swmr and swmr_support:
98 flags |= h5f.ACC_SWMR_READ
---> 99 fid = h5f.open(name, flags, fapl=fapl)
100 elif mode == 'r+':
101 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5f.pyx in h5py.h5f.open()
OSError: Unable to open file (file signature not found)
```
Using scipy:
```
ncalls tottime percall cumtime percall filename:lineno(function)
65/42 80.448 1.238 80.489 1.916 {built-in method numpy.core.multiarray.array}
764838 0.548 0.000 0.548 0.000 core.py:169()
3 0.169 0.056 0.717 0.239 core.py:169()
2 0.041 0.021 0.041 0.021 {cftime._cftime.num2date}
3 0.038 0.013 0.775 0.258 core.py:173(getem)
1 0.024 0.024 81.313 81.313 :1()
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-439445695,https://api.github.com/repos/pydata/xarray/issues/1385,439445695,MDEyOklzc3VlQ29tbWVudDQzOTQ0NTY5NQ==,30007270,2018-11-16T16:20:25Z,2018-11-16T16:20:25Z,NONE,"Sorry, I think the speedup had to do with accessing a file that had previously been loaded rather than due to `decode_cf`. Here's the output of `prun` using two different files of approximately the same size (~75 GB), run from a notebook without using distributed (which doesn't lead to any speedup):
Output of
%prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/atmos_level.1999010100-2000123123.sphum.nc',chunks={'lat':20,'time':50,'lon':12,'pfull':11})
```
780980 function calls (780741 primitive calls) in 55.374 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
7 54.448 7.778 54.448 7.778 {built-in method _operator.getitem}
764838 0.473 0.000 0.473 0.000 core.py:169()
3 0.285 0.095 0.758 0.253 core.py:169()
2 0.041 0.020 0.041 0.020 {cftime._cftime.num2date}
3 0.040 0.013 0.821 0.274 core.py:173(getem)
1 0.027 0.027 55.374 55.374 :1()
```
Output of
%prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/atmos_level.2001010100-2002123123.temp.nc',chunks={'lat':20,'time':50,'lon':12,'pfull':11},\
decode_cf=False)
```
772212 function calls (772026 primitive calls) in 56.000 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
5 55.213 11.043 55.214 11.043 {built-in method _operator.getitem}
764838 0.486 0.000 0.486 0.000 core.py:169()
3 0.185 0.062 0.671 0.224 core.py:169()
3 0.041 0.014 0.735 0.245 core.py:173(getem)
1 0.027 0.027 56.001 56.001 :1()
```
/work isn't a remote archive, so it surprises me that this should happen. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-439042364,https://api.github.com/repos/pydata/xarray/issues/1385,439042364,MDEyOklzc3VlQ29tbWVudDQzOTA0MjM2NA==,30007270,2018-11-15T13:37:16Z,2018-11-15T14:06:04Z,NONE,"Yes, I'm on 0.11.
Nothing displays on the task stream/ progress bar when using `open_mfdataset`, although I can monitor progress when, say, computing the mean.
The output from `%time` using `decode_cf = False` is
```
CPU times: user 4.42 s, sys: 392 ms, total: 4.82 s
Wall time: 4.74 s
```
and for decode_cf = True:
```
CPU times: user 11.6 s, sys: 1.61 s, total: 13.2 s
Wall time: 3min 28s
```
Using `xr.set_options(file_cache_maxsize=1)` doesn't make any noticeable difference.
If I repeat the open_mfdataset for another 5 files (after opening the first 5), I occasionally get this warning:
`distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)`
I only began using the dashboard recently; please let me know if there's something basic I'm missing.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135
https://github.com/pydata/xarray/issues/1385#issuecomment-438870575,https://api.github.com/repos/pydata/xarray/issues/1385,438870575,MDEyOklzc3VlQ29tbWVudDQzODg3MDU3NQ==,30007270,2018-11-15T00:32:42Z,2018-11-15T00:32:42Z,NONE,"I can confirm that
```
ds = xr.open_mfdataset(data_fnames,chunks={'lat':20,'time':50,'lon':24,'pfull':11},\
decode_cf=False)
ds = xr.decode_cf(ds)
```
is much faster (seconds vs minutes) than
```
ds = xr.open_mfdataset(data_fnames,chunks={'lat':20,'time':50,'lon':24,'pfull':11})
```
. For reference, data_fnames is a list of 5 files, each of which is ~75 GB.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,224553135