id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 279456192,MDU6SXNzdWUyNzk0NTYxOTI=,1761,Importing xarray fails if old version of bottleneck is installed,1882397,closed,0,,,5,2017-12-05T17:10:25Z,2020-02-09T21:39:48Z,2020-02-09T21:39:48Z,NONE,,,,"Importing version 0.11 of xarray fails if version 1.0.0 of Bottleneck is installed. Bottleneck seems to be an optional dependency of xarray. During runtime xarray replaces functions by their bottleneck versions if that is installed, but it does not check if the version of bottleneck that is installed is new enough to provide that function: The `getattr` here fails with an AttributeError in this case: https://github.com/pydata/xarray/blob/b46fcd656391d786b8d25b0615f6d4bd30b524b7/xarray/core/ops.py#L361-L365 `AttributeError: 'module' object has no attribute 'move_argmax'` `move_argmax` was included into bottleneck in version 1.1.0, so if version 1.0 is installed this can't work. I saw this on python2.7, but I don't think that should matter...","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1761/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 372006204,MDU6SXNzdWUzNzIwMDYyMDQ=,2496,Incorrect conversion from sliced pd.MultiIndex,1882397,closed,0,,,2,2018-10-19T15:25:38Z,2019-02-19T09:42:52Z,2019-02-19T09:42:51Z,NONE,,,,"If we convert a pandas dataframe with a multiindex, slice it to remove some entries from the index, a converted DataArray still contains the removed items in the coordinates (although the values are NaN). ```python # We create an example dataframe idx = pd.MultiIndex.from_product([list('abc'), list('xyz')]) df = pd.DataFrame(data={'col': np.random.randn(len(idx))}, index=idx) df.columns.name = 'cols' df.index.names = ['idx1', 'idx2'] df2 = df.loc[['a', 'b']] ``` ```python # df2 does not contain `c` in the first level >>> df2 cols col idx1 idx2 a x -0.844476 y -0.845998 z 1.965143 b x -0.159293 y 0.188163 z -1.076204 # It still shows up in the converted xarray though: >>> xr.DataArray(df2).unstack('dim_0') array([[[-0.844476, -0.845998, 1.965143], [-0.159293, 0.188163, -1.076204], [ nan, nan, nan]]]) Coordinates: * cols (cols) object 'col' * idx1 (idx1) object 'a' 'b' 'c' * idx2 (idx2) object 'x' 'y' 'z' ``` If the original dataframe is very sparse, this can lead to gigantic unnecessary memory usage.
#### Output of ``xr.show_versions()`` ``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_GB.UTF-8 LANG: None LOCALE: en_GB.UTF-8 xarray: 0.10.9 pandas: 0.23.4 numpy: 1.15.2 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: 0.6.2 h5py: 2.8.0 Nio: None zarr: None cftime: 1.0.0b1 PseudonetCDF: None rasterio: None iris: None bottleneck: 1.2.1 cyordereddict: None dask: 0.19.2 distributed: 1.23.2 matplotlib: 3.0.0 cartopy: None seaborn: 0.9.0 setuptools: 40.4.3 pip: 18.0 conda: 4.5.11 pytest: 3.8.1 IPython: 7.0.1 sphinx: 1.8.1 ```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2496/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 355264812,MDU6SXNzdWUzNTUyNjQ4MTI=,2389,Large pickle overhead in ds.to_netcdf() involving dask.delayed functions,1882397,closed,0,,,11,2018-08-29T17:43:28Z,2019-01-13T21:17:12Z,2019-01-13T21:17:12Z,NONE,,,,"If we write a dask array that doesn't involve `dask.delayed` functions using `ds.to_netcdf`, there is only little overhead from pickle: ```python vals = da.random.random(500, chunks=(1,)) ds = xr.Dataset({'vals': (['a'], vals)}) write = ds.to_netcdf('file2.nc', compute=False) %prun -stime -l10 write.compute() ``` ``` 123410 function calls (104395 primitive calls) in 13.720 seconds Ordered by: internal time List reduced from 203 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 8 10.032 1.254 10.032 1.254 {method 'acquire' of '_thread.lock' objects} 1001 2.939 0.003 2.950 0.003 {built-in method _pickle.dumps} 1001 0.614 0.001 3.569 0.004 pickle.py:30(dumps) 6504/1002 0.012 0.000 0.021 0.000 utils.py:803(convert) 11507/1002 0.010 0.000 0.019 0.000 utils_comm.py:144(unpack_remotedata) 6013 0.009 0.000 0.009 0.000 utils.py:767(tokey) 3002/1002 0.008 0.000 0.017 0.000 utils_comm.py:181() 11512 0.007 0.000 0.008 0.000 core.py:26(istask) 1002 0.006 0.000 3.589 0.004 worker.py:788(dumps_task) 1 0.005 0.005 0.007 0.007 core.py:273() ``` But if we use results from `dask.delayed`, pickle takes up most of the time: ```python @dask.delayed def make_data(): return np.array(np.random.randn()) vals = da.stack([da.from_delayed(make_data(), (), np.float64) for _ in range(500)]) ds = xr.Dataset({'vals': (['a'], vals)}) write = ds.to_netcdf('file5.nc', compute=False) %prun -stime -l10 write.compute() ``` ``` 115045243 function calls (104115443 primitive calls) in 67.240 seconds Ordered by: internal time List reduced from 292 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 8120705/501 17.597 0.000 59.036 0.118 pickle.py:457(save) 2519027/501 7.581 0.000 59.032 0.118 pickle.py:723(save_tuple) 4 6.978 1.745 6.978 1.745 {method 'acquire' of '_thread.lock' objects} 3082150 5.362 0.000 8.748 0.000 pickle.py:413(memoize) 11474396 4.516 0.000 5.970 0.000 pickle.py:213(write) 8121206 4.186 0.000 5.202 0.000 pickle.py:200(commit_frame) 13747943 2.703 0.000 2.703 0.000 {method 'get' of 'dict' objects} 17057538 1.887 0.000 1.887 0.000 {built-in method builtins.id} 4568116 1.772 0.000 1.782 0.000 {built-in method _struct.pack} 2762513 1.613 0.000 2.826 0.000 pickle.py:448(get) ``` This additional pickle overhead does not happen if we compute the dataset without writing it to a file.
Output of `%prun -stime -l10 ds.compute()` without `dask.delayed`: ``` 83856 function calls (73348 primitive calls) in 0.566 seconds Ordered by: internal time List reduced from 259 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 4 0.441 0.110 0.441 0.110 {method 'acquire' of '_thread.lock' objects} 502 0.013 0.000 0.013 0.000 {method 'send' of '_socket.socket' objects} 500 0.011 0.000 0.011 0.000 {built-in method _pickle.dumps} 1000 0.007 0.000 0.008 0.000 core.py:159(get_dependencies) 3500 0.007 0.000 0.007 0.000 utils.py:767(tokey) 3000/500 0.006 0.000 0.010 0.000 utils.py:803(convert) 500 0.005 0.000 0.019 0.000 pickle.py:30(dumps) 1 0.004 0.004 0.008 0.008 core.py:3826(concatenate3) 4500/500 0.004 0.000 0.008 0.000 utils_comm.py:144(unpack_remotedata) 1 0.004 0.004 0.017 0.017 order.py:83(order) ``` With `dask.delayed`: ``` 149376 function calls (139868 primitive calls) in 1.738 seconds Ordered by: internal time List reduced from 264 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 4 1.568 0.392 1.568 0.392 {method 'acquire' of '_thread.lock' objects} 1 0.015 0.015 0.038 0.038 optimization.py:455(fuse) 502 0.012 0.000 0.012 0.000 {method 'send' of '_socket.socket' objects} 6500 0.010 0.000 0.010 0.000 utils.py:767(tokey) 5500/1000 0.009 0.000 0.012 0.000 utils_comm.py:144(unpack_remotedata) 2500 0.008 0.000 0.009 0.000 core.py:159(get_dependencies) 500 0.007 0.000 0.009 0.000 client.py:142(__init__) 1000 0.005 0.000 0.008 0.000 core.py:280(subs) 2000/1000 0.005 0.000 0.008 0.000 utils.py:803(convert) 1 0.004 0.004 0.022 0.022 order.py:83(order) ```
I am using `dask.distributed`. I haven't tested it with anything else. #### Software versions
``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_GB.UTF-8 LANG: None LOCALE: en_GB.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.15.1 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: 0.6.2 h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.18.2 distributed: 1.22.1 matplotlib: 2.2.2 cartopy: None seaborn: 0.9.0 setuptools: 40.2.0 pip: 18.0 conda: 4.5.11 pytest: 3.7.3 IPython: 6.5.0 sphinx: 1.7.7 ```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2389/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 342426261,MDU6SXNzdWUzNDI0MjYyNjE=,2299,Confusing behaviour with MultiIndex,1882397,closed,0,6815844,,1,2018-07-18T17:41:12Z,2018-08-13T22:16:31Z,2018-08-13T22:16:31Z,NONE,,,,"`Dataset` allows assignment of new variables with dimension names that are used in a MultiIndex, even if the lengths do not match the existing coordinate. ```python a = pd.DataFrame({'a': [1, 2], 'b': [3, 4]}).unstack('a') a.index.names = ['dim0', 'dim1'] a.index.name = 'stacked_dim' b = xr.Dataset(coords={'dim0': ['a', 'b'], 'dim1': [0, 1]}) b = b.stack(dim_stacked=['dim0', 'dim1']) assert(len(b.dim0) == 4) # This should raise an errors because the length is != 4 b['c'] = (('dim0',), [10, 11]) b ``` Instead, it reports `dim0` as a new dimension without coordinates: ``` Dimensions: (dim0: 2, dim_stacked: 4) Coordinates: * dim_stacked (dim_stacked) MultiIndex - dim0 (dim_stacked) object 'a' 'a' 'b' 'b' - dim1 (dim_stacked) int64 0 1 0 1 Dimensions without coordinates: dim0 Data variables: c (dim0) int64 10 11 ``` Similar cases of coordinates that aren't used do raise an error: ```python ds = xr.Dataset() ds.coords['a'] = [1, 2, 3] ds = ds.sel(a=1) ds['b'] = (('a',), [1, 2]) ds ``` #### Output of ``xr.show_versions()``
``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_GB.UTF-8 LANG: None LOCALE: en_GB.UTF-8 xarray: 0.10.7 pandas: 0.23.2 numpy: 1.14.5 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: None h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.18.1 distributed: 1.22.0 matplotlib: 2.2.2 cartopy: None seaborn: 0.8.1 setuptools: 39.2.0 pip: 10.0.1 conda: 4.5.8 pytest: 3.6.2 IPython: 6.4.0 sphinx: 1.7.5 ```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2299/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue