id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1503046820,I_kwDOAMm_X85Zlqyk,7388,Xarray does not support full range of netcdf-python compression options,1197350,closed,0,,,22,2022-12-19T14:21:17Z,2023-12-21T15:43:06Z,2023-12-21T15:24:17Z,MEMBER,,,,"### What is your issue? ### Summary The [netcdf4-python API docs](https://unidata.github.io/netcdf4-python/#Dataset.createVariable) say the following > If the optional keyword argument compression is set, the data will be compressed in the netCDF file using the specified compression algorithm. Currently `zlib`,`szip`,`zstd`,`bzip2`,`blosc_lz`,`blosc_lz4`,`blosc_lz4hc`, `blosc_zlib` and `blosc_zstd` are supported. Default is None (no compression). All of the compressors except `zlib` and `szip` use the HDF5 plugin architecture. > > If the optional keyword `zlib` is True, the data will be compressed in the netCDF file using zlib compression (default False). The use of this option is deprecated in favor of `compression='zlib'`. Although `compression` is considered a valid encoding option by Xarray https://github.com/pydata/xarray/blob/bbe63ab657e9cb16a7cbbf6338a8606676ddd7b0/xarray/backends/netCDF4_.py#L232-L242 ...it appears that we silently ignores the `compression` option when creating new netCDF4 variables: https://github.com/pydata/xarray/blob/bbe63ab657e9cb16a7cbbf6338a8606676ddd7b0/xarray/backends/netCDF4_.py#L488-L501 ### Code example ```python shape = (10, 20) chunksizes = (1, 10) encoding = { 'compression': 'zlib', 'shuffle': True, 'complevel': 8, 'fletcher32': False, 'contiguous': False, 'chunksizes': chunksizes } da = xr.DataArray( data=np.random.rand(*shape), dims=['y', 'x'], name=""foo"", attrs={""bar"": ""baz""} ) da.encoding = encoding ds = da.to_dataset() fname = ""test.nc"" ds.to_netcdf(fname, engine=""netcdf4"", mode=""w"") with xr.open_dataset(fname, engine=""netcdf4"") as ds1: display(ds1.foo.encoding) ``` ``` {'zlib': False, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 10), 'source': 'test.nc', 'original_shape': (10, 20), 'dtype': dtype('float64'), '_FillValue': nan} ``` In addition to showing that `compression` is ignored, this also reveals several other encoding options that are not available when writing data from xarray (`szip`, `zstd`, `bzip2`, `blosc`). ### Proposal We should align with the recommendation from the netcdf4 docs and support `compression=` style encoding in NetCDF. We should deprecate `zlib=True` syntax.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7388/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1983894219,PR_kwDOAMm_X85e8V31,8428,Add mode='a-': Do not overwrite coordinates when appending to Zarr with `append_dim`,1197350,closed,0,,,3,2023-11-08T15:41:58Z,2023-12-01T04:21:57Z,2023-12-01T03:58:54Z,MEMBER,,0,pydata/xarray/pulls/8428,"This implements the 1b option described in #8427. - [x] Closes #8427 - [x] Tests added - [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8428/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1983891070,I_kwDOAMm_X852P8Z-,8427,Ambiguous behavior with coordinates when appending to Zarr store with append_dim,1197350,closed,0,,,4,2023-11-08T15:40:19Z,2023-12-01T03:58:56Z,2023-12-01T03:58:55Z,MEMBER,,,,"### What happened? There are two quite different scenarios covered by ""append"" with Zarr - Adding new variables to a dataset - Extending arrays along a dimensions (via `append_dim`) This issue is about what should happen when using `append_dim` with variables that _do not contain `append_dim`_. Here's the current behavior. ```python import xarray as xr import zarr ds1 = xr.DataArray( np.array([1, 2, 3]).reshape(3, 1, 1), dims=('time', 'y', 'x'), coords={'x': [1], 'y': [2]}, name=""foo"" ).to_dataset() ds2 = xr.DataArray( np.array([4, 5]).reshape(2, 1, 1), dims=('time', 'y', 'x'), coords={'x':[-1], 'y': [-2]}, name=""foo"" ).to_dataset() # how concat works: data are aligned ds_concat = xr.concat([ds1, ds2], dim=""time"") assert ds_concat.dims == {""time"": 5, ""y"": 2, ""x"": 2} # now do a Zarr append store = zarr.storage.MemoryStore() ds1.to_zarr(store, consolidated=False) # we do not check that the coordinates are aligned--just that they have the same shape and dtype ds2.to_zarr(store, append_dim=""time"", consolidated=False) ds_append = xr.open_zarr(store, consolidated=False) # coordinates data have been overwritten assert ds_append.dims == {""time"": 5, ""y"": 1, ""x"": 1} # ...with the latest values assert ds_append.x.data[0] == -1 ``` Currently, we _always write all data variables in this scenario_. That includes overwriting the coordinates every time we append. That makes appending more expensive than it needs to be. I don't think that is the behavior most users want or expect. ### What did you expect to happen? There are a couple of different options we could consider for how to handle this ""extending"" situation (with `append_dim`) 1. [current behavior] Do not attempt to align coordinates a. [current behavior] Overwrite coordinates with new data b. Keep original coordinates c. Force the user to explicitly drop the coordinates, as we do for `region` operations. 2. Attempt to align coordinates a. Fail if coordinates don't match b. Extend the arrays to replicate the behavior of `concat` We currently do 1a. **I propose to switch to 1b**. I think it is closer to what users want, and it requires less I/O. ### Anything else we need to know? _No response_ ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 5.10.176-157.645.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.10.1 pandas: 2.1.2 numpy: 1.24.4 scipy: 1.11.3 netCDF4: 1.6.5 pydap: installed h5netcdf: 1.2.0 h5py: 3.10.0 Nio: None zarr: 2.16.0 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.10.1 distributed: 2023.10.1 matplotlib: 3.8.0 cartopy: 0.22.0 seaborn: 0.13.0 numbagg: 0.6.0 fsspec: 2023.10.0 cupy: None pint: 0.22 sparse: 0.14.0 flox: 0.8.1 numpy_groupies: 0.10.2 setuptools: 68.2.2 pip: 23.3.1 conda: None pytest: 7.4.3 mypy: None IPython: 8.16.1 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8427/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 350899839,MDU6SXNzdWUzNTA4OTk4Mzk=,2368,Let's list all the netCDF files that xarray can't open,1197350,closed,0,,,32,2018-08-15T17:41:13Z,2023-11-30T04:36:42Z,2023-11-30T04:36:42Z,MEMBER,,,,"At the Pangeo developers meetings, I am hearing lots of reports from folks like @dopplershift and @rsignell-usgs about netCDF datasets that xarray can't open. My expectation is that xarray doesn't have strong requirements on the contents of datasets. (It doesn't ""enforce"" cf compatibility for example; that's optional.) Anything that can be written to netCDF should be readable by xarray. I would like to collect examples of places where xarray fails. So far, I am only aware of one: - Self-referential multidimensional coordinates (#2233). Datasets which contain variables like `siglay(siglay, node)`. Only `siglay(siglay)` would work. __Are there other distinct cases?__ Please provide links / sample code of netCDF datasets that xarray can't read. Even better would be short code snippets to create such datasets in python using the netcdf4 interface.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2368/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1935984485,I_kwDOAMm_X85zZMdl,8290,Potential performance optimization for Zarr backend,1197350,closed,0,,,0,2023-10-10T18:41:19Z,2023-10-13T16:38:58Z,2023-10-13T16:38:58Z,MEMBER,,,,"### What is your issue? We have identified an inefficiency in the way the `ZarrArrayWrapper` works. This class currently stores a reference to a `ZarrStore` and a variable name https://github.com/pydata/xarray/blob/75af56c33a29529269a73bdd00df2d3af17ee0f5/xarray/backends/zarr.py#L63-L68 When accessing the array, the parent group of the array is read and used to open a new Zarr array. https://github.com/pydata/xarray/blob/75af56c33a29529269a73bdd00df2d3af17ee0f5/xarray/backends/zarr.py#L83-L84 This is a relatively metadata-intensive operation for Zarr. It requires reading both the group metadata and the array metadata. Because of how this wrapper works, these operations currently happen _every time data is read from the array_. If we have a dask array wrapping the zarr array with thousands of chunks, these metadata operations will happen within every single task. For high latency stores, this is really bad. Instead, we should just reference the `zarr.Array` object directly within the `ZarrArrayWrapper`. It's lightweight and easily serializable. There is no need to re-open the array each time we want to read data from it. This change will lead to an immediate performance enhancement in all Zarr operations.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8290/reactions"", ""total_count"": 6, ""+1"": 4, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 2, ""eyes"": 0}",,completed,13221727,issue 357808970,MDExOlB1bGxSZXF1ZXN0MjEzNzM2NTAx,2405,WIP: don't create indexes on multidimensional dimensions,1197350,closed,0,,,7,2018-09-06T20:13:11Z,2023-07-19T18:33:17Z,2023-07-19T18:33:17Z,MEMBER,,0,pydata/xarray/pulls/2405," - [x] Closes #2368, Closes #2233 - [ ] Tests added (for all bug fixes or enhancements) - [ ] Tests passed (for all non-documentation changes) - [ ] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later) This is just a start to the solution proposed in #2368. A surprisingly small number of tests broke in my local environment.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2405/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 401874795,MDU6SXNzdWU0MDE4NzQ3OTU=,2697,read ncml files to create multifile datasets,1197350,closed,0,,,18,2019-01-22T17:33:08Z,2023-05-29T13:41:38Z,2023-05-29T13:41:38Z,MEMBER,,,,"This issue was motivated by a recent conversation with @jdha regarding how they are preparing inputs for regional ocean models. They are currently using ncml with netcdf-java to consolidate and homogenize diverse data sources. But this approach doesn't play well with the xarray / dask stack. [ncml](https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/) is standard developed by Unidata for use with their netCDF-java library: > NcML is an XML representation of netCDF metadata, (approximately) the header information one gets from a netCDF file with the ""ncdump -h"" command. In addition to describing individual netCDF files, ncml can be used to annotate modifications to netCDF metadata (attributes, dimension names, etc.) and also to [aggregate](https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/Aggregation.html) multiple files into a single logical dataset. This is what such an aggregation over an existing dimension looks like in ncml: ```xml ``` Obviously this maps very well to xarray's `concat` operation. Similar aggregations can be defined that map to `merge` operations. I think it would be great if we could support the ncml spec in xarray, allowing us to write code like ```python ds = xr.open_ncml('file.ncml') ``` This idea has been discussed before in #893. Perhaps it's time has finally come. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2697/reactions"", ""total_count"": 7, ""+1"": 7, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1231184996,I_kwDOAMm_X85JYmRk,6588,Support lazy concatenation *without dask*,1197350,closed,0,,,2,2022-05-10T13:40:20Z,2023-03-10T18:40:22Z,2022-05-10T15:38:20Z,MEMBER,,,,"### Is your feature request related to a problem? Right now, if I want to concatenate multiple datasets (e.g. as in `open_mfdataset`), I have two options: - Eagerly load the data as numpy arrays ➡️ xarray will dispatch to np.concatenate - Chunk each dataset ➡️ xarray will dispatch to dask.array.concatenate In pseudocode: ```python ds1 = xr.open_dataset(""some_big_lazy_source_1.nc"") ds2 = xr.open_dataset(""some_big_lazy_source_2.nc"") item1 = ds1.foo[0, 0, 0] # lazily access a single item ds = xr.concat([ds1.chunk(), ds2.chunk()], ""time"") # only way to lazily concat # trying to access the same item will now trigger loading of all of ds1 item1 = ds.foo[0, 0, 0] # yes I could use different chunks, but the point is that I should not have to # arbitrarily choose chunks to make this work ``` However, I am increasingly encountering scenarios where I would like to lazily concatenate datasets (without loading into memory), but also without the requirement of using dask. This would be useful, for example, for creating composite datasets that point back to an OpenDAP server, preserving the possibility of granular lazy access to any array element without the requirement of arbitrary chunking at an intermediate stage. ### Describe the solution you'd like I propose to extend our LazilyIndexedArray classes to support simple concatenation and stacking. The result of applying concat to such arrays will be a new LazilyIndexedArray that wraps the underlying arrays into a single object. The main difficulty in implementing this will probably be with indexing: the concatenated array will need to understand how to map global indexes to the underling individual array indexes. That is a little tricky but eminently solvable. ### Describe alternatives you've considered The alternative is to structure your code in a way that avoids needing to lazily concatenate arrays. That is what we do now. It is not optimal. ### Additional context _No response_","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6588/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1260047355,I_kwDOAMm_X85LGsv7,6662,Obscure h5netcdf http serialization issue with python's http.server,1197350,closed,0,,,6,2022-06-03T15:28:15Z,2022-06-04T22:13:05Z,2022-06-04T22:13:05Z,MEMBER,,,,"### What is your issue? In Pangeo Forge, we try to test our ability to read data over http. This often surfaces edge cases involving xarray and fsspec. This is one such edge case. However, it is kind of important, because it affects our ability to reliably test http-based datasets using python's built-in http server. Here is some code that: - Creates a tiny dataset on disk - Serves it over http via `python -m http.server` - Opens the dataset with fsspec and xarray with the h5netcdf engine - Pickles the dataset, loads it, and calls `.load()` to load the data into memory As you can see, this works with a local file, but not with the http file, with h5py raising a checksum-related error. ```python import fsspec import xarray as xr from pickle import dumps, loads ds_orig = xr.tutorial.load_dataset('tiny') ds_orig fname = 'tiny.nc' ds_orig.to_netcdf(fname, engine='netcdf4') # now start an http server in a terminal in the same working directory # $ python -m http.server def open_pickle_and_reload(path): with fsspec.open(path, mode='rb') as fp: with xr.open_dataset(fp, engine='h5netcdf') as ds1: pass # pickle it and reload it ds2 = loads(dumps(ds1)) ds2.load() open_pickle_and_reload(fname) # works url = f'http://127.0.0.1:8000/{fname}' open_pickle_and_reload(url) # OSError: Unable to open file (incorrect metadata checksum after all read attempts) ```
full traceback ``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/Code/xarray/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock) 198 try: --> 199 file = self._cache[self._key] 200 except KeyError: ~/Code/xarray/xarray/backends/lru_cache.py in __getitem__(self, key) 52 with self._lock: ---> 53 value = self._cache[key] 54 self._cache.move_to_end(key) KeyError: [, (,), 'r', (('decode_vlen_strings', True), ('invalid_netcdf', None))] During handling of the above exception, another exception occurred: OSError Traceback (most recent call last) in 24 open_pickle_and_reload(fname) # works 25 url = f'[http://127.0.0.1:8000/{fname}'](http://127.0.0.1:8000/%7Bfname%7D'%3C/span%3E) ---> 26 open_pickle_and_reload(url) # OSError: Unable to open file (incorrect metadata checksum after all read attempts) in open_pickle_and_reload(path) 20 # pickle it and reload it 21 ds2 = loads(dumps(ds1)) ---> 22 ds2.load() # works 23 24 open_pickle_and_reload(fname) # works ~/Code/xarray/xarray/core/dataset.py in load(self, **kwargs) 687 for k, v in self.variables.items(): 688 if k not in lazy_data: --> 689 v.load() 690 691 return self ~/Code/xarray/xarray/core/variable.py in load(self, **kwargs) 442 self._data = as_compatible_data(self._data.compute(**kwargs)) 443 elif not is_duck_array(self._data): --> 444 self._data = np.asarray(self._data) 445 return self 446 ~/Code/xarray/xarray/core/indexing.py in __array__(self, dtype) 654 655 def __array__(self, dtype=None): --> 656 self._ensure_cached() 657 return np.asarray(self.array, dtype=dtype) 658 ~/Code/xarray/xarray/core/indexing.py in _ensure_cached(self) 651 def _ensure_cached(self): 652 if not isinstance(self.array, NumpyIndexingAdapter): --> 653 self.array = NumpyIndexingAdapter(np.asarray(self.array)) 654 655 def __array__(self, dtype=None): ~/Code/xarray/xarray/core/indexing.py in __array__(self, dtype) 624 625 def __array__(self, dtype=None): --> 626 return np.asarray(self.array, dtype=dtype) 627 628 def __getitem__(self, key): ~/Code/xarray/xarray/core/indexing.py in __array__(self, dtype) 525 def __array__(self, dtype=None): 526 array = as_indexable(self.array) --> 527 return np.asarray(array[self.key], dtype=None) 528 529 def transpose(self, order): ~/Code/xarray/xarray/backends/h5netcdf_.py in __getitem__(self, key) 49 50 def __getitem__(self, key): ---> 51 return indexing.explicit_indexing_adapter( 52 key, self.shape, indexing.IndexingSupport.OUTER_1VECTOR, self._getitem 53 ) ~/Code/xarray/xarray/core/indexing.py in explicit_indexing_adapter(key, shape, indexing_support, raw_indexing_method) 814 """""" 815 raw_key, numpy_indices = decompose_indexer(key, shape, indexing_support) --> 816 result = raw_indexing_method(raw_key.tuple) 817 if numpy_indices.tuple: 818 # index the loaded np.ndarray ~/Code/xarray/xarray/backends/h5netcdf_.py in _getitem(self, key) 58 key = tuple(list(k) if isinstance(k, np.ndarray) else k for k in key) 59 with self.datastore.lock: ---> 60 array = self.get_array(needs_lock=False) 61 return array[key] 62 ~/Code/xarray/xarray/backends/h5netcdf_.py in get_array(self, needs_lock) 45 class H5NetCDFArrayWrapper(BaseNetCDF4Array): 46 def get_array(self, needs_lock=True): ---> 47 ds = self.datastore._acquire(needs_lock) 48 return ds.variables[self.variable_name] 49 ~/Code/xarray/xarray/backends/h5netcdf_.py in _acquire(self, needs_lock) 180 181 def _acquire(self, needs_lock=True): --> 182 with self._manager.acquire_context(needs_lock) as root: 183 ds = _nc4_require_group( 184 root, self._group, self._mode, create_group=_h5netcdf_create_group /opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/contextlib.py in __enter__(self) 117 del self.args, self.kwds, self.func 118 try: --> 119 return next(self.gen) 120 except StopIteration: 121 raise RuntimeError(""generator didn't yield"") from None ~/Code/xarray/xarray/backends/file_manager.py in acquire_context(self, needs_lock) 185 def acquire_context(self, needs_lock=True): 186 """"""Context manager for acquiring a file."""""" --> 187 file, cached = self._acquire_with_cache_info(needs_lock) 188 try: 189 yield file ~/Code/xarray/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock) 203 kwargs = kwargs.copy() 204 kwargs[""mode""] = self._mode --> 205 file = self._opener(*self._args, **kwargs) 206 if self._mode == ""w"": 207 # ensure file doesn't get overridden when opened again /opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/h5netcdf/core.py in __init__(self, path, mode, invalid_netcdf, phony_dims, **kwargs) 719 else: 720 self._preexisting_file = mode in {""r"", ""r+"", ""a""} --> 721 self._h5file = h5py.File(path, mode, **kwargs) 722 except Exception: 723 self._closed = True /opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, **kwds) 505 fs_persist=fs_persist, fs_threshold=fs_threshold, 506 fs_page_size=fs_page_size) --> 507 fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr) 508 509 if isinstance(libver, tuple): /opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr) 218 if swmr and swmr_support: 219 flags |= h5f.ACC_SWMR_READ --> 220 fid = h5f.open(name, flags, fapl=fapl) 221 elif mode == 'r+': 222 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl) h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/h5f.pyx in h5py.h5f.open() OSError: Unable to open file (incorrect metadata checksum after all read attempts) (external_url) ```
Strangely, a similar workflow _does work_ with http files hosted elsewhere, e.g. ```python external_url = 'https://power-datastore.s3.amazonaws.com/v9/climatology/power_901_rolling_zones_utc.nc' open_pickle_and_reload(external_url) ``` This suggests there is something peculiar about python's `http.server` as compared to other http servers that makes this break. I would appreciate any thoughts or ideas about what might be going on here (pinging @martindurant and @shoyer) xref: - https://github.com/pangeo-forge/pangeo-forge-recipes/pull/373 - https://github.com/pydata/xarray/issues/4242 - https://github.com/google/xarray-beam/issues/49","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6662/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 333312849,MDU6SXNzdWUzMzMzMTI4NDk=,2237,why time grouping doesn't preserve chunks,1197350,closed,0,,,30,2018-06-18T15:12:38Z,2022-05-15T02:44:06Z,2022-05-15T02:38:30Z,MEMBER,,,,"#### Code Sample, a copy-pastable example if possible I am continuing my quest to obtain more efficient time grouping for calculation of climatologies and climatological anomalies. I believe this is one of the major performance bottlenecks facing xarray users today. I have raised this in other issues (e.g. #1832), but I believe I have narrowed it down here to a more specific problem. The easiest way to summarize the problem is with an example. Consider the following dataset ```python import xarray as xr ds = xr.Dataset({'foo': (['x'], [1, 1, 1, 1])}, coords={'x': (['x'], [0, 1, 2, 3]), 'bar': (['x'], ['a', 'a', 'b', 'b']), 'baz': (['x'], ['a', 'b', 'a', 'b'])}) ds = ds.chunk({'x': 2}) ds ``` ``` Dimensions: (x: 4) Coordinates: * x (x) int64 0 1 2 3 bar (x) baz (x) Data variables: foo (x) int64 dask.array ``` One non-dimension coordinate (`bar`) is contiguous with respect to `x` while the other `baz` is not. This is important. `baz` is structured similar to the way that `month` would be distributed on a timeseries dataset. Now let's do a trivial groupby operation on `bar` that does nothing, just returns the group unchanged: ```python ds.foo.groupby('bar').apply(lambda x: x) ``` ``` dask.array Coordinates: * x (x) int64 0 1 2 3 bar (x) baz (x) ``` This operation *preserved this original chunks in `foo`*. But if we group by `baz` we see something different ```python ds.foo.groupby('baz').apply(lambda x: x) ``` ``` dask.array Coordinates: * x (x) int64 0 1 2 3 bar (x) baz (x) ``` #### Problem description When grouping over a non-contiguous variable (`baz`) the result has no chunks. That means that we can't lazily access a single item without computing the whole array. This has major performance consequences that make it hard to calculate anomaly values in a more realistic case. What we really want to do is often something like ``` ds = xr.open_mfdataset('lots/of/files/*.nc') ds_anom = ds.groupby('time.month').apply(lambda x: x - x.mean(dim='time) ``` It is currently impossible to do this lazily due to the issue described above. #### Expected Output We would like to preserve the original chunk structure of `foo`. #### Output of ``xr.show_versions()`` `xr.show_versions()` is triggering a segfault right now on my system for unknown reasons! I am using xarray 0.10.7. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2237/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 413589315,MDU6SXNzdWU0MTM1ODkzMTU=,2785,error decoding cftime time_bnds over opendap with pydap,1197350,closed,0,,,2,2019-02-22T21:38:24Z,2021-07-21T14:51:36Z,2021-07-21T14:51:36Z,MEMBER,,,,"#### Code Sample, a copy-pastable example if possible I try to load the following dataset over opendap with the pydap engine. It only works if I do decode_times=False ```python url = 'http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NOAA-GFDL/GFDL-AM4/amip/r1i1p1f1/Amon/ta/gr1/v20180807/ta_Amon_GFDL-AM4_amip_r1i1p1f1_gr1_198001-201412.nc' ds = xr.open_dataset(url, decode_times=False, engine='pydap') xr.decode_times(ds) ``` raises ``` --------------------------------------------------------------------------- IndexError Traceback (most recent call last) in () 1 #ds.time_bnds.load() ----> 2 xr.decode_cf(ds) ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables) 459 vars, attrs, coord_names = decode_cf_variables( 460 vars, attrs, concat_characters, mask_and_scale, decode_times, --> 461 decode_coords, drop_variables=drop_variables) 462 ds = Dataset(vars, attrs=attrs) 463 ds = ds.set_coords(coord_names.union(extra_coords).intersection(vars)) ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/conventions.py in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables) 392 k, v, concat_characters=concat_characters, 393 mask_and_scale=mask_and_scale, decode_times=decode_times, --> 394 stack_char_dim=stack_char_dim) 395 if decode_coords: 396 var_attrs = new_vars[k].attrs ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/conventions.py in decode_cf_variable(name, var, concat_characters, mask_and_scale, decode_times, decode_endianness, stack_char_dim) 298 for coder in [times.CFTimedeltaCoder(), 299 times.CFDatetimeCoder()]: --> 300 var = coder.decode(var, name=name) 301 302 dimensions, data, attributes, encoding = ( ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/coding/times.py in decode(self, variable, name) 410 units = pop_to(attrs, encoding, 'units') 411 calendar = pop_to(attrs, encoding, 'calendar') --> 412 dtype = _decode_cf_datetime_dtype(data, units, calendar) 413 transform = partial( 414 decode_cf_datetime, units=units, calendar=calendar) ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/coding/times.py in _decode_cf_datetime_dtype(data, units, calendar) 116 values = indexing.ImplicitToExplicitIndexingAdapter( 117 indexing.as_indexable(data)) --> 118 example_value = np.concatenate([first_n_items(values, 1) or [0], 119 last_item(values) or [0]]) 120 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/formatting.py in first_n_items(array, n_desired) 94 from_end=False) 95 array = array[indexer] ---> 96 return np.asarray(array).flat[:n_desired] 97 98 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 529 530 """""" --> 531 return array(a, dtype, copy=False, order=order) 532 533 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in __array__(self, dtype) 630 631 def __array__(self, dtype=None): --> 632 self._ensure_cached() 633 return np.asarray(self.array, dtype=dtype) 634 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in _ensure_cached(self) 627 def _ensure_cached(self): 628 if not isinstance(self.array, NumpyIndexingAdapter): --> 629 self.array = NumpyIndexingAdapter(np.asarray(self.array)) 630 631 def __array__(self, dtype=None): ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 529 530 """""" --> 531 return array(a, dtype, copy=False, order=order) 532 533 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in __array__(self, dtype) 608 609 def __array__(self, dtype=None): --> 610 return np.asarray(self.array, dtype=dtype) 611 612 def __getitem__(self, key): ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 529 530 """""" --> 531 return array(a, dtype, copy=False, order=order) 532 533 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in __array__(self, dtype) 514 def __array__(self, dtype=None): 515 array = as_indexable(self.array) --> 516 return np.asarray(array[self.key], dtype=None) 517 518 def transpose(self, order): ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/conventions.py in __getitem__(self, key) 43 44 def __getitem__(self, key): ---> 45 return np.asarray(self.array[key], dtype=self.dtype) 46 47 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 529 530 """""" --> 531 return array(a, dtype, copy=False, order=order) 532 533 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in __array__(self, dtype) 514 def __array__(self, dtype=None): 515 array = as_indexable(self.array) --> 516 return np.asarray(array[self.key], dtype=None) 517 518 def transpose(self, order): ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/backends/pydap_.py in __getitem__(self, key) 24 def __getitem__(self, key): 25 return indexing.explicit_indexing_adapter( ---> 26 key, self.shape, indexing.IndexingSupport.BASIC, self._getitem) 27 28 def _getitem(self, key): ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in explicit_indexing_adapter(key, shape, indexing_support, raw_indexing_method) 785 if numpy_indices.tuple: 786 # index the loaded np.ndarray --> 787 result = NumpyIndexingAdapter(np.asarray(result))[numpy_indices] 788 return result 789 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in __getitem__(self, key) 1174 def __getitem__(self, key): 1175 array, key = self._indexing_array_and_key(key) -> 1176 return array[key] 1177 1178 def __setitem__(self, key, value): IndexError: too many indices for array ``` Strangely, I can overcome the error by first explicitly loading (or dropping) the `time_bnds` variable: ```python ds.time_bnds.load() xr.decode_cf(ds) ``` I wish this would work without the `.load()` step. I think it has something to do with the many layers of array wrappers involved in lazy opening. The problem does not occur with the netcdf4 engine. I know this is a very obscure problem, but I thought I would open an issue to document. #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.11.3 pandas: 0.23.4 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.4.2 pydap: installed h5netcdf: None h5py: None Nio: None zarr: 2.2.1.dev126+dirty cftime: 1.0.3.4 PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.2.1 cyordereddict: None dask: 0.20.2 distributed: 1.24.2 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 40.6.2 pip: 18.1 conda: None pytest: 4.0.0 IPython: 6.1.0 sphinx: 1.6.5
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2785/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 745801652,MDU6SXNzdWU3NDU4MDE2NTI=,4591,"Serialization issue with distributed, h5netcdf, and fsspec (ImplicitToExplicitIndexingAdapter)",1197350,closed,0,,,12,2020-11-18T16:18:42Z,2021-06-30T17:53:54Z,2020-11-19T15:54:38Z,MEMBER,,,,"This was originally reported by @jkingslake at https://github.com/pangeo-data/pangeo-datastore/issues/116. **What happened**: I tried to open a netcdf file over http using fsspec and the h5netcdf engine and compute data using dask.distributed. It appears that our `ImplicitToExplicitIndexingAdapter` is [no longer?] serializable? **What you expected to happen**: Things would work. Indeed, I could swear this _used to work_ with previous versions. **Minimal Complete Verifiable Example**: ```python import xarray as xr import fsspec from dask.distributed import Client # example needs to use distributed to reproduce the bug client = Client() url = 'https://storage.googleapis.com/ldeo-glaciology/bedmachine/BedMachineAntarctica_2019-11-05_v01.nc' with fsspec.open(url, mode='rb') as openfile: dsc = xr.open_dataset(openfile, chunks=3000) dsc.surface.mean().compute() ``` raises the following error ``` Traceback (most recent call last): File ""/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/core.py"", line 50, in dumps data = { File ""/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/core.py"", line 51, in key: serialize( File ""/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/serialize.py"", line 277, in serialize raise TypeError(msg, str(x)[:10000]) TypeError: ('Could not serialize object of type ImplicitToExplicitIndexingAdapter.', 'ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=, key=BasicIndexer((slice(None, None, None), slice(None, None, None))))))') distributed.comm.utils - ERROR - ('Could not serialize object of type ImplicitToExplicitIndexingAdapter.', 'ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=, key=BasicIndexer((slice(None, None, None), slice(None, None, None))))))') ``` **Anything else we need to know?**: One can work around this by using the netcdf4 library's new and undocumented [ability to open files over http](https://github.com/Unidata/netcdf4-python/issues/1043#issuecomment-697313022). ```python url = 'https://storage.googleapis.com/ldeo-glaciology/bedmachine/BedMachineAntarctica_2019-11-05_v01.nc#mode=bytes' ds = xr.open_dataset(url, engine='netcdf4', chunks=3000) ds ``` However, the fsspec + h5netcdf path _should_ work! **Environment**:
Output of xr.show_versions() ``` INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.19.112+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.1 pandas: 1.1.3 numpy: 1.19.2 scipy: 1.5.2 netCDF4: 1.5.4 pydap: installed h5netcdf: 0.8.1 h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.2.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.7 cfgrib: 0.9.8.4 iris: None bottleneck: 1.3.2 dask: 2.30.0 distributed: 2.30.0 matplotlib: 3.3.2 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.16.1 setuptools: 49.6.0.post20201009 pip: 20.2.4 conda: None pytest: 6.1.1 IPython: 7.18.1 sphinx: 3.2.1 ``` Also fsspec 0.8.4
cc @martindurant for fsspec integration.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4591/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 836391524,MDU6SXNzdWU4MzYzOTE1MjQ=,5056,"Allow ""unsafe"" mode for zarr writing",1197350,closed,0,,,1,2021-03-19T21:57:47Z,2021-04-26T16:37:43Z,2021-04-26T16:37:43Z,MEMBER,,,,"Curently, `Dataset.to_zarr` will only write Zarr datasets in cases in which - The Dataset arrays are in memory (no dask) - The arrays are chunked with dask with a one-to-many relationship between dask chunks and zarr chunks If I try to violate the one-to-many condition, I get an error ```python import xarray as xr ds = xr.DataArray([0, 1., 2], name='foo').chunk({'dim_0': 1}).to_dataset() d = ds.to_zarr('test.zarr', encoding={'foo': {'chunks': (3,)}}, compute=False) ``` ``` /srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/zarr.py in _determine_zarr_chunks(enc_chunks, var_chunks, ndim, name) 148 for dchunk in dchunks[:-1]: 149 if dchunk % zchunk: --> 150 raise NotImplementedError( 151 f""Specified zarr chunks encoding['chunks']={enc_chunks_tuple!r} for "" 152 f""variable named {name!r} would overlap multiple dask chunks {var_chunks!r}. "" NotImplementedError: Specified zarr chunks encoding['chunks']=(3,) for variable named 'foo' would overlap multiple dask chunks ((1, 1, 1),). This is not implemented in xarray yet. Consider either rechunking using `chunk()` or instead deleting or modifying `encoding['chunks']`. ``` In this case, the error is particularly frustrating because I'm not even writing any data yet. (Also related to #2300, #4046, #4380). There are at least two scenarios in which we might want to have more flexibility. 1. The case above, when we want to lazily initialize a Zarr array based on a Dataset, without actually computing anything. 2. The more general case, where we actually write arrays with many-to-many dask-chunk <-> zarr-chunk relationships For 1, I propose we add a new option like `safe_chunks=True` to `to_zarr`. `safe_chunks=False` would permit just bypassing this chunk. For 2, we could consider implementing locks. This probably has to be done at the Dask level. But is actually [not super hard](https://github.com/pangeo-forge/pangeo-forge/blob/c42ead11cf2643e815d353637ecb305973b86a53/pangeo_forge/utils.py#L38-L61) to deterministically figure out which chunks need to share a lock. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5056/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 837243943,MDExOlB1bGxSZXF1ZXN0NTk3NjA4NTg0,5065,Zarr chunking fixes,1197350,closed,0,,,32,2021-03-22T01:35:22Z,2021-04-26T16:37:43Z,2021-04-26T16:37:43Z,MEMBER,,0,pydata/xarray/pulls/5065," - [x] Closes #2300, closes #5056 - [x] Tests added - [x] Passes `pre-commit run --all-files` - [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst` This PR contains two small, related updates to how Zarr chunks are handled. 1. We now delete the `encoding` attribute at the Variable level whenever `chunk` is called. The persistence of `chunk` encoding has been the source of lots of confusion (see #2300, #4046, #4380, https://github.com/dcs4cop/xcube/issues/347) 2. Added a new option called `safe_chunks` in `to_zarr` which allows for bypassing the requirement of the many-to-one relationship between Zarr chunks and Dask chunks (see #5056). Both these touch the internal logic for how chunks are handled, so I thought it was easiest to tackle them with a single PR.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5065/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 859945463,MDU6SXNzdWU4NTk5NDU0NjM=,5172,Inconsistent attribute handling between netcdf4 and h5netcdf engines,1197350,closed,0,,,3,2021-04-16T15:54:03Z,2021-04-20T14:00:34Z,2021-04-16T17:13:26Z,MEMBER,,,," I have found a netCDF file that cannot be decoded by xarray via the h5netcdf engine but CAN be decoded via netCDF4. This could be considered an h5netcdf bug, but I thought I would raise it first here for visibility. This file will reproduce the bug ``` ! wget 'https://esgf-world.s3.amazonaws.com/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/abrupt-4xCO2/r1i1p1f1/Lmon/cLeaf/gr/v20190118/cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc' ``` ```python import netCDF4 import h5netcdf.legacyapi as netCDF4_h5 local_path = ""cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc"" with netCDF4_h5.Dataset(local_path, mode='r') as ncfile: print('h5netcdf:', ncfile['cLeaf'].getncattr(""coordinates"")) with netCDF4.Dataset(local_path, mode='r') as ncfile: #assert ""coordinates"" not in ncfile['cLeaf'].attrs print('netCDF4:', ncfile['cLeaf'].getncattr(""coordinates"")) ``` ``` h5netcdf: Empty(dtype=dtype('S1')) netCDF4: ``` As we can see, we get an empty string `''` in netCDF4 but a `` object from h5netcdf. This weird attribute prevents xarray from decoding the dataset. We could: - Fix it in xarray, but having special handling for this sort of `Empty` object - Fix it in h5netcdf **Environment**:
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 4.19.150+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.17.0 pandas: 1.2.3 numpy: 1.20.2 scipy: 1.6.2 netCDF4: 1.5.6 pydap: installed h5netcdf: 0.10.0 h5py: 3.1.0 Nio: None zarr: 2.7.0 cftime: 1.4.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.2.1 cfgrib: 0.9.8.5 iris: None bottleneck: 1.3.2 dask: 2021.03.1 distributed: 2021.03.1 matplotlib: 3.3.4 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.17 setuptools: 49.6.0.post20210108 pip: 20.3.4 conda: None pytest: None IPython: 7.22.0 sphinx: None
xref https://github.com/pangeo-forge/pangeo-forge/issues/105","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5172/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 548607657,MDU6SXNzdWU1NDg2MDc2NTc=,3689,Decode CF bounds to coords,1197350,closed,0,,,5,2020-01-12T18:23:26Z,2021-04-19T03:32:26Z,2021-04-19T03:32:26Z,MEMBER,,,,"CF conventions define [Cell Boundaries](http://cfconventions.org/cf-conventions/cf-conventions.html#cell-boundaries) and specify how to encode the presence of cell boundary variables in dataset attributes. > To represent cells we add the attribute bounds to the appropriate coordinate variable(s). The value of `bounds` is the name of the variable that contains the vertices of the cell boundaries. For example consider this dataset: `http://esgf-data.ucar.edu/thredds/dodsC/esg_dataroot/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_200001-201412.nc` ```python url = 'http://esgf-data.ucar.edu/thredds/dodsC/esg_dataroot/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_200001-201412.nc' ds = xr.open_dataset(url) ds ``` gives ``` Dimensions: (lat: 192, lon: 288, nbnd: 2, time: 180) Coordinates: * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0 * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8 * time (time) object 2000-01-15 12:00:00 ... 2014-12-15 12:00:00 Dimensions without coordinates: nbnd Data variables: time_bnds (time, nbnd) object ... lat_bnds (lat, nbnd) float64 ... lon_bnds (lon, nbnd) float64 ... tas (time, lat, lon) float32 ... ``` Despite the presence of the bounds attributes ``` >>> print(ds.time.bounds, ds.lat.bounds, ds.lon.bounds) time_bnds lat_bnds lon_bnds ``` The variables `time_bnds`, `lat_bnds`, and `lon_bnds` are not decoded as coordinates but as data variables. I believe that this is not in accordance with CF conventions. **Instead, we should decode all `bounds` variables to coordinates.** I cannot think of a single use case where one would want to treat these variables as data variables rather than coordinates. It would be easy to implement, but it is a breaking change. Not that this is just a proposal to move bounds variables to the coords part of the dataset. It does not address the more difficult / complex question of how to actually use the bounds for indexing or plotting operations (see e.g. #1475, #1613), although it could be a first step in that direction. #### Full ncdump of dataset
``` xarray.Dataset { dimensions: lat = 192 ; lon = 288 ; nbnd = 2 ; time = 180 ; variables: float64 lat(lat) ; lat:axis = Y ; lat:bounds = lat_bnds ; lat:standard_name = latitude ; lat:title = Latitude ; lat:type = double ; lat:units = degrees_north ; lat:valid_max = 90.0 ; lat:valid_min = -90.0 ; lat:_ChunkSizes = 192 ; float64 lon(lon) ; lon:axis = X ; lon:bounds = lon_bnds ; lon:standard_name = longitude ; lon:title = Longitude ; lon:type = double ; lon:units = degrees_east ; lon:valid_max = 360.0 ; lon:valid_min = 0.0 ; lon:_ChunkSizes = 288 ; object time(time) ; time:axis = T ; time:bounds = time_bnds ; time:standard_name = time ; time:title = time ; time:type = double ; time:_ChunkSizes = 512 ; object time_bnds(time, nbnd) ; time_bnds:_ChunkSizes = [1 2] ; float64 lat_bnds(lat, nbnd) ; lat_bnds:units = degrees_north ; lat_bnds:_ChunkSizes = [192 2] ; float64 lon_bnds(lon, nbnd) ; lon_bnds:units = degrees_east ; lon_bnds:_ChunkSizes = [288 2] ; float32 tas(time, lat, lon) ; tas:cell_measures = area: areacella ; tas:cell_methods = area: time: mean ; tas:comment = near-surface (usually, 2 meter) air temperature ; tas:description = near-surface (usually, 2 meter) air temperature ; tas:frequency = mon ; tas:id = tas ; tas:long_name = Near-Surface Air Temperature ; tas:mipTable = Amon ; tas:out_name = tas ; tas:prov = Amon ((isd.003)) ; tas:realm = atmos ; tas:standard_name = air_temperature ; tas:time = time ; tas:time_label = time-mean ; tas:time_title = Temporal mean ; tas:title = Near-Surface Air Temperature ; tas:type = real ; tas:units = K ; tas:variable_id = tas ; tas:_ChunkSizes = [ 1 192 288] ; // global attributes: :Conventions = CF-1.7 CMIP-6.2 ; ... [truncated] ```
#### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:07:37) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.6.2 xarray: 0.14.0+19.gba48fbcd pandas: 0.25.1 numpy: 1.17.2 scipy: 1.3.1 netCDF4: 1.5.1.2 pydap: None h5netcdf: 0.7.4 h5py: 2.10.0 Nio: None zarr: 2.3.2 cftime: 1.0.3.4 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: None cfgrib: 0.9.7.1 iris: None bottleneck: 1.2.1 dask: 2.4.0 distributed: 2.4.0 matplotlib: 3.1.1 cartopy: 0.17.0 seaborn: 0.9.0 numbagg: None setuptools: 41.2.0 pip: 19.2.3 conda: None pytest: 5.1.2 IPython: 7.8.0 sphinx: 1.6.5
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3689/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 99836561,MDU6SXNzdWU5OTgzNjU2MQ==,521,"time decoding error with ""days since"" ",1197350,closed,0,,,20,2015-08-08T21:54:24Z,2021-03-29T14:12:38Z,2015-08-14T17:23:26Z,MEMBER,,,,"I am trying to use xray with some CESM [POP model netCDF output](http://www.cesm.ucar.edu/models/ccsm3.0/pop/doc/POPusers_chap4.html), which supposedly follows CF-1.0 conventions. It is failing because the models time units are ""'days since 0000-01-01 00:00:00"". When calling open_dataset, I get the following error: ``` ValueError: unable to decode time units u'days since 0000-01-01 00:00:00' with the default calendar. Try opening your dataset with decode_times=False. Full traceback: Traceback (most recent call last): File ""/home/rpa/xray/xray/conventions.py"", line 372, in __init__ # Otherwise, tracebacks end up swallowed by Dataset.__repr__ when users File ""/home/rpa/xray/xray/conventions.py"", line 145, in decode_cf_datetime dates = _decode_datetime_with_netcdf4(flat_num_dates, units, calendar) File ""/home/rpa/xray/xray/conventions.py"", line 97, in _decode_datetime_with_netcdf4 dates = np.asarray(nc4.num2date(num_dates, units, calendar)) File ""netCDF4/_netCDF4.pyx"", line 4522, in netCDF4._netCDF4.num2date (netCDF4/_netCDF4.c:50388) File ""netCDF4/_netCDF4.pyx"", line 4337, in netCDF4._netCDF4._dateparse (netCDF4/_netCDF4.c:48234) ValueError: year is out of range ``` Full metadata for the time variable: ``` double time(time) ; time:long_name = ""time"" ; time:units = ""days since 0000-01-01 00:00:00"" ; time:bounds = ""time_bound"" ; time:calendar = ""noleap"" ; ``` I guess this is a problem with the underlying netCDF4 num2date package? ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/521/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 288184220,MDU6SXNzdWUyODgxODQyMjA=,1823,We need a fast path for open_mfdataset,1197350,closed,0,,,19,2018-01-12T17:01:49Z,2021-01-28T18:00:15Z,2021-01-27T17:50:09Z,MEMBER,,,,"It would be great to have a ""fast path"" option for `open_mfdataset`, in which all alignment / coordinate checking is bypassed. This would be used in cases where the user knows that many netCDF files all share the same coordinates (e.g. model output, satellite records from the same product, etc.). The coordinates would just be taken from the first file, and only the data variables would be read from all subsequent files. The only checking would be that the data variables have the correct shape. Implementing this would require some refactoring. @jbusecke mentioned that he had developed a solution for this (related to #1704), so maybe he could be the one to add this feature to xarray. This is also related to #1385.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1823/reactions"", ""total_count"": 9, ""+1"": 9, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 753965875,MDU6SXNzdWU3NTM5NjU4NzU=,4631,Decode_cf fails when scale_factor is a length-1 list,1197350,closed,0,,,4,2020-12-01T03:07:48Z,2021-01-15T18:19:56Z,2021-01-15T18:19:56Z,MEMBER,,,,"Some datasets I work with have `scale_factor` and `add_offset` encoded as length-1 lists. The following code worked as of Xarray 0.16.1 ```python import xarray as xr ds = xr.DataArray([0, 1, 2], name='foo', attrs={'scale_factor': [0.01], 'add_offset': [1.0]}).to_dataset() xr.decode_cf(ds) ``` In 0.16.2 (just released) and current master, it fails with this error ``` --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in 2 attrs={'scale_factor': [0.01], 3 'add_offset': [1.0]}).to_dataset() ----> 4 xr.decode_cf(ds) ~/Code/xarray/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta) 587 raise TypeError(""can only decode Dataset or DataStore objects"") 588 --> 589 vars, attrs, coord_names = decode_cf_variables( 590 vars, 591 attrs, ~/Code/xarray/xarray/conventions.py in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta) 490 and stackable(v.dims[-1]) 491 ) --> 492 new_vars[k] = decode_cf_variable( 493 k, 494 v, ~/Code/xarray/xarray/conventions.py in decode_cf_variable(name, var, concat_characters, mask_and_scale, decode_times, decode_endianness, stack_char_dim, use_cftime, decode_timedelta) 333 variables.CFScaleOffsetCoder(), 334 ]: --> 335 var = coder.decode(var, name=name) 336 337 if decode_timedelta: ~/Code/xarray/xarray/coding/variables.py in decode(self, variable, name) 271 dtype = _choose_float_dtype(data.dtype, ""add_offset"" in attrs) 272 if np.ndim(scale_factor) > 0: --> 273 scale_factor = scale_factor.item() 274 if np.ndim(add_offset) > 0: 275 add_offset = add_offset.item() AttributeError: 'list' object has no attribute 'item' ``` I'm very confused, because this feels quite similar to #4471, and I thought it was resolved #4485. However, the behavior is different with `'scale_factor': np.array([0.01])`. That works fine--no error. How might I end up with a dataset with `scale_factor` as a python list? It happens when I open a netcdf file using the `h5netcdf` engine (documented by @gerritholl in https://github.com/pydata/xarray/issues/4471#issuecomment-702018925) and then write it to zarr. The numpy array gets encoded as a list in the zarr json metadata. 🙃 This problem would go away if we could resolve the discrepancies between the two engines' treatment of scalar attributes. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4631/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 753514595,MDU6SXNzdWU3NTM1MTQ1OTU=,4624,Release 0.16.2?,1197350,closed,0,,,6,2020-11-30T14:15:55Z,2020-12-02T00:24:31Z,2020-12-01T15:09:38Z,MEMBER,,,,"Looking at our [what's new](http://xarray.pydata.org/en/latest/whats-new.html#v0-16-2-unreleased), we have quite a few important new features, as well as significant bug fixes. I propose we move towards releasing ~0.17.0~ 0.16.2 asap. (I have selfish motives for this, as I want to use the new features in production.) We can use this issue to track any PRs or issues we want to resolve before the next release. I personally am not aware of any major blockers, but other devs should feel free to edit this list. - [ ] #4461 - requires decisions - [x] #4618 - [x] #4621 cc @pydata/xarray ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4624/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 375663610,MDU6SXNzdWUzNzU2NjM2MTA=,2528,display_width doesn't apply to dask-backed arrays,1197350,closed,0,,,3,2018-10-30T19:49:05Z,2020-09-30T06:17:17Z,2020-09-30T06:17:17Z,MEMBER,,,,"The representation of dask-backed arrays in xarray's `__repr__` methods results in very long lines which often overflow the desired line width. Unfortunately, this can't be controlled or overridden with `xr.set_options(display_width=...)`. #### Code Sample, a copy-pastable example if possible ```python import xarray as xr xr.set_options(display_width=20) ds = (xr.DataArray(range(100)) .chunk({'dim_0': 10}) .to_dataset(name='really_long_long_name')) ds ``` ``` Dimensions: (dim_0: 100) Dimensions without coordinates: dim_0 Data variables: really_long_long_name (dim_0) int64 dask.array ``` #### Problem description [this should explain **why** the current behavior is a problem and why the expected output is a better solution.] #### Expected Output We need to decide how to abbreviate dask arrays with something more concise. I'm not sure the best way to do this. Maybe ``` really_long_long_name (dim_0) int64 dask chunks=(10,) ``` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2528/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 614814400,MDExOlB1bGxSZXF1ZXN0NDE1MjkyMzM3,4047,Document Xarray zarr encoding conventions,1197350,closed,0,,,3,2020-05-08T15:29:14Z,2020-05-22T21:59:09Z,2020-05-20T17:04:02Z,MEMBER,,0,pydata/xarray/pulls/4047,"When we implemented the Zarr backend, we made some _ad hoc_ choices about how to encode NetCDF data in Zarr. At this stage, it would be useful to explicitly document this encoding. I decided to put it on the ""Xarray Internals"" page, but I'm open to moving if folks feel it fits better elsewhere. cc @jeffdlb, @WardF, @DennisHeimbigner","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4047/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 528884925,MDU6SXNzdWU1Mjg4ODQ5MjU=,3575,map_blocks output inference problems,1197350,closed,0,,,6,2019-11-26T17:56:11Z,2020-05-06T16:41:54Z,2020-05-06T16:41:54Z,MEMBER,,,,"I am excited about using `map_blocks` to overcome a long-standing challenge related to calculating climatologies / anomalies with dask arrays. However, I hit what feels like a bug. I don't love how the new `map_blocks` function does this: > The function will be first run on mocked-up data, that looks like ‘obj’ but has sizes 0, to determine properties of the returned object such as dtype, variable names, new dimensions and new indexes (if any). The problem is that many functions will simply error on size 0 data. As in the example below #### MCVE Code Sample ```python import xarray as xr ds = xr.tutorial.load_dataset('rasm').chunk({'y': 20}) def calculate_anomaly(ds): # needed to workaround xarray's check with zero dimensions #if len(ds['time']) == 0: # return ds gb = ds.groupby(""time.month"") clim = gb.mean(dim='T') return gb - clim xr.map_blocks(calculate_anomaly, ds) ``` Raises ``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataset.py in _construct_dataarray(self, name) 1145 try: -> 1146 variable = self._variables[name] 1147 except KeyError: KeyError: 'time.month' During handling of the above exception, another exception occurred: AttributeError Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/parallel.py in infer_template(func, obj, *args, **kwargs) 77 try: ---> 78 template = func(*meta_args, **kwargs) 79 except Exception as e: in calculate_anomaly(ds) 5 # return ds ----> 6 gb = ds.groupby(""time.month"") 7 clim = gb.mean(dim='T') /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/common.py in groupby(self, group, squeeze, restore_coord_dims) 656 return self._groupby_cls( --> 657 self, group, squeeze=squeeze, restore_coord_dims=restore_coord_dims 658 ) /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/groupby.py in __init__(self, obj, group, squeeze, grouper, bins, restore_coord_dims, cut_kwargs) 298 ) --> 299 group = obj[group] 300 if len(group) == 0: /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataset.py in __getitem__(self, key) 1235 if hashable(key): -> 1236 return self._construct_dataarray(key) 1237 else: /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataset.py in _construct_dataarray(self, name) 1148 _, name, variable = _get_virtual_variable( -> 1149 self._variables, name, self._level_coords, self.dims 1150 ) /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataset.py in _get_virtual_variable(variables, key, level_vars, dim_sizes) 157 else: --> 158 data = getattr(ref_var, var_name).data 159 virtual_var = Variable(ref_var.dims, data) AttributeError: 'IndexVariable' object has no attribute 'month' The above exception was the direct cause of the following exception: Exception Traceback (most recent call last) in 8 return gb - clim 9 ---> 10 xr.map_blocks(calculate_anomaly, ds) /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/parallel.py in map_blocks(func, obj, args, kwargs) 203 input_chunks = dataset.chunks 204 --> 205 template: Union[DataArray, Dataset] = infer_template(func, obj, *args, **kwargs) 206 if isinstance(template, DataArray): 207 result_is_array = True /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/parallel.py in infer_template(func, obj, *args, **kwargs) 80 raise Exception( 81 ""Cannot infer object returned from running user provided function."" ---> 82 ) from e 83 84 if not isinstance(template, (Dataset, DataArray)): Exception: Cannot infer object returned from running user provided function. ``` #### Problem Description We should try to imitate what dask does in `map_blocks`: https://docs.dask.org/en/latest/array-api.html#dask.array.map_blocks Specifically: - We should allow the user to override the checks by explicitly specifying output dtype and shape - Maybe the check should be on small, rather than zero size, test data #### Output of ``xr.show_versions()``
# Paste the output here xr.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.138+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.6.2 xarray: 0.14.0 pandas: 0.25.3 numpy: 1.17.3 scipy: 1.3.2 netCDF4: 1.5.1.2 pydap: installed h5netcdf: 0.7.4 h5py: 2.10.0 Nio: None zarr: 2.3.2 cftime: 1.0.4.2 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.0.25 cfgrib: None iris: 2.2.0 bottleneck: 1.3.0 dask: 2.7.0 distributed: 2.7.0 matplotlib: 3.1.2 cartopy: 0.17.0 seaborn: 0.9.0 numbagg: None setuptools: 41.6.0.post20191101 pip: 19.3.1 conda: None pytest: 5.3.1 IPython: 7.9.0 sphinx: None ​
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3575/reactions"", ""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 499477363,MDU6SXNzdWU0OTk0NzczNjM=,3349,Implement polyfit?,1197350,closed,0,,,25,2019-09-27T14:25:14Z,2020-03-25T17:17:45Z,2020-03-25T17:17:45Z,MEMBER,,,,"Fitting a line (or curve) to data along a specified axis is a long-standing need of xarray users. There are many blog posts and SO questions about how to do it: - http://atedstone.github.io/rate-of-change-maps/ - https://gist.github.com/luke-gregor/4bb5c483b2d111e52413b260311fbe43 - https://stackoverflow.com/questions/38960903/applying-numpy-polyfit-to-xarray-dataset - https://stackoverflow.com/questions/52094320/with-xarray-how-to-parallelize-1d-operations-on-a-multidimensional-dataset - https://stackoverflow.com/questions/36275052/applying-a-function-along-an-axis-of-a-dask-array The main use case in my domain is finding the temporal trend on a 3D variable (e.g. temperature in time, lon, lat). Yes, you can do it with apply_ufunc, but apply_ufunc is inaccessibly complex for many users. Much of our existing API could be removed and replaced with apply_ufunc calls, but that doesn't mean we should do it. I am proposing we add a Dataarray method called `polyfit`. It would work like this: ```python x_ = np.linspace(0, 1, 10) y_ = np.arange(5) a_ = np.cos(y_) x = xr.DataArray(x_, dims=['x'], coords={'x': x_}) a = xr.DataArray(a_, dims=['y']) f = a*x p = f.polyfit(dim='x', deg=1) # equivalent numpy code p_ = np.polyfit(x_, f.values.transpose(), 1) np.testing.assert_allclose(p_[0], a_) ``` Numpy's [polyfit](https://docs.scipy.org/doc/numpy/reference/generated/numpy.polynomial.polynomial.Polynomial.fit.html#numpy.polynomial.polynomial.Polynomial.fit) function is already vectorized in the sense that it accepts 1D x and 2D y, performing the fit independently over each column of y. To extend this to ND, we would just need to reshape the data going in and out of the function. We do this already in [other packages](https://github.com/xgcm/xcape/blob/master/xcape/core.py#L16-L34). For dask, we could simply require that the dimension over which the fit is calculated be contiguous, and then call map_blocks. Thoughts? ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3349/reactions"", ""total_count"": 9, ""+1"": 9, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 361858640,MDU6SXNzdWUzNjE4NTg2NDA=,2423,manually specify chunks in open_zarr,1197350,closed,0,,,2,2018-09-19T17:52:31Z,2020-01-09T15:21:35Z,2020-01-09T15:21:35Z,MEMBER,,,,"Currently, `open_zarr` has two possible chunking behaviors. `auto_chunk=True` (default) creates dask chunks corresponding with zarr chunks. `auto_chunk=False` creates no chunks. But what if you want to manually specify the chunks, as with `open_dataset(chunks=...)`. `open_zarr` could easily support this, but it does not currently. Note that this is *not* the same as calling `.chunk()` post dataset creation. That operation is very inefficient, since it begins from a single global chunk for each variable.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2423/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 396285440,MDU6SXNzdWUzOTYyODU0NDA=,2656,dataset info in .json format,1197350,closed,0,,,9,2019-01-06T19:13:34Z,2020-01-08T22:43:25Z,2019-01-21T23:25:56Z,MEMBER,,,,"I am exploring the world of [Spatio Temporal Asset Catalogs](https://github.com/radiantearth/stac-spec) (STAC), in which all datasets are described using json/ geojson: > The STAC specification aims to standardize the way geospatial assets are exposed online and queried. I am thinking about how to put the sort of datasets that xarray deals with into STAC items (see https://github.com/radiantearth/stac-spec). This would be particular valuable in the context of Pangeo and the zarr-based datasets we have been putting in cloud storage. For this purpose, it would be very useful to have a concise summary of an xarray dataset's contents (minus the actual data) in .json format. I'm talking about the kind of info we currently get from the `.info()` method, which is designed to mirror the CDL output of [`ncdump -h`](https://www.unidata.ucar.edu/software/netcdf/netcdf-4/newdocs/netcdf/ncdump.html). For example ```python ds = xr.Dataset({'foo': ('x', np.ones(10, 'f8'), {'units': 'm s-1'})}, {'x': ('x', np.arange(10), {'units': 'm'})}, {'conventions': 'made up'}) ds.info() ``` ``` xarray.Dataset { dimensions: x = 10 ; variables: float64 foo(x) ; foo:units = m s-1 ; int64 x(x) ; x:units = m ; // global attributes: :conventions = made up ; ``` I would like to be able to do `ds.info(format='json')` and see something like this ``` { ""coords"": { ""x"": { ""dims"": [ ""x"" ], ""attrs"": { ""units"": ""m"" } } }, ""attrs"": { ""conventions"": ""made up"" }, ""dims"": { ""x"": 10 }, ""data_vars"": { ""foo"": { ""dims"": [ ""x"" ], ""attrs"": { ""units"": ""m s-1"" } } } } ``` Which is what I get by doing `print(json.dumps(ds.to_dict(), indent=2))` and manually stripping out all the `data` fields. So an alternative api might be something like `ds.to_dict(data=False)`. If anyone is aware of an existing spec for expressing [Common Data Language](https://www.unidata.ucar.edu/software/netcdf/workshops/2011/utilities/CDL.html) in json, we should probably use that instead of inventing our own. But I think some version of this would be a very useful addition to xarray.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2656/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 288785270,MDU6SXNzdWUyODg3ODUyNzA=,1832,groupby on dask objects doesn't handle chunks well,1197350,closed,0,,,22,2018-01-16T04:50:22Z,2019-11-27T16:45:14Z,2019-06-06T20:01:40Z,MEMBER,,,,"80% of climate data analysis begins with calculating the monthly-mean climatology and subtracting it from the dataset to get an anomaly. Unfortunately this is a fail case for xarray / dask with out-of-core datasets. This is becoming a serious problem for me. #### Code Sample ```python # Your code here import xarray as xr import dask.array as da import pandas as pd # construct an example datatset chunked in time nt, ny, nx = 366, 180, 360 time = pd.date_range(start='1950-01-01', periods=nt, freq='10D') ds = xr.DataArray(da.random.random((nt, ny, nx), chunks=(1, ny, nx)), dims=('time', 'lat', 'lon'), coords={'time': time}).to_dataset(name='field') # monthly climatology ds_mm = ds.groupby('time.month').mean(dim='time') # anomaly ds_anom = ds.groupby('time.month')- ds_mm print(ds_anom) ``` ``` Dimensions: (lat: 180, lon: 360, time: 366) Coordinates: * time (time) datetime64[ns] 1950-01-01 1950-01-11 1950-01-21 ... month (time) int64 1 1 1 1 2 2 3 3 3 4 4 4 5 5 5 5 6 6 6 7 7 7 8 8 8 ... Dimensions without coordinates: lat, lon Data variables: field (time, lat, lon) float64 dask.array ``` #### Problem description As we can see in the example above, the chunking has been lost. The dataset contains just one single huge chunk. This happens with any non-reducing operation on the groupby, even ```python ds.groupby('time.month').apply(lambda x: x) ``` Say we wanted to compute some statistics of the anomaly, like the variance: ```python (ds_anom.field**2).mean(dim='time').load() ``` This triggers the whole big chunk (with the whole timeseries) to be loaded into memory somewhere. For out-of-core datasets, this will crash our system. #### Expected Output It seems like we should be able to do this lazily, maintaining a chunk size of `(1, 180, 360)` for ds_anom. #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.0+dev27.g049cbdd pandas: 0.20.3 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.3.1 h5netcdf: 0.4.1 Nio: None zarr: 2.2.0a2.dev91 bottleneck: 1.2.1 cyordereddict: None dask: 0.16.0 distributed: 1.20.1 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 36.3.0 pip: 9.0.1 conda: None pytest: 3.2.1 IPython: 6.1.0 sphinx: 1.6.5
Possibly related to #392. cc @mrocklin ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1832/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 467776251,MDExOlB1bGxSZXF1ZXN0Mjk3MzU0NTEx,3121,Allow other tutorial filename extensions,1197350,closed,0,,,3,2019-07-13T23:27:44Z,2019-07-14T01:07:55Z,2019-07-14T01:07:51Z,MEMBER,,0,pydata/xarray/pulls/3121," - [x] Closes #3118 - [ ] Tests added - [ ] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API Together with https://github.com/pydata/xarray-data/pull/15, this allows us to generalize out tutorial datasets to non netCDF files. But it is backwards compatible--if there is no file suffix, it will append `.nc`.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3121/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 467674875,MDExOlB1bGxSZXF1ZXN0Mjk3MjgyNzA1,3106,Replace sphinx_gallery with notebook,1197350,closed,0,,,3,2019-07-13T05:35:34Z,2019-07-13T14:03:20Z,2019-07-13T14:03:19Z,MEMBER,,0,pydata/xarray/pulls/3106,"Today @jhamman and I discussed how to refactor our somewhat fragmented ""examples"". We decided to basically copy the approach of the [dask-examples](https://github.com/dask/dask-examples) repo, but have it live here in the main xarray repo. Basically this approach is: - all examples are notebooks - examples are rendered during doc build by nbsphinx - we will eventually have a binder that works with all of the same examples This PR removes the dependency on sphinx_gallery and replaces the existing gallery with a standalone notebook called `visualization_gallery.ipynb`. However, not all of the links that worked in the gallery work here, since we are now using nbsphinx to render the notebooks (see https://github.com/spatialaudio/nbsphinx/issues/308). Really important to get @dcherian's feedback on this, as he was the one who originally introduced the gallery. My view is that having everything as notebooks makes examples easier to maintain. But I'm curious to hear other views.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3106/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 467658326,MDExOlB1bGxSZXF1ZXN0Mjk3MjcwNjYw,3105,Switch doc examples to use nbsphinx,1197350,closed,0,,,4,2019-07-13T02:28:34Z,2019-07-13T04:53:09Z,2019-07-13T04:52:52Z,MEMBER,,0,pydata/xarray/pulls/3105,"This is the beginning of the docs refactor we have in mind for the sprint tomorrow. We will merge things first to the scipy19-docs branch so we can make sure things build on RTD. http://xarray.pydata.org/en/scipy19-docs","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3105/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 218260909,MDU6SXNzdWUyMTgyNjA5MDk=,1340,round-trip performance with save_mfdataset / open_mfdataset,1197350,closed,0,,,11,2017-03-30T16:52:26Z,2019-05-01T22:12:06Z,2019-05-01T22:12:06Z,MEMBER,,,,"I have encountered some major performance bottlenecks in trying to write and then read multi-file netcdf datasets. I start with an xarray dataset created by [xgcm](https://github.com/xgcm/xmitgcm) with the following repr: ``` Dimensions: (XC: 400, XG: 400, YC: 400, YG: 400, Z: 40, Zl: 40, Zp1: 41, Zu: 40, layer_1TH_bounds: 43, layer_1TH_center: 42, layer_1TH_interface: 41, time: 1566) Coordinates: iter (time) int64 8294400 8294976 8295552 8296128 ... * time (time) int64 8294400 8294976 8295552 8296128 ... * XC (XC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ... * YG (YG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ... * XG (XG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ... * YC (YC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ... * Zu (Zu) >f4 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 -91.0 ... * Zl (Zl) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ... * Zp1 (Zp1) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ... * Z (Z) >f4 -5.0 -15.0 -25.0 -36.0 -49.0 -64.0 -81.5 ... rAz (YG, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... dyC (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... rAw (YC, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... dxC (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... dxG (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... dyG (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... rAs (YG, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... Depth (YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... rA (YC, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... PHrefF (Zp1) >f4 0.0 98.1 196.2 294.3 412.02 549.36 706.32 ... PHrefC (Z) >f4 49.05 147.15 245.25 353.16 480.69 627.84 ... drC (Zp1) >f4 5.0 10.0 10.0 11.0 13.0 15.0 17.5 20.5 ... drF (Z) >f4 10.0 10.0 10.0 12.0 14.0 16.0 19.0 22.0 ... hFacC (Z, YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... hFacW (Z, YC, XG) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... hFacS (Z, YG, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... * layer_1TH_bounds (layer_1TH_bounds) >f4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 ... * layer_1TH_interface (layer_1TH_interface) >f4 0.0 0.2 0.4 0.6 0.8 1.0 ... * layer_1TH_center (layer_1TH_center) float32 -0.1 0.1 0.3 0.5 0.7 0.9 ... Data variables: T (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... U (time, Z, YC, XG) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... V (time, Z, YG, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... S (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... Eta (time, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... W (time, Zl, YC, XC) float32 -0.0 -0.0 -0.0 -0.0 -0.0 ... ``` An important point to note is that there are lots of ""non-dimension coordinates"" corresponding to various parameters of the numerical grid. I save this dataset to a multi-file netCDF dataset as follows: ```python iternums, datasets = zip(*ds.groupby('time')) paths = [outdir + 'xmitgcm_data.%010d.nc' % it for it in iternums] xr.save_mfdataset(datasets, paths) ``` This takes many hours to run, since it has to read and write all the data. (I think there are some performance issues here too, related to how dask schedules the read / write tasks, but that is probably a separate issue.) Then I try to re-load this dataset ```python ds_nc = xr.open_mfdataset('xmitgcm_data.*.nc') ``` This raises an error: ``` ValueError: too many different dimensions to concatenate: {'YG', 'Z', 'Zl', 'Zp1', 'layer_1TH_interface', 'YC', 'XC', 'layer_1TH_center', 'Zu', 'layer_1TH_bounds', 'XG'} ``` I need to specify `concat_dim='time'` in order to properly concatenate the data. It seems like this should be unnecessary, since I am reading back data that was just written with xarray, but I understand why (the dimensions of the Data Variables in each file are just Z, YC, XC, with no time dimension). Once I do that, it works, but it takes 18 minutes to load the dataset. I assume this is because it has to check the compatibility of all all the non-dimension coordinates. I just thought I would document this, because 18 minutes seems way too long to load a dataset.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1340/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 431199282,MDExOlB1bGxSZXF1ZXN0MjY4OTI3MjU0,2881,decreased pytest verbosity,1197350,closed,0,,,1,2019-04-09T21:12:50Z,2019-04-09T23:36:01Z,2019-04-09T23:34:22Z,MEMBER,,0,pydata/xarray/pulls/2881,"This removes the `--verbose` flag from py.test in .travis.yml. - [x] Closes #2880 ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2881/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 431156227,MDU6SXNzdWU0MzExNTYyMjc=,2880,pytest output on travis is too verbose,1197350,closed,0,,,1,2019-04-09T19:39:46Z,2019-04-09T23:34:22Z,2019-04-09T23:34:22Z,MEMBER,,,,"I have to scroll over an immense amount of passing tests on travis before I can get to the failures. ([example](https://travis-ci.org/pydata/xarray/jobs/515490337)) This is pretty annoying. The amount of tests in xarray has exploded recently. This is good! But maybe we should turn off `--verbose` in travis. What does @pydata/xarray think?","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2880/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 373121666,MDU6SXNzdWUzNzMxMjE2NjY=,2503,Problems with distributed and opendap netCDF endpoint,1197350,closed,0,,,26,2018-10-23T17:48:20Z,2019-04-09T12:02:01Z,2019-04-09T12:02:01Z,MEMBER,,,,"#### Code Sample I am trying to load a dataset from an opendap endpoint using xarray, netCDF4, and distributed. I am having a problem only with non-local distributed schedulers (KubeCluster specifically). This could plausibly be an xarray, dask, or pangeo issue, but I have decided to post it here. ```python import xarray as xr import dask # create dataset from Unidata's test opendap endpoint, chunked in time url = 'http://remotetest.unidata.ucar.edu/thredds/dodsC/testdods/coads_climatology.nc' ds = xr.open_dataset(url, decode_times=False, chunks={'TIME': 1}) # all these work with dask.config.set(scheduler='synchronous'): ds.SST.compute() with dask.config.set(scheduler='processes'): ds.SST.compute() with dask.config.set(scheduler='threads'): ds.SST.compute() # this works too from dask.distributed import Client local_client = Client() with dask.config.set(get=local_client): ds.SST.compute() # but this does not cluster = KubeCluster(n_workers=2) kube_client = Client(cluster) with dask.config.set(get=kube_client): ds.SST.compute() ``` In the worker log, I see the following sort of errors. ``` distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 5, 0, 0) distributed.worker - INFO - Dependent not found: open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf 0 . Asking scheduler distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 3, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 0, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 1, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 7, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 6, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 2, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 9, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 8, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 11, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 10, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 4, 0, 0) distributed.worker - WARNING - Compute Failed Function: getter args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=_ElementwiseFunctionArray(LazilyOuterIndexedArray(array=, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))), func=functools.partial(, encoded_fill_values={-1e+34}, decoded_fill_value=nan, dtype=dtype('float32')), dtype=dtype('float32')), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))))), (slice(3, 4, None), slice(0, 90, None), slice(0, 180, None))) kwargs: {} Exception: RuntimeError('NetCDF: Not a valid ID',) ``` Ultimately, the error comes from the netCDF library: `RuntimeError('NetCDF: Not a valid ID',)` This seems like something to do with serialization of the netCDF store. The worker images have identical netcdf version (and all other package versions). I am at a loss for how to debug further. #### Output of ``xr.show_versions()``
xr.show_versions() ``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.111+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.2 numpy: 1.15.1 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: None h5py: None Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.18.2 distributed: 1.22.1 matplotlib: 2.2.3 cartopy: None seaborn: None setuptools: 39.2.0 pip: 18.0 conda: 4.5.4 pytest: 3.8.0 IPython: 6.4.0 sphinx: None ``` `cube_client.get_versions(check=True)` ``` {'scheduler': {'host': (('python', '3.6.3.final.0'), ('python-bits', 64), ('OS', 'Linux'), ('OS-release', '4.4.111+'), ('machine', 'x86_64'), ('processor', 'x86_64'), ('byteorder', 'little'), ('LC_ALL', 'en_US.UTF-8'), ('LANG', 'en_US.UTF-8'), ('LOCALE', 'en_US.UTF-8')), 'packages': {'required': (('dask', '0.18.2'), ('distributed', '1.22.1'), ('msgpack', '0.5.6'), ('cloudpickle', '0.5.5'), ('tornado', '5.0.2'), ('toolz', '0.9.0')), 'optional': (('numpy', '1.15.1'), ('pandas', '0.23.2'), ('bokeh', '0.12.16'), ('lz4', '1.1.0'), ('blosc', '1.5.1'))}}, 'workers': {'tcp://10.20.8.4:36940': {'host': (('python', '3.6.3.final.0'), ('python-bits', 64), ('OS', 'Linux'), ('OS-release', '4.4.111+'), ('machine', 'x86_64'), ('processor', 'x86_64'), ('byteorder', 'little'), ('LC_ALL', 'en_US.UTF-8'), ('LANG', 'en_US.UTF-8'), ('LOCALE', 'en_US.UTF-8')), 'packages': {'required': (('dask', '0.18.2'), ('distributed', '1.22.1'), ('msgpack', '0.5.6'), ('cloudpickle', '0.5.5'), ('tornado', '5.0.2'), ('toolz', '0.9.0')), 'optional': (('numpy', '1.15.1'), ('pandas', '0.23.2'), ('bokeh', '0.12.16'), ('lz4', '1.1.0'), ('blosc', '1.5.1'))}}, 'tcp://10.21.177.254:42939': {'host': (('python', '3.6.3.final.0'), ('python-bits', 64), ('OS', 'Linux'), ('OS-release', '4.4.111+'), ('machine', 'x86_64'), ('processor', 'x86_64'), ('byteorder', 'little'), ('LC_ALL', 'en_US.UTF-8'), ('LANG', 'en_US.UTF-8'), ('LOCALE', 'en_US.UTF-8')), 'packages': {'required': (('dask', '0.18.2'), ('distributed', '1.22.1'), ('msgpack', '0.5.6'), ('cloudpickle', '0.5.5'), ('tornado', '5.0.2'), ('toolz', '0.9.0')), 'optional': (('numpy', '1.15.1'), ('pandas', '0.23.2'), ('bokeh', '0.12.16'), ('lz4', '1.1.0'), ('blosc', '1.5.1'))}}}, 'client': {'host': [('python', '3.6.3.final.0'), ('python-bits', 64), ('OS', 'Linux'), ('OS-release', '4.4.111+'), ('machine', 'x86_64'), ('processor', 'x86_64'), ('byteorder', 'little'), ('LC_ALL', 'en_US.UTF-8'), ('LANG', 'en_US.UTF-8'), ('LOCALE', 'en_US.UTF-8')], 'packages': {'required': [('dask', '0.18.2'), ('distributed', '1.22.1'), ('msgpack', '0.5.6'), ('cloudpickle', '0.5.5'), ('tornado', '5.0.2'), ('toolz', '0.9.0')], 'optional': [('numpy', '1.15.1'), ('pandas', '0.23.2'), ('bokeh', '0.12.16'), ('lz4', '1.1.0'), ('blosc', '1.5.1')]}}} ```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2503/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 209561985,MDU6SXNzdWUyMDk1NjE5ODU=,1282,description of xarray assumes knowledge of pandas,1197350,closed,0,,,4,2017-02-22T19:52:54Z,2019-02-26T19:01:47Z,2019-02-26T19:01:46Z,MEMBER,,,,"The first sentence a potential new user reads about xarray is > xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures. Now imagine you had never heard of pandas (like most new Ph.D. students in physical sciences). You would have no idea how useful and powerful xarray was. I would propose modifying these top-level descriptions to remove the assumption that the user understands pandas. Of course we can still refer to pandas, but a more self-contained description would serve us well. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1282/reactions"", ""total_count"": 3, ""+1"": 3, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 396501063,MDExOlB1bGxSZXF1ZXN0MjQyNjY4ODEw,2659,to_dict without data,1197350,closed,0,,,14,2019-01-07T14:09:25Z,2019-02-12T21:21:13Z,2019-01-21T23:25:56Z,MEMBER,,0,pydata/xarray/pulls/2659,"This PR provides the ability to export Datasets and DataArrays to dictionary _without_ the actual data. This could be useful for generating indices of dataset contents to expose to search indices or other automated data discovery tools In the process of doing this, I refactored the core dictionary export function to live in the Variable class, since the same code was duplicated in several places. - [x] Closes #2656 - [x] Tests added - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2659/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 324740017,MDU6SXNzdWUzMjQ3NDAwMTc=,2164,holoviews / bokeh doesn't like cftime coords,1197350,closed,0,,,16,2018-05-20T20:29:03Z,2019-02-08T00:11:14Z,2019-02-08T00:11:14Z,MEMBER,,,,"#### Code Sample, a copy-pastable example if possible Consider a simple working example of converting an xarray dataset to holoviews for plotting: ```python ref_date = '1981-01-01' ds = xr.DataArray([1, 2, 3], dims=['time'], coords={'time': ('time', [1, 2, 3], {'units': 'days since %s' % ref_date})} ).to_dataset(name='foo') with xr.set_options(enable_cftimeindex=True): ds = xr.decode_cf(ds) print(ds) hv_ds = hv.Dataset(ds) hv_ds.to(hv.Curve) ``` This gives ``` Dimensions: (time: 3) Coordinates: * time (time) datetime64[ns] 1981-01-02 1981-01-03 1981-01-04 Data variables: foo (time) int64 ... ``` and ![image](https://user-images.githubusercontent.com/1197350/40283280-c3dd5506-5c49-11e8-8301-f21068dd50e9.png) #### Problem description Now change `ref_date = '0181-01-01'` (or anything outside of the valid range for regular pandas datetime index). We get a beautiful new cftimeindex ``` Dimensions: (time: 3) Coordinates: * time (time) object 0181-01-02 00:00:00 0181-01-03 00:00:00 ... Data variables: foo (time) int64 ... ``` but holoviews / bokeh doesn't like it ``` /opt/conda/lib/python3.6/site-packages/xarray/coding/times.py:132: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy cftime.datetime objects instead, reason: dates out of range enable_cftimeindex) /opt/conda/lib/python3.6/site-packages/xarray/coding/variables.py:66: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy cftime.datetime objects instead, reason: dates out of range return self.func(self.array[key]) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) /opt/conda/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj, include, exclude) 968 969 if method is not None: --> 970 return method(include=include, exclude=exclude) 971 return None 972 else: /opt/conda/lib/python3.6/site-packages/holoviews/core/dimension.py in _repr_mimebundle_(self, include, exclude) 1229 combined and returned. 1230 """""" -> 1231 return Store.render(self) 1232 1233 /opt/conda/lib/python3.6/site-packages/holoviews/core/options.py in render(cls, obj) 1287 data, metadata = {}, {} 1288 for hook in hooks: -> 1289 ret = hook(obj) 1290 if ret is None: 1291 continue /opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in pprint_display(obj) 278 if not ip.display_formatter.formatters['text/plain'].pprint: 279 return None --> 280 return display(obj, raw_output=True) 281 282 /opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in display(obj, raw_output, **kwargs) 248 elif isinstance(obj, (CompositeOverlay, ViewableElement)): 249 with option_state(obj): --> 250 output = element_display(obj) 251 elif isinstance(obj, (Layout, NdLayout, AdjointLayout)): 252 with option_state(obj): /opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in wrapped(element) 140 try: 141 max_frames = OutputSettings.options['max_frames'] --> 142 mimebundle = fn(element, max_frames=max_frames) 143 if mimebundle is None: 144 return {}, {} /opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in element_display(element, max_frames) 186 return None 187 --> 188 return render(element) 189 190 /opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in render(obj, **kwargs) 63 renderer = renderer.instance(fig='png') 64 ---> 65 return renderer.components(obj, **kwargs) 66 67 /opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/renderer.py in components(self, obj, fmt, comm, **kwargs) 257 # Bokeh has to handle comms directly in <0.12.15 258 comm = False if bokeh_version < '0.12.15' else comm --> 259 return super(BokehRenderer, self).components(obj,fmt, comm, **kwargs) 260 261 /opt/conda/lib/python3.6/site-packages/holoviews/plotting/renderer.py in components(self, obj, fmt, comm, **kwargs) 319 plot = obj 320 else: --> 321 plot, fmt = self._validate(obj, fmt) 322 323 widget_id = None /opt/conda/lib/python3.6/site-packages/holoviews/plotting/renderer.py in _validate(self, obj, fmt, **kwargs) 218 if isinstance(obj, tuple(self.widgets.values())): 219 return obj, 'html' --> 220 plot = self.get_plot(obj, renderer=self, **kwargs) 221 222 fig_formats = self.mode_formats['fig'][self.mode] /opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/renderer.py in get_plot(self_or_cls, obj, doc, renderer) 150 doc = Document() if self_or_cls.notebook_context else curdoc() 151 doc.theme = self_or_cls.theme --> 152 plot = super(BokehRenderer, self_or_cls).get_plot(obj, renderer) 153 plot.document = doc 154 return plot /opt/conda/lib/python3.6/site-packages/holoviews/plotting/renderer.py in get_plot(self_or_cls, obj, renderer) 205 init_key = tuple(v if d is None else d for v, d in 206 zip(plot.keys[0], defaults)) --> 207 plot.update(init_key) 208 else: 209 plot = obj /opt/conda/lib/python3.6/site-packages/holoviews/plotting/plot.py in update(self, key) 511 def update(self, key): 512 if len(self) == 1 and ((key == 0) or (key == self.keys[0])) and not self.drawn: --> 513 return self.initialize_plot() 514 item = self.__getitem__(key) 515 self.traverse(lambda x: setattr(x, '_updated', True)) /opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/element.py in initialize_plot(self, ranges, plot, plots, source) 729 if not self.overlaid: 730 self._update_plot(key, plot, style_element) --> 731 self._update_ranges(style_element, ranges) 732 733 for cb in self.callbacks: /opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/element.py in _update_ranges(self, element, ranges) 498 if not self.drawn or xupdate: 499 self._update_range(x_range, l, r, xfactors, self.invert_xaxis, --> 500 self._shared['x'], self.logx, streaming) 501 if not self.drawn or yupdate: 502 self._update_range(y_range, b, t, yfactors, self.invert_yaxis, /opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/element.py in _update_range(self, axis_range, low, high, factors, invert, shared, log, streaming) 525 updates = {} 526 if low is not None and (isinstance(low, util.datetime_types) --> 527 or np.isfinite(low)): 528 updates['start'] = (axis_range.start, low) 529 if high is not None and (isinstance(high, util.datetime_types) TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' ``` Similar but slightly different errors arise for different holoviews types (e.g. `hv.Image`) and contexts (using time as a holoviews kdim). #### Expected Output This should work. I'm not sure if this is really an xarray problem. Maybe it needs a fix in holoviews (or bokeh). But I'm raising it here first since clearly we have introduced this new wrinkle in the stack. Cc'ing @philippjfr since he is the expert on all things holoviews. #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.111+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.4 pandas: 0.23.0 numpy: 1.14.3 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: None h5py: None Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.17.5 distributed: 1.21.8 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 39.0.1 pip: 10.0.1 conda: 4.3.34 pytest: 3.5.1 IPython: 6.3.1 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2164/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 193657418,MDU6SXNzdWUxOTM2NTc0MTg=,1154,netCDF reading is not prominent in the docs,1197350,closed,0,,,7,2016-12-06T01:18:40Z,2019-02-02T06:33:44Z,2019-02-02T06:33:44Z,MEMBER,,,,"Just opening an issue to highlight what I think is a problem with the docs. For me, the primary use of xarray is to read and process existing netCDF data files. @shoyer's popular [blog post](https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python) illustrates this use case extremely well. However, when I open the [docs](http://xarray.pydata.org/), I have to dig quite deep before I can see how to read a netCDF file. This could be turning away many potential users. The stuff about netCDF reading is hidden under ""Serialization and IO"". Many potential users will have no idea what either of these words mean. IMO the solution to this is to reorganize the docs to make reading netCDF much more prominent and obvious.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1154/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 225734529,MDU6SXNzdWUyMjU3MzQ1Mjk=,1394,autoclose with distributed doesn't seem to work,1197350,closed,0,,,9,2017-05-02T15:37:07Z,2019-01-13T19:35:10Z,2019-01-13T19:35:10Z,MEMBER,,,,"I am trying to analyze a very large netCDF dataset using xarray and distributed. I open my dataset with the new `autoclose` option: ```python ds = xr.open_mfdataset(ddir + '*.nc', decode_cf=False, autoclose=True) ``` However, when I try some reduction operation (e.g. `ds['Salt'].mean()`), I can see my open file count continue to rise monotonically. Eventually the dask worker dies with `OSError: [Errno 24] Too many open files: '/proc/65644/sta` once I hit the system ulimit. Am I doing something wrong here? Why are the files not being closed? cc: @pwolfram ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1394/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 225774140,MDU6SXNzdWUyMjU3NzQxNDA=,1396,selecting a point from an mfdataset,1197350,closed,0,,,12,2017-05-02T18:02:50Z,2019-01-13T06:32:45Z,2019-01-13T06:32:45Z,MEMBER,,,,"Sorry to be opening so many vague performance issues. I am really having a hard time with my current dataset, which is exposing certain limitations of xarray and dask in a way none of my previous work has done. I have a directory full of netCDF4 files. There are 1754 files, each 8.1GB in size, each representing a single model timestep. So there is ~14 TB of data total. (In addition to the time-dependent output, there is a single file with information about the grid.) Imagine I want to extract a timeseries from a single point (indexed by `k, j, i`) in this simulation. Without xarray, I would do something like this: ```python import netCDF4 ts = np.zeros(len(all_files)) for n, fname in enumerate(tqdm(all_files)): nc = netCDF4.Dataset(fname) ts[n] = nc.variables['Salt'][k, j, i] nc.close() ``` Which goes reasonably quick: tqdm gives `[02:38<00:00, 11.56it/s]`. I could do the same sort of loop using xarray: ```python import xarray as xr ts = np.zeros(len(all_files)) for n, fname in enumerate(tqdm(all_files)): ds = xr.open_dataset(fname) ts[n] = ds['Salt'][k, j, i] ds.close() ``` Which has a <50% performance overhead: `[03:29<00:00, 8.74it/s]`. Totally acceptable. Of course, what I really want is to avoid a loop and deal with the whole dataset as a single self-contained object. ```python ds = xr.open_mfdataset(all_files, decode_cf=False, autoclose=True) ``` This alone takes between 4-5 minutes to run (see #1385). If I want to print the repr, it takes another 3 minutes or so to `print(ds)`. The full dataset looks like this: ```python Dimensions: (i: 2160, i_g: 2160, j: 2160, j_g: 2160, k: 90, k_l: 90, k_p1: 91, k_u: 90, time: 1752) Coordinates: * j (j) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ... * k (k) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... * j_g (j_g) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ... * i (i) int64 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 ... * k_p1 (k_p1) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * k_u (k_u) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * i_g (i_g) int64 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 ... * k_l (k_l) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * time (time) float64 2.592e+05 2.628e+05 2.664e+05 2.7e+05 2.736e+05 ... Data variables: face (time) int64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... PhiBot (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceQnet (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... SIvice (time, j_g, i) float32 0.0516454 0.0523205 0.0308559 ... SIhsalt (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceFWflx (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... V (time, k, j_g, i) float32 0.0491903 0.0496442 0.0276739 ... iter (time) int64 10368 10512 10656 10800 10944 11088 11232 11376 ... oceQsw (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceTAUY (time, j_g, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Theta (time, k, j, i) float32 -1.31868 -1.27825 -1.21401 -1.17964 ... SIhsnow (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... U (time, k, j, i_g) float32 0.0281392 0.0203967 0.0075199 ... SIheff (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... SIuice (time, j, i_g) float32 -0.041163 -0.0487612 -0.0614498 ... SIarea (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Salt (time, k, j, i) float32 33.7534 33.7652 33.7755 33.7723 ... oceSflux (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... W (time, k_l, j, i) float32 -2.27453e-05 -2.28018e-05 ... oceTAUX (time, j, i_g) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Eta (time, j, i) float32 -1.28886 -1.28811 -1.2871 -1.28567 ... YC (j, i) float32 -57.001 -57.001 -57.001 -57.001 -57.001 -57.001 ... YG (j_g, i_g) float32 -57.0066 -57.0066 -57.0066 -57.0066 ... XC (j, i) float32 -15.4896 -15.4688 -15.4479 -15.4271 -15.4062 ... XG (j_g, i_g) float32 -15.5 -15.4792 -15.4583 -15.4375 -15.4167 ... Zp1 (k_p1) float32 0.0 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 ... Z (k) float32 -0.5 -1.57 -2.79 -4.185 -5.78 -7.595 -9.66 -12.01 ... Zl (k_l) float32 0.0 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 ... Zu (k_u) float32 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 -13.26 ... rA (j, i) float32 1.5528e+06 1.5528e+06 1.5528e+06 1.5528e+06 ... rAw (j, i_g) float32 1.5528e+06 1.5528e+06 1.5528e+06 1.5528e+06 ... rAs (j_g, i) float32 9.96921e+36 9.96921e+36 9.96921e+36 ... rAz (j_g, i_g) float32 1.55245e+06 1.55245e+06 1.55245e+06 ... dxG (j_g, i) float32 1261.27 1261.27 1261.27 1261.27 1261.27 ... dyG (j, i_g) float32 1230.96 1230.96 1230.96 1230.96 1230.96 ... dxC (j, i_g) float32 1261.46 1261.46 1261.46 1261.46 1261.46 ... Depth (j, i) float32 4578.67 4611.09 4647.6 4674.88 4766.75 4782.64 ... dyC (j_g, i) float32 1230.86 1230.86 1230.86 1230.86 1230.86 ... PHrefF (k_p1) float32 0.0 9.81 20.9934 33.7464 48.3633 65.0403 ... drF (k) float32 1.0 1.14 1.3 1.49 1.7 1.93 2.2 2.5 2.84 3.21 3.63 ... PHrefC (k) float32 4.905 15.4017 27.3699 41.0549 56.7018 74.507 ... drC (k_p1) float32 0.5 1.07 1.22 1.395 1.595 1.815 2.065 2.35 2.67 ... hFacW (k, j, i_g) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... hFacS (k, j_g, i) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... hFacC (k, j, i) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... Attributes: coordinates: face ``` Now, to extract the same timeseries, I would like to say ```python ts = ds.Salt[:, k, j, i].load() ``` I monitor what is happening under the hood using when I call this by using [netdata](https://my-netdata.io/) and the dask.distributed dashboard, using only a single process and thread. First, all the files are opened (see #1394). Then they start getting read. Each read takes between 10 and 30 seconds, and the memory usage starts increasing steadily. My impression is that the entire dataset is being read into memory for concatenation. (I have dumped out the [dask graph](https://gist.github.com/rabernat/3e4fe655c6352accbd033b1face20b9c) in case anyone can make sense of it.) I have never let this calculation complete, as it looks like it would eat up all the memory on my system...plus it's extremely slow. To me, this seems like a failure of lazy indexing. I naively expected that the underlying file access would work similar to my loop, perhaps even in parallel. Can anyone shed some light on what might be going wrong? ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1396/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 108623921,MDU6SXNzdWUxMDg2MjM5MjE=,591,distarray backend?,1197350,closed,0,,,5,2015-09-28T09:49:52Z,2019-01-13T04:11:08Z,2019-01-13T04:11:08Z,MEMBER,,,,"This is probably a long shot, but I think a [distarray](https://github.com/enthought/distarray) backend could potentially be very useful in xray. Distarray implements the numpy interface, so it should be possible in principle. Distarray has a different architecture from dask (using MPI for parallelization) and in this way is more similar to traditional HPC codes. The application I have in mind is very high resolution GCM output where one wants to tile the data spatially across multiple nodes on a cluster. (This is how a GCM itself works.) ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/591/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 280626621,MDU6SXNzdWUyODA2MjY2MjE=,1770,slow performance when storing datasets in gcsfs-backed zarr stores,1197350,closed,0,,,11,2017-12-08T21:46:32Z,2019-01-13T03:52:46Z,2019-01-13T03:52:46Z,MEMBER,,,,"We are working on integrating zarr with xarray. In the process, we have encountered a performance issue that I am documenting here. At this point, it is not clear if the core issue is in zarr, gcsfs, dask, or xarray. I originally started posting this in zarr, but in the process, I became more convinced the issue was with xarray. ### Dask Only Here is an example using only dask and zarr. ```python # connect to a local dask scheduler from dask.distributed import Client client = Client('tcp://129.236.20.45:8786') # create a big dask array import dask.array as dsa shape = (30, 50, 1080, 2160) chunkshape = (1, 1, 1080, 2160) ar = dsa.random.random(shape, chunks=chunkshape) # connect to gcs and create MutableMapping import gcsfs fs = gcsfs.GCSFileSystem(project='pangeo-181919') gcsmap = gcsfs.mapping.GCSMap('pangeo-data/test999', gcs=fs, check=True, create=True) # create a zarr array to store into import zarr za = zarr.create(ar.shape, chunks=chunkshape, dtype=ar.dtype, store=gcsmap) # write it ar.store(za, lock=False) ``` When you do this, it spends a long time serializing stuff before the computation starts. For a more fine-grained look at the process, one can instead do ```python delayed_obj = a.store(za, compute=False, lock=False) %prun future = client.compute(dobj) ``` This reveals that the pre-compute step takes about 10s. Monitoring the distributed scheduler, I can see that, once the computation starts, it takes about 1:30 to store the array (27 GB). (This is actually not bad!) Some debugging by @mrocklin revealed the following step is quite slow ```python import cloudpickle %time len(cloudpickle.dumps(za)) ``` On my system, this was taking close to 1s. On contrast, when the `store` passed to `gcsmap` is not a `GCSMap` but instead a path, it is in the microsecond territory. So pickling `GCSMap` objects is relatively slow. I'm not sure whether this pickling happens when we call `client.compute` or during the task execution. There is room for improvement here, but overall, zarr + gcsfs + dask seem to integrate well and give decent performance. ### Xarray This get much worse once xarray enters the picture. (Note that this example requires the xarray PR pydata/xarray#1528, which has not been merged yet.) ```python # wrap the dask array in an xarray import xarray as xr import numpy as np ds = xr.DataArray(ar, dims=['time', 'depth', 'lat', 'lon'], coords={'lat': np.linspace(-90, 90, Ny), 'lon': np.linspace(0, 360, Nx)}).to_dataset(name='temperature') # store to a different bucket gcsmap = gcsfs.mapping.GCSMap('pangeo-data/test1', gcs=fs, check=True, create=True) ds.to_zarr(store=gcsmap, mode='w') ``` Now the store step takes 18 minutes. Most of this time, is upfront, during which there is little CPU activity and no network activity. After about 15 minutes or so, it finally starts computing, at which point the writes to gcs proceed more-or-less at the same rate as with the dask-only example. Profiling the `to_zarr` with snakeviz reveals that it is spending most of its time waiting for thread locks. ![image](https://user-images.githubusercontent.com/1197350/33786360-d645461a-dc36-11e7-8341-e60675af7eb9.png) I don't understand this, since I specifically eliminated locks when storing the zarr arrays. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1770/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 362866468,MDExOlB1bGxSZXF1ZXN0MjE3NDYzMTU4,2430,WIP: revise top-level package description,1197350,closed,0,,,10,2018-09-22T15:35:47Z,2019-01-07T01:04:19Z,2019-01-06T00:31:57Z,MEMBER,,0,pydata/xarray/pulls/2430,"I have often complained that xarray's top-level package description assumes that the user knows all about pandas. I think this alienates many new users. This is a first draft at revising that top-level description. Feedback from the community very needed here.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2430/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 389594572,MDU6SXNzdWUzODk1OTQ1NzI=,2597,add dayofyear to CFTimeIndex,1197350,closed,0,,,2,2018-12-11T04:41:59Z,2018-12-11T19:28:31Z,2018-12-11T19:28:31Z,MEMBER,,,,"I have noticed that `CFTimeIndex` does not provide the `.dayofyear` attributes. Pandas `DatetimeIndex` does. Implementing these attributes would make certain grouping operations much easier on non-standard calendars. Perhaps there are other similar attributes. I don't know if `.dayofweek` makes sense for non-standard calendars. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2597/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 382497709,MDExOlB1bGxSZXF1ZXN0MjMyMTkwMjg5,2559,Zarr consolidated,1197350,closed,0,,,19,2018-11-20T04:39:41Z,2018-12-05T14:58:58Z,2018-12-04T23:51:00Z,MEMBER,,0,pydata/xarray/pulls/2559,"This PR adds support for reading and writing of [consolidated metadata](https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata) in zarr stores. - [x] Closes #2558 (remove if there is no corresponding issue, which should only be the case for minor changes) - [x] Tests added (for all bug fixes or enhancements) - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2559/reactions"", ""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 382043672,MDU6SXNzdWUzODIwNDM2NzI=,2558,how to incorporate zarr's new open_consolidated method?,1197350,closed,0,,,1,2018-11-19T03:28:40Z,2018-12-04T23:51:00Z,2018-12-04T23:51:00Z,MEMBER,,,,"Zarr has a new feature called [consolidated metadata](https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata). This feature will make it much faster to open certain zarr datasets, because all the metadata needed to construct the xarray dataset will live in a single .json file. To use this new feature, the new function `zarr.open_consolidated` needs to be called. So it won't work with xarray out of the box. We need to decide how to add support for this at the xarray level. **I am seeking feedback on what API people would like to see before starting a PR.** My proposal is to add a new keyword argument to `xarray.open_zarr` called `consolidated` (default = False). An alternative would be to automatically try `open_consolidated` and fall back on the standard `open_group` function if that fails. I played around with this a bit and realized that https://github.com/zarr-developers/zarr/issues/336 needs to be resolved before we can do the xarray side. cc @martindurant, who might want to weigh on what would be most convenient for intake.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2558/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 301891754,MDU6SXNzdWUzMDE4OTE3NTQ=,1955,Skipping / failing zarr tests,1197350,closed,0,,,3,2018-03-02T20:17:31Z,2018-10-29T00:25:34Z,2018-10-29T00:25:34Z,MEMBER,,,,"Zarr tests are currently getting skipped on our main testing environments (because the zarr version is less than 2.2): https://travis-ci.org/pydata/xarray/jobs/348350073#L1264 And failing in the `py36-zarr-dev` environment https://travis-ci.org/pydata/xarray/jobs/348350087#L4989 I'm not sure how this regression occurred, but the zarr tests have been failing for a long time, e.g. https://travis-ci.org/pydata/xarray/jobs/342651302 Possibly related to #1954 cc @jhamman ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1955/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 332762756,MDU6SXNzdWUzMzI3NjI3NTY=,2234,fillna error with distributed,1197350,closed,0,,,3,2018-06-15T12:54:54Z,2018-06-15T13:13:54Z,2018-06-15T13:13:54Z,MEMBER,,,,"#### Code Sample, a copy-pastable example if possible The following code works with the default dask threaded scheduler. ```python da = xr.DataArray([1, 1, 1, np.nan]).chunk() da.fillna(0.).mean().load() ``` It fails with distributed. I see the following error on the client side: ``` --------------------------------------------------------------------------- KilledWorker Traceback (most recent call last) in () ----> 1 da.fillna(0.).mean().load() /opt/conda/lib/python3.6/site-packages/xarray/core/dataarray.py in load(self, **kwargs) 631 dask.array.compute 632 """""" --> 633 ds = self._to_temp_dataset().load(**kwargs) 634 new = self._from_temp_dataset(ds) 635 self._variable = new._variable /opt/conda/lib/python3.6/site-packages/xarray/core/dataset.py in load(self, **kwargs) 489 490 # evaluate all the dask arrays simultaneously --> 491 evaluated_data = da.compute(*lazy_data.values(), **kwargs) 492 493 for k, data in zip(lazy_data, evaluated_data): /opt/conda/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs) 398 keys = [x.__dask_keys__() for x in collections] 399 postcomputes = [x.__dask_postcompute__() for x in collections] --> 400 results = schedule(dsk, keys, **kwargs) 401 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)]) 402 /opt/conda/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, **kwargs) 2157 try: 2158 results = self.gather(packed, asynchronous=asynchronous, -> 2159 direct=direct) 2160 finally: 2161 for f in futures.values(): /opt/conda/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous) 1560 return self.sync(self._gather, futures, errors=errors, 1561 direct=direct, local_worker=local_worker, -> 1562 asynchronous=asynchronous) 1563 1564 @gen.coroutine /opt/conda/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs) 650 return future 651 else: --> 652 return sync(self.loop, func, *args, **kwargs) 653 654 def __repr__(self): /opt/conda/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs) 273 e.wait(10) 274 if error[0]: --> 275 six.reraise(*error[0]) 276 else: 277 return result[0] /opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb) 691 if value.__traceback__ is not tb: 692 raise value.with_traceback(tb) --> 693 raise value 694 finally: 695 value = None /opt/conda/lib/python3.6/site-packages/distributed/utils.py in f() 258 yield gen.moment 259 thread_state.asynchronous = True --> 260 result[0] = yield make_coro() 261 except Exception as exc: 262 error[0] = sys.exc_info() /opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self) 1097 1098 try: -> 1099 value = future.result() 1100 except Exception: 1101 self.had_exception = True /opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self) 1105 if exc_info is not None: 1106 try: -> 1107 yielded = self.gen.throw(*exc_info) 1108 finally: 1109 # Break up a reference to itself /opt/conda/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker) 1437 six.reraise(type(exception), 1438 exception, -> 1439 traceback) 1440 if errors == 'skip': 1441 bad_keys.add(key) /opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb) 691 if value.__traceback__ is not tb: 692 raise value.with_traceback(tb) --> 693 raise value 694 finally: 695 value = None KilledWorker: (""('isna-mean_chunk-where-mean_agg-aggregate-74ec0f30171c1c667640f1f18df5f84b',)"", 'tcp://10.20.197.7:43357') ``` While the worker logs show this: ``` distributed.worker - ERROR - Can't get attribute 'isna' on Traceback (most recent call last): File ""/opt/conda/lib/python3.6/site-packages/distributed/worker.py"", line 346, in handle_scheduler self.ensure_computing]) File ""/opt/conda/lib/python3.6/site-packages/tornado/gen.py"", line 1055, in run value = future.result() File ""/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py"", line 238, in result raise_exc_info(self._exc_info) File """", line 4, in raise_exc_info File ""/opt/conda/lib/python3.6/site-packages/tornado/gen.py"", line 1063, in run yielded = self.gen.throw(*exc_info) File ""/opt/conda/lib/python3.6/site-packages/distributed/core.py"", line 361, in handle_stream msgs = yield comm.read() File ""/opt/conda/lib/python3.6/site-packages/tornado/gen.py"", line 1055, in run value = future.result() File ""/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py"", line 238, in result raise_exc_info(self._exc_info) File """", line 4, in raise_exc_info File ""/opt/conda/lib/python3.6/site-packages/tornado/gen.py"", line 1063, in run yielded = self.gen.throw(*exc_info) File ""/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py"", line 203, in read deserializers=deserializers) File ""/opt/conda/lib/python3.6/site-packages/tornado/gen.py"", line 1055, in run value = future.result() File ""/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py"", line 238, in result raise_exc_info(self._exc_info) File """", line 4, in raise_exc_info File ""/opt/conda/lib/python3.6/site-packages/tornado/gen.py"", line 307, in wrapper yielded = next(result) File ""/opt/conda/lib/python3.6/site-packages/distributed/comm/utils.py"", line 79, in from_frames res = _from_frames() File ""/opt/conda/lib/python3.6/site-packages/distributed/comm/utils.py"", line 65, in _from_frames deserializers=deserializers) File ""/opt/conda/lib/python3.6/site-packages/distributed/protocol/core.py"", line 122, in loads value = _deserialize(head, fs, deserializers=deserializers) File ""/opt/conda/lib/python3.6/site-packages/distributed/protocol/serialize.py"", line 236, in deserialize return loads(header, frames) File ""/opt/conda/lib/python3.6/site-packages/distributed/protocol/serialize.py"", line 58, in pickle_loads return pickle.loads(b''.join(frames)) File ""/opt/conda/lib/python3.6/site-packages/distributed/protocol/pickle.py"", line 59, in loads return pickle.loads(x) AttributeError: Can't get attribute 'isna' on ``` This could very well be a distributed issue. Or a pandas issue. I'm not too sure what is going on. Why is pandas even involved at all? #### Problem description This should not raise an error. It worked fine in previous versions, but something in our latest environment has caused it to break. #### Expected Output ``` array(0.75) ``` #### Output of ``xr.show_versions()`` This is running in the latest pangeo.pydata.org environment (https://github.com/pangeo-data/helm-chart/pull/29). @mrocklin picked a custom set of dask / distributed commits to install.
``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.111+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.7 pandas: 0.23.1 numpy: 1.14.5 scipy: 1.1.0 netCDF4: 1.3.1 h5netcdf: None h5py: None Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.17.4+51.g0a7fe8de distributed: 1.21.8+54.g7909f27d matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 39.2.0 pip: 10.0.1 conda: 4.5.4 pytest: 3.6.1 IPython: 6.4.0 sphinx: None ```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2234/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 323359733,MDU6SXNzdWUzMjMzNTk3MzM=,2135,use CF conventions to enhance plot labels,1197350,closed,0,,,4,2018-05-15T19:53:51Z,2018-06-02T00:10:26Z,2018-06-02T00:10:26Z,MEMBER,,,,"Elsewhere in xarray we use CF conventions to help with automatic decoding of datasets. Here I propose we consider using [CF metadata conventions](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch03s03.html) to improve the automatic labelling of plots. If datasets declare `long_name`, `standard_name`, and `units` attributes, we could use these instead of the variable name to label the relevant axes / colorbars. This feature would have helped me avoid several past mistakes due to my failure to examine the `units` attribute (e.g. data given in cm when I assumed m). #### Code Sample, a copy-pastable example if possible Here I create some data with relevant attributes ```python import xarray as xr import numpy as np ds = xr.Dataset({'foo': ('x', np.random.rand(10), {'long_name': 'height', 'units': 'm'})}, coords={'x': ('x', np.arange(10), {'long_name': 'distance', 'units': 'km'})}) ds.foo.plot() ``` ![image](https://user-images.githubusercontent.com/1197350/40079941-7b7d338a-5857-11e8-8f6e-abd530c29ac8.png) #### Problem description We have neglected the variable attributes, which would provide better axis labels. #### Expected Output Consider this instead: ```python def label_from_attrs(da): attrs = da.attrs if 'long_name' in attrs: name = attrs['long_name'] elif 'standard_name' in attrs: name = attrs['standard_name'] else: name = da.name if 'units' in da.attrs: units = ' [{}]'.format(da.attrs['units']) label = name + units return label ds.foo.plot() plt.xlabel(label_from_attrs(ds.x)) plt.ylabel(label_from_attrs(ds.foo)) ``` ![image](https://user-images.githubusercontent.com/1197350/40079995-abbabbee-5857-11e8-8296-905bc8545cd1.png) I feel like this would be a sensible default. But it would be a breaking change. We could make it optional with a keyword like `labels_from_attrs=True`. #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.4.111+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.3+dev13.g98373f0 pandas: 0.22.0 numpy: 1.14.3 scipy: 1.0.1 netCDF4: 1.3.1 h5netcdf: 0.5.1 h5py: 2.7.1 Nio: None zarr: 2.2.1.dev2 bottleneck: 1.2.1 cyordereddict: None dask: 0.17.4 distributed: 1.21.8 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 39.1.0 pip: 9.0.1 conda: 4.3.29 pytest: 3.5.1 IPython: 6.3.1 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2135/reactions"", ""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 1, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 180516114,MDU6SXNzdWUxODA1MTYxMTQ=,1026,multidim groupby on dask arrays: dask.array.reshape error,1197350,closed,0,,,17,2016-10-02T14:55:25Z,2018-05-24T17:59:31Z,2018-05-24T17:59:31Z,MEMBER,,,,"If I try to run a groupby operation using a multidimensional group, I get an error from dask about ""dask.array.reshape requires that reshaped dimensions after the first contain at most one chunk"". This error is arises with dask 0.11.0 but NOT dask 0.8.0. Consider the following test example: ``` python import dask.array as da import xarray as xr nz, ny, nx = (10,20,30) data = da.ones((nz,ny,nx), chunks=(5,ny,nx)) coord_2d = da.random.random((ny,nx), chunks=(ny,nx))>0.5 ds = xr.Dataset({'thedata': (('z','y','x'), data)}, coords={'thegroup': (('y','x'), coord_2d)}) # this works fine ds.thedata.groupby('thegroup') ``` Now I rechunk one of the later dimensions and group again: ``` python ds.chunk({'x': 5}).thedata.groupby('thegroup') ``` This raises the following error and stack trace ``` ValueError Traceback (most recent call last) in () ----> 1 ds.chunk({'x': 5}).thedata.groupby('thegroup') /Users/rpa/RND/open_source/xray/xarray/core/common.pyc in groupby(self, group, squeeze) 343 if isinstance(group, basestring): 344 group = self[group] --> 345 return self.groupby_cls(self, group, squeeze=squeeze) 346 347 def groupby_bins(self, group, bins, right=True, labels=None, precision=3, /Users/rpa/RND/open_source/xray/xarray/core/groupby.pyc in __init__(self, obj, group, squeeze, grouper, bins, cut_kwargs) 170 # the copy is necessary here, otherwise read only array raises error 171 # in pandas: https://github.com/pydata/pandas/issues/12813> --> 172 group = group.stack(**{stacked_dim_name: orig_dims}).copy() 173 obj = obj.stack(**{stacked_dim_name: orig_dims}) 174 self._stacked_dim = stacked_dim_name /Users/rpa/RND/open_source/xray/xarray/core/dataarray.pyc in stack(self, **dimensions) 857 DataArray.unstack 858 """""" --> 859 ds = self._to_temp_dataset().stack(**dimensions) 860 return self._from_temp_dataset(ds) 861 /Users/rpa/RND/open_source/xray/xarray/core/dataset.pyc in stack(self, **dimensions) 1359 result = self 1360 for new_dim, dims in dimensions.items(): -> 1361 result = result._stack_once(dims, new_dim) 1362 return result 1363 /Users/rpa/RND/open_source/xray/xarray/core/dataset.pyc in _stack_once(self, dims, new_dim) 1322 shape = [self.dims[d] for d in vdims] 1323 exp_var = var.expand_dims(vdims, shape) -> 1324 stacked_var = exp_var.stack(**{new_dim: dims}) 1325 variables[name] = stacked_var 1326 else: /Users/rpa/RND/open_source/xray/xarray/core/variable.pyc in stack(self, **dimensions) 801 result = self 802 for new_dim, dims in dimensions.items(): --> 803 result = result._stack_once(dims, new_dim) 804 return result 805 /Users/rpa/RND/open_source/xray/xarray/core/variable.pyc in _stack_once(self, dims, new_dim) 771 772 new_shape = reordered.shape[:len(other_dims)] + (-1,) --> 773 new_data = reordered.data.reshape(new_shape) 774 new_dims = reordered.dims[:len(other_dims)] + (new_dim,) 775 /Users/rpa/anaconda/lib/python2.7/site-packages/dask/array/core.pyc in reshape(self, *shape) 1101 if len(shape) == 1 and not isinstance(shape[0], Number): 1102 shape = shape[0] -> 1103 return reshape(self, shape) 1104 1105 @wraps(topk) /Users/rpa/anaconda/lib/python2.7/site-packages/dask/array/core.pyc in reshape(array, shape) 2585 2586 if any(len(c) != 1 for c in array.chunks[ndim_same+1:]): -> 2587 raise ValueError('dask.array.reshape requires that reshaped ' 2588 'dimensions after the first contain at most one chunk') 2589 ValueError: dask.array.reshape requires that reshaped dimensions after the first contain at most one chunk ``` I am using the latest xarray master and dask version 0.11.0. Note that the example works _fine_ if I use an earlier version of dask (e.g. 0.8.0, the only other one I tested.) This suggests an upstream issue with dask, but I wanted to bring it up here first. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1026/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 317783678,MDU6SXNzdWUzMTc3ODM2Nzg=,2082,searching is broken on readthedocs,1197350,closed,0,,,2,2018-04-25T20:34:13Z,2018-05-04T20:10:31Z,2018-05-04T20:10:31Z,MEMBER,,,,"Searches return no results for me. For example: http://xarray.pydata.org/en/latest/search.html?q=xarray&check_keywords=yes&area=default","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2082/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 312986662,MDExOlB1bGxSZXF1ZXN0MTgwNjUwMjc5,2047,Fix decode cf with dask,1197350,closed,0,,,1,2018-04-10T15:56:20Z,2018-04-12T23:38:02Z,2018-04-12T23:38:02Z,MEMBER,,0,pydata/xarray/pulls/2047," - [x] Closes #1372 - [x] Tests added - [x] Tests passed - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API This was a very simple fix for an issue that has vexed me for quite a while. Am I missing something obvious here? ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2047/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 293913247,MDU6SXNzdWUyOTM5MTMyNDc=,1882,xarray tutorial at SciPy 2018?,1197350,closed,0,,,17,2018-02-02T14:52:11Z,2018-04-09T20:30:13Z,2018-04-09T20:30:13Z,MEMBER,,,,"It would be great to hold an xarray tutorial at SciPy 2018. Xarray has matured a lot recently, and it would be great to raise awareness of what it can do among the broader scipy community. From the [conference website](https://scipy2018.scipy.org/ehome/299527/648139/): > Tutorials should be focused on covering a well-defined topic in a hands-on manner. We want to see attendees coding! We encourage submissions to be designed to allow at least 50% of the time for hands-on exercises even if this means the subject matter needs to be limited. Tutorials will be 4 hours in duration. In your tutorial application, you can indicate what prerequisite skills and knowledge will be needed for your tutorial, and the approximate expected level of knowledge of your students (i.e., beginner, intermediate, advanced). I'm curious if anyone was already planning on submitting a tutorial. If not, let's put together a team. @jhamman has indicated interest in participating in, but not leading, the tutorial. Anyone else interested? xref pangeo-data/pangeo#97","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1882/reactions"", ""total_count"": 4, ""+1"": 4, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 106562046,MDU6SXNzdWUxMDY1NjIwNDY=,575,1D line plot with data on the x axis,1197350,closed,0,,,13,2015-09-15T13:56:51Z,2018-03-05T22:14:46Z,2018-03-05T22:14:46Z,MEMBER,,,,"Consider the following Dataset, representing a function f = cos(z) ``` python z = np.arange(10) ds = xray.Dataset( {'f': ('z', np.cos(z))}, coords={'z': z}) ``` If I call ``` python ds.f.plot() ``` xray naturally puts ""z"" on the x-axis. However, since z represents the vertical dimension, it would be more natural do put it on the y-axis, i.e. ``` python plt.plot(ds.f, ds.z) ``` This is conventional in atmospheric science and oceanography for buoy data or balloon data. Is there an easy way to do this with xray's plotting functions? I scanned the code and didn't see an obvious solution, but maybe I missed it. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/575/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 295744504,MDU6SXNzdWUyOTU3NDQ1MDQ=,1898,zarr RTD docs broken,1197350,closed,0,,3008859,1,2018-02-09T03:35:05Z,2018-02-15T23:20:31Z,2018-02-15T23:20:31Z,MEMBER,,,,"This is what is getting rendered on RTD http://xarray.pydata.org/en/latest/io.html#zarr ``` In [26]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))}, ....: coords={'x': [10, 20, 30, 40], ....: 'y': pd.date_range('2000-01-01', periods=5), ....: 'z': ('x', list('abcd'))}) ....: In [27]: ds.to_zarr('path/to/directory.zarr') --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () ----> 1 ds.to_zarr('path/to/directory.zarr') /home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/core/dataset.py in to_zarr(self, store, mode, synchronizer, group, encoding) 1165 from ..backends.api import to_zarr 1166 return to_zarr(self, store=store, mode=mode, synchronizer=synchronizer, -> 1167 group=group, encoding=encoding) 1168 1169 def __unicode__(self): /home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/backends/api.py in to_zarr(dataset, store, mode, synchronizer, group, encoding) 752 # I think zarr stores should always be sync'd immediately 753 # TODO: figure out how to properly handle unlimited_dims --> 754 dataset.dump_to_store(store, sync=True, encoding=encoding) 755 return store /home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/core/dataset.py in dump_to_store(self, store, encoder, sync, encoding, unlimited_dims) 1068 1069 store.store(variables, attrs, check_encoding, -> 1070 unlimited_dims=unlimited_dims) 1071 if sync: 1072 store.sync() /home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/backends/zarr.py in store(self, variables, attributes, *args, **kwargs) 378 def store(self, variables, attributes, *args, **kwargs): 379 AbstractWritableDataStore.store(self, variables, attributes, --> 380 *args, **kwargs) 381 382 /home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, unlimited_dims) 275 variables, attributes = self.encode(variables, attributes) 276 --> 277 self.set_attributes(attributes) 278 self.set_dimensions(variables, unlimited_dims=unlimited_dims) 279 self.set_variables(variables, check_encoding_set, /home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/backends/zarr.py in set_attributes(self, attributes) 341 342 def set_attributes(self, attributes): --> 343 self.ds.attrs.put(attributes) 344 345 def encode_variable(self, variable): AttributeError: 'Attributes' object has no attribute 'put' ```","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1898/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 253136694,MDExOlB1bGxSZXF1ZXN0MTM3ODE5MTA0,1528,WIP: Zarr backend,1197350,closed,0,,,103,2017-08-27T02:38:01Z,2018-02-13T21:35:03Z,2017-12-14T02:11:36Z,MEMBER,,0,pydata/xarray/pulls/1528," - [x] Closes #1223 - [x] Tests added / passed - [x] Passes ``git diff upstream/master | flake8 --diff`` - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API I think that a zarr backend could be the ideal storage format for xarray datasets, overcoming many of the frustrations associated with netcdf and enabling optimal performance on cloud platforms. This is a very basic start to implementing a zarr backend (as proposed in #1223); however, I am taking a somewhat different approach. I store the whole dataset in a single zarr group. I encode the extra metadata needed by xarray (so far just dimension information) as attributes within the zarr group and child arrays. I hide these special attributes from the user by wrapping the attribute dictionaries in a ""`HiddenKeyDict`"", so that they can't be viewed or modified. I have no tests yet (:flushed:), but the following code works. ```python from xarray.backends.zarr import ZarrStore import xarray as xr import numpy as np ds = xr.Dataset( {'foo': (('y', 'x'), np.ones((100, 200)), {'myattr1': 1, 'myattr2': 2}), 'bar': (('x',), np.zeros(200))}, {'y': (('y',), np.arange(100)), 'x': (('x',), np.arange(200))}, {'some_attr': 'copana'} ).chunk({'y': 50, 'x': 40}) zs = ZarrStore(store='zarr_test') ds.dump_to_store(zs) ds2 = xr.Dataset.load_store(zs) assert ds2.equals(ds) ``` There is a very long way to go here, but I thought I would just get a PR started. Some questions that would help me move forward. 1. What is ""encoding"" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr? 1. Should we encode / decode CF for zarr stores? 1. Do we want to always automatically align dask chunks with the underlying zarr chunks? 1. What sort of public API should the zarr backend have? Should you be able to load zarr stores via `open_dataset`? Or do we need a new method? I think `.to_zarr()` would be quite useful. 1. zarr arrays are extensible along all axes. What does this imply for unlimited dimensions? 1. Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1528/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 287569331,MDExOlB1bGxSZXF1ZXN0MTYyMjI0MTg2,1817,fix rasterio chunking with s3 datasets,1197350,closed,0,,,11,2018-01-10T20:37:45Z,2018-01-24T09:33:07Z,2018-01-23T16:33:28Z,MEMBER,,0,pydata/xarray/pulls/1817," - [x] Closes #1816 (remove if there is no corresponding issue, which should only be the case for minor changes) - [x] Tests added (for all bug fixes or enhancements) - [x] Tests passed (for all non-documentation changes) - [x] Passes ``git diff upstream/master **/*py | flake8 --diff`` (remove if you did not edit any Python files) - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later) This is a simple fix for token generation of non-filename targets for rasterio. The problem is that I have no idea how to test it without actually hitting s3 (which requires boto and aws credentials). ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1817/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 287566823,MDU6SXNzdWUyODc1NjY4MjM=,1816,rasterio chunks argument causes loading from s3 to fail,1197350,closed,0,,,1,2018-01-10T20:28:40Z,2018-01-23T16:33:28Z,2018-01-23T16:33:28Z,MEMBER,,,,"#### Code Sample, a copy-pastable example if possible ```python # This works url = 's3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF' ds = xr.open_rasterio(url) # this doesn't ds = xr.open_rasterio(url, chunks=512) ``` The error is ``` --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) in () 6 # https://aws.amazon.com/public-datasets/landsat/ 7 # 512x512 chunking ----> 8 ds = xr.open_rasterio(url, chunks=512) 9 ds ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray-0.10.0-py3.6.egg/xarray/backends/rasterio_.py in open_rasterio(filename, chunks, cache, lock) 172 from dask.base import tokenize 173 # augment the token with the file modification time --> 174 mtime = os.path.getmtime(filename) 175 token = tokenize(filename, mtime, chunks) 176 name_prefix = 'open_rasterio-%s' % token ~/miniconda3/envs/geo_scipy/lib/python3.6/genericpath.py in getmtime(filename) 53 def getmtime(filename): 54 """"""Return the last modification time of a file, reported by os.stat()."""""" ---> 55 return os.stat(filename).st_mtime 56 57 FileNotFoundError: [Errno 2] No such file or directory: 's3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF' ``` #### Problem description It is pretty clear that the current xarray code expects to receive a filename. (The name of the argument is `filename`.) But rasterio's `open` function accepts a much wider range of [dataset identifiers](https://mapbox.github.io/rasterio/switch.html#dataset-identifiers). The tokenizing function should be updated to allow for this. Seems like it should be a pretty easy fix. #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.0 pandas: 0.20.3 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.3.1 h5netcdf: 0.4.1 Nio: None bottleneck: 1.2.1 cyordereddict: None dask: 0.16.0 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 36.3.0 pip: 9.0.1 conda: None pytest: 3.2.1 IPython: 6.1.0 sphinx: 1.6.5
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1816/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 281983819,MDU6SXNzdWUyODE5ODM4MTk=,1779,decode_cf destroys chunks,1197350,closed,0,,,2,2017-12-14T05:12:00Z,2017-12-15T14:50:42Z,2017-12-15T14:50:41Z,MEMBER,,,,"#### Code Sample, a copy-pastable example if possible ```python import numpy as np import xarray as xr xr.DataArray(np.random.rand(1000)).to_dataset(name='random').chunk(100) ds_cf = xr.decode_cf(ds) assert not ds_cf.chunks ``` #### Problem description Calling `decode_cf` causes variables whose data is dask arrays to be wrapped in two layers of abstractions: `DaskIndexingAdapter` and `LazilyIndexedArray`. In the example above ```python >>> ds.random.variable._data dask.array >>> ds_cf.random.variable._data LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array), key=BasicIndexer((slice(None, None, None),))) ``` At least part of the problem comes from this line: https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L1045 This is especially problematic if we want to concatenate several such datasets together with dask. Chunking the decoded dataset creates a nested dask-within-dask array which is sure to cause undesirable behavior down the line ```python >>> dict(ds_cf.chunk().random.data.dask) {('xarray-random-bf5298b8790e93c1564b5dca9e04399e', 0): (, 'xarray-random-bf5298b8790e93c1564b5dca9e04399e', (slice(0, 1000, None),)), 'xarray-random-bf5298b8790e93c1564b5dca9e04399e': ImplicitToExplicitIndexingAdapter(array=LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array), key=BasicIndexer((slice(None, None, None),))))} ``` #### Expected Output If we call `decode_cf` on a dataset made of dask arrays, it should preserve the chunks of the original dask arrays. Hopefully this can be addressed by #1752. #### Output of ``xr.show_versions()``
commit: 85174cda6440c2f6eed7860357e79897e796e623 python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.0-52-gd8842a6 pandas: 0.20.3 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.2.9 h5netcdf: 0.4.1 Nio: None bottleneck: 1.2.1 cyordereddict: None dask: 0.16.0 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 36.3.0 pip: 9.0.1 conda: None pytest: 3.2.1 IPython: 6.1.0 sphinx: 1.6.5
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1779/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 94328498,MDU6SXNzdWU5NDMyODQ5OA==,463,open_mfdataset too many files,1197350,closed,0,,,47,2015-07-10T15:24:14Z,2017-11-27T12:17:17Z,2017-03-23T19:22:43Z,MEMBER,,,,"I am very excited to try xray. On my first attempt, I tried to use open_mfdataset on a set of ~8000 netcdf files. I hit a ""RuntimeError: Too many open files"". The ulimit on my system is 1024, so clearly that is the source of the error. I am curious whether this is the desired behavior for open_mfdataset. Does xray have to keep all the files open? If so, I will work with my sysadmin to increase the ulimit. It seems like the whole point of this function is to work with large collections of files, so this could be a significant limitation. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/463/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 229474101,MDExOlB1bGxSZXF1ZXN0MTIxMTQyODkw,1413,concat prealigned objects,1197350,closed,0,,,11,2017-05-17T20:16:00Z,2017-07-17T21:53:53Z,2017-07-17T21:53:40Z,MEMBER,,0,pydata/xarray/pulls/1413," - [x] Closes #1385 - [ ] Tests added / passed - [ ] Passes ``git diff upstream/master | flake8 --diff`` - [ ] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API This is an initial PR to bypass index alignment and coordinate checking when concatenating datasets.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1413/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 229138906,MDExOlB1bGxSZXF1ZXN0MTIwOTAzMjY5,1411,fixed dask prefix naming,1197350,closed,0,,,6,2017-05-16T19:10:30Z,2017-05-22T20:39:01Z,2017-05-22T20:38:56Z,MEMBER,,0,pydata/xarray/pulls/1411," - [x] Closes #1343 - [x] Tests added / passed - [x] Passes ``git diff upstream/master | flake8 --diff`` - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API I am starting a new PR for this since the original one (#1345) was not branched of my own fork. As the discussion there stood, @shoyer suggested that `dataset.chunk` should also be updated to match the latest conventions in dask naming. The relevant code is here ```python def maybe_chunk(name, var, chunks): chunks = selkeys(chunks, var.dims) if not chunks: chunks = None if var.ndim > 0: token2 = tokenize(name, token if token else var._data) name2 = '%s%s-%s' % (name_prefix, name, token2) return var.chunk(chunks, name=name2, lock=lock) else: return var variables = OrderedDict([(k, maybe_chunk(k, v, chunks)) for k, v in self.variables.items()]) ``` Currently, `chunk` has an optional keyword argument `name_prefix='xarray-'`. Do we want to keep this optional? IMO, the current naming logic in `chunk` is not a problem for dask and will not cause problems for the distributed bokeh dashboard (as `open_dataset` did).","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1411/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 218368855,MDExOlB1bGxSZXF1ZXN0MTEzNTU0Njk4,1345,new dask prefix,1197350,closed,0,,,2,2017-03-31T00:56:24Z,2017-05-21T09:45:39Z,2017-05-16T19:11:13Z,MEMBER,,0,pydata/xarray/pulls/1345," - [x] closes #1343 - [ ] tests added / passed - [ ] passes ``git diff upstream/master | flake8 --diff`` - [ ] whatsnew entry ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1345/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 225482023,MDExOlB1bGxSZXF1ZXN0MTE4NDA4NDc1,1390,Fix groupby bins tests,1197350,closed,0,,,1,2017-05-01T17:46:41Z,2017-05-01T21:52:14Z,2017-05-01T21:52:14Z,MEMBER,,0,pydata/xarray/pulls/1390," - [x] closes #1386 - [x] tests added / passed - [x] passes ``git diff upstream/master | flake8 --diff`` - [x] whatsnew entry ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1390/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 220078792,MDU6SXNzdWUyMjAwNzg3OTI=,1357,dask strict version check fails,1197350,closed,0,,,1,2017-04-07T01:08:56Z,2017-04-07T01:43:53Z,2017-04-07T01:43:53Z,MEMBER,,,,"I am on xarray version 0.9.1-28-g1cad803 and dask version 0.14.1+39.g964b377 (both from recent github masters). I can't save chunked data to netcdf because of a failing dask version check. ```python ds = xr.Dataset({'a': (['x'], np.random.rand(100)), 'b': (['x'], np.random.rand(100))}) ds = ds.chunk({'x': 20}) ds.to_netcdf('test.nc') ``` The relevant part of the stack trace is ``` /home/rpa/xarray/xarray/backends/common.pyc in sync(self) 165 import dask.array as da 166 import dask --> 167 if StrictVersion(dask.__version__) > StrictVersion('0.8.1'): 168 da.store(self.sources, self.targets, lock=GLOBAL_LOCK) 169 else: /home/rpa/.conda/envs/lagrangian_vorticity/lib/python2.7/distutils/version.pyc in __init__(self, vstring) 38 def __init__ (self, vstring=None): 39 if vstring: ---> 40 self.parse(vstring) 41 42 def __repr__ (self): /home/rpa/.conda/envs/lagrangian_vorticity/lib/python2.7/distutils/version.pyc in parse(self, vstring) 105 match = self.version_re.match(vstring) 106 if not match: --> 107 raise ValueError, ""invalid version number '%s'"" % vstring 108 109 (major, minor, patch, prerelease, prerelease_num) = \ ValueError: invalid version number '0.14.1+39.g964b377' ``` It appears that `StrictVersion` does not like the dask version numbering scheme.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1357/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 188537472,MDExOlB1bGxSZXF1ZXN0OTMxNzEyODE=,1104,add optimization tips,1197350,closed,0,,,1,2016-11-10T15:26:25Z,2016-11-10T16:49:13Z,2016-11-10T16:49:06Z,MEMBER,,0,pydata/xarray/pulls/1104,This adds some dask optimization tips from the mailing list (closes #1103).,"{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1104/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 188517316,MDU6SXNzdWUxODg1MTczMTY=,1103,add dask optimization tips to docs,1197350,closed,0,,,0,2016-11-10T14:08:39Z,2016-11-10T16:49:06Z,2016-11-10T16:49:06Z,MEMBER,,,,"We should add the optimization tips that @shoyer describes in this mailing list thread to @karenamckinnon. https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/xarray/11lDGSeza78/lR1uj9yWDAAJ Specific things to try (we should add similar guidelines to xarray's docs): 1. Do your spatial and temporal indexing with .sel() earlier in the pipeline, specifically before you resample. Resample triggers some computation on all the blocks, which in theory should commute with indexing, but we haven't implemented this optimization in dask yet: https://github.com/dask/dask/issues/746 2. Save the temporal mean to disk as a netCDF file (and then load it again with open_dataset) before subtracting it. Again, in theory, dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the dask scheduler, because it tries to keep every chunk of an array that it computes in memory: https://github.com/dask/dask/issues/874 3. Specify smaller chunks across space when using open_mfdataset, e.g., chunks={'latitude': 10, 'longitude': 10}. This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you do my suggestion 1). ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1103/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 180536861,MDExOlB1bGxSZXF1ZXN0ODc2NDc0MDk=,1027,Groupby bins empty groups,1197350,closed,0,,,7,2016-10-02T21:31:32Z,2016-10-03T15:22:18Z,2016-10-03T15:22:15Z,MEMBER,,0,pydata/xarray/pulls/1027,"This PR fixes a bug in `groupby_bins` in which empty bins were dropped from the grouped results. Now `groupby_bins` restores any empty bins automatically. To recover the old behavior, one could apply `dropna` after a groupby operation. Fixes #1019 ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1027/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 178359375,MDU6SXNzdWUxNzgzNTkzNzU=,1014,dask tokenize error with chunking,1197350,closed,0,,,1,2016-09-21T14:14:10Z,2016-09-22T02:38:08Z,2016-09-22T02:38:08Z,MEMBER,,,,"I have hit a problem with my custom xarray store: https://github.com/xgcm/xgcm/blob/master/xgcm/models/mitgcm/mds_store.py Unfortunately it is hard for me to create a re-producible example, since this error is only coming up when I try to read a large binary dataset stored on my server. Nevertheless, I am opening an issue in hopes that someone can help me. I create an xarray dataset via a custom function ``` python ds = xgcm.open_mdsdataset(ddir, iters, delta_t=deltaT, prefix=['DiagLAYERS-diapycnal','DiagLAYERS-transport']) ``` This function creates a dataset object successfully and then calls `ds.chunk()`. Dask is unable to tokenize the variables and fails. I don't really understand why, but it seems to ultimately depend on the presence and value of the `filename` attribute in the data getting passed to dask. Any advice would be appreciated. The relevant stack trace is ``` python /home/rpa/xgcm/xgcm/models/mitgcm/mds_store.pyc in open_mdsdataset(dirname, iters, prefix, read_grid, delta_t, ref_date, calendar, geometry, grid_vars_to_coords, swap_dims, endian, chunks, ignore_unknown_vars) 154 # do we need more fancy logic (like open_dataset), or is this enough 155 if chunks is not None: --> 156 ds = ds.chunk(chunks) 157 158 return ds /home/rpa/xarray/xarray/core/dataset.py in chunk(self, chunks, name_prefix, token, lock) 863 864 variables = OrderedDict([(k, maybe_chunk(k, v, chunks)) --> 865 for k, v in self.variables.items()]) 866 return self._replace_vars_and_dims(variables) 867 /home/rpa/xarray/xarray/core/dataset.py in maybe_chunk(name, var, chunks) 856 chunks = None 857 if var.ndim > 0: --> 858 token2 = tokenize(name, token if token else var._data) 859 name2 = '%s%s-%s' % (name_prefix, name, token2) 860 return var.chunk(chunks, name=name2, lock=lock) /home/rpa/dask/dask/base.pyc in tokenize(*args, **kwargs) 355 if kwargs: 356 args = args + (kwargs,) --> 357 return md5(str(tuple(map(normalize_token, args))).encode()).hexdigest() /home/rpa/dask/dask/utils.pyc in __call__(self, arg) 510 for cls in inspect.getmro(typ)[1:]: 511 if cls in lk: --> 512 return lk[cls](arg) 513 raise TypeError(""No dispatch for {0} type"".format(typ)) 514 /home/rpa/dask/dask/base.pyc in normalize_array(x) 320 return (str(x), x.dtype) 321 if hasattr(x, 'mode') and hasattr(x, 'filename'): --> 322 return x.filename, os.path.getmtime(x.filename), x.dtype, x.shape 323 if x.dtype.hasobject: 324 try: /usr/local/anaconda/lib/python2.7/genericpath.pyc in getmtime(filename) 60 def getmtime(filename): 61 """"""Return the last modification time of a file, reported by os.stat()."""""" ---> 62 return os.stat(filename).st_mtime 63 64 TypeError: coercing to Unicode: need string or buffer, NoneType found ``` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1014/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 146182176,MDExOlB1bGxSZXF1ZXN0NjU0MDc4NzA=,818,Multidimensional groupby,1197350,closed,0,,,61,2016-04-06T04:14:37Z,2016-07-31T23:02:59Z,2016-07-08T01:50:38Z,MEMBER,,0,pydata/xarray/pulls/818,"Many datasets have a two dimensional coordinate variable (e.g. longitude) which is different from the logical grid coordinates (e.g. nx, ny). (See #605.) For plotting purposes, this is solved by #608. However, we still might want to split / apply / combine over such coordinates. That has not been possible, because groupby only supports creating groups on one-dimensional arrays. This PR overcomes that issue by using `stack` to collapse multiple dimensions in the group variable. A minimal example of the new functionality is ``` python >>> da = xr.DataArray([[0,1],[2,3]], coords={'lon': (['ny','nx'], [[30,40],[40,50]] ), 'lat': (['ny','nx'], [[10,10],[20,20]] )}, dims=['ny','nx']) >>> da.groupby('lon').sum() array([0, 3, 3]) Coordinates: * lon (lon) int64 30 40 50 ``` This feature could have broad applicability for many realistic datasets (particularly model output on irregular grids): for example, averaging non-rectangular grids zonally (i.e. in latitude), binning in temperature, etc. If you think this is worth pursuing, I would love some feedback. The PR is not complete. Some items to address are - [x] Create a specialized grouper to allow coarser bins. By default, if no `grouper` is specified, the `GroupBy` object uses all unique values to define the groups. With a high resolution dataset, this could balloon to a huge number of groups. With the latitude example, we would like to be able to specify e.g. 1-degree bins. Usage would be `da.groupby('lon', bins=range(-90,90))`. - [ ] Allow specification of which dims to stack. For example, stack in space but keep time dimension intact. (Currently it just stacks all the dimensions of the group variable.) - [x] A nice example for the docs. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/818/reactions"", ""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 162974170,MDExOlB1bGxSZXF1ZXN0NzU2ODI3NzM=,892,fix printing of unicode attributes,1197350,closed,0,,,2,2016-06-29T16:47:27Z,2016-07-24T02:57:13Z,2016-07-24T02:57:13Z,MEMBER,,0,pydata/xarray/pulls/892,"fixes #834 I would welcome a suggestion of how to test this in a way that works with both python 2 and 3. This is somewhat outside my expertise. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/892/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 100055216,MDExOlB1bGxSZXF1ZXN0NDIwMTYyMDg=,524,Option for closing files with scipy backend,1197350,closed,0,,,6,2015-08-10T12:49:23Z,2016-06-24T17:45:07Z,2016-06-24T17:45:07Z,MEMBER,,0,pydata/xarray/pulls/524,"This is the same as #468, which was accidentally closed. I just copied and pasted my comment below This addresses issue #463, in which open_mfdataset failed when trying to open a list of files longer than my system's ulimit. I tried to find a solution in which the underlying netcdf file objects are kept closed by default and only reopened ""when needed"". I ended up subclassing scipy.io.netcdf_file and overwriting the variable attribute with a property which first checks whether the file is open or closed and opens it if needed. That was the easy part. The hard part was figuring out when to close them. The problem is that a couple of different parts of the code (e.g. each individual variable and also the datastore object itself) keep references to the netcdf_file object. In the end I used the debugger to find out when during initialization the variables were actually being read and added some calls to close() in various different places. It is relatively easy to close the files up at the end of the initialization, but it was much harder to make sure that the whole array of files is never open at the same time. I also had to disable mmap when this option is active. This solution is messy and, moreover, extremely slow. There is a factor of ~100 performance penalty during initialization for reopening and closing the files all the time (but only a factor of 10 for the actual calculation). I am sure this could be reduced if someone who understands the code better found some judicious points at which to call close() on the netcdf_file. The loss of mmap also sucks. This option can be accessed with the close_files key word, which I added to api. Timing for loading and doing a calculation with close_files=True: ``` python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=True) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files() ``` output: ``` 3 open files CPU times: user 11.1 s, sys: 17.5 s, total: 28.5 s Wall time: 27.7 s 2 open files 0.0055650632367 CPU times: user 649 ms, sys: 974 ms, total: 1.62 s Wall time: 633 ms 2 open files ``` Timing for loading and doing a calculation with close_files=False (default, should revert to old behavior): ``` python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=False) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files() ``` ``` 3 open files CPU times: user 264 ms, sys: 85.3 ms, total: 349 ms Wall time: 291 ms 22 open files 0.0055650632367 CPU times: user 174 ms, sys: 141 ms, total: 315 ms Wall time: 56 ms 22 open files ``` This is not a very serious pull request, but I spent all day on it, so I thought I would share. Maybe you can see some obvious way to improve it... ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/524/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 111471076,MDU6SXNzdWUxMTE0NzEwNzY=,624,roll method,1197350,closed,0,,,8,2015-10-14T19:14:36Z,2015-12-02T23:32:28Z,2015-12-02T23:32:28Z,MEMBER,,,,"I would like to pick up my idea to add a roll method. Among many uses, it could help with #623. The method is pretty simple. ``` python def roll(darr, n, dim): """"""Clone of numpy.roll for xray objects."""""" left = darr.isel(**{dim: slice(None, -n)}) right = darr.isel(**{dim: slice(-n, None)}) return xray.concat([right, left], dim=dim, data_vars='minimal', coords='minimal') ``` I have already been using this function a lot (defined from outside xray) and find it quite useful. I would like to create a PR to add it, but I am having a little trouble understanding how to correctly ""inject"" it into the api. A few words of advice from @shoyer would probably save me a lot of trial and error. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/624/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 115897556,MDU6SXNzdWUxMTU4OTc1NTY=,649,error when using broadcast_arrays with coordinates,1197350,closed,0,,,5,2015-11-09T15:16:32Z,2015-11-10T14:27:41Z,2015-11-10T14:27:41Z,MEMBER,,,,"I frequently use `broadcast_arrays` to to feed xray variables to non-xray libraries (e.g. [gsw](https://github.com/TEOS-10/python-gsw).) Often I need to broadcast the coordinates and variables in order to do call functions that take both as arguments. I have found that `broadcast_arrays` doesn't work as I expect with coordinates. For example ``` python import xray import numpy as np ds = xray.Dataset({'a': (['y','x'], np.ones((20,10)))}, coords={'x': (['x'], np.arange(10)), 'y': (['y'], np.arange(20))}) xbc, ybc, abc = xray.broadcast_arrays(ds.x, ds.y, ds.a) ``` This raises `ValueError: an index variable must be defined with 1-dimensional data`. If I change the last line to ``` python xbc, ybc, abc = xray.broadcast_arrays(1*ds.x, 1*ds.y, ds.a) ``` it works fine. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/649/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 101719623,MDExOlB1bGxSZXF1ZXN0NDI3MzE1NDg=,538,Fix contour color,1197350,closed,0,,,25,2015-08-18T18:24:36Z,2015-09-01T17:48:12Z,2015-09-01T17:20:56Z,MEMBER,,0,pydata/xarray/pulls/538,"This fixes #537 by adding a check for the presence of the colors kwarg. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/538/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 101716715,MDU6SXNzdWUxMDE3MTY3MTU=,537,xray.plot.contour doesn't handle colors kwarg correctly,1197350,closed,0,,,2,2015-08-18T18:11:55Z,2015-09-01T17:20:55Z,2015-09-01T17:20:55Z,MEMBER,,,,"I found this while playing around with the plotting functions. (Really nice work btw @clarkfitzg!) I know the plotting is still under heavy development, but I thought I would share this issue anyway. I might take a crack at fixing it myself... The goal is to make an unfilled contour plot with no colors. In matplotlib this is easy ``` python x, y = np.arange(20), np.arange(20) xx, yy = np.meshgrid(x, y) f = np.sqrt(xx**2 + yy**2) plt.contour(x, y, f, colors='k') ``` If I try the same thing in dask ``` python da = xray.DataArray(f, coords={'y': y, 'x': x}) plt.figure() xray.plot.contour(da, colors='k') ``` I get `ValueError: Either colors or cmap must be None`. I can't find any way around this (e.g. adding a `cmap=None` argument has no effect). If I remove the colors keyword, it works and makes colored contours, as expected. I think this could be fixed easily if you agree it is a bug... ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/537/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 99847237,MDExOlB1bGxSZXF1ZXN0NDE5NjI5MDg=,523,Fix datetime decoding when time units are 'days since 0000-01-01 00:00:00',1197350,closed,0,,,22,2015-08-09T00:12:00Z,2015-08-14T17:22:02Z,2015-08-14T17:22:02Z,MEMBER,,0,pydata/xarray/pulls/523,"This fixes #521 using the workaround described in Unidata/netcdf4-python#442. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/523/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 94508580,MDExOlB1bGxSZXF1ZXN0Mzk3NTI1MTQ=,468,Option for closing files with scipy backend,1197350,closed,0,,,7,2015-07-11T21:24:24Z,2015-08-10T12:50:45Z,2015-08-09T00:04:12Z,MEMBER,,0,pydata/xarray/pulls/468,"This addresses issue #463, in which open_mfdataset failed when trying to open a list of files longer than my system's ulimit. I tried to find a solution in which the underlying netcdf file objects are kept closed by default and only reopened ""when needed"". I ended up subclassing scipy.io.netcdf_file and overwriting the variable attribute with a property which first checks whether the file is open or closed and opens it if needed. That was the easy part. The hard part was figuring out when to close them. The problem is that a couple of different parts of the code (e.g. each individual variable and also the datastore object itself) keep references to the netcdf_file object. In the end I used the debugger to find out when during initialization the variables were actually being read and added some calls to close() in various different places. It is relatively easy to close the files up at the end of the initialization, but it was much harder to make sure that the whole array of files is never open at the same time. I also had to disable mmap when this option is active. This solution is messy and, moreover, extremely slow. There is a factor of ~100 performance penalty during initialization for reopening and closing the files all the time (but only a factor of 10 for the actual calculation). I am sure this could be reduced if someone who understands the code better found some judicious points at which to call close() on the netcdf_file. The loss of mmap also sucks. This option can be accessed with the close_files key word, which I added to api. Timing for loading and doing a calculation with close_files=True: ``` python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=True) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files() ``` output: ``` 3 open files CPU times: user 11.1 s, sys: 17.5 s, total: 28.5 s Wall time: 27.7 s 2 open files 0.0055650632367 CPU times: user 649 ms, sys: 974 ms, total: 1.62 s Wall time: 633 ms 2 open files ``` Timing for loading and doing a calculation with close_files=False (default, should revert to old behavior): ``` python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=False) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files() ``` ``` 3 open files CPU times: user 264 ms, sys: 85.3 ms, total: 349 ms Wall time: 291 ms 22 open files 0.0055650632367 CPU times: user 174 ms, sys: 141 ms, total: 315 ms Wall time: 56 ms 22 open files ``` This is not a very serious pull request, but I spent all day on it, so I thought I would share. Maybe you can see some obvious way to improve it... ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/468/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 99844089,MDExOlB1bGxSZXF1ZXN0NDE5NjI0NDM=,522,Fix datetime decoding when time units are 'days since 0000-01-01 00:00:00',1197350,closed,0,,,1,2015-08-08T23:26:07Z,2015-08-09T00:10:18Z,2015-08-09T00:06:49Z,MEMBER,,0,pydata/xarray/pulls/522,"This fixes #521 using the workaround described in Unidata/netcdf4-python#442. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/522/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 96732359,MDU6SXNzdWU5NjczMjM1OQ==,489,problems with big endian DataArrays,1197350,closed,0,,,4,2015-07-23T05:24:07Z,2015-07-23T20:28:00Z,2015-07-23T20:28:00Z,MEMBER,,,,"I have some [MITgcm](http://mitgcm.org/) data in a [custom binary format](http://mitgcm.org/public/r2_manual/latest/online_documents/node277.html) that I am trying to wedge into xray. I found that DataArray does not know how to handle big endian datatypes, at least on my system. ``` python x = xray.DataArray( np.ones(10, dtype='>f4')) print float(x.sum()), x.data.sum() ``` result: ``` 4.60060298822e-40 10.0 ``` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/489/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 96185559,MDU6SXNzdWU5NjE4NTU1OQ==,484,segfault with hdf4 file,1197350,closed,0,,,5,2015-07-20T23:15:06Z,2015-07-21T02:34:16Z,2015-07-21T02:34:16Z,MEMBER,,,,"I am trying to read data from the NASA MERRA reanalysis. An example file is: ftp://goldsmr3.sci.gsfc.nasa.gov/data/s4pa/MERRA/MAI3CPASM.5.2.0/2014/01/MERRA300.prod.assim.inst3_3d_asm_Cp.20140101.hdf The file format is hdf4 (NOT hdf5). ([full file specification](http://gmao.gsfc.nasa.gov/pubs/docs/Lucchesi528.pdf)) This file can be read by netCDF4.Dataset ``` python from netCDF4 import Dataset fname = 'MERRA300.prod.assim.inst3_3d_asm_Cp.20140101.hdf' nc = Dataset(fname) nc.variables['SLP'][0] ``` No errors However, with xray ``` python import xray ds = xray.open_dataset(fname) ``` I get a segfault. Is this behavior unique to my system? Or is this a reproducible bug? Note: I am not using anaconda's netCDF package, because it does not have hdf4 file support. I had my sysadmin build us a custom netcdf and netCDF4 python. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/484/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue