github: issues: 81 rows where state = "closed" and user = 1197350 sorted by updated

81 rows where state = "closed" and user = 1197350 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	milestone	comments	created_at	updated_at ▲	closed_at	author_association	draft	pull_request	body	reactions	state_reason	repo	type
1503046820	I_kwDOAMm_X85Zlqyk	7388	Xarray does not support full range of netcdf-python compression options	rabernat 1197350	closed		22	2022-12-19T14:21:17Z	2023-12-21T15:43:06Z	2023-12-21T15:24:17Z	MEMBER			What is your issue? Summary The netcdf4-python API docs say the following If the optional keyword argument compression is set, the data will be compressed in the netCDF file using the specified compression algorithm. Currently `zlib`,`szip`,`zstd`,`bzip2`,`blosc_lz`,`blosc_lz4`,`blosc_lz4hc`, `blosc_zlib` and `blosc_zstd` are supported. Default is None (no compression). All of the compressors except `zlib` and `szip` use the HDF5 plugin architecture. If the optional keyword `zlib` is True, the data will be compressed in the netCDF file using zlib compression (default False). The use of this option is deprecated in favor of `compression='zlib'`. Although `compression` is considered a valid encoding option by Xarray https://github.com/pydata/xarray/blob/bbe63ab657e9cb16a7cbbf6338a8606676ddd7b0/xarray/backends/netCDF4_.py#L232-L242 ...it appears that we silently ignores the `compression` option when creating new netCDF4 variables: https://github.com/pydata/xarray/blob/bbe63ab657e9cb16a7cbbf6338a8606676ddd7b0/xarray/backends/netCDF4_.py#L488-L501 Code example ```python shape = (10, 20) chunksizes = (1, 10) encoding = { 'compression': 'zlib', 'shuffle': True, 'complevel': 8, 'fletcher32': False, 'contiguous': False, 'chunksizes': chunksizes } da = xr.DataArray( data=np.random.rand(*shape), dims=['y', 'x'], name="foo", attrs={"bar": "baz"} ) da.encoding = encoding ds = da.to_dataset() fname = "test.nc" ds.to_netcdf(fname, engine="netcdf4", mode="w") with xr.open_dataset(fname, engine="netcdf4") as ds1: display(ds1.foo.encoding) ``` `{'zlib': False, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 10), 'source': 'test.nc', 'original_shape': (10, 20), 'dtype': dtype('float64'), '_FillValue': nan}` In addition to showing that `compression` is ignored, this also reveals several other encoding options that are not available when writing data from xarray (`szip`, `zstd`, `bzip2`, `blosc`). Proposal We should align with the recommendation from the netcdf4 docs and support `compression=` style encoding in NetCDF. We should deprecate `zlib=True` syntax.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7388/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
1983894219	PR_kwDOAMm_X85e8V31	8428	Add mode='a-': Do not overwrite coordinates when appending to Zarr with `append_dim`	rabernat 1197350	closed		3	2023-11-08T15:41:58Z	2023-12-01T04:21:57Z	2023-12-01T03:58:54Z	MEMBER	0	pydata/xarray/pulls/8428	This implements the 1b option described in #8427. [x] Closes #8427 [x] Tests added [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst`	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8428/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
1983891070	I_kwDOAMm_X852P8Z-	8427	Ambiguous behavior with coordinates when appending to Zarr store with append_dim	rabernat 1197350	closed		4	2023-11-08T15:40:19Z	2023-12-01T03:58:56Z	2023-12-01T03:58:55Z	MEMBER			What happened? There are two quite different scenarios covered by "append" with Zarr Adding new variables to a dataset Extending arrays along a dimensions (via `append_dim`) This issue is about what should happen when using `append_dim` with variables that do not contain `append_dim`. Here's the current behavior. ```python import xarray as xr import zarr ds1 = xr.DataArray( np.array([1, 2, 3]).reshape(3, 1, 1), dims=('time', 'y', 'x'), coords={'x': [1], 'y': [2]}, name="foo" ).to_dataset() ds2 = xr.DataArray( np.array([4, 5]).reshape(2, 1, 1), dims=('time', 'y', 'x'), coords={'x':[-1], 'y': [-2]}, name="foo" ).to_dataset() how concat works: data are aligned ds_concat = xr.concat([ds1, ds2], dim="time") assert ds_concat.dims == {"time": 5, "y": 2, "x": 2} now do a Zarr append store = zarr.storage.MemoryStore() ds1.to_zarr(store, consolidated=False) we do not check that the coordinates are aligned--just that they have the same shape and dtype ds2.to_zarr(store, append_dim="time", consolidated=False) ds_append = xr.open_zarr(store, consolidated=False) coordinates data have been overwritten assert ds_append.dims == {"time": 5, "y": 1, "x": 1} ...with the latest values assert ds_append.x.data[0] == -1 ``` Currently, we always write all data variables in this scenario. That includes overwriting the coordinates every time we append. That makes appending more expensive than it needs to be. I don't think that is the behavior most users want or expect. What did you expect to happen? There are a couple of different options we could consider for how to handle this "extending" situation (with `append_dim`) [current behavior] Do not attempt to align coordinates a. [current behavior] Overwrite coordinates with new data b. Keep original coordinates c. Force the user to explicitly drop the coordinates, as we do for `region` operations. Attempt to align coordinates a. Fail if coordinates don't match b. Extend the arrays to replicate the behavior of `concat` We currently do 1a. I propose to switch to 1b. I think it is closer to what users want, and it requires less I/O. Anything else we need to know? No response Environment INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 \| packaged by conda-forge \| (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 5.10.176-157.645.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.10.1 pandas: 2.1.2 numpy: 1.24.4 scipy: 1.11.3 netCDF4: 1.6.5 pydap: installed h5netcdf: 1.2.0 h5py: 3.10.0 Nio: None zarr: 2.16.0 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.10.1 distributed: 2023.10.1 matplotlib: 3.8.0 cartopy: 0.22.0 seaborn: 0.13.0 numbagg: 0.6.0 fsspec: 2023.10.0 cupy: None pint: 0.22 sparse: 0.14.0 flox: 0.8.1 numpy_groupies: 0.10.2 setuptools: 68.2.2 pip: 23.3.1 conda: None pytest: 7.4.3 mypy: None IPython: 8.16.1 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8427/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
350899839	MDU6SXNzdWUzNTA4OTk4Mzk=	2368	Let's list all the netCDF files that xarray can't open	rabernat 1197350	closed		32	2018-08-15T17:41:13Z	2023-11-30T04:36:42Z	2023-11-30T04:36:42Z	MEMBER			At the Pangeo developers meetings, I am hearing lots of reports from folks like @dopplershift and @rsignell-usgs about netCDF datasets that xarray can't open. My expectation is that xarray doesn't have strong requirements on the contents of datasets. (It doesn't "enforce" cf compatibility for example; that's optional.) Anything that can be written to netCDF should be readable by xarray. I would like to collect examples of places where xarray fails. So far, I am only aware of one: Self-referential multidimensional coordinates (#2233). Datasets which contain variables like `siglay(siglay, node)`. Only `siglay(siglay)` would work. Are there other distinct cases? Please provide links / sample code of netCDF datasets that xarray can't read. Even better would be short code snippets to create such datasets in python using the netcdf4 interface.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2368/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
1935984485	I_kwDOAMm_X85zZMdl	8290	Potential performance optimization for Zarr backend	rabernat 1197350	closed		0	2023-10-10T18:41:19Z	2023-10-13T16:38:58Z	2023-10-13T16:38:58Z	MEMBER			What is your issue? We have identified an inefficiency in the way the `ZarrArrayWrapper` works. This class currently stores a reference to a `ZarrStore` and a variable name https://github.com/pydata/xarray/blob/75af56c33a29529269a73bdd00df2d3af17ee0f5/xarray/backends/zarr.py#L63-L68 When accessing the array, the parent group of the array is read and used to open a new Zarr array. https://github.com/pydata/xarray/blob/75af56c33a29529269a73bdd00df2d3af17ee0f5/xarray/backends/zarr.py#L83-L84 This is a relatively metadata-intensive operation for Zarr. It requires reading both the group metadata and the array metadata. Because of how this wrapper works, these operations currently happen every time data is read from the array. If we have a dask array wrapping the zarr array with thousands of chunks, these metadata operations will happen within every single task. For high latency stores, this is really bad. Instead, we should just reference the `zarr.Array` object directly within the `ZarrArrayWrapper`. It's lightweight and easily serializable. There is no need to re-open the array each time we want to read data from it. This change will lead to an immediate performance enhancement in all Zarr operations.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8290/reactions", "total_count": 6, "+1": 4, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 2, "eyes": 0 }	completed	xarray 13221727	issue
357808970	MDExOlB1bGxSZXF1ZXN0MjEzNzM2NTAx	2405	WIP: don't create indexes on multidimensional dimensions	rabernat 1197350	closed		7	2018-09-06T20:13:11Z	2023-07-19T18:33:17Z	2023-07-19T18:33:17Z	MEMBER	0	pydata/xarray/pulls/2405	[x] Closes #2368, Closes #2233 [ ] Tests added (for all bug fixes or enhancements) [ ] Tests passed (for all non-documentation changes) [ ] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later) This is just a start to the solution proposed in #2368. A surprisingly small number of tests broke in my local environment.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2405/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
401874795	MDU6SXNzdWU0MDE4NzQ3OTU=	2697	read ncml files to create multifile datasets	rabernat 1197350	closed		18	2019-01-22T17:33:08Z	2023-05-29T13:41:38Z	2023-05-29T13:41:38Z	MEMBER			This issue was motivated by a recent conversation with @jdha regarding how they are preparing inputs for regional ocean models. They are currently using ncml with netcdf-java to consolidate and homogenize diverse data sources. But this approach doesn't play well with the xarray / dask stack. ncml is standard developed by Unidata for use with their netCDF-java library: NcML is an XML representation of netCDF metadata, (approximately) the header information one gets from a netCDF file with the "ncdump -h" command. In addition to describing individual netCDF files, ncml can be used to annotate modifications to netCDF metadata (attributes, dimension names, etc.) and also to aggregate multiple files into a single logical dataset. This is what such an aggregation over an existing dimension looks like in ncml: `xml <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2"> <aggregation dimName="time" type="joinExisting"> <netcdf location="jan.nc" /> <netcdf location="feb.nc" /> </aggregation> </netcdf>` Obviously this maps very well to xarray's `concat` operation. Similar aggregations can be defined that map to `merge` operations. I think it would be great if we could support the ncml spec in xarray, allowing us to write code like `python ds = xr.open_ncml('file.ncml')` This idea has been discussed before in #893. Perhaps it's time has finally come.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2697/reactions", "total_count": 7, "+1": 7, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
1231184996	I_kwDOAMm_X85JYmRk	6588	Support lazy concatenation without dask	rabernat 1197350	closed		2	2022-05-10T13:40:20Z	2023-03-10T18:40:22Z	2022-05-10T15:38:20Z	MEMBER			Is your feature request related to a problem? Right now, if I want to concatenate multiple datasets (e.g. as in `open_mfdataset`), I have two options: - Eagerly load the data as numpy arrays ➡️ xarray will dispatch to np.concatenate - Chunk each dataset ➡️ xarray will dispatch to dask.array.concatenate In pseudocode: ```python ds1 = xr.open_dataset("some_big_lazy_source_1.nc") ds2 = xr.open_dataset("some_big_lazy_source_2.nc") item1 = ds1.foo[0, 0, 0] # lazily access a single item ds = xr.concat([ds1.chunk(), ds2.chunk()], "time") # only way to lazily concat trying to access the same item will now trigger loading of all of ds1 item1 = ds.foo[0, 0, 0] yes I could use different chunks, but the point is that I should not have to arbitrarily choose chunks to make this work ``` However, I am increasingly encountering scenarios where I would like to lazily concatenate datasets (without loading into memory), but also without the requirement of using dask. This would be useful, for example, for creating composite datasets that point back to an OpenDAP server, preserving the possibility of granular lazy access to any array element without the requirement of arbitrary chunking at an intermediate stage. Describe the solution you'd like I propose to extend our LazilyIndexedArray classes to support simple concatenation and stacking. The result of applying concat to such arrays will be a new LazilyIndexedArray that wraps the underlying arrays into a single object. The main difficulty in implementing this will probably be with indexing: the concatenated array will need to understand how to map global indexes to the underling individual array indexes. That is a little tricky but eminently solvable. Describe alternatives you've considered The alternative is to structure your code in a way that avoids needing to lazily concatenate arrays. That is what we do now. It is not optimal. Additional context No response	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6588/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
1260047355	I_kwDOAMm_X85LGsv7	6662	Obscure h5netcdf http serialization issue with python's http.server	rabernat 1197350	closed		6	2022-06-03T15:28:15Z	2022-06-04T22:13:05Z	2022-06-04T22:13:05Z	MEMBER			What is your issue? In Pangeo Forge, we try to test our ability to read data over http. This often surfaces edge cases involving xarray and fsspec. This is one such edge case. However, it is kind of important, because it affects our ability to reliably test http-based datasets using python's built-in http server. Here is some code that: - Creates a tiny dataset on disk - Serves it over http via `python -m http.server` - Opens the dataset with fsspec and xarray with the h5netcdf engine - Pickles the dataset, loads it, and calls `.load()` to load the data into memory As you can see, this works with a local file, but not with the http file, with h5py raising a checksum-related error. ```python import fsspec import xarray as xr from pickle import dumps, loads ds_orig = xr.tutorial.load_dataset('tiny') ds_orig fname = 'tiny.nc' ds_orig.to_netcdf(fname, engine='netcdf4') now start an http server in a terminal in the same working directory $ python -m http.server def open_pickle_and_reload(path): with fsspec.open(path, mode='rb') as fp: with xr.open_dataset(fp, engine='h5netcdf') as ds1: pass `# pickle it and reload it ds2 = loads(dumps(ds1)) ds2.load()` open_pickle_and_reload(fname) # works url = f'http://127.0.0.1:8000/{fname}' open_pickle_and_reload(url) # OSError: Unable to open file (incorrect metadata checksum after all read attempts) ``` full traceback ``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/Code/xarray/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock) 198 try: --> 199 file = self._cache[self._key] 200 except KeyError: ~/Code/xarray/xarray/backends/lru_cache.py in __getitem__(self, key) 52 with self._lock: ---> 53 value = self._cache[key] 54 self._cache.move_to_end(key) KeyError: [<class 'h5netcdf.core.File'>, (<File-like object HTTPFileSystem, http://127.0.0.1:8000/tiny.nc>,), 'r', (('decode_vlen_strings', True), ('invalid_netcdf', None))] During handling of the above exception, another exception occurred: OSError Traceback (most recent call last) <ipython-input-2-195ac3fcdb43> in <module> 24 open_pickle_and_reload(fname) # works 25 url = f'[http://127.0.0.1:8000/{fname}'](http://127.0.0.1:8000/%7Bfname%7D'%3C/span%3E) ---> 26 open_pickle_and_reload(url) # OSError: Unable to open file (incorrect metadata checksum after all read attempts) <ipython-input-2-195ac3fcdb43> in open_pickle_and_reload(path) 20 # pickle it and reload it 21 ds2 = loads(dumps(ds1)) ---> 22 ds2.load() # works 23 24 open_pickle_and_reload(fname) # works ~/Code/xarray/xarray/core/dataset.py in load(self, kwargs) 687 for k, v in self.variables.items(): 688 if k not in lazy_data: --> 689 v.load() 690 691 return self ~/Code/xarray/xarray/core/variable.py in load(self, kwargs) 442 self._data = as_compatible_data(self._data.compute(*kwargs)) 443 elif not is_duck_array(self._data): --> 444 self._data = np.asarray(self._data) 445 return self 446 ~/Code/xarray/xarray/core/indexing.py in __array__(self, dtype) 654 655 def __array__(self, dtype=None): --> 656 self._ensure_cached() 657 return np.asarray(self.array, dtype=dtype) 658 ~/Code/xarray/xarray/core/indexing.py in _ensure_cached(self) 651 def _ensure_cached(self): 652 if not isinstance(self.array, NumpyIndexingAdapter): --> 653 self.array = NumpyIndexingAdapter(np.asarray(self.array)) 654 655 def __array__(self, dtype=None): ~/Code/xarray/xarray/core/indexing.py in __array__(self, dtype) 624 625 def __array__(self, dtype=None): --> 626 return np.asarray(self.array, dtype=dtype) 627 628 def __getitem__(self, key): ~/Code/xarray/xarray/core/indexing.py in __array__(self, dtype) 525 def __array__(self, dtype=None): 526 array = as_indexable(self.array) --> 527 return np.asarray(array[self.key], dtype=None) 528 529 def transpose(self, order): ~/Code/xarray/xarray/backends/h5netcdf_.py in __getitem__(self, key) 49 50 def __getitem__(self, key): ---> 51 return indexing.explicit_indexing_adapter( 52 key, self.shape, indexing.IndexingSupport.OUTER_1VECTOR, self._getitem 53 ) ~/Code/xarray/xarray/core/indexing.py in explicit_indexing_adapter(key, shape, indexing_support, raw_indexing_method) 814 """ 815 raw_key, numpy_indices = decompose_indexer(key, shape, indexing_support) --> 816 result = raw_indexing_method(raw_key.tuple) 817 if numpy_indices.tuple: 818 # index the loaded np.ndarray ~/Code/xarray/xarray/backends/h5netcdf_.py in _getitem(self, key) 58 key = tuple(list(k) if isinstance(k, np.ndarray) else k for k in key) 59 with self.datastore.lock: ---> 60 array = self.get_array(needs_lock=False) 61 return array[key] 62 ~/Code/xarray/xarray/backends/h5netcdf_.py in get_array(self, needs_lock) 45 class H5NetCDFArrayWrapper(BaseNetCDF4Array): 46 def get_array(self, needs_lock=True): ---> 47 ds = self.datastore._acquire(needs_lock) 48 return ds.variables[self.variable_name] 49 ~/Code/xarray/xarray/backends/h5netcdf_.py in _acquire(self, needs_lock) 180 181 def _acquire(self, needs_lock=True): --> 182 with self._manager.acquire_context(needs_lock) as root: 183 ds = _nc4_require_group( 184 root, self._group, self._mode, create_group=_h5netcdf_create_group /opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/contextlib.py in __enter__(self) 117 del self.args, self.kwds, self.func 118 try: --> 119 return next(self.gen) 120 except StopIteration: 121 raise RuntimeError("generator didn't yield") from None ~/Code/xarray/xarray/backends/file_manager.py in acquire_context(self, needs_lock) 185 def acquire_context(self, needs_lock=True): 186 """Context manager for acquiring a file.""" --> 187 file, cached = self._acquire_with_cache_info(needs_lock) 188 try: 189 yield file ~/Code/xarray/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock) 203 kwargs = kwargs.copy() 204 kwargs["mode"] = self._mode --> 205 file = self._opener(self._args, kwargs) 206 if self._mode == "w": 207 # ensure file doesn't get overridden when opened again /opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/h5netcdf/core.py in __init__(self, path, mode, invalid_netcdf, phony_dims, kwargs) 719 else: 720 self._preexisting_file = mode in {"r", "r+", "a"} --> 721 self._h5file = h5py.File(path, mode, kwargs) 722 except Exception: 723 self._closed = True /opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, kwds) 505 fs_persist=fs_persist, fs_threshold=fs_threshold, 506 fs_page_size=fs_page_size) --> 507 fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr) 508 509 if isinstance(libver, tuple): /opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr) 218 if swmr and swmr_support: 219 flags \|= h5f.ACC_SWMR_READ --> 220 fid = h5f.open(name, flags, fapl=fapl) 221 elif mode == 'r+': 222 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl) h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/h5f.pyx in h5py.h5f.open() OSError: Unable to open file (incorrect metadata checksum after all read attempts) (external_url) ``` Strangely, a similar workflow does work with http files hosted elsewhere, e.g. `python external_url = 'https://power-datastore.s3.amazonaws.com/v9/climatology/power_901_rolling_zones_utc.nc' open_pickle_and_reload(external_url)` This suggests there is something peculiar about python's `http.server` as compared to other http servers that makes this break. I would appreciate any thoughts or ideas about what might be going on here (pinging @martindurant and @shoyer) xref: - https://github.com/pangeo-forge/pangeo-forge-recipes/pull/373 - https://github.com/pydata/xarray/issues/4242 - https://github.com/google/xarray-beam/issues/49	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6662/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
333312849	MDU6SXNzdWUzMzMzMTI4NDk=	2237	why time grouping doesn't preserve chunks	rabernat 1197350	closed		30	2018-06-18T15:12:38Z	2022-05-15T02:44:06Z	2022-05-15T02:38:30Z	MEMBER			Code Sample, a copy-pastable example if possible I am continuing my quest to obtain more efficient time grouping for calculation of climatologies and climatological anomalies. I believe this is one of the major performance bottlenecks facing xarray users today. I have raised this in other issues (e.g. #1832), but I believe I have narrowed it down here to a more specific problem. The easiest way to summarize the problem is with an example. Consider the following dataset `python import xarray as xr ds = xr.Dataset({'foo': (['x'], [1, 1, 1, 1])}, coords={'x': (['x'], [0, 1, 2, 3]), 'bar': (['x'], ['a', 'a', 'b', 'b']), 'baz': (['x'], ['a', 'b', 'a', 'b'])}) ds = ds.chunk({'x': 2}) ds` `<xarray.Dataset> Dimensions: (x: 4) Coordinates: * x (x) int64 0 1 2 3 bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)> baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)> Data variables: foo (x) int64 dask.array<shape=(4,), chunksize=(2,)>` One non-dimension coordinate (`bar`) is contiguous with respect to `x` while the other `baz` is not. This is important. `baz` is structured similar to the way that `month` would be distributed on a timeseries dataset. Now let's do a trivial groupby operation on `bar` that does nothing, just returns the group unchanged: `python ds.foo.groupby('bar').apply(lambda x: x)` `<xarray.DataArray 'foo' (x: 4)> dask.array<shape=(4,), dtype=int64, chunksize=(2,)> Coordinates: * x (x) int64 0 1 2 3 bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)> baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)>` This operation preserved this original chunks in `foo`. But if we group by `baz` we see something different `python ds.foo.groupby('baz').apply(lambda x: x)` `<xarray.DataArray 'foo' (x: 4)> dask.array<shape=(4,), dtype=int64, chunksize=(4,)> Coordinates: * x (x) int64 0 1 2 3 bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)> baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)>` Problem description When grouping over a non-contiguous variable (`baz`) the result has no chunks. That means that we can't lazily access a single item without computing the whole array. This has major performance consequences that make it hard to calculate anomaly values in a more realistic case. What we really want to do is often something like `ds = xr.open_mfdataset('lots/of/files/*.nc') ds_anom = ds.groupby('time.month').apply(lambda x: x - x.mean(dim='time)` It is currently impossible to do this lazily due to the issue described above. Expected Output We would like to preserve the original chunk structure of `foo`. Output of `xr.show_versions()` `xr.show_versions()` is triggering a segfault right now on my system for unknown reasons! I am using xarray 0.10.7.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2237/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
413589315	MDU6SXNzdWU0MTM1ODkzMTU=	2785	error decoding cftime time_bnds over opendap with pydap	rabernat 1197350	closed		2	2019-02-22T21:38:24Z	2021-07-21T14:51:36Z	2021-07-21T14:51:36Z	MEMBER			Code Sample, a copy-pastable example if possible I try to load the following dataset over opendap with the pydap engine. It only works if I do decode_times=False `python url = 'http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NOAA-GFDL/GFDL-AM4/amip/r1i1p1f1/Amon/ta/gr1/v20180807/ta_Amon_GFDL-AM4_amip_r1i1p1f1_gr1_198001-201412.nc' ds = xr.open_dataset(url, decode_times=False, engine='pydap') xr.decode_times(ds)` raises ``` IndexError Traceback (most recent call last) <ipython-input-52-df985a95e29e> in <module>() 1 #ds.time_bnds.load() ----> 2 xr.decode_cf(ds) ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables) 459 vars, attrs, coord_names = decode_cf_variables( 460 vars, attrs, concat_characters, mask_and_scale, decode_times, --> 461 decode_coords, drop_variables=drop_variables) 462 ds = Dataset(vars, attrs=attrs) 463 ds = ds.set_coords(coord_names.union(extra_coords).intersection(vars)) ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/conventions.py in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables) 392 k, v, concat_characters=concat_characters, 393 mask_and_scale=mask_and_scale, decode_times=decode_times, --> 394 stack_char_dim=stack_char_dim) 395 if decode_coords: 396 var_attrs = new_vars[k].attrs ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/conventions.py in decode_cf_variable(name, var, concat_characters, mask_and_scale, decode_times, decode_endianness, stack_char_dim) 298 for coder in [times.CFTimedeltaCoder(), 299 times.CFDatetimeCoder()]: --> 300 var = coder.decode(var, name=name) 301 302 dimensions, data, attributes, encoding = ( ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/coding/times.py in decode(self, variable, name) 410 units = pop_to(attrs, encoding, 'units') 411 calendar = pop_to(attrs, encoding, 'calendar') --> 412 dtype = _decode_cf_datetime_dtype(data, units, calendar) 413 transform = partial( 414 decode_cf_datetime, units=units, calendar=calendar) ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/coding/times.py in _decode_cf_datetime_dtype(data, units, calendar) 116 values = indexing.ImplicitToExplicitIndexingAdapter( 117 indexing.as_indexable(data)) --> 118 example_value = np.concatenate([first_n_items(values, 1) or [0], 119 last_item(values) or [0]]) 120 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/formatting.py in first_n_items(array, n_desired) 94 from_end=False) 95 array = array[indexer] ---> 96 return np.asarray(array).flat[:n_desired] 97 98 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 529 530 """ --> 531 return array(a, dtype, copy=False, order=order) 532 533 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in array(self, dtype) 630 631 def array(self, dtype=None): --> 632 self._ensure_cached() 633 return np.asarray(self.array, dtype=dtype) 634 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in _ensure_cached(self) 627 def _ensure_cached(self): 628 if not isinstance(self.array, NumpyIndexingAdapter): --> 629 self.array = NumpyIndexingAdapter(np.asarray(self.array)) 630 631 def array(self, dtype=None): ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 529 530 """ --> 531 return array(a, dtype, copy=False, order=order) 532 533 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in array(self, dtype) 608 609 def array(self, dtype=None): --> 610 return np.asarray(self.array, dtype=dtype) 611 612 def getitem(self, key): ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 529 530 """ --> 531 return array(a, dtype, copy=False, order=order) 532 533 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in array(self, dtype) 514 def array(self, dtype=None): 515 array = as_indexable(self.array) --> 516 return np.asarray(array[self.key], dtype=None) 517 518 def transpose(self, order): ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/conventions.py in getitem(self, key) 43 44 def getitem(self, key): ---> 45 return np.asarray(self.array[key], dtype=self.dtype) 46 47 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 529 530 """ --> 531 return array(a, dtype, copy=False, order=order) 532 533 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in array(self, dtype) 514 def array(self, dtype=None): 515 array = as_indexable(self.array) --> 516 return np.asarray(array[self.key], dtype=None) 517 518 def transpose(self, order): ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/backends/pydap_.py in getitem(self, key) 24 def getitem(self, key): 25 return indexing.explicit_indexing_adapter( ---> 26 key, self.shape, indexing.IndexingSupport.BASIC, self._getitem) 27 28 def _getitem(self, key): ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in explicit_indexing_adapter(key, shape, indexing_support, raw_indexing_method) 785 if numpy_indices.tuple: 786 # index the loaded np.ndarray --> 787 result = NumpyIndexingAdapter(np.asarray(result))[numpy_indices] 788 return result 789 ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in getitem(self, key) 1174 def getitem(self, key): 1175 array, key = self._indexing_array_and_key(key) -> 1176 return array[key] 1177 1178 def setitem(self, key, value): IndexError: too many indices for array ``` Strangely, I can overcome the error by first explicitly loading (or dropping) the `time_bnds` variable: `python ds.time_bnds.load() xr.decode_cf(ds)` I wish this would work without the `.load()` step. I think it has something to do with the many layers of array wrappers involved in lazy opening. The problem does not occur with the netcdf4 engine. I know this is a very obscure problem, but I thought I would open an issue to document. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.6.8 \|Anaconda, Inc.\| (default, Dec 29 2018, 19:04:46) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.11.3 pandas: 0.23.4 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.4.2 pydap: installed h5netcdf: None h5py: None Nio: None zarr: 2.2.1.dev126+dirty cftime: 1.0.3.4 PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.2.1 cyordereddict: None dask: 0.20.2 distributed: 1.24.2 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 40.6.2 pip: 18.1 conda: None pytest: 4.0.0 IPython: 6.1.0 sphinx: 1.6.5	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2785/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
745801652	MDU6SXNzdWU3NDU4MDE2NTI=	4591	Serialization issue with distributed, h5netcdf, and fsspec (ImplicitToExplicitIndexingAdapter)	rabernat 1197350	closed		12	2020-11-18T16:18:42Z	2021-06-30T17:53:54Z	2020-11-19T15:54:38Z	MEMBER			This was originally reported by @jkingslake at https://github.com/pangeo-data/pangeo-datastore/issues/116. What happened: I tried to open a netcdf file over http using fsspec and the h5netcdf engine and compute data using dask.distributed. It appears that our `ImplicitToExplicitIndexingAdapter` is [no longer?] serializable? What you expected to happen: Things would work. Indeed, I could swear this used to work with previous versions. Minimal Complete Verifiable Example: ```python import xarray as xr import fsspec from dask.distributed import Client example needs to use distributed to reproduce the bug client = Client() url = 'https://storage.googleapis.com/ldeo-glaciology/bedmachine/BedMachineAntarctica_2019-11-05_v01.nc' with fsspec.open(url, mode='rb') as openfile: dsc = xr.open_dataset(openfile, chunks=3000) dsc.surface.mean().compute() ``` raises the following error Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/core.py", line 50, in dumps data = { File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/core.py", line 51, in <dictcomp> key: serialize( File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 277, in serialize raise TypeError(msg, str(x)[:10000]) TypeError: ('Could not serialize object of type ImplicitToExplicitIndexingAdapter.', 'ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=<xarray.backends.h5netcdf_.H5NetCDFArrayWrapper object at 0x7ff8e3988540>, key=BasicIndexer((slice(None, None, None), slice(None, None, None))))))') distributed.comm.utils - ERROR - ('Could not serialize object of type ImplicitToExplicitIndexingAdapter.', 'ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=<xarray.backends.h5netcdf_.H5NetCDFArrayWrapper object at 0x7ff8e3988540>, key=BasicIndexer((slice(None, None, None), slice(None, None, None))))))') Anything else we need to know?: One can work around this by using the netcdf4 library's new and undocumented ability to open files over http. `python url = 'https://storage.googleapis.com/ldeo-glaciology/bedmachine/BedMachineAntarctica_2019-11-05_v01.nc#mode=bytes' ds = xr.open_dataset(url, engine='netcdf4', chunks=3000) ds` However, the fsspec + h5netcdf path should work! Environment: Output of <tt>xr.show_versions()</tt> ``` INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 \| packaged by conda-forge \| (default, Oct 7 2020, 19:08:05) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.19.112+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.1 pandas: 1.1.3 numpy: 1.19.2 scipy: 1.5.2 netCDF4: 1.5.4 pydap: installed h5netcdf: 0.8.1 h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.2.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.7 cfgrib: 0.9.8.4 iris: None bottleneck: 1.3.2 dask: 2.30.0 distributed: 2.30.0 matplotlib: 3.3.2 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.16.1 setuptools: 49.6.0.post20201009 pip: 20.2.4 conda: None pytest: 6.1.1 IPython: 7.18.1 sphinx: 3.2.1 ``` Also fsspec 0.8.4 cc @martindurant for fsspec integration.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/4591/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
836391524	MDU6SXNzdWU4MzYzOTE1MjQ=	5056	Allow "unsafe" mode for zarr writing	rabernat 1197350	closed		1	2021-03-19T21:57:47Z	2021-04-26T16:37:43Z	2021-04-26T16:37:43Z	MEMBER			Curently, `Dataset.to_zarr` will only write Zarr datasets in cases in which - The Dataset arrays are in memory (no dask) - The arrays are chunked with dask with a one-to-many relationship between dask chunks and zarr chunks If I try to violate the one-to-many condition, I get an error `python import xarray as xr ds = xr.DataArray([0, 1., 2], name='foo').chunk({'dim_0': 1}).to_dataset() d = ds.to_zarr('test.zarr', encoding={'foo': {'chunks': (3,)}}, compute=False)` ``` /srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/zarr.py in _determine_zarr_chunks(enc_chunks, var_chunks, ndim, name) 148 for dchunk in dchunks[:-1]: 149 if dchunk % zchunk: --> 150 raise NotImplementedError( 151 f"Specified zarr chunks encoding['chunks']={enc_chunks_tuple!r} for " 152 f"variable named {name!r} would overlap multiple dask chunks {var_chunks!r}. " NotImplementedError: Specified zarr chunks encoding['chunks']=(3,) for variable named 'foo' would overlap multiple dask chunks ((1, 1, 1),). This is not implemented in xarray yet. Consider either rechunking using `chunk()` or instead deleting or modifying `encoding['chunks']`. ``` In this case, the error is particularly frustrating because I'm not even writing any data yet. (Also related to #2300, #4046, #4380). There are at least two scenarios in which we might want to have more flexibility. 1. The case above, when we want to lazily initialize a Zarr array based on a Dataset, without actually computing anything. 2. The more general case, where we actually write arrays with many-to-many dask-chunk <-> zarr-chunk relationships For 1, I propose we add a new option like `safe_chunks=True` to `to_zarr`. `safe_chunks=False` would permit just bypassing this chunk. For 2, we could consider implementing locks. This probably has to be done at the Dask level. But is actually not super hard to deterministically figure out which chunks need to share a lock.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/5056/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
837243943	MDExOlB1bGxSZXF1ZXN0NTk3NjA4NTg0	5065	Zarr chunking fixes	rabernat 1197350	closed		32	2021-03-22T01:35:22Z	2021-04-26T16:37:43Z	2021-04-26T16:37:43Z	MEMBER	0	pydata/xarray/pulls/5065	[x] Closes #2300, closes #5056 [x] Tests added [x] Passes `pre-commit run --all-files` [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst` This PR contains two small, related updates to how Zarr chunks are handled. We now delete the `encoding` attribute at the Variable level whenever `chunk` is called. The persistence of `chunk` encoding has been the source of lots of confusion (see #2300, #4046, #4380, https://github.com/dcs4cop/xcube/issues/347) Added a new option called `safe_chunks` in `to_zarr` which allows for bypassing the requirement of the many-to-one relationship between Zarr chunks and Dask chunks (see #5056). Both these touch the internal logic for how chunks are handled, so I thought it was easiest to tackle them with a single PR.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/5065/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
859945463	MDU6SXNzdWU4NTk5NDU0NjM=	5172	Inconsistent attribute handling between netcdf4 and h5netcdf engines	rabernat 1197350	closed		3	2021-04-16T15:54:03Z	2021-04-20T14:00:34Z	2021-04-16T17:13:26Z	MEMBER			I have found a netCDF file that cannot be decoded by xarray via the h5netcdf engine but CAN be decoded via netCDF4. This could be considered an h5netcdf bug, but I thought I would raise it first here for visibility. This file will reproduce the bug `! wget 'https://esgf-world.s3.amazonaws.com/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/abrupt-4xCO2/r1i1p1f1/Lmon/cLeaf/gr/v20190118/cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc'` ```python import netCDF4 import h5netcdf.legacyapi as netCDF4_h5 local_path = "cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc" with netCDF4_h5.Dataset(local_path, mode='r') as ncfile: print('h5netcdf:', ncfile['cLeaf'].getncattr("coordinates")) with netCDF4.Dataset(local_path, mode='r') as ncfile: #assert "coordinates" not in ncfile['cLeaf'].attrs print('netCDF4:', ncfile['cLeaf'].getncattr("coordinates")) ``` `h5netcdf: Empty(dtype=dtype('S1')) netCDF4:` As we can see, we get an empty string `''` in netCDF4 but a `<class 'h5py._hl.base.Empty'>` object from h5netcdf. This weird attribute prevents xarray from decoding the dataset. We could: - Fix it in xarray, but having special handling for this sort of `Empty` object - Fix it in h5netcdf Environment: Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.8 \| packaged by conda-forge \| (default, Feb 20 2021, 16:22:27) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 4.19.150+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.17.0 pandas: 1.2.3 numpy: 1.20.2 scipy: 1.6.2 netCDF4: 1.5.6 pydap: installed h5netcdf: 0.10.0 h5py: 3.1.0 Nio: None zarr: 2.7.0 cftime: 1.4.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.2.1 cfgrib: 0.9.8.5 iris: None bottleneck: 1.3.2 dask: 2021.03.1 distributed: 2021.03.1 matplotlib: 3.3.4 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.17 setuptools: 49.6.0.post20210108 pip: 20.3.4 conda: None pytest: None IPython: 7.22.0 sphinx: None xref https://github.com/pangeo-forge/pangeo-forge/issues/105	{ "url": "https://api.github.com/repos/pydata/xarray/issues/5172/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
548607657	MDU6SXNzdWU1NDg2MDc2NTc=	3689	Decode CF bounds to coords	rabernat 1197350	closed		5	2020-01-12T18:23:26Z	2021-04-19T03:32:26Z	2021-04-19T03:32:26Z	MEMBER			CF conventions define Cell Boundaries and specify how to encode the presence of cell boundary variables in dataset attributes. To represent cells we add the attribute bounds to the appropriate coordinate variable(s). The value of `bounds` is the name of the variable that contains the vertices of the cell boundaries. For example consider this dataset: `http://esgf-data.ucar.edu/thredds/dodsC/esg_dataroot/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_200001-201412.nc` `python url = 'http://esgf-data.ucar.edu/thredds/dodsC/esg_dataroot/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_200001-201412.nc' ds = xr.open_dataset(url) ds` gives `<xarray.Dataset> Dimensions: (lat: 192, lon: 288, nbnd: 2, time: 180) Coordinates: * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0 * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8 * time (time) object 2000-01-15 12:00:00 ... 2014-12-15 12:00:00 Dimensions without coordinates: nbnd Data variables: time_bnds (time, nbnd) object ... lat_bnds (lat, nbnd) float64 ... lon_bnds (lon, nbnd) float64 ... tas (time, lat, lon) float32 ...` Despite the presence of the bounds attributes ``` print(ds.time.bounds, ds.lat.bounds, ds.lon.bounds) time_bnds lat_bnds lon_bnds ``` The variables `time_bnds`, `lat_bnds`, and `lon_bnds` are not decoded as coordinates but as data variables. I believe that this is not in accordance with CF conventions. Instead, we should decode all `bounds` variables to coordinates. I cannot think of a single use case where one would want to treat these variables as data variables rather than coordinates. It would be easy to implement, but it is a breaking change. Not that this is just a proposal to move bounds variables to the coords part of the dataset. It does not address the more difficult / complex question of how to actually use the bounds for indexing or plotting operations (see e.g. #1475, #1613), although it could be a first step in that direction. Full ncdump of dataset ``` xarray.Dataset { dimensions: lat = 192 ; lon = 288 ; nbnd = 2 ; time = 180 ; variables: float64 lat(lat) ; lat:axis = Y ; lat:bounds = lat_bnds ; lat:standard_name = latitude ; lat:title = Latitude ; lat:type = double ; lat:units = degrees_north ; lat:valid_max = 90.0 ; lat:valid_min = -90.0 ; lat:_ChunkSizes = 192 ; float64 lon(lon) ; lon:axis = X ; lon:bounds = lon_bnds ; lon:standard_name = longitude ; lon:title = Longitude ; lon:type = double ; lon:units = degrees_east ; lon:valid_max = 360.0 ; lon:valid_min = 0.0 ; lon:_ChunkSizes = 288 ; object time(time) ; time:axis = T ; time:bounds = time_bnds ; time:standard_name = time ; time:title = time ; time:type = double ; time:_ChunkSizes = 512 ; object time_bnds(time, nbnd) ; time_bnds:_ChunkSizes = [1 2] ; float64 lat_bnds(lat, nbnd) ; lat_bnds:units = degrees_north ; lat_bnds:_ChunkSizes = [192 2] ; float64 lon_bnds(lon, nbnd) ; lon_bnds:units = degrees_east ; lon_bnds:_ChunkSizes = [288 2] ; float32 tas(time, lat, lon) ; tas:cell_measures = area: areacella ; tas:cell_methods = area: time: mean ; tas:comment = near-surface (usually, 2 meter) air temperature ; tas:description = near-surface (usually, 2 meter) air temperature ; tas:frequency = mon ; tas:id = tas ; tas:long_name = Near-Surface Air Temperature ; tas:mipTable = Amon ; tas:out_name = tas ; tas:prov = Amon ((isd.003)) ; tas:realm = atmos ; tas:standard_name = air_temperature ; tas:time = time ; tas:time_label = time-mean ; tas:time_title = Temporal mean ; tas:title = Near-Surface Air Temperature ; tas:type = real ; tas:units = K ; tas:variable_id = tas ; tas:_ChunkSizes = [ 1 192 288] ; // global attributes: :Conventions = CF-1.7 CMIP-6.2 ; ... [truncated] ``` Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.6.7 \| packaged by conda-forge \| (default, Jul 2 2019, 02:07:37) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.6.2 xarray: 0.14.0+19.gba48fbcd pandas: 0.25.1 numpy: 1.17.2 scipy: 1.3.1 netCDF4: 1.5.1.2 pydap: None h5netcdf: 0.7.4 h5py: 2.10.0 Nio: None zarr: 2.3.2 cftime: 1.0.3.4 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: None cfgrib: 0.9.7.1 iris: None bottleneck: 1.2.1 dask: 2.4.0 distributed: 2.4.0 matplotlib: 3.1.1 cartopy: 0.17.0 seaborn: 0.9.0 numbagg: None setuptools: 41.2.0 pip: 19.2.3 conda: None pytest: 5.1.2 IPython: 7.8.0 sphinx: 1.6.5	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3689/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
99836561	MDU6SXNzdWU5OTgzNjU2MQ==	521	time decoding error with "days since"	rabernat 1197350	closed		20	2015-08-08T21:54:24Z	2021-03-29T14:12:38Z	2015-08-14T17:23:26Z	MEMBER			I am trying to use xray with some CESM POP model netCDF output, which supposedly follows CF-1.0 conventions. It is failing because the models time units are "'days since 0000-01-01 00:00:00". When calling open_dataset, I get the following error: ValueError: unable to decode time units u'days since 0000-01-01 00:00:00' with the default calendar. Try opening your dataset with decode_times=False. Full traceback: Traceback (most recent call last): File "/home/rpa/xray/xray/conventions.py", line 372, in __init__ # Otherwise, tracebacks end up swallowed by Dataset.__repr__ when users File "/home/rpa/xray/xray/conventions.py", line 145, in decode_cf_datetime dates = _decode_datetime_with_netcdf4(flat_num_dates, units, calendar) File "/home/rpa/xray/xray/conventions.py", line 97, in _decode_datetime_with_netcdf4 dates = np.asarray(nc4.num2date(num_dates, units, calendar)) File "netCDF4/_netCDF4.pyx", line 4522, in netCDF4._netCDF4.num2date (netCDF4/_netCDF4.c:50388) File "netCDF4/_netCDF4.pyx", line 4337, in netCDF4._netCDF4._dateparse (netCDF4/_netCDF4.c:48234) ValueError: year is out of range Full metadata for the time variable: `double time(time) ; time:long_name = "time" ; time:units = "days since 0000-01-01 00:00:00" ; time:bounds = "time_bound" ; time:calendar = "noleap" ;` I guess this is a problem with the underlying netCDF4 num2date package?	{ "url": "https://api.github.com/repos/pydata/xarray/issues/521/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
288184220	MDU6SXNzdWUyODgxODQyMjA=	1823	We need a fast path for open_mfdataset	rabernat 1197350	closed		19	2018-01-12T17:01:49Z	2021-01-28T18:00:15Z	2021-01-27T17:50:09Z	MEMBER			It would be great to have a "fast path" option for `open_mfdataset`, in which all alignment / coordinate checking is bypassed. This would be used in cases where the user knows that many netCDF files all share the same coordinates (e.g. model output, satellite records from the same product, etc.). The coordinates would just be taken from the first file, and only the data variables would be read from all subsequent files. The only checking would be that the data variables have the correct shape. Implementing this would require some refactoring. @jbusecke mentioned that he had developed a solution for this (related to #1704), so maybe he could be the one to add this feature to xarray. This is also related to #1385.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1823/reactions", "total_count": 9, "+1": 9, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
753965875	MDU6SXNzdWU3NTM5NjU4NzU=	4631	Decode_cf fails when scale_factor is a length-1 list	rabernat 1197350	closed		4	2020-12-01T03:07:48Z	2021-01-15T18:19:56Z	2021-01-15T18:19:56Z	MEMBER			Some datasets I work with have `scale_factor` and `add_offset` encoded as length-1 lists. The following code worked as of Xarray 0.16.1 `python import xarray as xr ds = xr.DataArray([0, 1, 2], name='foo', attrs={'scale_factor': [0.01], 'add_offset': [1.0]}).to_dataset() xr.decode_cf(ds)` In 0.16.2 (just released) and current master, it fails with this error ``` AttributeError Traceback (most recent call last) <ipython-input-2-a0b01d6a314b> in <module> 2 attrs={'scale_factor': [0.01], 3 'add_offset': [1.0]}).to_dataset() ----> 4 xr.decode_cf(ds) ~/Code/xarray/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta) 587 raise TypeError("can only decode Dataset or DataStore objects") 588 --> 589 vars, attrs, coord_names = decode_cf_variables( 590 vars, 591 attrs, ~/Code/xarray/xarray/conventions.py in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta) 490 and stackable(v.dims[-1]) 491 ) --> 492 new_vars[k] = decode_cf_variable( 493 k, 494 v, ~/Code/xarray/xarray/conventions.py in decode_cf_variable(name, var, concat_characters, mask_and_scale, decode_times, decode_endianness, stack_char_dim, use_cftime, decode_timedelta) 333 variables.CFScaleOffsetCoder(), 334 ]: --> 335 var = coder.decode(var, name=name) 336 337 if decode_timedelta: ~/Code/xarray/xarray/coding/variables.py in decode(self, variable, name) 271 dtype = _choose_float_dtype(data.dtype, "add_offset" in attrs) 272 if np.ndim(scale_factor) > 0: --> 273 scale_factor = scale_factor.item() 274 if np.ndim(add_offset) > 0: 275 add_offset = add_offset.item() AttributeError: 'list' object has no attribute 'item' ``` I'm very confused, because this feels quite similar to #4471, and I thought it was resolved #4485. However, the behavior is different with `'scale_factor': np.array([0.01])`. That works fine--no error. How might I end up with a dataset with `scale_factor` as a python list? It happens when I open a netcdf file using the `h5netcdf` engine (documented by @gerritholl in https://github.com/pydata/xarray/issues/4471#issuecomment-702018925) and then write it to zarr. The numpy array gets encoded as a list in the zarr json metadata. 🙃 This problem would go away if we could resolve the discrepancies between the two engines' treatment of scalar attributes.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/4631/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
753514595	MDU6SXNzdWU3NTM1MTQ1OTU=	4624	Release 0.16.2?	rabernat 1197350	closed		6	2020-11-30T14:15:55Z	2020-12-02T00:24:31Z	2020-12-01T15:09:38Z	MEMBER			Looking at our what's new, we have quite a few important new features, as well as significant bug fixes. I propose we move towards releasing ~0.17.0~ 0.16.2 asap. (I have selfish motives for this, as I want to use the new features in production.) We can use this issue to track any PRs or issues we want to resolve before the next release. I personally am not aware of any major blockers, but other devs should feel free to edit this list. [ ] #4461 - requires decisions [x] #4618 [x] #4621 cc @pydata/xarray	{ "url": "https://api.github.com/repos/pydata/xarray/issues/4624/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
375663610	MDU6SXNzdWUzNzU2NjM2MTA=	2528	display_width doesn't apply to dask-backed arrays	rabernat 1197350	closed		3	2018-10-30T19:49:05Z	2020-09-30T06:17:17Z	2020-09-30T06:17:17Z	MEMBER			The representation of dask-backed arrays in xarray's `__repr__` methods results in very long lines which often overflow the desired line width. Unfortunately, this can't be controlled or overridden with `xr.set_options(display_width=...)`. Code Sample, a copy-pastable example if possible `python import xarray as xr xr.set_options(display_width=20) ds = (xr.DataArray(range(100)) .chunk({'dim_0': 10}) .to_dataset(name='really_long_long_name')) ds` `<xarray.Dataset> Dimensions: (dim_0: 100) Dimensions without coordinates: dim_0 Data variables: really_long_long_name (dim_0) int64 dask.array<shape=(100,), chunksize=(10,)>` Problem description [this should explain why the current behavior is a problem and why the expected output is a better solution.] Expected Output We need to decide how to abbreviate dask arrays with something more concise. I'm not sure the best way to do this. Maybe `really_long_long_name (dim_0) int64 dask chunks=(10,)`	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2528/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
614814400	MDExOlB1bGxSZXF1ZXN0NDE1MjkyMzM3	4047	Document Xarray zarr encoding conventions	rabernat 1197350	closed		3	2020-05-08T15:29:14Z	2020-05-22T21:59:09Z	2020-05-20T17:04:02Z	MEMBER	0	pydata/xarray/pulls/4047	When we implemented the Zarr backend, we made some ad hoc choices about how to encode NetCDF data in Zarr. At this stage, it would be useful to explicitly document this encoding. I decided to put it on the "Xarray Internals" page, but I'm open to moving if folks feel it fits better elsewhere. cc @jeffdlb, @WardF, @DennisHeimbigner	{ "url": "https://api.github.com/repos/pydata/xarray/issues/4047/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
528884925	MDU6SXNzdWU1Mjg4ODQ5MjU=	3575	map_blocks output inference problems	rabernat 1197350	closed		6	2019-11-26T17:56:11Z	2020-05-06T16:41:54Z	2020-05-06T16:41:54Z	MEMBER			I am excited about using `map_blocks` to overcome a long-standing challenge related to calculating climatologies / anomalies with dask arrays. However, I hit what feels like a bug. I don't love how the new `map_blocks` function does this: The function will be first run on mocked-up data, that looks like ‘obj’ but has sizes 0, to determine properties of the returned object such as dtype, variable names, new dimensions and new indexes (if any). The problem is that many functions will simply error on size 0 data. As in the example below MCVE Code Sample ```python import xarray as xr ds = xr.tutorial.load_dataset('rasm').chunk({'y': 20}) def calculate_anomaly(ds): # needed to workaround xarray's check with zero dimensions #if len(ds['time']) == 0: # return ds gb = ds.groupby("time.month") clim = gb.mean(dim='T') return gb - clim xr.map_blocks(calculate_anomaly, ds) ``` Raises ``` KeyError Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataset.py in _construct_dataarray(self, name) 1145 try: -> 1146 variable = self._variables[name] 1147 except KeyError: KeyError: 'time.month' During handling of the above exception, another exception occurred: AttributeError Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/parallel.py in infer_template(func, obj, args, kwargs) 77 try: ---> 78 template = func(meta_args, kwargs) 79 except Exception as e: <ipython-input-40-d7b2b2978c29> in calculate_anomaly(ds) 5 # return ds ----> 6 gb = ds.groupby("time.month") 7 clim = gb.mean(dim='T') /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/common.py in groupby(self, group, squeeze, restore_coord_dims) 656 return self._groupby_cls( --> 657 self, group, squeeze=squeeze, restore_coord_dims=restore_coord_dims 658 ) /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/groupby.py in init(self, obj, group, squeeze, grouper, bins, restore_coord_dims, cut_kwargs) 298 ) --> 299 group = obj[group] 300 if len(group) == 0: /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataset.py in getitem*(self, key) 1235 if hashable(key): -> 1236 return self._construct_dataarray(key) 1237 else: /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataset.py in construct_dataarray(self, name) 1148* , name, variable = _get_virtual_variable( -> 1149 self._variables, name, self._level_coords, self.dims 1150 ) /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataset.py in _get_virtual_variable(variables, key, level_vars, dim_sizes) 157 else: --> 158 data = getattr(ref_var, var_name).data 159 virtual_var = Variable(ref_var.dims, data) AttributeError: 'IndexVariable' object has no attribute 'month' The above exception was the direct cause of the following exception: Exception Traceback (most recent call last) <ipython-input-40-d7b2b2978c29> in <module> 8 return gb - clim 9 ---> 10 xr.map_blocks(calculate_anomaly, ds) /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/parallel.py in map_blocks(func, obj, args, kwargs) 203 input_chunks = dataset.chunks 204 --> 205 template: Union[DataArray, Dataset] = infer_template(func, obj, args, kwargs) 206 if isinstance(template, DataArray): 207 result_is_array = True /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/parallel.py in infer_template(func, obj, args,* *kwargs) 80 raise Exception( 81 "Cannot infer object returned from running user provided function." ---> 82 ) from e 83 84 if not isinstance(template, (Dataset, DataArray)): Exception: Cannot infer object returned from running user provided function. ``` Problem Description We should try to imitate what dask does in `map_blocks`: https://docs.dask.org/en/latest/array-api.html#dask.array.map_blocks Specifically: - We should allow the user to override the checks by explicitly specifying output dtype and shape - Maybe the check should be on small, rather than zero size, test data Output of `xr.show_versions()` # Paste the output here xr.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 \| packaged by conda-forge \| (default, Jul 1 2019, 21:52:21) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.138+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.6.2 xarray: 0.14.0 pandas: 0.25.3 numpy: 1.17.3 scipy: 1.3.2 netCDF4: 1.5.1.2 pydap: installed h5netcdf: 0.7.4 h5py: 2.10.0 Nio: None zarr: 2.3.2 cftime: 1.0.4.2 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.0.25 cfgrib: None iris: 2.2.0 bottleneck: 1.3.0 dask: 2.7.0 distributed: 2.7.0 matplotlib: 3.1.2 cartopy: 0.17.0 seaborn: 0.9.0 numbagg: None setuptools: 41.6.0.post20191101 pip: 19.3.1 conda: None pytest: 5.3.1 IPython: 7.9.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3575/reactions", "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
499477363	MDU6SXNzdWU0OTk0NzczNjM=	3349	Implement polyfit?	rabernat 1197350	closed		25	2019-09-27T14:25:14Z	2020-03-25T17:17:45Z	2020-03-25T17:17:45Z	MEMBER			Fitting a line (or curve) to data along a specified axis is a long-standing need of xarray users. There are many blog posts and SO questions about how to do it: - http://atedstone.github.io/rate-of-change-maps/ - https://gist.github.com/luke-gregor/4bb5c483b2d111e52413b260311fbe43 - https://stackoverflow.com/questions/38960903/applying-numpy-polyfit-to-xarray-dataset - https://stackoverflow.com/questions/52094320/with-xarray-how-to-parallelize-1d-operations-on-a-multidimensional-dataset - https://stackoverflow.com/questions/36275052/applying-a-function-along-an-axis-of-a-dask-array The main use case in my domain is finding the temporal trend on a 3D variable (e.g. temperature in time, lon, lat). Yes, you can do it with apply_ufunc, but apply_ufunc is inaccessibly complex for many users. Much of our existing API could be removed and replaced with apply_ufunc calls, but that doesn't mean we should do it. I am proposing we add a Dataarray method called `polyfit`. It would work like this: ```python x_ = np.linspace(0, 1, 10) y_ = np.arange(5) a_ = np.cos(y_) x = xr.DataArray(x_, dims=['x'], coords={'x': x_}) a = xr.DataArray(a_, dims=['y']) f = a*x p = f.polyfit(dim='x', deg=1) equivalent numpy code p_ = np.polyfit(x_, f.values.transpose(), 1) np.testing.assert_allclose(p_[0], a_) ``` Numpy's polyfit function is already vectorized in the sense that it accepts 1D x and 2D y, performing the fit independently over each column of y. To extend this to ND, we would just need to reshape the data going in and out of the function. We do this already in other packages. For dask, we could simply require that the dimension over which the fit is calculated be contiguous, and then call map_blocks. Thoughts?	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3349/reactions", "total_count": 9, "+1": 9, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
361858640	MDU6SXNzdWUzNjE4NTg2NDA=	2423	manually specify chunks in open_zarr	rabernat 1197350	closed		2	2018-09-19T17:52:31Z	2020-01-09T15:21:35Z	2020-01-09T15:21:35Z	MEMBER			Currently, `open_zarr` has two possible chunking behaviors. `auto_chunk=True` (default) creates dask chunks corresponding with zarr chunks. `auto_chunk=False` creates no chunks. But what if you want to manually specify the chunks, as with `open_dataset(chunks=...)`. `open_zarr` could easily support this, but it does not currently. Note that this is not the same as calling `.chunk()` post dataset creation. That operation is very inefficient, since it begins from a single global chunk for each variable.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2423/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
396285440	MDU6SXNzdWUzOTYyODU0NDA=	2656	dataset info in .json format	rabernat 1197350	closed		9	2019-01-06T19:13:34Z	2020-01-08T22:43:25Z	2019-01-21T23:25:56Z	MEMBER			I am exploring the world of Spatio Temporal Asset Catalogs (STAC), in which all datasets are described using json/ geojson: The STAC specification aims to standardize the way geospatial assets are exposed online and queried. I am thinking about how to put the sort of datasets that xarray deals with into STAC items (see https://github.com/radiantearth/stac-spec). This would be particular valuable in the context of Pangeo and the zarr-based datasets we have been putting in cloud storage. For this purpose, it would be very useful to have a concise summary of an xarray dataset's contents (minus the actual data) in .json format. I'm talking about the kind of info we currently get from the `.info()` method, which is designed to mirror the CDL output of `ncdump -h`. For example `python ds = xr.Dataset({'foo': ('x', np.ones(10, 'f8'), {'units': 'm s-1'})}, {'x': ('x', np.arange(10), {'units': 'm'})}, {'conventions': 'made up'}) ds.info()` ``` xarray.Dataset { dimensions: x = 10 ; variables: float64 foo(x) ; foo:units = m s-1 ; int64 x(x) ; x:units = m ; // global attributes: :conventions = made up ; ``` I would like to be able to do `ds.info(format='json')` and see something like this `{ "coords": { "x": { "dims": [ "x" ], "attrs": { "units": "m" } } }, "attrs": { "conventions": "made up" }, "dims": { "x": 10 }, "data_vars": { "foo": { "dims": [ "x" ], "attrs": { "units": "m s-1" } } } }` Which is what I get by doing `print(json.dumps(ds.to_dict(), indent=2))` and manually stripping out all the `data` fields. So an alternative api might be something like `ds.to_dict(data=False)`. If anyone is aware of an existing spec for expressing Common Data Language in json, we should probably use that instead of inventing our own. But I think some version of this would be a very useful addition to xarray.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2656/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
288785270	MDU6SXNzdWUyODg3ODUyNzA=	1832	groupby on dask objects doesn't handle chunks well	rabernat 1197350	closed		22	2018-01-16T04:50:22Z	2019-11-27T16:45:14Z	2019-06-06T20:01:40Z	MEMBER			80% of climate data analysis begins with calculating the monthly-mean climatology and subtracting it from the dataset to get an anomaly. Unfortunately this is a fail case for xarray / dask with out-of-core datasets. This is becoming a serious problem for me. Code Sample ```python Your code here import xarray as xr import dask.array as da import pandas as pd construct an example datatset chunked in time nt, ny, nx = 366, 180, 360 time = pd.date_range(start='1950-01-01', periods=nt, freq='10D') ds = xr.DataArray(da.random.random((nt, ny, nx), chunks=(1, ny, nx)), dims=('time', 'lat', 'lon'), coords={'time': time}).to_dataset(name='field') monthly climatology ds_mm = ds.groupby('time.month').mean(dim='time') anomaly ds_anom = ds.groupby('time.month')- ds_mm print(ds_anom) <xarray.Dataset> Dimensions: (lat: 180, lon: 360, time: 366) Coordinates: * time (time) datetime64[ns] 1950-01-01 1950-01-11 1950-01-21 ... month (time) int64 1 1 1 1 2 2 3 3 3 4 4 4 5 5 5 5 6 6 6 7 7 7 8 8 8 ... Dimensions without coordinates: lat, lon Data variables: field (time, lat, lon) float64 dask.array<shape=(366, 180, 360), chunksize=(366, 180, 360)> ``` Problem description As we can see in the example above, the chunking has been lost. The dataset contains just one single huge chunk. This happens with any non-reducing operation on the groupby, even `python ds.groupby('time.month').apply(lambda x: x)` Say we wanted to compute some statistics of the anomaly, like the variance: `python (ds_anom.field**2).mean(dim='time').load()` This triggers the whole big chunk (with the whole timeseries) to be loaded into memory somewhere. For out-of-core datasets, this will crash our system. Expected Output It seems like we should be able to do this lazily, maintaining a chunk size of `(1, 180, 360)` for ds_anom. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.0+dev27.g049cbdd pandas: 0.20.3 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.3.1 h5netcdf: 0.4.1 Nio: None zarr: 2.2.0a2.dev91 bottleneck: 1.2.1 cyordereddict: None dask: 0.16.0 distributed: 1.20.1 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 36.3.0 pip: 9.0.1 conda: None pytest: 3.2.1 IPython: 6.1.0 sphinx: 1.6.5 Possibly related to #392. cc @mrocklin	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1832/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
467776251	MDExOlB1bGxSZXF1ZXN0Mjk3MzU0NTEx	3121	Allow other tutorial filename extensions	rabernat 1197350	closed		3	2019-07-13T23:27:44Z	2019-07-14T01:07:55Z	2019-07-14T01:07:51Z	MEMBER	0	pydata/xarray/pulls/3121	[x] Closes #3118 [ ] Tests added [ ] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API Together with https://github.com/pydata/xarray-data/pull/15, this allows us to generalize out tutorial datasets to non netCDF files. But it is backwards compatible--if there is no file suffix, it will append `.nc`.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3121/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
467674875	MDExOlB1bGxSZXF1ZXN0Mjk3MjgyNzA1	3106	Replace sphinx_gallery with notebook	rabernat 1197350	closed		3	2019-07-13T05:35:34Z	2019-07-13T14:03:20Z	2019-07-13T14:03:19Z	MEMBER	0	pydata/xarray/pulls/3106	Today @jhamman and I discussed how to refactor our somewhat fragmented "examples". We decided to basically copy the approach of the dask-examples repo, but have it live here in the main xarray repo. Basically this approach is: - all examples are notebooks - examples are rendered during doc build by nbsphinx - we will eventually have a binder that works with all of the same examples This PR removes the dependency on sphinx_gallery and replaces the existing gallery with a standalone notebook called `visualization_gallery.ipynb`. However, not all of the links that worked in the gallery work here, since we are now using nbsphinx to render the notebooks (see https://github.com/spatialaudio/nbsphinx/issues/308). Really important to get @dcherian's feedback on this, as he was the one who originally introduced the gallery. My view is that having everything as notebooks makes examples easier to maintain. But I'm curious to hear other views.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3106/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
467658326	MDExOlB1bGxSZXF1ZXN0Mjk3MjcwNjYw	3105	Switch doc examples to use nbsphinx	rabernat 1197350	closed		4	2019-07-13T02:28:34Z	2019-07-13T04:53:09Z	2019-07-13T04:52:52Z	MEMBER	0	pydata/xarray/pulls/3105	This is the beginning of the docs refactor we have in mind for the sprint tomorrow. We will merge things first to the scipy19-docs branch so we can make sure things build on RTD. http://xarray.pydata.org/en/scipy19-docs	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3105/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
218260909	MDU6SXNzdWUyMTgyNjA5MDk=	1340	round-trip performance with save_mfdataset / open_mfdataset	rabernat 1197350	closed		11	2017-03-30T16:52:26Z	2019-05-01T22:12:06Z	2019-05-01T22:12:06Z	MEMBER			I have encountered some major performance bottlenecks in trying to write and then read multi-file netcdf datasets. I start with an xarray dataset created by xgcm with the following repr: <xarray.Dataset> Dimensions: (XC: 400, XG: 400, YC: 400, YG: 400, Z: 40, Zl: 40, Zp1: 41, Zu: 40, layer_1TH_bounds: 43, layer_1TH_center: 42, layer_1TH_interface: 41, time: 1566) Coordinates: iter (time) int64 8294400 8294976 8295552 8296128 ... * time (time) int64 8294400 8294976 8295552 8296128 ... * XC (XC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ... * YG (YG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ... * XG (XG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ... * YC (YC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ... * Zu (Zu) >f4 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 -91.0 ... * Zl (Zl) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ... * Zp1 (Zp1) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ... * Z (Z) >f4 -5.0 -15.0 -25.0 -36.0 -49.0 -64.0 -81.5 ... rAz (YG, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... dyC (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... rAw (YC, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... dxC (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... dxG (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... dyG (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... rAs (YG, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... Depth (YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... rA (YC, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... PHrefF (Zp1) >f4 0.0 98.1 196.2 294.3 412.02 549.36 706.32 ... PHrefC (Z) >f4 49.05 147.15 245.25 353.16 480.69 627.84 ... drC (Zp1) >f4 5.0 10.0 10.0 11.0 13.0 15.0 17.5 20.5 ... drF (Z) >f4 10.0 10.0 10.0 12.0 14.0 16.0 19.0 22.0 ... hFacC (Z, YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... hFacW (Z, YC, XG) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... hFacS (Z, YG, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... * layer_1TH_bounds (layer_1TH_bounds) >f4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 ... * layer_1TH_interface (layer_1TH_interface) >f4 0.0 0.2 0.4 0.6 0.8 1.0 ... * layer_1TH_center (layer_1TH_center) float32 -0.1 0.1 0.3 0.5 0.7 0.9 ... Data variables: T (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... U (time, Z, YC, XG) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... V (time, Z, YG, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... S (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... Eta (time, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... W (time, Zl, YC, XC) float32 -0.0 -0.0 -0.0 -0.0 -0.0 ... An important point to note is that there are lots of "non-dimension coordinates" corresponding to various parameters of the numerical grid. I save this dataset to a multi-file netCDF dataset as follows: `python iternums, datasets = zip(ds.groupby('time')) paths = [outdir + 'xmitgcm_data.%010d.nc' % it for it in iternums] xr.save_mfdataset(datasets, paths)` This takes many hours to run, since it has to read and write all the data. (I think there are some performance issues here too, related to how dask schedules the read / write tasks, but that is probably a separate issue.) Then I try to re-load this dataset `python ds_nc = xr.open_mfdataset('xmitgcm_data..nc')` This raises an error: `ValueError: too many different dimensions to concatenate: {'YG', 'Z', 'Zl', 'Zp1', 'layer_1TH_interface', 'YC', 'XC', 'layer_1TH_center', 'Zu', 'layer_1TH_bounds', 'XG'}` I need to specify `concat_dim='time'` in order to properly concatenate the data. It seems like this should be unnecessary, since I am reading back data that was just written with xarray, but I understand why (the dimensions of the Data Variables in each file are just Z, YC, XC, with no time dimension). Once I do that, it works, but it takes 18 minutes to load the dataset. I assume this is because it has to check the compatibility of all all the non-dimension coordinates. I just thought I would document this, because 18 minutes seems way too long to load a dataset.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1340/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
431199282	MDExOlB1bGxSZXF1ZXN0MjY4OTI3MjU0	2881	decreased pytest verbosity	rabernat 1197350	closed		1	2019-04-09T21:12:50Z	2019-04-09T23:36:01Z	2019-04-09T23:34:22Z	MEMBER	0	pydata/xarray/pulls/2881	This removes the `--verbose` flag from py.test in .travis.yml. [x] Closes #2880	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2881/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
431156227	MDU6SXNzdWU0MzExNTYyMjc=	2880	pytest output on travis is too verbose	rabernat 1197350	closed		1	2019-04-09T19:39:46Z	2019-04-09T23:34:22Z	2019-04-09T23:34:22Z	MEMBER			I have to scroll over an immense amount of passing tests on travis before I can get to the failures. (example) This is pretty annoying. The amount of tests in xarray has exploded recently. This is good! But maybe we should turn off `--verbose` in travis. What does @pydata/xarray think?	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2880/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
373121666	MDU6SXNzdWUzNzMxMjE2NjY=	2503	Problems with distributed and opendap netCDF endpoint	rabernat 1197350	closed		26	2018-10-23T17:48:20Z	2019-04-09T12:02:01Z	2019-04-09T12:02:01Z	MEMBER			Code Sample I am trying to load a dataset from an opendap endpoint using xarray, netCDF4, and distributed. I am having a problem only with non-local distributed schedulers (KubeCluster specifically). This could plausibly be an xarray, dask, or pangeo issue, but I have decided to post it here. ```python import xarray as xr import dask create dataset from Unidata's test opendap endpoint, chunked in time url = 'http://remotetest.unidata.ucar.edu/thredds/dodsC/testdods/coads_climatology.nc' ds = xr.open_dataset(url, decode_times=False, chunks={'TIME': 1}) all these work with dask.config.set(scheduler='synchronous'): ds.SST.compute() with dask.config.set(scheduler='processes'): ds.SST.compute() with dask.config.set(scheduler='threads'): ds.SST.compute() this works too from dask.distributed import Client local_client = Client() with dask.config.set(get=local_client): ds.SST.compute() but this does not cluster = KubeCluster(n_workers=2) kube_client = Client(cluster) with dask.config.set(get=kube_client): ds.SST.compute() ``` In the worker log, I see the following sort of errors. distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 5, 0, 0) distributed.worker - INFO - Dependent not found: open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf 0 . Asking scheduler distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 3, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 0, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 1, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 7, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 6, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 2, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 9, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 8, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 11, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 10, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 4, 0, 0) distributed.worker - WARNING - Compute Failed Function: getter args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=_ElementwiseFunctionArray(LazilyOuterIndexedArray(array=<xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x7f45d6fcbb38>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))), func=functools.partial(<function _apply_mask at 0x7f45d70507b8>, encoded_fill_values={-1e+34}, decoded_fill_value=nan, dtype=dtype('float32')), dtype=dtype('float32')), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))))), (slice(3, 4, None), slice(0, 90, None), slice(0, 180, None))) kwargs: {} Exception: RuntimeError('NetCDF: Not a valid ID',) Ultimately, the error comes from the netCDF library: `RuntimeError('NetCDF: Not a valid ID',)` This seems like something to do with serialization of the netCDF store. The worker images have identical netcdf version (and all other package versions). I am at a loss for how to debug further. Output of `xr.show_versions()` xr.show_versions() ``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.111+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.2 numpy: 1.15.1 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: None h5py: None Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.18.2 distributed: 1.22.1 matplotlib: 2.2.3 cartopy: None seaborn: None setuptools: 39.2.0 pip: 18.0 conda: 4.5.4 pytest: 3.8.0 IPython: 6.4.0 sphinx: None ``` `cube_client.get_versions(check=True)` ``` {'scheduler': {'host': (('python', '3.6.3.final.0'), ('python-bits', 64), ('OS', 'Linux'), ('OS-release', '4.4.111+'), ('machine', 'x86_64'), ('processor', 'x86_64'), ('byteorder', 'little'), ('LC_ALL', 'en_US.UTF-8'), ('LANG', 'en_US.UTF-8'), ('LOCALE', 'en_US.UTF-8')), 'packages': {'required': (('dask', '0.18.2'), ('distributed', '1.22.1'), ('msgpack', '0.5.6'), ('cloudpickle', '0.5.5'), ('tornado', '5.0.2'), ('toolz', '0.9.0')), 'optional': (('numpy', '1.15.1'), ('pandas', '0.23.2'), ('bokeh', '0.12.16'), ('lz4', '1.1.0'), ('blosc', '1.5.1'))}}, 'workers': {'tcp://10.20.8.4:36940': {'host': (('python', '3.6.3.final.0'), ('python-bits', 64), ('OS', 'Linux'), ('OS-release', '4.4.111+'), ('machine', 'x86_64'), ('processor', 'x86_64'), ('byteorder', 'little'), ('LC_ALL', 'en_US.UTF-8'), ('LANG', 'en_US.UTF-8'), ('LOCALE', 'en_US.UTF-8')), 'packages': {'required': (('dask', '0.18.2'), ('distributed', '1.22.1'), ('msgpack', '0.5.6'), ('cloudpickle', '0.5.5'), ('tornado', '5.0.2'), ('toolz', '0.9.0')), 'optional': (('numpy', '1.15.1'), ('pandas', '0.23.2'), ('bokeh', '0.12.16'), ('lz4', '1.1.0'), ('blosc', '1.5.1'))}}, 'tcp://10.21.177.254:42939': {'host': (('python', '3.6.3.final.0'), ('python-bits', 64), ('OS', 'Linux'), ('OS-release', '4.4.111+'), ('machine', 'x86_64'), ('processor', 'x86_64'), ('byteorder', 'little'), ('LC_ALL', 'en_US.UTF-8'), ('LANG', 'en_US.UTF-8'), ('LOCALE', 'en_US.UTF-8')), 'packages': {'required': (('dask', '0.18.2'), ('distributed', '1.22.1'), ('msgpack', '0.5.6'), ('cloudpickle', '0.5.5'), ('tornado', '5.0.2'), ('toolz', '0.9.0')), 'optional': (('numpy', '1.15.1'), ('pandas', '0.23.2'), ('bokeh', '0.12.16'), ('lz4', '1.1.0'), ('blosc', '1.5.1'))}}}, 'client': {'host': [('python', '3.6.3.final.0'), ('python-bits', 64), ('OS', 'Linux'), ('OS-release', '4.4.111+'), ('machine', 'x86_64'), ('processor', 'x86_64'), ('byteorder', 'little'), ('LC_ALL', 'en_US.UTF-8'), ('LANG', 'en_US.UTF-8'), ('LOCALE', 'en_US.UTF-8')], 'packages': {'required': [('dask', '0.18.2'), ('distributed', '1.22.1'), ('msgpack', '0.5.6'), ('cloudpickle', '0.5.5'), ('tornado', '5.0.2'), ('toolz', '0.9.0')], 'optional': [('numpy', '1.15.1'), ('pandas', '0.23.2'), ('bokeh', '0.12.16'), ('lz4', '1.1.0'), ('blosc', '1.5.1')]}}} ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2503/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
209561985	MDU6SXNzdWUyMDk1NjE5ODU=	1282	description of xarray assumes knowledge of pandas	rabernat 1197350	closed		4	2017-02-22T19:52:54Z	2019-02-26T19:01:47Z	2019-02-26T19:01:46Z	MEMBER			The first sentence a potential new user reads about xarray is xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures. Now imagine you had never heard of pandas (like most new Ph.D. students in physical sciences). You would have no idea how useful and powerful xarray was. I would propose modifying these top-level descriptions to remove the assumption that the user understands pandas. Of course we can still refer to pandas, but a more self-contained description would serve us well.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1282/reactions", "total_count": 3, "+1": 3, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
396501063	MDExOlB1bGxSZXF1ZXN0MjQyNjY4ODEw	2659	to_dict without data	rabernat 1197350	closed		14	2019-01-07T14:09:25Z	2019-02-12T21:21:13Z	2019-01-21T23:25:56Z	MEMBER	0	pydata/xarray/pulls/2659	This PR provides the ability to export Datasets and DataArrays to dictionary without the actual data. This could be useful for generating indices of dataset contents to expose to search indices or other automated data discovery tools In the process of doing this, I refactored the core dictionary export function to live in the Variable class, since the same code was duplicated in several places. [x] Closes #2656 [x] Tests added [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2659/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
324740017	MDU6SXNzdWUzMjQ3NDAwMTc=	2164	holoviews / bokeh doesn't like cftime coords	rabernat 1197350	closed		16	2018-05-20T20:29:03Z	2019-02-08T00:11:14Z	2019-02-08T00:11:14Z	MEMBER			Code Sample, a copy-pastable example if possible Consider a simple working example of converting an xarray dataset to holoviews for plotting: `python ref_date = '1981-01-01' ds = xr.DataArray([1, 2, 3], dims=['time'], coords={'time': ('time', [1, 2, 3], {'units': 'days since %s' % ref_date})} ).to_dataset(name='foo') with xr.set_options(enable_cftimeindex=True): ds = xr.decode_cf(ds) print(ds) hv_ds = hv.Dataset(ds) hv_ds.to(hv.Curve)` This gives `<xarray.Dataset> Dimensions: (time: 3) Coordinates: * time (time) datetime64[ns] 1981-01-02 1981-01-03 1981-01-04 Data variables: foo (time) int64 ...` and Problem description Now change `ref_date = '0181-01-01'` (or anything outside of the valid range for regular pandas datetime index). We get a beautiful new cftimeindex `<xarray.Dataset> Dimensions: (time: 3) Coordinates: * time (time) object 0181-01-02 00:00:00 0181-01-03 00:00:00 ... Data variables: foo (time) int64 ...` but holoviews / bokeh doesn't like it ``` /opt/conda/lib/python3.6/site-packages/xarray/coding/times.py:132: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy cftime.datetime objects instead, reason: dates out of range enable_cftimeindex) /opt/conda/lib/python3.6/site-packages/xarray/coding/variables.py:66: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy cftime.datetime objects instead, reason: dates out of range return self.func(self.array[key]) TypeError Traceback (most recent call last) /opt/conda/lib/python3.6/site-packages/IPython/core/formatters.py in call(self, obj, include, exclude) 968 969 if method is not None: --> 970 return method(include=include, exclude=exclude) 971 return None 972 else: /opt/conda/lib/python3.6/site-packages/holoviews/core/dimension.py in repr_mimebundle(self, include, exclude) 1229 combined and returned. 1230 """ -> 1231 return Store.render(self) 1232 1233 /opt/conda/lib/python3.6/site-packages/holoviews/core/options.py in render(cls, obj) 1287 data, metadata = {}, {} 1288 for hook in hooks: -> 1289 ret = hook(obj) 1290 if ret is None: 1291 continue /opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in pprint_display(obj) 278 if not ip.display_formatter.formatters['text/plain'].pprint: 279 return None --> 280 return display(obj, raw_output=True) 281 282 /opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in display(obj, raw_output, kwargs) 248 elif isinstance(obj, (CompositeOverlay, ViewableElement)): 249 with option_state(obj): --> 250 output = element_display(obj) 251 elif isinstance(obj, (Layout, NdLayout, AdjointLayout)): 252 with option_state(obj): /opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in wrapped(element) 140 try: 141 max_frames = OutputSettings.options['max_frames'] --> 142 mimebundle = fn(element, max_frames=max_frames) 143 if mimebundle is None: 144 return {}, {} /opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in element_display(element, max_frames) 186 return None 187 --> 188 return render(element) 189 190 /opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in render(obj, kwargs) 63 renderer = renderer.instance(fig='png') 64 ---> 65 return renderer.components(obj, kwargs) 66 67 /opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/renderer.py in components(self, obj, fmt, comm, kwargs) 257 # Bokeh has to handle comms directly in <0.12.15 258 comm = False if bokeh_version < '0.12.15' else comm --> 259 return super(BokehRenderer, self).components(obj,fmt, comm, kwargs) 260 261 /opt/conda/lib/python3.6/site-packages/holoviews/plotting/renderer.py in components(self, obj, fmt, comm, kwargs) 319 plot = obj 320 else: --> 321 plot, fmt = self._validate(obj, fmt) 322 323 widget_id = None /opt/conda/lib/python3.6/site-packages/holoviews/plotting/renderer.py in _validate(self, obj, fmt, kwargs) 218 if isinstance(obj, tuple(self.widgets.values())): 219 return obj, 'html' --> 220 plot = self.get_plot(obj, renderer=self, kwargs) 221 222 fig_formats = self.mode_formats['fig'][self.mode] /opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/renderer.py in get_plot(self_or_cls, obj, doc, renderer) 150 doc = Document() if self_or_cls.notebook_context else curdoc() 151 doc.theme = self_or_cls.theme --> 152 plot = super(BokehRenderer, self_or_cls).get_plot(obj, renderer) 153 plot.document = doc 154 return plot /opt/conda/lib/python3.6/site-packages/holoviews/plotting/renderer.py in get_plot(self_or_cls, obj, renderer) 205 init_key = tuple(v if d is None else d for v, d in 206 zip(plot.keys[0], defaults)) --> 207 plot.update(init_key) 208 else: 209 plot = obj /opt/conda/lib/python3.6/site-packages/holoviews/plotting/plot.py in update(self, key) 511 def update(self, key): 512 if len(self) == 1 and ((key == 0) or (key == self.keys[0])) and not self.drawn: --> 513 return self.initialize_plot() 514 item = self.getitem(key) 515 self.traverse(lambda x: setattr(x, '_updated', True)) /opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/element.py in initialize_plot(self, ranges, plot, plots, source) 729 if not self.overlaid: 730 self._update_plot(key, plot, style_element) --> 731 self._update_ranges(style_element, ranges) 732 733 for cb in self.callbacks: /opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/element.py in _update_ranges(self, element, ranges) 498 if not self.drawn or xupdate: 499 self._update_range(x_range, l, r, xfactors, self.invert_xaxis, --> 500 self._shared['x'], self.logx, streaming) 501 if not self.drawn or yupdate: 502 self._update_range(y_range, b, t, yfactors, self.invert_yaxis, /opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/element.py in _update_range(self, axis_range, low, high, factors, invert, shared, log, streaming) 525 updates = {} 526 if low is not None and (isinstance(low, util.datetime_types) --> 527 or np.isfinite(low)): 528 updates['start'] = (axis_range.start, low) 529 if high is not None and (isinstance(high, util.datetime_types) TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' ``` Similar but slightly different errors arise for different holoviews types (e.g. `hv.Image`) and contexts (using time as a holoviews kdim). Expected Output This should work. I'm not sure if this is really an xarray problem. Maybe it needs a fix in holoviews (or bokeh). But I'm raising it here first since clearly we have introduced this new wrinkle in the stack. Cc'ing @philippjfr since he is the expert on all things holoviews. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.111+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.4 pandas: 0.23.0 numpy: 1.14.3 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: None h5py: None Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.17.5 distributed: 1.21.8 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 39.0.1 pip: 10.0.1 conda: 4.3.34 pytest: 3.5.1 IPython: 6.3.1 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2164/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
193657418	MDU6SXNzdWUxOTM2NTc0MTg=	1154	netCDF reading is not prominent in the docs	rabernat 1197350	closed		7	2016-12-06T01:18:40Z	2019-02-02T06:33:44Z	2019-02-02T06:33:44Z	MEMBER			Just opening an issue to highlight what I think is a problem with the docs. For me, the primary use of xarray is to read and process existing netCDF data files. @shoyer's popular blog post illustrates this use case extremely well. However, when I open the docs, I have to dig quite deep before I can see how to read a netCDF file. This could be turning away many potential users. The stuff about netCDF reading is hidden under "Serialization and IO". Many potential users will have no idea what either of these words mean. IMO the solution to this is to reorganize the docs to make reading netCDF much more prominent and obvious.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1154/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
225734529	MDU6SXNzdWUyMjU3MzQ1Mjk=	1394	autoclose with distributed doesn't seem to work	rabernat 1197350	closed		9	2017-05-02T15:37:07Z	2019-01-13T19:35:10Z	2019-01-13T19:35:10Z	MEMBER			I am trying to analyze a very large netCDF dataset using xarray and distributed. I open my dataset with the new `autoclose` option: `python ds = xr.open_mfdataset(ddir + '*.nc', decode_cf=False, autoclose=True)` However, when I try some reduction operation (e.g. `ds['Salt'].mean()`), I can see my open file count continue to rise monotonically. Eventually the dask worker dies with `OSError: [Errno 24] Too many open files: '/proc/65644/sta` once I hit the system ulimit. Am I doing something wrong here? Why are the files not being closed? cc: @pwolfram	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1394/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
225774140	MDU6SXNzdWUyMjU3NzQxNDA=	1396	selecting a point from an mfdataset	rabernat 1197350	closed		12	2017-05-02T18:02:50Z	2019-01-13T06:32:45Z	2019-01-13T06:32:45Z	MEMBER			Sorry to be opening so many vague performance issues. I am really having a hard time with my current dataset, which is exposing certain limitations of xarray and dask in a way none of my previous work has done. I have a directory full of netCDF4 files. There are 1754 files, each 8.1GB in size, each representing a single model timestep. So there is ~14 TB of data total. (In addition to the time-dependent output, there is a single file with information about the grid.) Imagine I want to extract a timeseries from a single point (indexed by `k, j, i`) in this simulation. Without xarray, I would do something like this: `python import netCDF4 ts = np.zeros(len(all_files)) for n, fname in enumerate(tqdm(all_files)): nc = netCDF4.Dataset(fname) ts[n] = nc.variables['Salt'][k, j, i] nc.close()` Which goes reasonably quick: tqdm gives `[02:38<00:00, 11.56it/s]`. I could do the same sort of loop using xarray: `python import xarray as xr ts = np.zeros(len(all_files)) for n, fname in enumerate(tqdm(all_files)): ds = xr.open_dataset(fname) ts[n] = ds['Salt'][k, j, i] ds.close()` Which has a <50% performance overhead: `[03:29<00:00, 8.74it/s]`. Totally acceptable. Of course, what I really want is to avoid a loop and deal with the whole dataset as a single self-contained object. `python ds = xr.open_mfdataset(all_files, decode_cf=False, autoclose=True)` This alone takes between 4-5 minutes to run (see #1385). If I want to print the repr, it takes another 3 minutes or so to `print(ds)`. The full dataset looks like this: python <xarray.Dataset> Dimensions: (i: 2160, i_g: 2160, j: 2160, j_g: 2160, k: 90, k_l: 90, k_p1: 91, k_u: 90, time: 1752) Coordinates: * j (j) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ... * k (k) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... * j_g (j_g) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ... * i (i) int64 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 ... * k_p1 (k_p1) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * k_u (k_u) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * i_g (i_g) int64 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 ... * k_l (k_l) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * time (time) float64 2.592e+05 2.628e+05 2.664e+05 2.7e+05 2.736e+05 ... Data variables: face (time) int64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... PhiBot (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceQnet (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... SIvice (time, j_g, i) float32 0.0516454 0.0523205 0.0308559 ... SIhsalt (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceFWflx (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... V (time, k, j_g, i) float32 0.0491903 0.0496442 0.0276739 ... iter (time) int64 10368 10512 10656 10800 10944 11088 11232 11376 ... oceQsw (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceTAUY (time, j_g, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Theta (time, k, j, i) float32 -1.31868 -1.27825 -1.21401 -1.17964 ... SIhsnow (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... U (time, k, j, i_g) float32 0.0281392 0.0203967 0.0075199 ... SIheff (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... SIuice (time, j, i_g) float32 -0.041163 -0.0487612 -0.0614498 ... SIarea (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Salt (time, k, j, i) float32 33.7534 33.7652 33.7755 33.7723 ... oceSflux (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... W (time, k_l, j, i) float32 -2.27453e-05 -2.28018e-05 ... oceTAUX (time, j, i_g) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Eta (time, j, i) float32 -1.28886 -1.28811 -1.2871 -1.28567 ... YC (j, i) float32 -57.001 -57.001 -57.001 -57.001 -57.001 -57.001 ... YG (j_g, i_g) float32 -57.0066 -57.0066 -57.0066 -57.0066 ... XC (j, i) float32 -15.4896 -15.4688 -15.4479 -15.4271 -15.4062 ... XG (j_g, i_g) float32 -15.5 -15.4792 -15.4583 -15.4375 -15.4167 ... Zp1 (k_p1) float32 0.0 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 ... Z (k) float32 -0.5 -1.57 -2.79 -4.185 -5.78 -7.595 -9.66 -12.01 ... Zl (k_l) float32 0.0 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 ... Zu (k_u) float32 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 -13.26 ... rA (j, i) float32 1.5528e+06 1.5528e+06 1.5528e+06 1.5528e+06 ... rAw (j, i_g) float32 1.5528e+06 1.5528e+06 1.5528e+06 1.5528e+06 ... rAs (j_g, i) float32 9.96921e+36 9.96921e+36 9.96921e+36 ... rAz (j_g, i_g) float32 1.55245e+06 1.55245e+06 1.55245e+06 ... dxG (j_g, i) float32 1261.27 1261.27 1261.27 1261.27 1261.27 ... dyG (j, i_g) float32 1230.96 1230.96 1230.96 1230.96 1230.96 ... dxC (j, i_g) float32 1261.46 1261.46 1261.46 1261.46 1261.46 ... Depth (j, i) float32 4578.67 4611.09 4647.6 4674.88 4766.75 4782.64 ... dyC (j_g, i) float32 1230.86 1230.86 1230.86 1230.86 1230.86 ... PHrefF (k_p1) float32 0.0 9.81 20.9934 33.7464 48.3633 65.0403 ... drF (k) float32 1.0 1.14 1.3 1.49 1.7 1.93 2.2 2.5 2.84 3.21 3.63 ... PHrefC (k) float32 4.905 15.4017 27.3699 41.0549 56.7018 74.507 ... drC (k_p1) float32 0.5 1.07 1.22 1.395 1.595 1.815 2.065 2.35 2.67 ... hFacW (k, j, i_g) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... hFacS (k, j_g, i) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... hFacC (k, j, i) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... Attributes: coordinates: face Now, to extract the same timeseries, I would like to say `python ts = ds.Salt[:, k, j, i].load()` I monitor what is happening under the hood using when I call this by using netdata and the dask.distributed dashboard, using only a single process and thread. First, all the files are opened (see #1394). Then they start getting read. Each read takes between 10 and 30 seconds, and the memory usage starts increasing steadily. My impression is that the entire dataset is being read into memory for concatenation. (I have dumped out the dask graph in case anyone can make sense of it.) I have never let this calculation complete, as it looks like it would eat up all the memory on my system...plus it's extremely slow. To me, this seems like a failure of lazy indexing. I naively expected that the underlying file access would work similar to my loop, perhaps even in parallel. Can anyone shed some light on what might be going wrong?	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1396/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
108623921	MDU6SXNzdWUxMDg2MjM5MjE=	591	distarray backend?	rabernat 1197350	closed		5	2015-09-28T09:49:52Z	2019-01-13T04:11:08Z	2019-01-13T04:11:08Z	MEMBER			This is probably a long shot, but I think a distarray backend could potentially be very useful in xray. Distarray implements the numpy interface, so it should be possible in principle. Distarray has a different architecture from dask (using MPI for parallelization) and in this way is more similar to traditional HPC codes. The application I have in mind is very high resolution GCM output where one wants to tile the data spatially across multiple nodes on a cluster. (This is how a GCM itself works.)	{ "url": "https://api.github.com/repos/pydata/xarray/issues/591/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
280626621	MDU6SXNzdWUyODA2MjY2MjE=	1770	slow performance when storing datasets in gcsfs-backed zarr stores	rabernat 1197350	closed		11	2017-12-08T21:46:32Z	2019-01-13T03:52:46Z	2019-01-13T03:52:46Z	MEMBER			We are working on integrating zarr with xarray. In the process, we have encountered a performance issue that I am documenting here. At this point, it is not clear if the core issue is in zarr, gcsfs, dask, or xarray. I originally started posting this in zarr, but in the process, I became more convinced the issue was with xarray. Dask Only Here is an example using only dask and zarr. ```python connect to a local dask scheduler from dask.distributed import Client client = Client('tcp://129.236.20.45:8786') create a big dask array import dask.array as dsa shape = (30, 50, 1080, 2160) chunkshape = (1, 1, 1080, 2160) ar = dsa.random.random(shape, chunks=chunkshape) connect to gcs and create MutableMapping import gcsfs fs = gcsfs.GCSFileSystem(project='pangeo-181919') gcsmap = gcsfs.mapping.GCSMap('pangeo-data/test999', gcs=fs, check=True, create=True) create a zarr array to store into import zarr za = zarr.create(ar.shape, chunks=chunkshape, dtype=ar.dtype, store=gcsmap) write it ar.store(za, lock=False) ``` When you do this, it spends a long time serializing stuff before the computation starts. For a more fine-grained look at the process, one can instead do `python delayed_obj = a.store(za, compute=False, lock=False) %prun future = client.compute(dobj)` This reveals that the pre-compute step takes about 10s. Monitoring the distributed scheduler, I can see that, once the computation starts, it takes about 1:30 to store the array (27 GB). (This is actually not bad!) Some debugging by @mrocklin revealed the following step is quite slow `python import cloudpickle %time len(cloudpickle.dumps(za))` On my system, this was taking close to 1s. On contrast, when the `store` passed to `gcsmap` is not a `GCSMap` but instead a path, it is in the microsecond territory. So pickling `GCSMap` objects is relatively slow. I'm not sure whether this pickling happens when we call `client.compute` or during the task execution. There is room for improvement here, but overall, zarr + gcsfs + dask seem to integrate well and give decent performance. Xarray This get much worse once xarray enters the picture. (Note that this example requires the xarray PR pydata/xarray#1528, which has not been merged yet.) ```python wrap the dask array in an xarray import xarray as xr import numpy as np ds = xr.DataArray(ar, dims=['time', 'depth', 'lat', 'lon'], coords={'lat': np.linspace(-90, 90, Ny), 'lon': np.linspace(0, 360, Nx)}).to_dataset(name='temperature') store to a different bucket gcsmap = gcsfs.mapping.GCSMap('pangeo-data/test1', gcs=fs, check=True, create=True) ds.to_zarr(store=gcsmap, mode='w') ``` Now the store step takes 18 minutes. Most of this time, is upfront, during which there is little CPU activity and no network activity. After about 15 minutes or so, it finally starts computing, at which point the writes to gcs proceed more-or-less at the same rate as with the dask-only example. Profiling the `to_zarr` with snakeviz reveals that it is spending most of its time waiting for thread locks. I don't understand this, since I specifically eliminated locks when storing the zarr arrays.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1770/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
362866468	MDExOlB1bGxSZXF1ZXN0MjE3NDYzMTU4	2430	WIP: revise top-level package description	rabernat 1197350	closed		10	2018-09-22T15:35:47Z	2019-01-07T01:04:19Z	2019-01-06T00:31:57Z	MEMBER	0	pydata/xarray/pulls/2430	I have often complained that xarray's top-level package description assumes that the user knows all about pandas. I think this alienates many new users. This is a first draft at revising that top-level description. Feedback from the community very needed here.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2430/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
389594572	MDU6SXNzdWUzODk1OTQ1NzI=	2597	add dayofyear to CFTimeIndex	rabernat 1197350	closed		2	2018-12-11T04:41:59Z	2018-12-11T19:28:31Z	2018-12-11T19:28:31Z	MEMBER			I have noticed that `CFTimeIndex` does not provide the `.dayofyear` attributes. Pandas `DatetimeIndex` does. Implementing these attributes would make certain grouping operations much easier on non-standard calendars. Perhaps there are other similar attributes. I don't know if `.dayofweek` makes sense for non-standard calendars.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2597/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
382497709	MDExOlB1bGxSZXF1ZXN0MjMyMTkwMjg5	2559	Zarr consolidated	rabernat 1197350	closed		19	2018-11-20T04:39:41Z	2018-12-05T14:58:58Z	2018-12-04T23:51:00Z	MEMBER	0	pydata/xarray/pulls/2559	This PR adds support for reading and writing of consolidated metadata in zarr stores. [x] Closes #2558 (remove if there is no corresponding issue, which should only be the case for minor changes) [x] Tests added (for all bug fixes or enhancements) [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2559/reactions", "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 1, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
382043672	MDU6SXNzdWUzODIwNDM2NzI=	2558	how to incorporate zarr's new open_consolidated method?	rabernat 1197350	closed		1	2018-11-19T03:28:40Z	2018-12-04T23:51:00Z	2018-12-04T23:51:00Z	MEMBER			Zarr has a new feature called consolidated metadata. This feature will make it much faster to open certain zarr datasets, because all the metadata needed to construct the xarray dataset will live in a single .json file. To use this new feature, the new function `zarr.open_consolidated` needs to be called. So it won't work with xarray out of the box. We need to decide how to add support for this at the xarray level. I am seeking feedback on what API people would like to see before starting a PR. My proposal is to add a new keyword argument to `xarray.open_zarr` called `consolidated` (default = False). An alternative would be to automatically try `open_consolidated` and fall back on the standard `open_group` function if that fails. I played around with this a bit and realized that https://github.com/zarr-developers/zarr/issues/336 needs to be resolved before we can do the xarray side. cc @martindurant, who might want to weigh on what would be most convenient for intake.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2558/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
301891754	MDU6SXNzdWUzMDE4OTE3NTQ=	1955	Skipping / failing zarr tests	rabernat 1197350	closed		3	2018-03-02T20:17:31Z	2018-10-29T00:25:34Z	2018-10-29T00:25:34Z	MEMBER			Zarr tests are currently getting skipped on our main testing environments (because the zarr version is less than 2.2): https://travis-ci.org/pydata/xarray/jobs/348350073#L1264 And failing in the `py36-zarr-dev` environment https://travis-ci.org/pydata/xarray/jobs/348350087#L4989 I'm not sure how this regression occurred, but the zarr tests have been failing for a long time, e.g. https://travis-ci.org/pydata/xarray/jobs/342651302 Possibly related to #1954 cc @jhamman	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1955/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
332762756	MDU6SXNzdWUzMzI3NjI3NTY=	2234	fillna error with distributed	rabernat 1197350	closed		3	2018-06-15T12:54:54Z	2018-06-15T13:13:54Z	2018-06-15T13:13:54Z	MEMBER			Code Sample, a copy-pastable example if possible The following code works with the default dask threaded scheduler. `python da = xr.DataArray([1, 1, 1, np.nan]).chunk() da.fillna(0.).mean().load()` It fails with distributed. I see the following error on the client side: ``` KilledWorker Traceback (most recent call last) <ipython-input-7-5ed3c292af2e> in <module>() ----> 1 da.fillna(0.).mean().load() /opt/conda/lib/python3.6/site-packages/xarray/core/dataarray.py in load(self, kwargs) 631 dask.array.compute 632 """ --> 633 ds = self._to_temp_dataset().load(kwargs) 634 new = self._from_temp_dataset(ds) 635 self._variable = new._variable /opt/conda/lib/python3.6/site-packages/xarray/core/dataset.py in load(self, *kwargs) 489 490 # evaluate all the dask arrays simultaneously --> 491 evaluated_data = da.compute(lazy_data.values(),** kwargs) 492 493 for k, data in zip(lazy_data, evaluated_data): /opt/conda/lib/python3.6/site-packages/dask/base.py in compute(args, kwargs) 398 keys = [x.dask_keys() for x in collections] 399 postcomputes = [x.dask_postcompute() for x in collections] --> 400 results = schedule(dsk, keys, kwargs) 401 return repack([f(r, a) for r, (f, a) in zip(results, postcomputes)]) 402 /opt/conda/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, *kwargs) 2157 try: 2158 results = self.gather(packed, asynchronous=asynchronous, -> 2159 direct=direct) 2160 finally: 2161 for f in futures.values(): /opt/conda/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous) 1560 return self.sync(self._gather, futures, errors=errors, 1561 direct=direct, local_worker=local_worker, -> 1562 asynchronous=asynchronous) 1563 1564 @gen.coroutine /opt/conda/lib/python3.6/site-packages/distributed/client.py in sync(self, func, args,* kwargs) 650 return future 651 else: --> 652 return sync(self.loop, func, args, kwargs) 653 654 def repr*(self): /opt/conda/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, args,* kwargs) 273 e.wait(10) 274 if error[0]: --> 275 six.reraise(error[0]) 276 else: 277 return result[0] /opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb) 691 if value.traceback is not tb: 692 raise value.with_traceback(tb) --> 693 raise value 694 finally: 695 value = None /opt/conda/lib/python3.6/site-packages/distributed/utils.py in f() 258 yield gen.moment 259 thread_state.asynchronous = True --> 260 result[0] = yield make_coro() 261 except Exception as exc: 262 error[0] = sys.exc_info() /opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self) 1097 1098 try: -> 1099 value = future.result() 1100 except Exception: 1101 self.had_exception = True /opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self) 1105 if exc_info is not None: 1106 try: -> 1107 yielded = self.gen.throw(*exc_info) 1108 finally: 1109 # Break up a reference to itself /opt/conda/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker) 1437 six.reraise(type(exception), 1438 exception, -> 1439 traceback) 1440 if errors == 'skip': 1441 bad_keys.add(key) /opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb) 691 if value.traceback is not tb: 692 raise value.with_traceback(tb) --> 693 raise value 694 finally: 695 value = None KilledWorker: ("('isna-mean_chunk-where-mean_agg-aggregate-74ec0f30171c1c667640f1f18df5f84b',)", 'tcp://10.20.197.7:43357') `While the worker logs show this:` distributed.worker - ERROR - Can't get attribute 'isna' on <module 'pandas.core.dtypes.missing' from '/opt/conda/lib/python3.6/site-packages/pandas/core/dtypes/missing.py'> Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/distributed/worker.py", line 346, in handle_scheduler self.ensure_computing]) File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(exc_info) File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 361, in handle_stream msgs = yield comm.read() File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(exc_info) File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 203, in read deserializers=deserializers) File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 307, in wrapper yielded = next(result) File "/opt/conda/lib/python3.6/site-packages/distributed/comm/utils.py", line 79, in from_frames res = _from_frames() File "/opt/conda/lib/python3.6/site-packages/distributed/comm/utils.py", line 65, in _from_frames deserializers=deserializers) File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/core.py", line 122, in loads value = _deserialize(head, fs, deserializers=deserializers) File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/serialize.py", line 236, in deserialize return loads(header, frames) File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/serialize.py", line 58, in pickle_loads return pickle.loads(b''.join(frames)) File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads return pickle.loads(x) AttributeError: Can't get attribute 'isna' on <module 'pandas.core.dtypes.missing' from '/opt/conda/lib/python3.6/site-packages/pandas/core/dtypes/missing.py'> ``` This could very well be a distributed issue. Or a pandas issue. I'm not too sure what is going on. Why is pandas even involved at all? Problem description This should not raise an error. It worked fine in previous versions, but something in our latest environment has caused it to break. Expected Output `<xarray.DataArray ()> array(0.75)` Output of `xr.show_versions()` This is running in the latest pangeo.pydata.org environment (https://github.com/pangeo-data/helm-chart/pull/29). @mrocklin picked a custom set of dask / distributed commits to install. ``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.111+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.7 pandas: 0.23.1 numpy: 1.14.5 scipy: 1.1.0 netCDF4: 1.3.1 h5netcdf: None h5py: None Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.17.4+51.g0a7fe8de distributed: 1.21.8+54.g7909f27d matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 39.2.0 pip: 10.0.1 conda: 4.5.4 pytest: 3.6.1 IPython: 6.4.0 sphinx: None ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2234/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
323359733	MDU6SXNzdWUzMjMzNTk3MzM=	2135	use CF conventions to enhance plot labels	rabernat 1197350	closed		4	2018-05-15T19:53:51Z	2018-06-02T00:10:26Z	2018-06-02T00:10:26Z	MEMBER			Elsewhere in xarray we use CF conventions to help with automatic decoding of datasets. Here I propose we consider using CF metadata conventions to improve the automatic labelling of plots. If datasets declare `long_name`, `standard_name`, and `units` attributes, we could use these instead of the variable name to label the relevant axes / colorbars. This feature would have helped me avoid several past mistakes due to my failure to examine the `units` attribute (e.g. data given in cm when I assumed m). Code Sample, a copy-pastable example if possible Here I create some data with relevant attributes `python import xarray as xr import numpy as np ds = xr.Dataset({'foo': ('x', np.random.rand(10), {'long_name': 'height', 'units': 'm'})}, coords={'x': ('x', np.arange(10), {'long_name': 'distance', 'units': 'km'})}) ds.foo.plot()` Problem description We have neglected the variable attributes, which would provide better axis labels. Expected Output Consider this instead: `python def label_from_attrs(da): attrs = da.attrs if 'long_name' in attrs: name = attrs['long_name'] elif 'standard_name' in attrs: name = attrs['standard_name'] else: name = da.name if 'units' in da.attrs: units = ' [{}]'.format(da.attrs['units']) label = name + units return label ds.foo.plot() plt.xlabel(label_from_attrs(ds.x)) plt.ylabel(label_from_attrs(ds.foo))` I feel like this would be a sensible default. But it would be a breaking change. We could make it optional with a keyword like `labels_from_attrs=True`. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.4.111+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.3+dev13.g98373f0 pandas: 0.22.0 numpy: 1.14.3 scipy: 1.0.1 netCDF4: 1.3.1 h5netcdf: 0.5.1 h5py: 2.7.1 Nio: None zarr: 2.2.1.dev2 bottleneck: 1.2.1 cyordereddict: None dask: 0.17.4 distributed: 1.21.8 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 39.1.0 pip: 9.0.1 conda: 4.3.29 pytest: 3.5.1 IPython: 6.3.1 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2135/reactions", "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
180516114	MDU6SXNzdWUxODA1MTYxMTQ=	1026	multidim groupby on dask arrays: dask.array.reshape error	rabernat 1197350	closed		17	2016-10-02T14:55:25Z	2018-05-24T17:59:31Z	2018-05-24T17:59:31Z	MEMBER			If I try to run a groupby operation using a multidimensional group, I get an error from dask about "dask.array.reshape requires that reshaped dimensions after the first contain at most one chunk". This error is arises with dask 0.11.0 but NOT dask 0.8.0. Consider the following test example: ``` python import dask.array as da import xarray as xr nz, ny, nx = (10,20,30) data = da.ones((nz,ny,nx), chunks=(5,ny,nx)) coord_2d = da.random.random((ny,nx), chunks=(ny,nx))>0.5 ds = xr.Dataset({'thedata': (('z','y','x'), data)}, coords={'thegroup': (('y','x'), coord_2d)}) this works fine ds.thedata.groupby('thegroup') ``` Now I rechunk one of the later dimensions and group again: `python ds.chunk({'x': 5}).thedata.groupby('thegroup')` This raises the following error and stack trace ``` ValueError Traceback (most recent call last) <ipython-input-16-1b0095ee24a0> in <module>() ----> 1 ds.chunk({'x': 5}).thedata.groupby('thegroup') /Users/rpa/RND/open_source/xray/xarray/core/common.pyc in groupby(self, group, squeeze) 343 if isinstance(group, basestring): 344 group = self[group] --> 345 return self.groupby_cls(self, group, squeeze=squeeze) 346 347 def groupby_bins(self, group, bins, right=True, labels=None, precision=3, /Users/rpa/RND/open_source/xray/xarray/core/groupby.pyc in init(self, obj, group, squeeze, grouper, bins, cut_kwargs) 170 # the copy is necessary here, otherwise read only array raises error 171 # in pandas: https://github.com/pydata/pandas/issues/12813> --> 172 group = group.stack({stacked_dim_name: orig_dims}).copy() 173 obj = obj.stack({stacked_dim_name: orig_dims}) 174 self._stacked_dim = stacked_dim_name /Users/rpa/RND/open_source/xray/xarray/core/dataarray.pyc in stack(self, dimensions) 857 DataArray.unstack 858 """ --> 859 ds = self._to_temp_dataset().stack(dimensions) 860 return self._from_temp_dataset(ds) 861 /Users/rpa/RND/open_source/xray/xarray/core/dataset.pyc in stack(self, dimensions) 1359 result = self 1360 for new_dim, dims in dimensions.items(): -> 1361 result = result._stack_once(dims, new_dim) 1362 return result 1363 /Users/rpa/RND/open_source/xray/xarray/core/dataset.pyc in _stack_once(self, dims, new_dim) 1322 shape = [self.dims[d] for d in vdims] 1323 exp_var = var.expand_dims(vdims, shape) -> 1324 stacked_var = exp_var.stack({new_dim: dims}) 1325 variables[name] = stacked_var 1326 else: /Users/rpa/RND/open_source/xray/xarray/core/variable.pyc in stack(self, *dimensions) 801 result = self 802 for new_dim, dims in dimensions.items(): --> 803 result = result._stack_once(dims, new_dim) 804 return result 805 /Users/rpa/RND/open_source/xray/xarray/core/variable.pyc in _stack_once(self, dims, new_dim) 771 772 new_shape = reordered.shape[:len(other_dims)] + (-1,) --> 773 new_data = reordered.data.reshape(new_shape) 774 new_dims = reordered.dims[:len(other_dims)] + (new_dim,) 775 /Users/rpa/anaconda/lib/python2.7/site-packages/dask/array/core.pyc in reshape(self, shape) 1101 if len(shape) == 1 and not isinstance(shape[0], Number): 1102 shape = shape[0] -> 1103 return reshape(self, shape) 1104 1105 @wraps(topk) /Users/rpa/anaconda/lib/python2.7/site-packages/dask/array/core.pyc in reshape(array, shape) 2585 2586 if any(len(c) != 1 for c in array.chunks[ndim_same+1:]): -> 2587 raise ValueError('dask.array.reshape requires that reshaped ' 2588 'dimensions after the first contain at most one chunk') 2589 ValueError: dask.array.reshape requires that reshaped dimensions after the first contain at most one chunk ``` I am using the latest xarray master and dask version 0.11.0. Note that the example works fine if I use an earlier version of dask (e.g. 0.8.0, the only other one I tested.) This suggests an upstream issue with dask, but I wanted to bring it up here first.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1026/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
317783678	MDU6SXNzdWUzMTc3ODM2Nzg=	2082	searching is broken on readthedocs	rabernat 1197350	closed		2	2018-04-25T20:34:13Z	2018-05-04T20:10:31Z	2018-05-04T20:10:31Z	MEMBER			Searches return no results for me. For example: http://xarray.pydata.org/en/latest/search.html?q=xarray&check_keywords=yes&area=default	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2082/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
312986662	MDExOlB1bGxSZXF1ZXN0MTgwNjUwMjc5	2047	Fix decode cf with dask	rabernat 1197350	closed		1	2018-04-10T15:56:20Z	2018-04-12T23:38:02Z	2018-04-12T23:38:02Z	MEMBER	0	pydata/xarray/pulls/2047	[x] Closes #1372 [x] Tests added [x] Tests passed [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API This was a very simple fix for an issue that has vexed me for quite a while. Am I missing something obvious here?	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2047/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
293913247	MDU6SXNzdWUyOTM5MTMyNDc=	1882	xarray tutorial at SciPy 2018?	rabernat 1197350	closed		17	2018-02-02T14:52:11Z	2018-04-09T20:30:13Z	2018-04-09T20:30:13Z	MEMBER			It would be great to hold an xarray tutorial at SciPy 2018. Xarray has matured a lot recently, and it would be great to raise awareness of what it can do among the broader scipy community. From the conference website: Tutorials should be focused on covering a well-defined topic in a hands-on manner. We want to see attendees coding! We encourage submissions to be designed to allow at least 50% of the time for hands-on exercises even if this means the subject matter needs to be limited. Tutorials will be 4 hours in duration. In your tutorial application, you can indicate what prerequisite skills and knowledge will be needed for your tutorial, and the approximate expected level of knowledge of your students (i.e., beginner, intermediate, advanced). I'm curious if anyone was already planning on submitting a tutorial. If not, let's put together a team. @jhamman has indicated interest in participating in, but not leading, the tutorial. Anyone else interested? xref pangeo-data/pangeo#97	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1882/reactions", "total_count": 4, "+1": 4, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
106562046	MDU6SXNzdWUxMDY1NjIwNDY=	575	1D line plot with data on the x axis	rabernat 1197350	closed		13	2015-09-15T13:56:51Z	2018-03-05T22:14:46Z	2018-03-05T22:14:46Z	MEMBER			Consider the following Dataset, representing a function f = cos(z) `python z = np.arange(10) ds = xray.Dataset( {'f': ('z', np.cos(z))}, coords={'z': z})` If I call `python ds.f.plot()` xray naturally puts "z" on the x-axis. However, since z represents the vertical dimension, it would be more natural do put it on the y-axis, i.e. `python plt.plot(ds.f, ds.z)` This is conventional in atmospheric science and oceanography for buoy data or balloon data. Is there an easy way to do this with xray's plotting functions? I scanned the code and didn't see an obvious solution, but maybe I missed it.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/575/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
295744504	MDU6SXNzdWUyOTU3NDQ1MDQ=	1898	zarr RTD docs broken	rabernat 1197350	closed	0.10.3 3008859	1	2018-02-09T03:35:05Z	2018-02-15T23:20:31Z	2018-02-15T23:20:31Z	MEMBER			This is what is getting rendered on RTD http://xarray.pydata.org/en/latest/io.html#zarr ``` In [26]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))}, ....: coords={'x': [10, 20, 30, 40], ....: 'y': pd.date_range('2000-01-01', periods=5), ....: 'z': ('x', list('abcd'))}) ....: In [27]: ds.to_zarr('path/to/directory.zarr') AttributeError Traceback (most recent call last) <ipython-input-27-8c5f1b00edbc> in <module>() ----> 1 ds.to_zarr('path/to/directory.zarr') /home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/core/dataset.py in to_zarr(self, store, mode, synchronizer, group, encoding) 1165 from ..backends.api import to_zarr 1166 return to_zarr(self, store=store, mode=mode, synchronizer=synchronizer, -> 1167 group=group, encoding=encoding) 1168 1169 def unicode(self): /home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/backends/api.py in to_zarr(dataset, store, mode, synchronizer, group, encoding) 752 # I think zarr stores should always be sync'd immediately 753 # TODO: figure out how to properly handle unlimited_dims --> 754 dataset.dump_to_store(store, sync=True, encoding=encoding) 755 return store /home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/core/dataset.py in dump_to_store(self, store, encoder, sync, encoding, unlimited_dims) 1068 1069 store.store(variables, attrs, check_encoding, -> 1070 unlimited_dims=unlimited_dims) 1071 if sync: 1072 store.sync() /home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/backends/zarr.py in store(self, variables, attributes, args, kwargs) 378 def store(self, variables, attributes, args, *kwargs): 379 AbstractWritableDataStore.store(self, variables, attributes, --> 380 args,** kwargs) 381 382 /home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, unlimited_dims) 275 variables, attributes = self.encode(variables, attributes) 276 --> 277 self.set_attributes(attributes) 278 self.set_dimensions(variables, unlimited_dims=unlimited_dims) 279 self.set_variables(variables, check_encoding_set, /home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/backends/zarr.py in set_attributes(self, attributes) 341 342 def set_attributes(self, attributes): --> 343 self.ds.attrs.put(attributes) 344 345 def encode_variable(self, variable): AttributeError: 'Attributes' object has no attribute 'put' ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1898/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
253136694	MDExOlB1bGxSZXF1ZXN0MTM3ODE5MTA0	1528	WIP: Zarr backend	rabernat 1197350	closed		103	2017-08-27T02:38:01Z	2018-02-13T21:35:03Z	2017-12-14T02:11:36Z	MEMBER	0	pydata/xarray/pulls/1528	[x] Closes #1223 [x] Tests added / passed [x] Passes `git diff upstream/master \| flake8 --diff` [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API I think that a zarr backend could be the ideal storage format for xarray datasets, overcoming many of the frustrations associated with netcdf and enabling optimal performance on cloud platforms. This is a very basic start to implementing a zarr backend (as proposed in #1223); however, I am taking a somewhat different approach. I store the whole dataset in a single zarr group. I encode the extra metadata needed by xarray (so far just dimension information) as attributes within the zarr group and child arrays. I hide these special attributes from the user by wrapping the attribute dictionaries in a "`HiddenKeyDict`", so that they can't be viewed or modified. I have no tests yet (:flushed:), but the following code works. ```python from xarray.backends.zarr import ZarrStore import xarray as xr import numpy as np ds = xr.Dataset( {'foo': (('y', 'x'), np.ones((100, 200)), {'myattr1': 1, 'myattr2': 2}), 'bar': (('x',), np.zeros(200))}, {'y': (('y',), np.arange(100)), 'x': (('x',), np.arange(200))}, {'some_attr': 'copana'} ).chunk({'y': 50, 'x': 40}) zs = ZarrStore(store='zarr_test') ds.dump_to_store(zs) ds2 = xr.Dataset.load_store(zs) assert ds2.equals(ds) ``` There is a very long way to go here, but I thought I would just get a PR started. Some questions that would help me move forward. What is "encoding" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr? Should we encode / decode CF for zarr stores? Do we want to always automatically align dask chunks with the underlying zarr chunks? What sort of public API should the zarr backend have? Should you be able to load zarr stores via `open_dataset`? Or do we need a new method? I think `.to_zarr()` would be quite useful. zarr arrays are extensible along all axes. What does this imply for unlimited dimensions? Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1528/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
287569331	MDExOlB1bGxSZXF1ZXN0MTYyMjI0MTg2	1817	fix rasterio chunking with s3 datasets	rabernat 1197350	closed		11	2018-01-10T20:37:45Z	2018-01-24T09:33:07Z	2018-01-23T16:33:28Z	MEMBER	0	pydata/xarray/pulls/1817	[x] Closes #1816 (remove if there is no corresponding issue, which should only be the case for minor changes) [x] Tests added (for all bug fixes or enhancements) [x] Tests passed (for all non-documentation changes) [x] Passes `git diff upstream/master */py \| flake8 --diff` (remove if you did not edit any Python files) [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later) This is a simple fix for token generation of non-filename targets for rasterio. The problem is that I have no idea how to test it without actually hitting s3 (which requires boto and aws credentials).	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1817/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
287566823	MDU6SXNzdWUyODc1NjY4MjM=	1816	rasterio chunks argument causes loading from s3 to fail	rabernat 1197350	closed		1	2018-01-10T20:28:40Z	2018-01-23T16:33:28Z	2018-01-23T16:33:28Z	MEMBER			Code Sample, a copy-pastable example if possible ```python This works url = 's3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF' ds = xr.open_rasterio(url) this doesn't ds = xr.open_rasterio(url, chunks=512) ``` The error is ``` FileNotFoundError Traceback (most recent call last) <ipython-input-17-8b55d7e920b8> in <module>() 6 # https://aws.amazon.com/public-datasets/landsat/ 7 # 512x512 chunking ----> 8 ds = xr.open_rasterio(url, chunks=512) 9 ds ~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray-0.10.0-py3.6.egg/xarray/backends/rasterio_.py in open_rasterio(filename, chunks, cache, lock) 172 from dask.base import tokenize 173 # augment the token with the file modification time --> 174 mtime = os.path.getmtime(filename) 175 token = tokenize(filename, mtime, chunks) 176 name_prefix = 'open_rasterio-%s' % token ~/miniconda3/envs/geo_scipy/lib/python3.6/genericpath.py in getmtime(filename) 53 def getmtime(filename): 54 """Return the last modification time of a file, reported by os.stat().""" ---> 55 return os.stat(filename).st_mtime 56 57 FileNotFoundError: [Errno 2] No such file or directory: 's3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF' ``` Problem description It is pretty clear that the current xarray code expects to receive a filename. (The name of the argument is `filename`.) But rasterio's `open` function accepts a much wider range of dataset identifiers. The tokenizing function should be updated to allow for this. Seems like it should be a pretty easy fix. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.0 pandas: 0.20.3 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.3.1 h5netcdf: 0.4.1 Nio: None bottleneck: 1.2.1 cyordereddict: None dask: 0.16.0 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 36.3.0 pip: 9.0.1 conda: None pytest: 3.2.1 IPython: 6.1.0 sphinx: 1.6.5	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1816/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
281983819	MDU6SXNzdWUyODE5ODM4MTk=	1779	decode_cf destroys chunks	rabernat 1197350	closed		2	2017-12-14T05:12:00Z	2017-12-15T14:50:42Z	2017-12-15T14:50:41Z	MEMBER			Code Sample, a copy-pastable example if possible `python import numpy as np import xarray as xr xr.DataArray(np.random.rand(1000)).to_dataset(name='random').chunk(100) ds_cf = xr.decode_cf(ds) assert not ds_cf.chunks` Problem description Calling `decode_cf` causes variables whose data is dask arrays to be wrapped in two layers of abstractions: `DaskIndexingAdapter` and `LazilyIndexedArray`. In the example above ```python ds.random.variable._data dask.array<da.random.random_sample, shape=(1000,), dtype=float64, chunksize=(100,)> ds_cf.random.variable._data LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array<da.random.random_sample, shape=(1000,), dtype=float64, chunksize=(100,)>), key=BasicIndexer((slice(None, None, None),))) ``` At least part of the problem comes from this line: https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L1045 This is especially problematic if we want to concatenate several such datasets together with dask. Chunking the decoded dataset creates a nested dask-within-dask array which is sure to cause undesirable behavior down the line ```python dict(ds_cf.chunk().random.data.dask) {('xarray-random-bf5298b8790e93c1564b5dca9e04399e', 0): (<function dask.array.core.getter>, 'xarray-random-bf5298b8790e93c1564b5dca9e04399e', (slice(0, 1000, None),)), 'xarray-random-bf5298b8790e93c1564b5dca9e04399e': ImplicitToExplicitIndexingAdapter(array=LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array<da.random.random_sample, shape=(1000,), dtype=float64, chunksize=(100,)>), key=BasicIndexer((slice(None, None, None),))))} ``` Expected Output If we call `decode_cf` on a dataset made of dask arrays, it should preserve the chunks of the original dask arrays. Hopefully this can be addressed by #1752. Output of `xr.show_versions()` commit: 85174cda6440c2f6eed7860357e79897e796e623 python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.0-52-gd8842a6 pandas: 0.20.3 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.2.9 h5netcdf: 0.4.1 Nio: None bottleneck: 1.2.1 cyordereddict: None dask: 0.16.0 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 36.3.0 pip: 9.0.1 conda: None pytest: 3.2.1 IPython: 6.1.0 sphinx: 1.6.5	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1779/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
94328498	MDU6SXNzdWU5NDMyODQ5OA==	463	open_mfdataset too many files	rabernat 1197350	closed		47	2015-07-10T15:24:14Z	2017-11-27T12:17:17Z	2017-03-23T19:22:43Z	MEMBER			I am very excited to try xray. On my first attempt, I tried to use open_mfdataset on a set of ~8000 netcdf files. I hit a "RuntimeError: Too many open files". The ulimit on my system is 1024, so clearly that is the source of the error. I am curious whether this is the desired behavior for open_mfdataset. Does xray have to keep all the files open? If so, I will work with my sysadmin to increase the ulimit. It seems like the whole point of this function is to work with large collections of files, so this could be a significant limitation.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/463/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
229474101	MDExOlB1bGxSZXF1ZXN0MTIxMTQyODkw	1413	concat prealigned objects	rabernat 1197350	closed		11	2017-05-17T20:16:00Z	2017-07-17T21:53:53Z	2017-07-17T21:53:40Z	MEMBER	0	pydata/xarray/pulls/1413	[x] Closes #1385 [ ] Tests added / passed [ ] Passes `git diff upstream/master \| flake8 --diff` [ ] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API This is an initial PR to bypass index alignment and coordinate checking when concatenating datasets.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1413/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
229138906	MDExOlB1bGxSZXF1ZXN0MTIwOTAzMjY5	1411	fixed dask prefix naming	rabernat 1197350	closed		6	2017-05-16T19:10:30Z	2017-05-22T20:39:01Z	2017-05-22T20:38:56Z	MEMBER	0	pydata/xarray/pulls/1411	[x] Closes #1343 [x] Tests added / passed [x] Passes `git diff upstream/master \| flake8 --diff` [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API I am starting a new PR for this since the original one (#1345) was not branched of my own fork. As the discussion there stood, @shoyer suggested that `dataset.chunk` should also be updated to match the latest conventions in dask naming. The relevant code is here ```python def maybe_chunk(name, var, chunks): chunks = selkeys(chunks, var.dims) if not chunks: chunks = None if var.ndim > 0: token2 = tokenize(name, token if token else var._data) name2 = '%s%s-%s' % (name_prefix, name, token2) return var.chunk(chunks, name=name2, lock=lock) else: return var `variables = OrderedDict([(k, maybe_chunk(k, v, chunks)) for k, v in self.variables.items()])` ``` Currently, `chunk` has an optional keyword argument `name_prefix='xarray-'`. Do we want to keep this optional? IMO, the current naming logic in `chunk` is not a problem for dask and will not cause problems for the distributed bokeh dashboard (as `open_dataset` did).	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1411/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
218368855	MDExOlB1bGxSZXF1ZXN0MTEzNTU0Njk4	1345	new dask prefix	rabernat 1197350	closed		2	2017-03-31T00:56:24Z	2017-05-21T09:45:39Z	2017-05-16T19:11:13Z	MEMBER	0	pydata/xarray/pulls/1345	[x] closes #1343 [ ] tests added / passed [ ] passes `git diff upstream/master \| flake8 --diff` [ ] whatsnew entry	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1345/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
225482023	MDExOlB1bGxSZXF1ZXN0MTE4NDA4NDc1	1390	Fix groupby bins tests	rabernat 1197350	closed		1	2017-05-01T17:46:41Z	2017-05-01T21:52:14Z	2017-05-01T21:52:14Z	MEMBER	0	pydata/xarray/pulls/1390	[x] closes #1386 [x] tests added / passed [x] passes `git diff upstream/master \| flake8 --diff` [x] whatsnew entry	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1390/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
220078792	MDU6SXNzdWUyMjAwNzg3OTI=	1357	dask strict version check fails	rabernat 1197350	closed		1	2017-04-07T01:08:56Z	2017-04-07T01:43:53Z	2017-04-07T01:43:53Z	MEMBER			I am on xarray version 0.9.1-28-g1cad803 and dask version 0.14.1+39.g964b377 (both from recent github masters). I can't save chunked data to netcdf because of a failing dask version check. `python ds = xr.Dataset({'a': (['x'], np.random.rand(100)), 'b': (['x'], np.random.rand(100))}) ds = ds.chunk({'x': 20}) ds.to_netcdf('test.nc')` The relevant part of the stack trace is ``` /home/rpa/xarray/xarray/backends/common.pyc in sync(self) 165 import dask.array as da 166 import dask --> 167 if StrictVersion(dask.version) > StrictVersion('0.8.1'): 168 da.store(self.sources, self.targets, lock=GLOBAL_LOCK) 169 else: /home/rpa/.conda/envs/lagrangian_vorticity/lib/python2.7/distutils/version.pyc in init(self, vstring) 38 def init (self, vstring=None): 39 if vstring: ---> 40 self.parse(vstring) 41 42 def repr (self): /home/rpa/.conda/envs/lagrangian_vorticity/lib/python2.7/distutils/version.pyc in parse(self, vstring) 105 match = self.version_re.match(vstring) 106 if not match: --> 107 raise ValueError, "invalid version number '%s'" % vstring 108 109 (major, minor, patch, prerelease, prerelease_num) = \ ValueError: invalid version number '0.14.1+39.g964b377' ``` It appears that `StrictVersion` does not like the dask version numbering scheme.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1357/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
188537472	MDExOlB1bGxSZXF1ZXN0OTMxNzEyODE=	1104	add optimization tips	rabernat 1197350	closed		1	2016-11-10T15:26:25Z	2016-11-10T16:49:13Z	2016-11-10T16:49:06Z	MEMBER	0	pydata/xarray/pulls/1104	This adds some dask optimization tips from the mailing list (closes #1103).	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1104/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
188517316	MDU6SXNzdWUxODg1MTczMTY=	1103	add dask optimization tips to docs	rabernat 1197350	closed		0	2016-11-10T14:08:39Z	2016-11-10T16:49:06Z	2016-11-10T16:49:06Z	MEMBER			We should add the optimization tips that @shoyer describes in this mailing list thread to @karenamckinnon. https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/xarray/11lDGSeza78/lR1uj9yWDAAJ Specific things to try (we should add similar guidelines to xarray's docs): Do your spatial and temporal indexing with .sel() earlier in the pipeline, specifically before you resample. Resample triggers some computation on all the blocks, which in theory should commute with indexing, but we haven't implemented this optimization in dask yet: https://github.com/dask/dask/issues/746 Save the temporal mean to disk as a netCDF file (and then load it again with open_dataset) before subtracting it. Again, in theory, dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the dask scheduler, because it tries to keep every chunk of an array that it computes in memory: https://github.com/dask/dask/issues/874 Specify smaller chunks across space when using open_mfdataset, e.g., chunks={'latitude': 10, 'longitude': 10}. This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you do my suggestion 1).	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1103/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
180536861	MDExOlB1bGxSZXF1ZXN0ODc2NDc0MDk=	1027	Groupby bins empty groups	rabernat 1197350	closed		7	2016-10-02T21:31:32Z	2016-10-03T15:22:18Z	2016-10-03T15:22:15Z	MEMBER	0	pydata/xarray/pulls/1027	This PR fixes a bug in `groupby_bins` in which empty bins were dropped from the grouped results. Now `groupby_bins` restores any empty bins automatically. To recover the old behavior, one could apply `dropna` after a groupby operation. Fixes #1019	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1027/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
178359375	MDU6SXNzdWUxNzgzNTkzNzU=	1014	dask tokenize error with chunking	rabernat 1197350	closed		1	2016-09-21T14:14:10Z	2016-09-22T02:38:08Z	2016-09-22T02:38:08Z	MEMBER			I have hit a problem with my custom xarray store: https://github.com/xgcm/xgcm/blob/master/xgcm/models/mitgcm/mds_store.py Unfortunately it is hard for me to create a re-producible example, since this error is only coming up when I try to read a large binary dataset stored on my server. Nevertheless, I am opening an issue in hopes that someone can help me. I create an xarray dataset via a custom function `python ds = xgcm.open_mdsdataset(ddir, iters, delta_t=deltaT, prefix=['DiagLAYERS-diapycnal','DiagLAYERS-transport'])` This function creates a dataset object successfully and then calls `ds.chunk()`. Dask is unable to tokenize the variables and fails. I don't really understand why, but it seems to ultimately depend on the presence and value of the `filename` attribute in the data getting passed to dask. Any advice would be appreciated. The relevant stack trace is ``` python /home/rpa/xgcm/xgcm/models/mitgcm/mds_store.pyc in open_mdsdataset(dirname, iters, prefix, read_grid, delta_t, ref_date, calendar, geometry, grid_vars_to_coords, swap_dims, endian, chunks, ignore_unknown_vars) 154 # do we need more fancy logic (like open_dataset), or is this enough 155 if chunks is not None: --> 156 ds = ds.chunk(chunks) 157 158 return ds /home/rpa/xarray/xarray/core/dataset.py in chunk(self, chunks, name_prefix, token, lock) 863 864 variables = OrderedDict([(k, maybe_chunk(k, v, chunks)) --> 865 for k, v in self.variables.items()]) 866 return self._replace_vars_and_dims(variables) 867 /home/rpa/xarray/xarray/core/dataset.py in maybe_chunk(name, var, chunks) 856 chunks = None 857 if var.ndim > 0: --> 858 token2 = tokenize(name, token if token else var._data) 859 name2 = '%s%s-%s' % (name_prefix, name, token2) 860 return var.chunk(chunks, name=name2, lock=lock) /home/rpa/dask/dask/base.pyc in tokenize(args, kwargs) 355 if kwargs: 356 args = args + (kwargs,) --> 357 return md5(str(tuple(map(normalize_token, args))).encode()).hexdigest() /home/rpa/dask/dask/utils.pyc in call*(self, arg) 510 for cls in inspect.getmro(typ)[1:]: 511 if cls in lk: --> 512 return lkcls 513 raise TypeError("No dispatch for {0} type".format(typ)) 514 /home/rpa/dask/dask/base.pyc in normalize_array(x) 320 return (str(x), x.dtype) 321 if hasattr(x, 'mode') and hasattr(x, 'filename'): --> 322 return x.filename, os.path.getmtime(x.filename), x.dtype, x.shape 323 if x.dtype.hasobject: 324 try: /usr/local/anaconda/lib/python2.7/genericpath.pyc in getmtime(filename) 60 def getmtime(filename): 61 """Return the last modification time of a file, reported by os.stat().""" ---> 62 return os.stat(filename).st_mtime 63 64 TypeError: coercing to Unicode: need string or buffer, NoneType found ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1014/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
146182176	MDExOlB1bGxSZXF1ZXN0NjU0MDc4NzA=	818	Multidimensional groupby	rabernat 1197350	closed		61	2016-04-06T04:14:37Z	2016-07-31T23:02:59Z	2016-07-08T01:50:38Z	MEMBER	0	pydata/xarray/pulls/818	Many datasets have a two dimensional coordinate variable (e.g. longitude) which is different from the logical grid coordinates (e.g. nx, ny). (See #605.) For plotting purposes, this is solved by #608. However, we still might want to split / apply / combine over such coordinates. That has not been possible, because groupby only supports creating groups on one-dimensional arrays. This PR overcomes that issue by using `stack` to collapse multiple dimensions in the group variable. A minimal example of the new functionality is ``` python da = xr.DataArray([[0,1],[2,3]], coords={'lon': (['ny','nx'], [[30,40],[40,50]] ), 'lat': (['ny','nx'], [[10,10],[20,20]] )}, dims=['ny','nx']) da.groupby('lon').sum() <xarray.DataArray (lon: 3)> array([0, 3, 3]) Coordinates: * lon (lon) int64 30 40 50 ``` This feature could have broad applicability for many realistic datasets (particularly model output on irregular grids): for example, averaging non-rectangular grids zonally (i.e. in latitude), binning in temperature, etc. If you think this is worth pursuing, I would love some feedback. The PR is not complete. Some items to address are - [x] Create a specialized grouper to allow coarser bins. By default, if no `grouper` is specified, the `GroupBy` object uses all unique values to define the groups. With a high resolution dataset, this could balloon to a huge number of groups. With the latitude example, we would like to be able to specify e.g. 1-degree bins. Usage would be `da.groupby('lon', bins=range(-90,90))`. - [ ] Allow specification of which dims to stack. For example, stack in space but keep time dimension intact. (Currently it just stacks all the dimensions of the group variable.) - [x] A nice example for the docs.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/818/reactions", "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
162974170	MDExOlB1bGxSZXF1ZXN0NzU2ODI3NzM=	892	fix printing of unicode attributes	rabernat 1197350	closed		2	2016-06-29T16:47:27Z	2016-07-24T02:57:13Z	2016-07-24T02:57:13Z	MEMBER	0	pydata/xarray/pulls/892	fixes #834 I would welcome a suggestion of how to test this in a way that works with both python 2 and 3. This is somewhat outside my expertise.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/892/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
100055216	MDExOlB1bGxSZXF1ZXN0NDIwMTYyMDg=	524	Option for closing files with scipy backend	rabernat 1197350	closed		6	2015-08-10T12:49:23Z	2016-06-24T17:45:07Z	2016-06-24T17:45:07Z	MEMBER	0	pydata/xarray/pulls/524	This is the same as #468, which was accidentally closed. I just copied and pasted my comment below This addresses issue #463, in which open_mfdataset failed when trying to open a list of files longer than my system's ulimit. I tried to find a solution in which the underlying netcdf file objects are kept closed by default and only reopened "when needed". I ended up subclassing scipy.io.netcdf_file and overwriting the variable attribute with a property which first checks whether the file is open or closed and opens it if needed. That was the easy part. The hard part was figuring out when to close them. The problem is that a couple of different parts of the code (e.g. each individual variable and also the datastore object itself) keep references to the netcdf_file object. In the end I used the debugger to find out when during initialization the variables were actually being read and added some calls to close() in various different places. It is relatively easy to close the files up at the end of the initialization, but it was much harder to make sure that the whole array of files is never open at the same time. I also had to disable mmap when this option is active. This solution is messy and, moreover, extremely slow. There is a factor of ~100 performance penalty during initialization for reopening and closing the files all the time (but only a factor of 10 for the actual calculation). I am sure this could be reduced if someone who understands the code better found some judicious points at which to call close() on the netcdf_file. The loss of mmap also sucks. This option can be accessed with the close_files key word, which I added to api. Timing for loading and doing a calculation with close_files=True: `python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101.nc', engine='scipy', close_files=True) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files()` output: `3 open files CPU times: user 11.1 s, sys: 17.5 s, total: 28.5 s Wall time: 27.7 s 2 open files 0.0055650632367 CPU times: user 649 ms, sys: 974 ms, total: 1.62 s Wall time: 633 ms 2 open files` Timing for loading and doing a calculation with close_files=False (default, should revert to old behavior): `python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101.nc', engine='scipy', close_files=False) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files()` `3 open files CPU times: user 264 ms, sys: 85.3 ms, total: 349 ms Wall time: 291 ms 22 open files 0.0055650632367 CPU times: user 174 ms, sys: 141 ms, total: 315 ms Wall time: 56 ms 22 open files` This is not a very serious pull request, but I spent all day on it, so I thought I would share. Maybe you can see some obvious way to improve it...	{ "url": "https://api.github.com/repos/pydata/xarray/issues/524/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
111471076	MDU6SXNzdWUxMTE0NzEwNzY=	624	roll method	rabernat 1197350	closed		8	2015-10-14T19:14:36Z	2015-12-02T23:32:28Z	2015-12-02T23:32:28Z	MEMBER			I would like to pick up my idea to add a roll method. Among many uses, it could help with #623. The method is pretty simple. `python def roll(darr, n, dim): """Clone of numpy.roll for xray objects.""" left = darr.isel({dim: slice(None, -n)}) right = darr.isel({dim: slice(-n, None)}) return xray.concat([right, left], dim=dim, data_vars='minimal', coords='minimal')` I have already been using this function a lot (defined from outside xray) and find it quite useful. I would like to create a PR to add it, but I am having a little trouble understanding how to correctly "inject" it into the api. A few words of advice from @shoyer would probably save me a lot of trial and error.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/624/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
115897556	MDU6SXNzdWUxMTU4OTc1NTY=	649	error when using broadcast_arrays with coordinates	rabernat 1197350	closed		5	2015-11-09T15:16:32Z	2015-11-10T14:27:41Z	2015-11-10T14:27:41Z	MEMBER			I frequently use `broadcast_arrays` to to feed xray variables to non-xray libraries (e.g. gsw.) Often I need to broadcast the coordinates and variables in order to do call functions that take both as arguments. I have found that `broadcast_arrays` doesn't work as I expect with coordinates. For example `python import xray import numpy as np ds = xray.Dataset({'a': (['y','x'], np.ones((20,10)))}, coords={'x': (['x'], np.arange(10)), 'y': (['y'], np.arange(20))}) xbc, ybc, abc = xray.broadcast_arrays(ds.x, ds.y, ds.a)` This raises `ValueError: an index variable must be defined with 1-dimensional data`. If I change the last line to `python xbc, ybc, abc = xray.broadcast_arrays(1ds.x, 1ds.y, ds.a)` it works fine.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/649/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
101719623	MDExOlB1bGxSZXF1ZXN0NDI3MzE1NDg=	538	Fix contour color	rabernat 1197350	closed		25	2015-08-18T18:24:36Z	2015-09-01T17:48:12Z	2015-09-01T17:20:56Z	MEMBER	0	pydata/xarray/pulls/538	This fixes #537 by adding a check for the presence of the colors kwarg.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/538/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
101716715	MDU6SXNzdWUxMDE3MTY3MTU=	537	xray.plot.contour doesn't handle colors kwarg correctly	rabernat 1197350	closed		2	2015-08-18T18:11:55Z	2015-09-01T17:20:55Z	2015-09-01T17:20:55Z	MEMBER			I found this while playing around with the plotting functions. (Really nice work btw @clarkfitzg!) I know the plotting is still under heavy development, but I thought I would share this issue anyway. I might take a crack at fixing it myself... The goal is to make an unfilled contour plot with no colors. In matplotlib this is easy `python x, y = np.arange(20), np.arange(20) xx, yy = np.meshgrid(x, y) f = np.sqrt(xx2 + yy2) plt.contour(x, y, f, colors='k')` If I try the same thing in dask `python da = xray.DataArray(f, coords={'y': y, 'x': x}) plt.figure() xray.plot.contour(da, colors='k')` I get `ValueError: Either colors or cmap must be None`. I can't find any way around this (e.g. adding a `cmap=None` argument has no effect). If I remove the colors keyword, it works and makes colored contours, as expected. I think this could be fixed easily if you agree it is a bug...	{ "url": "https://api.github.com/repos/pydata/xarray/issues/537/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
99847237	MDExOlB1bGxSZXF1ZXN0NDE5NjI5MDg=	523	Fix datetime decoding when time units are 'days since 0000-01-01 00:00:00'	rabernat 1197350	closed		22	2015-08-09T00:12:00Z	2015-08-14T17:22:02Z	2015-08-14T17:22:02Z	MEMBER	0	pydata/xarray/pulls/523	This fixes #521 using the workaround described in Unidata/netcdf4-python#442.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/523/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
94508580	MDExOlB1bGxSZXF1ZXN0Mzk3NTI1MTQ=	468	Option for closing files with scipy backend	rabernat 1197350	closed		7	2015-07-11T21:24:24Z	2015-08-10T12:50:45Z	2015-08-09T00:04:12Z	MEMBER	0	pydata/xarray/pulls/468	This addresses issue #463, in which open_mfdataset failed when trying to open a list of files longer than my system's ulimit. I tried to find a solution in which the underlying netcdf file objects are kept closed by default and only reopened "when needed". I ended up subclassing scipy.io.netcdf_file and overwriting the variable attribute with a property which first checks whether the file is open or closed and opens it if needed. That was the easy part. The hard part was figuring out when to close them. The problem is that a couple of different parts of the code (e.g. each individual variable and also the datastore object itself) keep references to the netcdf_file object. In the end I used the debugger to find out when during initialization the variables were actually being read and added some calls to close() in various different places. It is relatively easy to close the files up at the end of the initialization, but it was much harder to make sure that the whole array of files is never open at the same time. I also had to disable mmap when this option is active. This solution is messy and, moreover, extremely slow. There is a factor of ~100 performance penalty during initialization for reopening and closing the files all the time (but only a factor of 10 for the actual calculation). I am sure this could be reduced if someone who understands the code better found some judicious points at which to call close() on the netcdf_file. The loss of mmap also sucks. This option can be accessed with the close_files key word, which I added to api. Timing for loading and doing a calculation with close_files=True: `python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101.nc', engine='scipy', close_files=True) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files()` output: `3 open files CPU times: user 11.1 s, sys: 17.5 s, total: 28.5 s Wall time: 27.7 s 2 open files 0.0055650632367 CPU times: user 649 ms, sys: 974 ms, total: 1.62 s Wall time: 633 ms 2 open files` Timing for loading and doing a calculation with close_files=False (default, should revert to old behavior): `python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101.nc', engine='scipy', close_files=False) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files()` `3 open files CPU times: user 264 ms, sys: 85.3 ms, total: 349 ms Wall time: 291 ms 22 open files 0.0055650632367 CPU times: user 174 ms, sys: 141 ms, total: 315 ms Wall time: 56 ms 22 open files` This is not a very serious pull request, but I spent all day on it, so I thought I would share. Maybe you can see some obvious way to improve it...	{ "url": "https://api.github.com/repos/pydata/xarray/issues/468/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
99844089	MDExOlB1bGxSZXF1ZXN0NDE5NjI0NDM=	522	Fix datetime decoding when time units are 'days since 0000-01-01 00:00:00'	rabernat 1197350	closed		1	2015-08-08T23:26:07Z	2015-08-09T00:10:18Z	2015-08-09T00:06:49Z	MEMBER	0	pydata/xarray/pulls/522	This fixes #521 using the workaround described in Unidata/netcdf4-python#442.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/522/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
96732359	MDU6SXNzdWU5NjczMjM1OQ==	489	problems with big endian DataArrays	rabernat 1197350	closed		4	2015-07-23T05:24:07Z	2015-07-23T20:28:00Z	2015-07-23T20:28:00Z	MEMBER			I have some MITgcm data in a custom binary format that I am trying to wedge into xray. I found that DataArray does not know how to handle big endian datatypes, at least on my system. `python x = xray.DataArray( np.ones(10, dtype='>f4')) print float(x.sum()), x.data.sum()` result: `4.60060298822e-40 10.0`	{ "url": "https://api.github.com/repos/pydata/xarray/issues/489/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
96185559	MDU6SXNzdWU5NjE4NTU1OQ==	484	segfault with hdf4 file	rabernat 1197350	closed		5	2015-07-20T23:15:06Z	2015-07-21T02:34:16Z	2015-07-21T02:34:16Z	MEMBER			I am trying to read data from the NASA MERRA reanalysis. An example file is: ftp://goldsmr3.sci.gsfc.nasa.gov/data/s4pa/MERRA/MAI3CPASM.5.2.0/2014/01/MERRA300.prod.assim.inst3_3d_asm_Cp.20140101.hdf The file format is hdf4 (NOT hdf5). (full file specification) This file can be read by netCDF4.Dataset `python from netCDF4 import Dataset fname = 'MERRA300.prod.assim.inst3_3d_asm_Cp.20140101.hdf' nc = Dataset(fname) nc.variables['SLP'][0]` No errors However, with xray `python import xray ds = xray.open_dataset(fname)` I get a segfault. Is this behavior unique to my system? Or is this a reproducible bug? Note: I am not using anaconda's netCDF package, because it does not have hdf4 file support. I had my sysadmin build us a custom netcdf and netCDF4 python.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/484/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

81 rows where state = "closed" and user = 1197350 sorted by updated_at descending

What is your issue?

Summary

Code example

Proposal

What happened?

how concat works: data are aligned

now do a Zarr append

we do not check that the coordinates are aligned--just that they have the same shape and dtype

coordinates data have been overwritten

...with the latest values

What did you expect to happen?

Anything else we need to know?

Environment

What is your issue?

Is your feature request related to a problem?

trying to access the same item will now trigger loading of all of ds1

yes I could use different chunks, but the point is that I should not have to

arbitrarily choose chunks to make this work

Describe the solution you'd like

Describe alternatives you've considered

Additional context

What is your issue?

now start an http server in a terminal in the same working directory

$ python -m http.server

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of xr.show_versions()

Code Sample, a copy-pastable example if possible

Output of xr.show_versions()

example needs to use distributed to reproduce the bug

Full ncdump of dataset

Output of xr.show_versions()

```

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

MCVE Code Sample

```

Problem Description

Output of xr.show_versions()

equivalent numpy code

Code Sample

Your code here

construct an example datatset chunked in time

monthly climatology

anomaly

Problem description

Expected Output

Output of xr.show_versions()

Code Sample

create dataset from Unidata's test opendap endpoint, chunked in time

all these work

this works too

but this does not

Output of xr.show_versions()

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of xr.show_versions()

Dask Only

connect to a local dask scheduler

create a big dask array

connect to gcs and create MutableMapping

create a zarr array to store into

write it

Xarray

wrap the dask array in an xarray

store to a different bucket

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of xr.show_versions()

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of xr.show_versions()

this works fine

Output of `xr.show_versions()`

Output of `xr.show_versions()`

Output of `xr.show_versions()`

Output of `xr.show_versions()`

Output of `xr.show_versions()`

Output of `xr.show_versions()`

Output of `xr.show_versions()`

Output of `xr.show_versions()`

Output of `xr.show_versions()`

Output of `xr.show_versions()`

Output of `xr.show_versions()`