home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

81 rows where repo = 13221727, state = "closed" and user = 1197350 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, created_at (date), updated_at (date), closed_at (date)

type 2

  • issue 54
  • pull 27

state 1

  • closed · 81 ✖

repo 1

  • xarray · 81 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1503046820 I_kwDOAMm_X85Zlqyk 7388 Xarray does not support full range of netcdf-python compression options rabernat 1197350 closed 0     22 2022-12-19T14:21:17Z 2023-12-21T15:43:06Z 2023-12-21T15:24:17Z MEMBER      

What is your issue?

Summary

The netcdf4-python API docs say the following

If the optional keyword argument compression is set, the data will be compressed in the netCDF file using the specified compression algorithm. Currently zlib,szip,zstd,bzip2,blosc_lz,blosc_lz4,blosc_lz4hc, blosc_zlib and blosc_zstd are supported. Default is None (no compression). All of the compressors except zlib and szip use the HDF5 plugin architecture.

If the optional keyword zlib is True, the data will be compressed in the netCDF file using zlib compression (default False). The use of this option is deprecated in favor of compression='zlib'.

Although compression is considered a valid encoding option by Xarray

https://github.com/pydata/xarray/blob/bbe63ab657e9cb16a7cbbf6338a8606676ddd7b0/xarray/backends/netCDF4_.py#L232-L242

...it appears that we silently ignores the compression option when creating new netCDF4 variables:

https://github.com/pydata/xarray/blob/bbe63ab657e9cb16a7cbbf6338a8606676ddd7b0/xarray/backends/netCDF4_.py#L488-L501

Code example

```python shape = (10, 20) chunksizes = (1, 10)

encoding = { 'compression': 'zlib', 'shuffle': True, 'complevel': 8, 'fletcher32': False, 'contiguous': False, 'chunksizes': chunksizes }

da = xr.DataArray( data=np.random.rand(*shape), dims=['y', 'x'], name="foo", attrs={"bar": "baz"} ) da.encoding = encoding ds = da.to_dataset()

fname = "test.nc" ds.to_netcdf(fname, engine="netcdf4", mode="w")

with xr.open_dataset(fname, engine="netcdf4") as ds1: display(ds1.foo.encoding) ```

{'zlib': False, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 10), 'source': 'test.nc', 'original_shape': (10, 20), 'dtype': dtype('float64'), '_FillValue': nan}

In addition to showing that compression is ignored, this also reveals several other encoding options that are not available when writing data from xarray (szip, zstd, bzip2, blosc).

Proposal

We should align with the recommendation from the netcdf4 docs and support compression= style encoding in NetCDF. We should deprecate zlib=True syntax.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7388/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1983894219 PR_kwDOAMm_X85e8V31 8428 Add mode='a-': Do not overwrite coordinates when appending to Zarr with `append_dim` rabernat 1197350 closed 0     3 2023-11-08T15:41:58Z 2023-12-01T04:21:57Z 2023-12-01T03:58:54Z MEMBER   0 pydata/xarray/pulls/8428

This implements the 1b option described in #8427.

  • [x] Closes #8427
  • [x] Tests added
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8428/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1983891070 I_kwDOAMm_X852P8Z- 8427 Ambiguous behavior with coordinates when appending to Zarr store with append_dim rabernat 1197350 closed 0     4 2023-11-08T15:40:19Z 2023-12-01T03:58:56Z 2023-12-01T03:58:55Z MEMBER      

What happened?

There are two quite different scenarios covered by "append" with Zarr

  • Adding new variables to a dataset
  • Extending arrays along a dimensions (via append_dim)

This issue is about what should happen when using append_dim with variables that do not contain append_dim.

Here's the current behavior.

```python import xarray as xr import zarr

ds1 = xr.DataArray( np.array([1, 2, 3]).reshape(3, 1, 1), dims=('time', 'y', 'x'), coords={'x': [1], 'y': [2]}, name="foo" ).to_dataset()

ds2 = xr.DataArray( np.array([4, 5]).reshape(2, 1, 1), dims=('time', 'y', 'x'), coords={'x':[-1], 'y': [-2]}, name="foo" ).to_dataset()

how concat works: data are aligned

ds_concat = xr.concat([ds1, ds2], dim="time") assert ds_concat.dims == {"time": 5, "y": 2, "x": 2}

now do a Zarr append

store = zarr.storage.MemoryStore() ds1.to_zarr(store, consolidated=False)

we do not check that the coordinates are aligned--just that they have the same shape and dtype

ds2.to_zarr(store, append_dim="time", consolidated=False) ds_append = xr.open_zarr(store, consolidated=False)

coordinates data have been overwritten

assert ds_append.dims == {"time": 5, "y": 1, "x": 1}

...with the latest values

assert ds_append.x.data[0] == -1 ```

Currently, we always write all data variables in this scenario. That includes overwriting the coordinates every time we append. That makes appending more expensive than it needs to be. I don't think that is the behavior most users want or expect.

What did you expect to happen?

There are a couple of different options we could consider for how to handle this "extending" situation (with append_dim)

  1. [current behavior] Do not attempt to align coordinates a. [current behavior] Overwrite coordinates with new data b. Keep original coordinates c. Force the user to explicitly drop the coordinates, as we do for region operations.
  2. Attempt to align coordinates a. Fail if coordinates don't match b. Extend the arrays to replicate the behavior of concat

We currently do 1a. I propose to switch to 1b. I think it is closer to what users want, and it requires less I/O.

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 5.10.176-157.645.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.10.1 pandas: 2.1.2 numpy: 1.24.4 scipy: 1.11.3 netCDF4: 1.6.5 pydap: installed h5netcdf: 1.2.0 h5py: 3.10.0 Nio: None zarr: 2.16.0 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.10.1 distributed: 2023.10.1 matplotlib: 3.8.0 cartopy: 0.22.0 seaborn: 0.13.0 numbagg: 0.6.0 fsspec: 2023.10.0 cupy: None pint: 0.22 sparse: 0.14.0 flox: 0.8.1 numpy_groupies: 0.10.2 setuptools: 68.2.2 pip: 23.3.1 conda: None pytest: 7.4.3 mypy: None IPython: 8.16.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8427/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
350899839 MDU6SXNzdWUzNTA4OTk4Mzk= 2368 Let's list all the netCDF files that xarray can't open rabernat 1197350 closed 0     32 2018-08-15T17:41:13Z 2023-11-30T04:36:42Z 2023-11-30T04:36:42Z MEMBER      

At the Pangeo developers meetings, I am hearing lots of reports from folks like @dopplershift and @rsignell-usgs about netCDF datasets that xarray can't open.

My expectation is that xarray doesn't have strong requirements on the contents of datasets. (It doesn't "enforce" cf compatibility for example; that's optional.) Anything that can be written to netCDF should be readable by xarray.

I would like to collect examples of places where xarray fails. So far, I am only aware of one:

  • Self-referential multidimensional coordinates (#2233). Datasets which contain variables like siglay(siglay, node). Only siglay(siglay) would work.

Are there other distinct cases?

Please provide links / sample code of netCDF datasets that xarray can't read. Even better would be short code snippets to create such datasets in python using the netcdf4 interface.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2368/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1935984485 I_kwDOAMm_X85zZMdl 8290 Potential performance optimization for Zarr backend rabernat 1197350 closed 0     0 2023-10-10T18:41:19Z 2023-10-13T16:38:58Z 2023-10-13T16:38:58Z MEMBER      

What is your issue?

We have identified an inefficiency in the way the ZarrArrayWrapper works. This class currently stores a reference to a ZarrStore and a variable name

https://github.com/pydata/xarray/blob/75af56c33a29529269a73bdd00df2d3af17ee0f5/xarray/backends/zarr.py#L63-L68

When accessing the array, the parent group of the array is read and used to open a new Zarr array.

https://github.com/pydata/xarray/blob/75af56c33a29529269a73bdd00df2d3af17ee0f5/xarray/backends/zarr.py#L83-L84

This is a relatively metadata-intensive operation for Zarr. It requires reading both the group metadata and the array metadata. Because of how this wrapper works, these operations currently happen every time data is read from the array. If we have a dask array wrapping the zarr array with thousands of chunks, these metadata operations will happen within every single task. For high latency stores, this is really bad.

Instead, we should just reference the zarr.Array object directly within the ZarrArrayWrapper. It's lightweight and easily serializable. There is no need to re-open the array each time we want to read data from it. This change will lead to an immediate performance enhancement in all Zarr operations.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8290/reactions",
    "total_count": 6,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 2,
    "eyes": 0
}
  completed xarray 13221727 issue
357808970 MDExOlB1bGxSZXF1ZXN0MjEzNzM2NTAx 2405 WIP: don't create indexes on multidimensional dimensions rabernat 1197350 closed 0     7 2018-09-06T20:13:11Z 2023-07-19T18:33:17Z 2023-07-19T18:33:17Z MEMBER   0 pydata/xarray/pulls/2405
  • [x] Closes #2368, Closes #2233
  • [ ] Tests added (for all bug fixes or enhancements)
  • [ ] Tests passed (for all non-documentation changes)
  • [ ] Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

This is just a start to the solution proposed in #2368. A surprisingly small number of tests broke in my local environment.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2405/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
401874795 MDU6SXNzdWU0MDE4NzQ3OTU= 2697 read ncml files to create multifile datasets rabernat 1197350 closed 0     18 2019-01-22T17:33:08Z 2023-05-29T13:41:38Z 2023-05-29T13:41:38Z MEMBER      

This issue was motivated by a recent conversation with @jdha regarding how they are preparing inputs for regional ocean models. They are currently using ncml with netcdf-java to consolidate and homogenize diverse data sources. But this approach doesn't play well with the xarray / dask stack.

ncml is standard developed by Unidata for use with their netCDF-java library:

NcML is an XML representation of netCDF metadata, (approximately) the header information one gets from a netCDF file with the "ncdump -h" command.

In addition to describing individual netCDF files, ncml can be used to annotate modifications to netCDF metadata (attributes, dimension names, etc.) and also to aggregate multiple files into a single logical dataset. This is what such an aggregation over an existing dimension looks like in ncml:

xml <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2"> <aggregation dimName="time" type="joinExisting"> <netcdf location="jan.nc" /> <netcdf location="feb.nc" /> </aggregation> </netcdf>

Obviously this maps very well to xarray's concat operation. Similar aggregations can be defined that map to merge operations.

I think it would be great if we could support the ncml spec in xarray, allowing us to write code like

python ds = xr.open_ncml('file.ncml')

This idea has been discussed before in #893. Perhaps it's time has finally come.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2697/reactions",
    "total_count": 7,
    "+1": 7,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1231184996 I_kwDOAMm_X85JYmRk 6588 Support lazy concatenation *without dask* rabernat 1197350 closed 0     2 2022-05-10T13:40:20Z 2023-03-10T18:40:22Z 2022-05-10T15:38:20Z MEMBER      

Is your feature request related to a problem?

Right now, if I want to concatenate multiple datasets (e.g. as in open_mfdataset), I have two options: - Eagerly load the data as numpy arrays ➡️ xarray will dispatch to np.concatenate - Chunk each dataset ➡️ xarray will dispatch to dask.array.concatenate

In pseudocode: ```python ds1 = xr.open_dataset("some_big_lazy_source_1.nc") ds2 = xr.open_dataset("some_big_lazy_source_2.nc") item1 = ds1.foo[0, 0, 0] # lazily access a single item ds = xr.concat([ds1.chunk(), ds2.chunk()], "time") # only way to lazily concat

trying to access the same item will now trigger loading of all of ds1

item1 = ds.foo[0, 0, 0]

yes I could use different chunks, but the point is that I should not have to

arbitrarily choose chunks to make this work

```

However, I am increasingly encountering scenarios where I would like to lazily concatenate datasets (without loading into memory), but also without the requirement of using dask. This would be useful, for example, for creating composite datasets that point back to an OpenDAP server, preserving the possibility of granular lazy access to any array element without the requirement of arbitrary chunking at an intermediate stage.

Describe the solution you'd like

I propose to extend our LazilyIndexedArray classes to support simple concatenation and stacking. The result of applying concat to such arrays will be a new LazilyIndexedArray that wraps the underlying arrays into a single object.

The main difficulty in implementing this will probably be with indexing: the concatenated array will need to understand how to map global indexes to the underling individual array indexes. That is a little tricky but eminently solvable.

Describe alternatives you've considered

The alternative is to structure your code in a way that avoids needing to lazily concatenate arrays. That is what we do now. It is not optimal.

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6588/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1260047355 I_kwDOAMm_X85LGsv7 6662 Obscure h5netcdf http serialization issue with python's http.server rabernat 1197350 closed 0     6 2022-06-03T15:28:15Z 2022-06-04T22:13:05Z 2022-06-04T22:13:05Z MEMBER      

What is your issue?

In Pangeo Forge, we try to test our ability to read data over http. This often surfaces edge cases involving xarray and fsspec. This is one such edge case. However, it is kind of important, because it affects our ability to reliably test http-based datasets using python's built-in http server.

Here is some code that: - Creates a tiny dataset on disk - Serves it over http via python -m http.server - Opens the dataset with fsspec and xarray with the h5netcdf engine - Pickles the dataset, loads it, and calls .load() to load the data into memory

As you can see, this works with a local file, but not with the http file, with h5py raising a checksum-related error.

```python import fsspec import xarray as xr from pickle import dumps, loads

ds_orig = xr.tutorial.load_dataset('tiny') ds_orig

fname = 'tiny.nc' ds_orig.to_netcdf(fname, engine='netcdf4')

now start an http server in a terminal in the same working directory

$ python -m http.server

def open_pickle_and_reload(path): with fsspec.open(path, mode='rb') as fp: with xr.open_dataset(fp, engine='h5netcdf') as ds1: pass

# pickle it and reload it
ds2 = loads(dumps(ds1))
ds2.load()

open_pickle_and_reload(fname) # works url = f'http://127.0.0.1:8000/{fname}' open_pickle_and_reload(url) # OSError: Unable to open file (incorrect metadata checksum after all read attempts) ```

full traceback ``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/Code/xarray/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock) 198 try: --> 199 file = self._cache[self._key] 200 except KeyError: ~/Code/xarray/xarray/backends/lru_cache.py in __getitem__(self, key) 52 with self._lock: ---> 53 value = self._cache[key] 54 self._cache.move_to_end(key) KeyError: [<class 'h5netcdf.core.File'>, (<File-like object HTTPFileSystem, http://127.0.0.1:8000/tiny.nc>,), 'r', (('decode_vlen_strings', True), ('invalid_netcdf', None))] During handling of the above exception, another exception occurred: OSError Traceback (most recent call last) <ipython-input-2-195ac3fcdb43> in <module> 24 open_pickle_and_reload(fname) # works 25 url = f'[http://127.0.0.1:8000/{fname}'](http://127.0.0.1:8000/%7Bfname%7D'%3C/span%3E) ---> 26 open_pickle_and_reload(url) # OSError: Unable to open file (incorrect metadata checksum after all read attempts) <ipython-input-2-195ac3fcdb43> in open_pickle_and_reload(path) 20 # pickle it and reload it 21 ds2 = loads(dumps(ds1)) ---> 22 ds2.load() # works 23 24 open_pickle_and_reload(fname) # works ~/Code/xarray/xarray/core/dataset.py in load(self, **kwargs) 687 for k, v in self.variables.items(): 688 if k not in lazy_data: --> 689 v.load() 690 691 return self ~/Code/xarray/xarray/core/variable.py in load(self, **kwargs) 442 self._data = as_compatible_data(self._data.compute(**kwargs)) 443 elif not is_duck_array(self._data): --> 444 self._data = np.asarray(self._data) 445 return self 446 ~/Code/xarray/xarray/core/indexing.py in __array__(self, dtype) 654 655 def __array__(self, dtype=None): --> 656 self._ensure_cached() 657 return np.asarray(self.array, dtype=dtype) 658 ~/Code/xarray/xarray/core/indexing.py in _ensure_cached(self) 651 def _ensure_cached(self): 652 if not isinstance(self.array, NumpyIndexingAdapter): --> 653 self.array = NumpyIndexingAdapter(np.asarray(self.array)) 654 655 def __array__(self, dtype=None): ~/Code/xarray/xarray/core/indexing.py in __array__(self, dtype) 624 625 def __array__(self, dtype=None): --> 626 return np.asarray(self.array, dtype=dtype) 627 628 def __getitem__(self, key): ~/Code/xarray/xarray/core/indexing.py in __array__(self, dtype) 525 def __array__(self, dtype=None): 526 array = as_indexable(self.array) --> 527 return np.asarray(array[self.key], dtype=None) 528 529 def transpose(self, order): ~/Code/xarray/xarray/backends/h5netcdf_.py in __getitem__(self, key) 49 50 def __getitem__(self, key): ---> 51 return indexing.explicit_indexing_adapter( 52 key, self.shape, indexing.IndexingSupport.OUTER_1VECTOR, self._getitem 53 ) ~/Code/xarray/xarray/core/indexing.py in explicit_indexing_adapter(key, shape, indexing_support, raw_indexing_method) 814 """ 815 raw_key, numpy_indices = decompose_indexer(key, shape, indexing_support) --> 816 result = raw_indexing_method(raw_key.tuple) 817 if numpy_indices.tuple: 818 # index the loaded np.ndarray ~/Code/xarray/xarray/backends/h5netcdf_.py in _getitem(self, key) 58 key = tuple(list(k) if isinstance(k, np.ndarray) else k for k in key) 59 with self.datastore.lock: ---> 60 array = self.get_array(needs_lock=False) 61 return array[key] 62 ~/Code/xarray/xarray/backends/h5netcdf_.py in get_array(self, needs_lock) 45 class H5NetCDFArrayWrapper(BaseNetCDF4Array): 46 def get_array(self, needs_lock=True): ---> 47 ds = self.datastore._acquire(needs_lock) 48 return ds.variables[self.variable_name] 49 ~/Code/xarray/xarray/backends/h5netcdf_.py in _acquire(self, needs_lock) 180 181 def _acquire(self, needs_lock=True): --> 182 with self._manager.acquire_context(needs_lock) as root: 183 ds = _nc4_require_group( 184 root, self._group, self._mode, create_group=_h5netcdf_create_group /opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/contextlib.py in __enter__(self) 117 del self.args, self.kwds, self.func 118 try: --> 119 return next(self.gen) 120 except StopIteration: 121 raise RuntimeError("generator didn't yield") from None ~/Code/xarray/xarray/backends/file_manager.py in acquire_context(self, needs_lock) 185 def acquire_context(self, needs_lock=True): 186 """Context manager for acquiring a file.""" --> 187 file, cached = self._acquire_with_cache_info(needs_lock) 188 try: 189 yield file ~/Code/xarray/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock) 203 kwargs = kwargs.copy() 204 kwargs["mode"] = self._mode --> 205 file = self._opener(*self._args, **kwargs) 206 if self._mode == "w": 207 # ensure file doesn't get overridden when opened again /opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/h5netcdf/core.py in __init__(self, path, mode, invalid_netcdf, phony_dims, **kwargs) 719 else: 720 self._preexisting_file = mode in {"r", "r+", "a"} --> 721 self._h5file = h5py.File(path, mode, **kwargs) 722 except Exception: 723 self._closed = True /opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, **kwds) 505 fs_persist=fs_persist, fs_threshold=fs_threshold, 506 fs_page_size=fs_page_size) --> 507 fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr) 508 509 if isinstance(libver, tuple): /opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr) 218 if swmr and swmr_support: 219 flags |= h5f.ACC_SWMR_READ --> 220 fid = h5f.open(name, flags, fapl=fapl) 221 elif mode == 'r+': 222 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl) h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/h5f.pyx in h5py.h5f.open() OSError: Unable to open file (incorrect metadata checksum after all read attempts) (external_url) ```

Strangely, a similar workflow does work with http files hosted elsewhere, e.g.

python external_url = 'https://power-datastore.s3.amazonaws.com/v9/climatology/power_901_rolling_zones_utc.nc' open_pickle_and_reload(external_url)

This suggests there is something peculiar about python's http.server as compared to other http servers that makes this break.

I would appreciate any thoughts or ideas about what might be going on here (pinging @martindurant and @shoyer)

xref: - https://github.com/pangeo-forge/pangeo-forge-recipes/pull/373 - https://github.com/pydata/xarray/issues/4242 - https://github.com/google/xarray-beam/issues/49

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6662/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
333312849 MDU6SXNzdWUzMzMzMTI4NDk= 2237 why time grouping doesn't preserve chunks rabernat 1197350 closed 0     30 2018-06-18T15:12:38Z 2022-05-15T02:44:06Z 2022-05-15T02:38:30Z MEMBER      

Code Sample, a copy-pastable example if possible

I am continuing my quest to obtain more efficient time grouping for calculation of climatologies and climatological anomalies. I believe this is one of the major performance bottlenecks facing xarray users today. I have raised this in other issues (e.g. #1832), but I believe I have narrowed it down here to a more specific problem.

The easiest way to summarize the problem is with an example. Consider the following dataset

python import xarray as xr ds = xr.Dataset({'foo': (['x'], [1, 1, 1, 1])}, coords={'x': (['x'], [0, 1, 2, 3]), 'bar': (['x'], ['a', 'a', 'b', 'b']), 'baz': (['x'], ['a', 'b', 'a', 'b'])}) ds = ds.chunk({'x': 2}) ds <xarray.Dataset> Dimensions: (x: 4) Coordinates: * x (x) int64 0 1 2 3 bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)> baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)> Data variables: foo (x) int64 dask.array<shape=(4,), chunksize=(2,)>

One non-dimension coordinate (bar) is contiguous with respect to x while the other baz is not. This is important. baz is structured similar to the way that month would be distributed on a timeseries dataset.

Now let's do a trivial groupby operation on bar that does nothing, just returns the group unchanged: python ds.foo.groupby('bar').apply(lambda x: x) <xarray.DataArray 'foo' (x: 4)> dask.array<shape=(4,), dtype=int64, chunksize=(2,)> Coordinates: * x (x) int64 0 1 2 3 bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)> baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)> This operation preserved this original chunks in foo. But if we group by baz we see something different python ds.foo.groupby('baz').apply(lambda x: x) <xarray.DataArray 'foo' (x: 4)> dask.array<shape=(4,), dtype=int64, chunksize=(4,)> Coordinates: * x (x) int64 0 1 2 3 bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)> baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)>

Problem description

When grouping over a non-contiguous variable (baz) the result has no chunks. That means that we can't lazily access a single item without computing the whole array. This has major performance consequences that make it hard to calculate anomaly values in a more realistic case. What we really want to do is often something like ds = xr.open_mfdataset('lots/of/files/*.nc') ds_anom = ds.groupby('time.month').apply(lambda x: x - x.mean(dim='time) It is currently impossible to do this lazily due to the issue described above.

Expected Output

We would like to preserve the original chunk structure of foo.

Output of xr.show_versions()

xr.show_versions() is triggering a segfault right now on my system for unknown reasons! I am using xarray 0.10.7.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2237/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
413589315 MDU6SXNzdWU0MTM1ODkzMTU= 2785 error decoding cftime time_bnds over opendap with pydap rabernat 1197350 closed 0     2 2019-02-22T21:38:24Z 2021-07-21T14:51:36Z 2021-07-21T14:51:36Z MEMBER      

Code Sample, a copy-pastable example if possible

I try to load the following dataset over opendap with the pydap engine. It only works if I do decode_times=False

python url = 'http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NOAA-GFDL/GFDL-AM4/amip/r1i1p1f1/Amon/ta/gr1/v20180807/ta_Amon_GFDL-AM4_amip_r1i1p1f1_gr1_198001-201412.nc' ds = xr.open_dataset(url, decode_times=False, engine='pydap') xr.decode_times(ds)

raises ```


IndexError Traceback (most recent call last) <ipython-input-52-df985a95e29e> in <module>() 1 #ds.time_bnds.load() ----> 2 xr.decode_cf(ds)

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables) 459 vars, attrs, coord_names = decode_cf_variables( 460 vars, attrs, concat_characters, mask_and_scale, decode_times, --> 461 decode_coords, drop_variables=drop_variables) 462 ds = Dataset(vars, attrs=attrs) 463 ds = ds.set_coords(coord_names.union(extra_coords).intersection(vars))

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/conventions.py in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables) 392 k, v, concat_characters=concat_characters, 393 mask_and_scale=mask_and_scale, decode_times=decode_times, --> 394 stack_char_dim=stack_char_dim) 395 if decode_coords: 396 var_attrs = new_vars[k].attrs

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/conventions.py in decode_cf_variable(name, var, concat_characters, mask_and_scale, decode_times, decode_endianness, stack_char_dim) 298 for coder in [times.CFTimedeltaCoder(), 299 times.CFDatetimeCoder()]: --> 300 var = coder.decode(var, name=name) 301 302 dimensions, data, attributes, encoding = (

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/coding/times.py in decode(self, variable, name) 410 units = pop_to(attrs, encoding, 'units') 411 calendar = pop_to(attrs, encoding, 'calendar') --> 412 dtype = _decode_cf_datetime_dtype(data, units, calendar) 413 transform = partial( 414 decode_cf_datetime, units=units, calendar=calendar)

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/coding/times.py in _decode_cf_datetime_dtype(data, units, calendar) 116 values = indexing.ImplicitToExplicitIndexingAdapter( 117 indexing.as_indexable(data)) --> 118 example_value = np.concatenate([first_n_items(values, 1) or [0], 119 last_item(values) or [0]]) 120

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/formatting.py in first_n_items(array, n_desired) 94 from_end=False) 95 array = array[indexer] ---> 96 return np.asarray(array).flat[:n_desired] 97 98

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 529 530 """ --> 531 return array(a, dtype, copy=False, order=order) 532 533

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in array(self, dtype) 630 631 def array(self, dtype=None): --> 632 self._ensure_cached() 633 return np.asarray(self.array, dtype=dtype) 634

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in _ensure_cached(self) 627 def _ensure_cached(self): 628 if not isinstance(self.array, NumpyIndexingAdapter): --> 629 self.array = NumpyIndexingAdapter(np.asarray(self.array)) 630 631 def array(self, dtype=None):

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 529 530 """ --> 531 return array(a, dtype, copy=False, order=order) 532 533

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in array(self, dtype) 608 609 def array(self, dtype=None): --> 610 return np.asarray(self.array, dtype=dtype) 611 612 def getitem(self, key):

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 529 530 """ --> 531 return array(a, dtype, copy=False, order=order) 532 533

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in array(self, dtype) 514 def array(self, dtype=None): 515 array = as_indexable(self.array) --> 516 return np.asarray(array[self.key], dtype=None) 517 518 def transpose(self, order):

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/conventions.py in getitem(self, key) 43 44 def getitem(self, key): ---> 45 return np.asarray(self.array[key], dtype=self.dtype) 46 47

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order) 529 530 """ --> 531 return array(a, dtype, copy=False, order=order) 532 533

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in array(self, dtype) 514 def array(self, dtype=None): 515 array = as_indexable(self.array) --> 516 return np.asarray(array[self.key], dtype=None) 517 518 def transpose(self, order):

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/backends/pydap_.py in getitem(self, key) 24 def getitem(self, key): 25 return indexing.explicit_indexing_adapter( ---> 26 key, self.shape, indexing.IndexingSupport.BASIC, self._getitem) 27 28 def _getitem(self, key):

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in explicit_indexing_adapter(key, shape, indexing_support, raw_indexing_method) 785 if numpy_indices.tuple: 786 # index the loaded np.ndarray --> 787 result = NumpyIndexingAdapter(np.asarray(result))[numpy_indices] 788 return result 789

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray/core/indexing.py in getitem(self, key) 1174 def getitem(self, key): 1175 array, key = self._indexing_array_and_key(key) -> 1176 return array[key] 1177 1178 def setitem(self, key, value):

IndexError: too many indices for array ```

Strangely, I can overcome the error by first explicitly loading (or dropping) the time_bnds variable: python ds.time_bnds.load() xr.decode_cf(ds)

I wish this would work without the .load() step. I think it has something to do with the many layers of array wrappers involved in lazy opening. The problem does not occur with the netcdf4 engine.

I know this is a very obscure problem, but I thought I would open an issue to document.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.11.3 pandas: 0.23.4 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.4.2 pydap: installed h5netcdf: None h5py: None Nio: None zarr: 2.2.1.dev126+dirty cftime: 1.0.3.4 PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.2.1 cyordereddict: None dask: 0.20.2 distributed: 1.24.2 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 40.6.2 pip: 18.1 conda: None pytest: 4.0.0 IPython: 6.1.0 sphinx: 1.6.5
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2785/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
745801652 MDU6SXNzdWU3NDU4MDE2NTI= 4591 Serialization issue with distributed, h5netcdf, and fsspec (ImplicitToExplicitIndexingAdapter) rabernat 1197350 closed 0     12 2020-11-18T16:18:42Z 2021-06-30T17:53:54Z 2020-11-19T15:54:38Z MEMBER      

This was originally reported by @jkingslake at https://github.com/pangeo-data/pangeo-datastore/issues/116.

What happened:

I tried to open a netcdf file over http using fsspec and the h5netcdf engine and compute data using dask.distributed. It appears that our ImplicitToExplicitIndexingAdapter is [no longer?] serializable?

What you expected to happen:

Things would work. Indeed, I could swear this used to work with previous versions.

Minimal Complete Verifiable Example:

```python import xarray as xr import fsspec from dask.distributed import Client

example needs to use distributed to reproduce the bug

client = Client()

url = 'https://storage.googleapis.com/ldeo-glaciology/bedmachine/BedMachineAntarctica_2019-11-05_v01.nc'
with fsspec.open(url, mode='rb') as openfile:
dsc = xr.open_dataset(openfile, chunks=3000) dsc.surface.mean().compute() ```

raises the following error Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/core.py", line 50, in dumps data = { File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/core.py", line 51, in <dictcomp> key: serialize( File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 277, in serialize raise TypeError(msg, str(x)[:10000]) TypeError: ('Could not serialize object of type ImplicitToExplicitIndexingAdapter.', 'ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=<xarray.backends.h5netcdf_.H5NetCDFArrayWrapper object at 0x7ff8e3988540>, key=BasicIndexer((slice(None, None, None), slice(None, None, None))))))') distributed.comm.utils - ERROR - ('Could not serialize object of type ImplicitToExplicitIndexingAdapter.', 'ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=<xarray.backends.h5netcdf_.H5NetCDFArrayWrapper object at 0x7ff8e3988540>, key=BasicIndexer((slice(None, None, None), slice(None, None, None))))))')

Anything else we need to know?:

One can work around this by using the netcdf4 library's new and undocumented ability to open files over http.

python url = 'https://storage.googleapis.com/ldeo-glaciology/bedmachine/BedMachineAntarctica_2019-11-05_v01.nc#mode=bytes' ds = xr.open_dataset(url, engine='netcdf4', chunks=3000) ds

However, the fsspec + h5netcdf path should work!

Environment:

Output of <tt>xr.show_versions()</tt> ``` INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.19.112+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.1 pandas: 1.1.3 numpy: 1.19.2 scipy: 1.5.2 netCDF4: 1.5.4 pydap: installed h5netcdf: 0.8.1 h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.2.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.7 cfgrib: 0.9.8.4 iris: None bottleneck: 1.3.2 dask: 2.30.0 distributed: 2.30.0 matplotlib: 3.3.2 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.16.1 setuptools: 49.6.0.post20201009 pip: 20.2.4 conda: None pytest: 6.1.1 IPython: 7.18.1 sphinx: 3.2.1 ``` Also fsspec 0.8.4

cc @martindurant for fsspec integration.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4591/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
836391524 MDU6SXNzdWU4MzYzOTE1MjQ= 5056 Allow "unsafe" mode for zarr writing rabernat 1197350 closed 0     1 2021-03-19T21:57:47Z 2021-04-26T16:37:43Z 2021-04-26T16:37:43Z MEMBER      

Curently, Dataset.to_zarr will only write Zarr datasets in cases in which - The Dataset arrays are in memory (no dask) - The arrays are chunked with dask with a one-to-many relationship between dask chunks and zarr chunks

If I try to violate the one-to-many condition, I get an error

python import xarray as xr ds = xr.DataArray([0, 1., 2], name='foo').chunk({'dim_0': 1}).to_dataset() d = ds.to_zarr('test.zarr', encoding={'foo': {'chunks': (3,)}}, compute=False)

``` /srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/zarr.py in _determine_zarr_chunks(enc_chunks, var_chunks, ndim, name) 148 for dchunk in dchunks[:-1]: 149 if dchunk % zchunk: --> 150 raise NotImplementedError( 151 f"Specified zarr chunks encoding['chunks']={enc_chunks_tuple!r} for " 152 f"variable named {name!r} would overlap multiple dask chunks {var_chunks!r}. "

NotImplementedError: Specified zarr chunks encoding['chunks']=(3,) for variable named 'foo' would overlap multiple dask chunks ((1, 1, 1),). This is not implemented in xarray yet. Consider either rechunking using chunk() or instead deleting or modifying encoding['chunks']. ```

In this case, the error is particularly frustrating because I'm not even writing any data yet. (Also related to #2300, #4046, #4380).

There are at least two scenarios in which we might want to have more flexibility. 1. The case above, when we want to lazily initialize a Zarr array based on a Dataset, without actually computing anything. 2. The more general case, where we actually write arrays with many-to-many dask-chunk <-> zarr-chunk relationships

For 1, I propose we add a new option like safe_chunks=True to to_zarr. safe_chunks=False would permit just bypassing this chunk.

For 2, we could consider implementing locks. This probably has to be done at the Dask level. But is actually not super hard to deterministically figure out which chunks need to share a lock.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5056/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
837243943 MDExOlB1bGxSZXF1ZXN0NTk3NjA4NTg0 5065 Zarr chunking fixes rabernat 1197350 closed 0     32 2021-03-22T01:35:22Z 2021-04-26T16:37:43Z 2021-04-26T16:37:43Z MEMBER   0 pydata/xarray/pulls/5065
  • [x] Closes #2300, closes #5056
  • [x] Tests added
  • [x] Passes pre-commit run --all-files
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst

This PR contains two small, related updates to how Zarr chunks are handled.

  1. We now delete the encoding attribute at the Variable level whenever chunk is called. The persistence of chunk encoding has been the source of lots of confusion (see #2300, #4046, #4380, https://github.com/dcs4cop/xcube/issues/347)
  2. Added a new option called safe_chunks in to_zarr which allows for bypassing the requirement of the many-to-one relationship between Zarr chunks and Dask chunks (see #5056).

Both these touch the internal logic for how chunks are handled, so I thought it was easiest to tackle them with a single PR.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5065/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
859945463 MDU6SXNzdWU4NTk5NDU0NjM= 5172 Inconsistent attribute handling between netcdf4 and h5netcdf engines rabernat 1197350 closed 0     3 2021-04-16T15:54:03Z 2021-04-20T14:00:34Z 2021-04-16T17:13:26Z MEMBER      

I have found a netCDF file that cannot be decoded by xarray via the h5netcdf engine but CAN be decoded via netCDF4. This could be considered an h5netcdf bug, but I thought I would raise it first here for visibility.

This file will reproduce the bug ! wget 'https://esgf-world.s3.amazonaws.com/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/abrupt-4xCO2/r1i1p1f1/Lmon/cLeaf/gr/v20190118/cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc'

```python import netCDF4 import h5netcdf.legacyapi as netCDF4_h5

local_path = "cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc" with netCDF4_h5.Dataset(local_path, mode='r') as ncfile: print('h5netcdf:', ncfile['cLeaf'].getncattr("coordinates")) with netCDF4.Dataset(local_path, mode='r') as ncfile: #assert "coordinates" not in ncfile['cLeaf'].attrs print('netCDF4:', ncfile['cLeaf'].getncattr("coordinates")) ```

h5netcdf: Empty(dtype=dtype('S1')) netCDF4:

As we can see, we get an empty string '' in netCDF4 but a <class 'h5py._hl.base.Empty'> object from h5netcdf. This weird attribute prevents xarray from decoding the dataset.

We could: - Fix it in xarray, but having special handling for this sort of Empty object - Fix it in h5netcdf

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 4.19.150+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.17.0 pandas: 1.2.3 numpy: 1.20.2 scipy: 1.6.2 netCDF4: 1.5.6 pydap: installed h5netcdf: 0.10.0 h5py: 3.1.0 Nio: None zarr: 2.7.0 cftime: 1.4.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.2.1 cfgrib: 0.9.8.5 iris: None bottleneck: 1.3.2 dask: 2021.03.1 distributed: 2021.03.1 matplotlib: 3.3.4 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.17 setuptools: 49.6.0.post20210108 pip: 20.3.4 conda: None pytest: None IPython: 7.22.0 sphinx: None

xref https://github.com/pangeo-forge/pangeo-forge/issues/105

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5172/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
548607657 MDU6SXNzdWU1NDg2MDc2NTc= 3689 Decode CF bounds to coords rabernat 1197350 closed 0     5 2020-01-12T18:23:26Z 2021-04-19T03:32:26Z 2021-04-19T03:32:26Z MEMBER      

CF conventions define Cell Boundaries and specify how to encode the presence of cell boundary variables in dataset attributes.

To represent cells we add the attribute bounds to the appropriate coordinate variable(s). The value of bounds is the name of the variable that contains the vertices of the cell boundaries.

For example consider this dataset: http://esgf-data.ucar.edu/thredds/dodsC/esg_dataroot/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_200001-201412.nc

python url = 'http://esgf-data.ucar.edu/thredds/dodsC/esg_dataroot/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_200001-201412.nc' ds = xr.open_dataset(url) ds

gives <xarray.Dataset> Dimensions: (lat: 192, lon: 288, nbnd: 2, time: 180) Coordinates: * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0 * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8 * time (time) object 2000-01-15 12:00:00 ... 2014-12-15 12:00:00 Dimensions without coordinates: nbnd Data variables: time_bnds (time, nbnd) object ... lat_bnds (lat, nbnd) float64 ... lon_bnds (lon, nbnd) float64 ... tas (time, lat, lon) float32 ...

Despite the presence of the bounds attributes ```

print(ds.time.bounds, ds.lat.bounds, ds.lon.bounds) time_bnds lat_bnds lon_bnds ```

The variables time_bnds, lat_bnds, and lon_bnds are not decoded as coordinates but as data variables. I believe that this is not in accordance with CF conventions.

Instead, we should decode all bounds variables to coordinates.

I cannot think of a single use case where one would want to treat these variables as data variables rather than coordinates. It would be easy to implement, but it is a breaking change.

Not that this is just a proposal to move bounds variables to the coords part of the dataset. It does not address the more difficult / complex question of how to actually use the bounds for indexing or plotting operations (see e.g. #1475, #1613), although it could be a first step in that direction.

Full ncdump of dataset

``` xarray.Dataset { dimensions: lat = 192 ; lon = 288 ; nbnd = 2 ; time = 180 ; variables: float64 lat(lat) ; lat:axis = Y ; lat:bounds = lat_bnds ; lat:standard_name = latitude ; lat:title = Latitude ; lat:type = double ; lat:units = degrees_north ; lat:valid_max = 90.0 ; lat:valid_min = -90.0 ; lat:_ChunkSizes = 192 ; float64 lon(lon) ; lon:axis = X ; lon:bounds = lon_bnds ; lon:standard_name = longitude ; lon:title = Longitude ; lon:type = double ; lon:units = degrees_east ; lon:valid_max = 360.0 ; lon:valid_min = 0.0 ; lon:_ChunkSizes = 288 ; object time(time) ; time:axis = T ; time:bounds = time_bnds ; time:standard_name = time ; time:title = time ; time:type = double ; time:_ChunkSizes = 512 ; object time_bnds(time, nbnd) ; time_bnds:_ChunkSizes = [1 2] ; float64 lat_bnds(lat, nbnd) ; lat_bnds:units = degrees_north ; lat_bnds:_ChunkSizes = [192 2] ; float64 lon_bnds(lon, nbnd) ; lon_bnds:units = degrees_east ; lon_bnds:_ChunkSizes = [288 2] ; float32 tas(time, lat, lon) ; tas:cell_measures = area: areacella ; tas:cell_methods = area: time: mean ; tas:comment = near-surface (usually, 2 meter) air temperature ; tas:description = near-surface (usually, 2 meter) air temperature ; tas:frequency = mon ; tas:id = tas ; tas:long_name = Near-Surface Air Temperature ; tas:mipTable = Amon ; tas:out_name = tas ; tas:prov = Amon ((isd.003)) ; tas:realm = atmos ; tas:standard_name = air_temperature ; tas:time = time ; tas:time_label = time-mean ; tas:time_title = Temporal mean ; tas:title = Near-Surface Air Temperature ; tas:type = real ; tas:units = K ; tas:variable_id = tas ; tas:_ChunkSizes = [ 1 192 288] ; // global attributes: :Conventions = CF-1.7 CMIP-6.2 ; ... [truncated] ```

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:07:37) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.6.2 xarray: 0.14.0+19.gba48fbcd pandas: 0.25.1 numpy: 1.17.2 scipy: 1.3.1 netCDF4: 1.5.1.2 pydap: None h5netcdf: 0.7.4 h5py: 2.10.0 Nio: None zarr: 2.3.2 cftime: 1.0.3.4 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: None cfgrib: 0.9.7.1 iris: None bottleneck: 1.2.1 dask: 2.4.0 distributed: 2.4.0 matplotlib: 3.1.1 cartopy: 0.17.0 seaborn: 0.9.0 numbagg: None setuptools: 41.2.0 pip: 19.2.3 conda: None pytest: 5.1.2 IPython: 7.8.0 sphinx: 1.6.5
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3689/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
99836561 MDU6SXNzdWU5OTgzNjU2MQ== 521 time decoding error with "days since" rabernat 1197350 closed 0     20 2015-08-08T21:54:24Z 2021-03-29T14:12:38Z 2015-08-14T17:23:26Z MEMBER      

I am trying to use xray with some CESM POP model netCDF output, which supposedly follows CF-1.0 conventions. It is failing because the models time units are "'days since 0000-01-01 00:00:00". When calling open_dataset, I get the following error:

ValueError: unable to decode time units u'days since 0000-01-01 00:00:00' with the default calendar. Try opening your dataset with decode_times=False. Full traceback: Traceback (most recent call last): File "/home/rpa/xray/xray/conventions.py", line 372, in __init__ # Otherwise, tracebacks end up swallowed by Dataset.__repr__ when users File "/home/rpa/xray/xray/conventions.py", line 145, in decode_cf_datetime dates = _decode_datetime_with_netcdf4(flat_num_dates, units, calendar) File "/home/rpa/xray/xray/conventions.py", line 97, in _decode_datetime_with_netcdf4 dates = np.asarray(nc4.num2date(num_dates, units, calendar)) File "netCDF4/_netCDF4.pyx", line 4522, in netCDF4._netCDF4.num2date (netCDF4/_netCDF4.c:50388) File "netCDF4/_netCDF4.pyx", line 4337, in netCDF4._netCDF4._dateparse (netCDF4/_netCDF4.c:48234) ValueError: year is out of range

Full metadata for the time variable:

double time(time) ; time:long_name = "time" ; time:units = "days since 0000-01-01 00:00:00" ; time:bounds = "time_bound" ; time:calendar = "noleap" ;

I guess this is a problem with the underlying netCDF4 num2date package?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/521/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
288184220 MDU6SXNzdWUyODgxODQyMjA= 1823 We need a fast path for open_mfdataset rabernat 1197350 closed 0     19 2018-01-12T17:01:49Z 2021-01-28T18:00:15Z 2021-01-27T17:50:09Z MEMBER      

It would be great to have a "fast path" option for open_mfdataset, in which all alignment / coordinate checking is bypassed. This would be used in cases where the user knows that many netCDF files all share the same coordinates (e.g. model output, satellite records from the same product, etc.). The coordinates would just be taken from the first file, and only the data variables would be read from all subsequent files. The only checking would be that the data variables have the correct shape.

Implementing this would require some refactoring. @jbusecke mentioned that he had developed a solution for this (related to #1704), so maybe he could be the one to add this feature to xarray.

This is also related to #1385.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1823/reactions",
    "total_count": 9,
    "+1": 9,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
753965875 MDU6SXNzdWU3NTM5NjU4NzU= 4631 Decode_cf fails when scale_factor is a length-1 list rabernat 1197350 closed 0     4 2020-12-01T03:07:48Z 2021-01-15T18:19:56Z 2021-01-15T18:19:56Z MEMBER      

Some datasets I work with have scale_factor and add_offset encoded as length-1 lists. The following code worked as of Xarray 0.16.1

python import xarray as xr ds = xr.DataArray([0, 1, 2], name='foo', attrs={'scale_factor': [0.01], 'add_offset': [1.0]}).to_dataset() xr.decode_cf(ds)

In 0.16.2 (just released) and current master, it fails with this error

```

AttributeError Traceback (most recent call last) <ipython-input-2-a0b01d6a314b> in <module> 2 attrs={'scale_factor': [0.01], 3 'add_offset': [1.0]}).to_dataset() ----> 4 xr.decode_cf(ds)

~/Code/xarray/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta) 587 raise TypeError("can only decode Dataset or DataStore objects") 588 --> 589 vars, attrs, coord_names = decode_cf_variables( 590 vars, 591 attrs,

~/Code/xarray/xarray/conventions.py in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta) 490 and stackable(v.dims[-1]) 491 ) --> 492 new_vars[k] = decode_cf_variable( 493 k, 494 v,

~/Code/xarray/xarray/conventions.py in decode_cf_variable(name, var, concat_characters, mask_and_scale, decode_times, decode_endianness, stack_char_dim, use_cftime, decode_timedelta) 333 variables.CFScaleOffsetCoder(), 334 ]: --> 335 var = coder.decode(var, name=name) 336 337 if decode_timedelta:

~/Code/xarray/xarray/coding/variables.py in decode(self, variable, name) 271 dtype = _choose_float_dtype(data.dtype, "add_offset" in attrs) 272 if np.ndim(scale_factor) > 0: --> 273 scale_factor = scale_factor.item() 274 if np.ndim(add_offset) > 0: 275 add_offset = add_offset.item()

AttributeError: 'list' object has no attribute 'item' ```

I'm very confused, because this feels quite similar to #4471, and I thought it was resolved #4485. However, the behavior is different with 'scale_factor': np.array([0.01]). That works fine--no error.

How might I end up with a dataset with scale_factor as a python list? It happens when I open a netcdf file using the h5netcdf engine (documented by @gerritholl in https://github.com/pydata/xarray/issues/4471#issuecomment-702018925) and then write it to zarr. The numpy array gets encoded as a list in the zarr json metadata. 🙃

This problem would go away if we could resolve the discrepancies between the two engines' treatment of scalar attributes.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4631/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
753514595 MDU6SXNzdWU3NTM1MTQ1OTU= 4624 Release 0.16.2? rabernat 1197350 closed 0     6 2020-11-30T14:15:55Z 2020-12-02T00:24:31Z 2020-12-01T15:09:38Z MEMBER      

Looking at our what's new, we have quite a few important new features, as well as significant bug fixes.

I propose we move towards releasing ~0.17.0~ 0.16.2 asap. (I have selfish motives for this, as I want to use the new features in production.)

We can use this issue to track any PRs or issues we want to resolve before the next release. I personally am not aware of any major blockers, but other devs should feel free to edit this list.

  • [ ] #4461 - requires decisions
  • [x] #4618
  • [x] #4621

cc @pydata/xarray

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4624/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
375663610 MDU6SXNzdWUzNzU2NjM2MTA= 2528 display_width doesn't apply to dask-backed arrays rabernat 1197350 closed 0     3 2018-10-30T19:49:05Z 2020-09-30T06:17:17Z 2020-09-30T06:17:17Z MEMBER      

The representation of dask-backed arrays in xarray's __repr__ methods results in very long lines which often overflow the desired line width. Unfortunately, this can't be controlled or overridden with xr.set_options(display_width=...).

Code Sample, a copy-pastable example if possible

python import xarray as xr xr.set_options(display_width=20) ds = (xr.DataArray(range(100)) .chunk({'dim_0': 10}) .to_dataset(name='really_long_long_name')) ds

<xarray.Dataset> Dimensions: (dim_0: 100) Dimensions without coordinates: dim_0 Data variables: really_long_long_name (dim_0) int64 dask.array<shape=(100,), chunksize=(10,)>

Problem description

[this should explain why the current behavior is a problem and why the expected output is a better solution.]

Expected Output

We need to decide how to abbreviate dask arrays with something more concise. I'm not sure the best way to do this. Maybe really_long_long_name (dim_0) int64 dask chunks=(10,)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2528/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
614814400 MDExOlB1bGxSZXF1ZXN0NDE1MjkyMzM3 4047 Document Xarray zarr encoding conventions rabernat 1197350 closed 0     3 2020-05-08T15:29:14Z 2020-05-22T21:59:09Z 2020-05-20T17:04:02Z MEMBER   0 pydata/xarray/pulls/4047

When we implemented the Zarr backend, we made some ad hoc choices about how to encode NetCDF data in Zarr. At this stage, it would be useful to explicitly document this encoding. I decided to put it on the "Xarray Internals" page, but I'm open to moving if folks feel it fits better elsewhere.

cc @jeffdlb, @WardF, @DennisHeimbigner

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4047/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
528884925 MDU6SXNzdWU1Mjg4ODQ5MjU= 3575 map_blocks output inference problems rabernat 1197350 closed 0     6 2019-11-26T17:56:11Z 2020-05-06T16:41:54Z 2020-05-06T16:41:54Z MEMBER      

I am excited about using map_blocks to overcome a long-standing challenge related to calculating climatologies / anomalies with dask arrays. However, I hit what feels like a bug. I don't love how the new map_blocks function does this:

The function will be first run on mocked-up data, that looks like ‘obj’ but has sizes 0, to determine properties of the returned object such as dtype, variable names, new dimensions and new indexes (if any).

The problem is that many functions will simply error on size 0 data. As in the example below

MCVE Code Sample

```python import xarray as xr ds = xr.tutorial.load_dataset('rasm').chunk({'y': 20})

def calculate_anomaly(ds): # needed to workaround xarray's check with zero dimensions #if len(ds['time']) == 0: # return ds gb = ds.groupby("time.month") clim = gb.mean(dim='T') return gb - clim

xr.map_blocks(calculate_anomaly, ds) ```

Raises

```

KeyError Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataset.py in _construct_dataarray(self, name) 1145 try: -> 1146 variable = self._variables[name] 1147 except KeyError:

KeyError: 'time.month'

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/parallel.py in infer_template(func, obj, args, kwargs) 77 try: ---> 78 template = func(meta_args, **kwargs) 79 except Exception as e:

<ipython-input-40-d7b2b2978c29> in calculate_anomaly(ds) 5 # return ds ----> 6 gb = ds.groupby("time.month") 7 clim = gb.mean(dim='T')

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/common.py in groupby(self, group, squeeze, restore_coord_dims) 656 return self._groupby_cls( --> 657 self, group, squeeze=squeeze, restore_coord_dims=restore_coord_dims 658 )

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/groupby.py in init(self, obj, group, squeeze, grouper, bins, restore_coord_dims, cut_kwargs) 298 ) --> 299 group = obj[group] 300 if len(group) == 0:

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataset.py in getitem(self, key) 1235 if hashable(key): -> 1236 return self._construct_dataarray(key) 1237 else:

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataset.py in construct_dataarray(self, name) 1148 , name, variable = _get_virtual_variable( -> 1149 self._variables, name, self._level_coords, self.dims 1150 )

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/dataset.py in _get_virtual_variable(variables, key, level_vars, dim_sizes) 157 else: --> 158 data = getattr(ref_var, var_name).data 159 virtual_var = Variable(ref_var.dims, data)

AttributeError: 'IndexVariable' object has no attribute 'month'

The above exception was the direct cause of the following exception:

Exception Traceback (most recent call last) <ipython-input-40-d7b2b2978c29> in <module> 8 return gb - clim 9 ---> 10 xr.map_blocks(calculate_anomaly, ds)

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/parallel.py in map_blocks(func, obj, args, kwargs) 203 input_chunks = dataset.chunks 204 --> 205 template: Union[DataArray, Dataset] = infer_template(func, obj, args, *kwargs) 206 if isinstance(template, DataArray): 207 result_is_array = True

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/core/parallel.py in infer_template(func, obj, args, *kwargs) 80 raise Exception( 81 "Cannot infer object returned from running user provided function." ---> 82 ) from e 83 84 if not isinstance(template, (Dataset, DataArray)):

Exception: Cannot infer object returned from running user provided function. ```

Problem Description

We should try to imitate what dask does in map_blocks: https://docs.dask.org/en/latest/array-api.html#dask.array.map_blocks

Specifically: - We should allow the user to override the checks by explicitly specifying output dtype and shape - Maybe the check should be on small, rather than zero size, test data

Output of xr.show_versions()

# Paste the output here xr.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.138+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.6.2 xarray: 0.14.0 pandas: 0.25.3 numpy: 1.17.3 scipy: 1.3.2 netCDF4: 1.5.1.2 pydap: installed h5netcdf: 0.7.4 h5py: 2.10.0 Nio: None zarr: 2.3.2 cftime: 1.0.4.2 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.0.25 cfgrib: None iris: 2.2.0 bottleneck: 1.3.0 dask: 2.7.0 distributed: 2.7.0 matplotlib: 3.1.2 cartopy: 0.17.0 seaborn: 0.9.0 numbagg: None setuptools: 41.6.0.post20191101 pip: 19.3.1 conda: None pytest: 5.3.1 IPython: 7.9.0 sphinx: None ​
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3575/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
499477363 MDU6SXNzdWU0OTk0NzczNjM= 3349 Implement polyfit? rabernat 1197350 closed 0     25 2019-09-27T14:25:14Z 2020-03-25T17:17:45Z 2020-03-25T17:17:45Z MEMBER      

Fitting a line (or curve) to data along a specified axis is a long-standing need of xarray users. There are many blog posts and SO questions about how to do it: - http://atedstone.github.io/rate-of-change-maps/ - https://gist.github.com/luke-gregor/4bb5c483b2d111e52413b260311fbe43 - https://stackoverflow.com/questions/38960903/applying-numpy-polyfit-to-xarray-dataset - https://stackoverflow.com/questions/52094320/with-xarray-how-to-parallelize-1d-operations-on-a-multidimensional-dataset - https://stackoverflow.com/questions/36275052/applying-a-function-along-an-axis-of-a-dask-array

The main use case in my domain is finding the temporal trend on a 3D variable (e.g. temperature in time, lon, lat).

Yes, you can do it with apply_ufunc, but apply_ufunc is inaccessibly complex for many users. Much of our existing API could be removed and replaced with apply_ufunc calls, but that doesn't mean we should do it.

I am proposing we add a Dataarray method called polyfit. It would work like this:

```python x_ = np.linspace(0, 1, 10) y_ = np.arange(5) a_ = np.cos(y_)

x = xr.DataArray(x_, dims=['x'], coords={'x': x_}) a = xr.DataArray(a_, dims=['y']) f = a*x p = f.polyfit(dim='x', deg=1)

equivalent numpy code

p_ = np.polyfit(x_, f.values.transpose(), 1) np.testing.assert_allclose(p_[0], a_) ```

Numpy's polyfit function is already vectorized in the sense that it accepts 1D x and 2D y, performing the fit independently over each column of y. To extend this to ND, we would just need to reshape the data going in and out of the function. We do this already in other packages. For dask, we could simply require that the dimension over which the fit is calculated be contiguous, and then call map_blocks.

Thoughts?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3349/reactions",
    "total_count": 9,
    "+1": 9,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
361858640 MDU6SXNzdWUzNjE4NTg2NDA= 2423 manually specify chunks in open_zarr rabernat 1197350 closed 0     2 2018-09-19T17:52:31Z 2020-01-09T15:21:35Z 2020-01-09T15:21:35Z MEMBER      

Currently, open_zarr has two possible chunking behaviors. auto_chunk=True (default) creates dask chunks corresponding with zarr chunks. auto_chunk=False creates no chunks. But what if you want to manually specify the chunks, as with open_dataset(chunks=...). open_zarr could easily support this, but it does not currently.

Note that this is not the same as calling .chunk() post dataset creation. That operation is very inefficient, since it begins from a single global chunk for each variable.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2423/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
396285440 MDU6SXNzdWUzOTYyODU0NDA= 2656 dataset info in .json format rabernat 1197350 closed 0     9 2019-01-06T19:13:34Z 2020-01-08T22:43:25Z 2019-01-21T23:25:56Z MEMBER      

I am exploring the world of Spatio Temporal Asset Catalogs (STAC), in which all datasets are described using json/ geojson:

The STAC specification aims to standardize the way geospatial assets are exposed online and queried.

I am thinking about how to put the sort of datasets that xarray deals with into STAC items (see https://github.com/radiantearth/stac-spec). This would be particular valuable in the context of Pangeo and the zarr-based datasets we have been putting in cloud storage.

For this purpose, it would be very useful to have a concise summary of an xarray dataset's contents (minus the actual data) in .json format. I'm talking about the kind of info we currently get from the .info() method, which is designed to mirror the CDL output of ncdump -h.

For example python ds = xr.Dataset({'foo': ('x', np.ones(10, 'f8'), {'units': 'm s-1'})}, {'x': ('x', np.arange(10), {'units': 'm'})}, {'conventions': 'made up'}) ds.info() ``` xarray.Dataset { dimensions: x = 10 ;

variables: float64 foo(x) ; foo:units = m s-1 ; int64 x(x) ; x:units = m ;

// global attributes: :conventions = made up ; ```

I would like to be able to do ds.info(format='json') and see something like this { "coords": { "x": { "dims": [ "x" ], "attrs": { "units": "m" } } }, "attrs": { "conventions": "made up" }, "dims": { "x": 10 }, "data_vars": { "foo": { "dims": [ "x" ], "attrs": { "units": "m s-1" } } } }

Which is what I get by doing print(json.dumps(ds.to_dict(), indent=2)) and manually stripping out all the data fields. So an alternative api might be something like ds.to_dict(data=False).

If anyone is aware of an existing spec for expressing Common Data Language in json, we should probably use that instead of inventing our own. But I think some version of this would be a very useful addition to xarray.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2656/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
288785270 MDU6SXNzdWUyODg3ODUyNzA= 1832 groupby on dask objects doesn't handle chunks well rabernat 1197350 closed 0     22 2018-01-16T04:50:22Z 2019-11-27T16:45:14Z 2019-06-06T20:01:40Z MEMBER      

80% of climate data analysis begins with calculating the monthly-mean climatology and subtracting it from the dataset to get an anomaly. Unfortunately this is a fail case for xarray / dask with out-of-core datasets. This is becoming a serious problem for me.

Code Sample

```python

Your code here

import xarray as xr import dask.array as da import pandas as pd

construct an example datatset chunked in time

nt, ny, nx = 366, 180, 360 time = pd.date_range(start='1950-01-01', periods=nt, freq='10D') ds = xr.DataArray(da.random.random((nt, ny, nx), chunks=(1, ny, nx)), dims=('time', 'lat', 'lon'), coords={'time': time}).to_dataset(name='field')

monthly climatology

ds_mm = ds.groupby('time.month').mean(dim='time')

anomaly

ds_anom = ds.groupby('time.month')- ds_mm print(ds_anom) <xarray.Dataset> Dimensions: (lat: 180, lon: 360, time: 366) Coordinates: * time (time) datetime64[ns] 1950-01-01 1950-01-11 1950-01-21 ... month (time) int64 1 1 1 1 2 2 3 3 3 4 4 4 5 5 5 5 6 6 6 7 7 7 8 8 8 ... Dimensions without coordinates: lat, lon Data variables: field (time, lat, lon) float64 dask.array<shape=(366, 180, 360), chunksize=(366, 180, 360)> ```

Problem description

As we can see in the example above, the chunking has been lost. The dataset contains just one single huge chunk. This happens with any non-reducing operation on the groupby, even python ds.groupby('time.month').apply(lambda x: x)

Say we wanted to compute some statistics of the anomaly, like the variance: python (ds_anom.field**2).mean(dim='time').load() This triggers the whole big chunk (with the whole timeseries) to be loaded into memory somewhere. For out-of-core datasets, this will crash our system.

Expected Output

It seems like we should be able to do this lazily, maintaining a chunk size of (1, 180, 360) for ds_anom.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.0+dev27.g049cbdd pandas: 0.20.3 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.3.1 h5netcdf: 0.4.1 Nio: None zarr: 2.2.0a2.dev91 bottleneck: 1.2.1 cyordereddict: None dask: 0.16.0 distributed: 1.20.1 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 36.3.0 pip: 9.0.1 conda: None pytest: 3.2.1 IPython: 6.1.0 sphinx: 1.6.5

Possibly related to #392.

cc @mrocklin

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1832/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
467776251 MDExOlB1bGxSZXF1ZXN0Mjk3MzU0NTEx 3121 Allow other tutorial filename extensions rabernat 1197350 closed 0     3 2019-07-13T23:27:44Z 2019-07-14T01:07:55Z 2019-07-14T01:07:51Z MEMBER   0 pydata/xarray/pulls/3121
  • [x] Closes #3118
  • [ ] Tests added
  • [ ] Fully documented, including whats-new.rst for all changes and api.rst for new API

Together with https://github.com/pydata/xarray-data/pull/15, this allows us to generalize out tutorial datasets to non netCDF files. But it is backwards compatible--if there is no file suffix, it will append .nc.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3121/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
467674875 MDExOlB1bGxSZXF1ZXN0Mjk3MjgyNzA1 3106 Replace sphinx_gallery with notebook rabernat 1197350 closed 0     3 2019-07-13T05:35:34Z 2019-07-13T14:03:20Z 2019-07-13T14:03:19Z MEMBER   0 pydata/xarray/pulls/3106

Today @jhamman and I discussed how to refactor our somewhat fragmented "examples". We decided to basically copy the approach of the dask-examples repo, but have it live here in the main xarray repo. Basically this approach is: - all examples are notebooks - examples are rendered during doc build by nbsphinx - we will eventually have a binder that works with all of the same examples

This PR removes the dependency on sphinx_gallery and replaces the existing gallery with a standalone notebook called visualization_gallery.ipynb. However, not all of the links that worked in the gallery work here, since we are now using nbsphinx to render the notebooks (see https://github.com/spatialaudio/nbsphinx/issues/308).

Really important to get @dcherian's feedback on this, as he was the one who originally introduced the gallery. My view is that having everything as notebooks makes examples easier to maintain. But I'm curious to hear other views.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3106/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
467658326 MDExOlB1bGxSZXF1ZXN0Mjk3MjcwNjYw 3105 Switch doc examples to use nbsphinx rabernat 1197350 closed 0     4 2019-07-13T02:28:34Z 2019-07-13T04:53:09Z 2019-07-13T04:52:52Z MEMBER   0 pydata/xarray/pulls/3105

This is the beginning of the docs refactor we have in mind for the sprint tomorrow.

We will merge things first to the scipy19-docs branch so we can make sure things build on RTD.

http://xarray.pydata.org/en/scipy19-docs

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3105/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
218260909 MDU6SXNzdWUyMTgyNjA5MDk= 1340 round-trip performance with save_mfdataset / open_mfdataset rabernat 1197350 closed 0     11 2017-03-30T16:52:26Z 2019-05-01T22:12:06Z 2019-05-01T22:12:06Z MEMBER      

I have encountered some major performance bottlenecks in trying to write and then read multi-file netcdf datasets.

I start with an xarray dataset created by xgcm with the following repr: <xarray.Dataset> Dimensions: (XC: 400, XG: 400, YC: 400, YG: 400, Z: 40, Zl: 40, Zp1: 41, Zu: 40, layer_1TH_bounds: 43, layer_1TH_center: 42, layer_1TH_interface: 41, time: 1566) Coordinates: iter (time) int64 8294400 8294976 8295552 8296128 ... * time (time) int64 8294400 8294976 8295552 8296128 ... * XC (XC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ... * YG (YG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ... * XG (XG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ... * YC (YC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ... * Zu (Zu) >f4 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 -91.0 ... * Zl (Zl) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ... * Zp1 (Zp1) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ... * Z (Z) >f4 -5.0 -15.0 -25.0 -36.0 -49.0 -64.0 -81.5 ... rAz (YG, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... dyC (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... rAw (YC, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... dxC (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... dxG (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... dyG (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... rAs (YG, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... Depth (YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... rA (YC, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... PHrefF (Zp1) >f4 0.0 98.1 196.2 294.3 412.02 549.36 706.32 ... PHrefC (Z) >f4 49.05 147.15 245.25 353.16 480.69 627.84 ... drC (Zp1) >f4 5.0 10.0 10.0 11.0 13.0 15.0 17.5 20.5 ... drF (Z) >f4 10.0 10.0 10.0 12.0 14.0 16.0 19.0 22.0 ... hFacC (Z, YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... hFacW (Z, YC, XG) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... hFacS (Z, YG, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... * layer_1TH_bounds (layer_1TH_bounds) >f4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 ... * layer_1TH_interface (layer_1TH_interface) >f4 0.0 0.2 0.4 0.6 0.8 1.0 ... * layer_1TH_center (layer_1TH_center) float32 -0.1 0.1 0.3 0.5 0.7 0.9 ... Data variables: T (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... U (time, Z, YC, XG) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... V (time, Z, YG, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... S (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... Eta (time, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... W (time, Zl, YC, XC) float32 -0.0 -0.0 -0.0 -0.0 -0.0 ...

An important point to note is that there are lots of "non-dimension coordinates" corresponding to various parameters of the numerical grid.

I save this dataset to a multi-file netCDF dataset as follows: python iternums, datasets = zip(*ds.groupby('time')) paths = [outdir + 'xmitgcm_data.%010d.nc' % it for it in iternums] xr.save_mfdataset(datasets, paths) This takes many hours to run, since it has to read and write all the data. (I think there are some performance issues here too, related to how dask schedules the read / write tasks, but that is probably a separate issue.)

Then I try to re-load this dataset python ds_nc = xr.open_mfdataset('xmitgcm_data.*.nc')

This raises an error: ValueError: too many different dimensions to concatenate: {'YG', 'Z', 'Zl', 'Zp1', 'layer_1TH_interface', 'YC', 'XC', 'layer_1TH_center', 'Zu', 'layer_1TH_bounds', 'XG'}

I need to specify concat_dim='time' in order to properly concatenate the data. It seems like this should be unnecessary, since I am reading back data that was just written with xarray, but I understand why (the dimensions of the Data Variables in each file are just Z, YC, XC, with no time dimension). Once I do that, it works, but it takes 18 minutes to load the dataset. I assume this is because it has to check the compatibility of all all the non-dimension coordinates.

I just thought I would document this, because 18 minutes seems way too long to load a dataset.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1340/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
431199282 MDExOlB1bGxSZXF1ZXN0MjY4OTI3MjU0 2881 decreased pytest verbosity rabernat 1197350 closed 0     1 2019-04-09T21:12:50Z 2019-04-09T23:36:01Z 2019-04-09T23:34:22Z MEMBER   0 pydata/xarray/pulls/2881

This removes the --verbose flag from py.test in .travis.yml.

  • [x] Closes #2880
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2881/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
431156227 MDU6SXNzdWU0MzExNTYyMjc= 2880 pytest output on travis is too verbose rabernat 1197350 closed 0     1 2019-04-09T19:39:46Z 2019-04-09T23:34:22Z 2019-04-09T23:34:22Z MEMBER      

I have to scroll over an immense amount of passing tests on travis before I can get to the failures. (example) This is pretty annoying.

The amount of tests in xarray has exploded recently. This is good! But maybe we should turn off --verbose in travis.

What does @pydata/xarray think?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2880/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
373121666 MDU6SXNzdWUzNzMxMjE2NjY= 2503 Problems with distributed and opendap netCDF endpoint rabernat 1197350 closed 0     26 2018-10-23T17:48:20Z 2019-04-09T12:02:01Z 2019-04-09T12:02:01Z MEMBER      

Code Sample

I am trying to load a dataset from an opendap endpoint using xarray, netCDF4, and distributed. I am having a problem only with non-local distributed schedulers (KubeCluster specifically). This could plausibly be an xarray, dask, or pangeo issue, but I have decided to post it here.

```python import xarray as xr import dask

create dataset from Unidata's test opendap endpoint, chunked in time

url = 'http://remotetest.unidata.ucar.edu/thredds/dodsC/testdods/coads_climatology.nc' ds = xr.open_dataset(url, decode_times=False, chunks={'TIME': 1})

all these work

with dask.config.set(scheduler='synchronous'): ds.SST.compute() with dask.config.set(scheduler='processes'): ds.SST.compute() with dask.config.set(scheduler='threads'): ds.SST.compute()

this works too

from dask.distributed import Client local_client = Client() with dask.config.set(get=local_client): ds.SST.compute()

but this does not

cluster = KubeCluster(n_workers=2) kube_client = Client(cluster) with dask.config.set(get=kube_client): ds.SST.compute() ```

In the worker log, I see the following sort of errors. distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 5, 0, 0) distributed.worker - INFO - Dependent not found: open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf 0 . Asking scheduler distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 3, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 0, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 1, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 7, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 6, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 2, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 9, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 8, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 11, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 10, 0, 0) distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 4, 0, 0) distributed.worker - WARNING - Compute Failed Function: getter args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=_ElementwiseFunctionArray(LazilyOuterIndexedArray(array=<xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x7f45d6fcbb38>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))), func=functools.partial(<function _apply_mask at 0x7f45d70507b8>, encoded_fill_values={-1e+34}, decoded_fill_value=nan, dtype=dtype('float32')), dtype=dtype('float32')), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))))), (slice(3, 4, None), slice(0, 90, None), slice(0, 180, None))) kwargs: {} Exception: RuntimeError('NetCDF: Not a valid ID',) Ultimately, the error comes from the netCDF library: RuntimeError('NetCDF: Not a valid ID',)

This seems like something to do with serialization of the netCDF store. The worker images have identical netcdf version (and all other package versions). I am at a loss for how to debug further.

Output of xr.show_versions()

xr.show_versions() ``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.111+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.2 numpy: 1.15.1 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: None h5py: None Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.18.2 distributed: 1.22.1 matplotlib: 2.2.3 cartopy: None seaborn: None setuptools: 39.2.0 pip: 18.0 conda: 4.5.4 pytest: 3.8.0 IPython: 6.4.0 sphinx: None ``` `cube_client.get_versions(check=True)` ``` {'scheduler': {'host': (('python', '3.6.3.final.0'), ('python-bits', 64), ('OS', 'Linux'), ('OS-release', '4.4.111+'), ('machine', 'x86_64'), ('processor', 'x86_64'), ('byteorder', 'little'), ('LC_ALL', 'en_US.UTF-8'), ('LANG', 'en_US.UTF-8'), ('LOCALE', 'en_US.UTF-8')), 'packages': {'required': (('dask', '0.18.2'), ('distributed', '1.22.1'), ('msgpack', '0.5.6'), ('cloudpickle', '0.5.5'), ('tornado', '5.0.2'), ('toolz', '0.9.0')), 'optional': (('numpy', '1.15.1'), ('pandas', '0.23.2'), ('bokeh', '0.12.16'), ('lz4', '1.1.0'), ('blosc', '1.5.1'))}}, 'workers': {'tcp://10.20.8.4:36940': {'host': (('python', '3.6.3.final.0'), ('python-bits', 64), ('OS', 'Linux'), ('OS-release', '4.4.111+'), ('machine', 'x86_64'), ('processor', 'x86_64'), ('byteorder', 'little'), ('LC_ALL', 'en_US.UTF-8'), ('LANG', 'en_US.UTF-8'), ('LOCALE', 'en_US.UTF-8')), 'packages': {'required': (('dask', '0.18.2'), ('distributed', '1.22.1'), ('msgpack', '0.5.6'), ('cloudpickle', '0.5.5'), ('tornado', '5.0.2'), ('toolz', '0.9.0')), 'optional': (('numpy', '1.15.1'), ('pandas', '0.23.2'), ('bokeh', '0.12.16'), ('lz4', '1.1.0'), ('blosc', '1.5.1'))}}, 'tcp://10.21.177.254:42939': {'host': (('python', '3.6.3.final.0'), ('python-bits', 64), ('OS', 'Linux'), ('OS-release', '4.4.111+'), ('machine', 'x86_64'), ('processor', 'x86_64'), ('byteorder', 'little'), ('LC_ALL', 'en_US.UTF-8'), ('LANG', 'en_US.UTF-8'), ('LOCALE', 'en_US.UTF-8')), 'packages': {'required': (('dask', '0.18.2'), ('distributed', '1.22.1'), ('msgpack', '0.5.6'), ('cloudpickle', '0.5.5'), ('tornado', '5.0.2'), ('toolz', '0.9.0')), 'optional': (('numpy', '1.15.1'), ('pandas', '0.23.2'), ('bokeh', '0.12.16'), ('lz4', '1.1.0'), ('blosc', '1.5.1'))}}}, 'client': {'host': [('python', '3.6.3.final.0'), ('python-bits', 64), ('OS', 'Linux'), ('OS-release', '4.4.111+'), ('machine', 'x86_64'), ('processor', 'x86_64'), ('byteorder', 'little'), ('LC_ALL', 'en_US.UTF-8'), ('LANG', 'en_US.UTF-8'), ('LOCALE', 'en_US.UTF-8')], 'packages': {'required': [('dask', '0.18.2'), ('distributed', '1.22.1'), ('msgpack', '0.5.6'), ('cloudpickle', '0.5.5'), ('tornado', '5.0.2'), ('toolz', '0.9.0')], 'optional': [('numpy', '1.15.1'), ('pandas', '0.23.2'), ('bokeh', '0.12.16'), ('lz4', '1.1.0'), ('blosc', '1.5.1')]}}} ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2503/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
209561985 MDU6SXNzdWUyMDk1NjE5ODU= 1282 description of xarray assumes knowledge of pandas rabernat 1197350 closed 0     4 2017-02-22T19:52:54Z 2019-02-26T19:01:47Z 2019-02-26T19:01:46Z MEMBER      

The first sentence a potential new user reads about xarray is

xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures.

Now imagine you had never heard of pandas (like most new Ph.D. students in physical sciences). You would have no idea how useful and powerful xarray was.

I would propose modifying these top-level descriptions to remove the assumption that the user understands pandas. Of course we can still refer to pandas, but a more self-contained description would serve us well.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1282/reactions",
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
396501063 MDExOlB1bGxSZXF1ZXN0MjQyNjY4ODEw 2659 to_dict without data rabernat 1197350 closed 0     14 2019-01-07T14:09:25Z 2019-02-12T21:21:13Z 2019-01-21T23:25:56Z MEMBER   0 pydata/xarray/pulls/2659

This PR provides the ability to export Datasets and DataArrays to dictionary without the actual data. This could be useful for generating indices of dataset contents to expose to search indices or other automated data discovery tools

In the process of doing this, I refactored the core dictionary export function to live in the Variable class, since the same code was duplicated in several places.

  • [x] Closes #2656
  • [x] Tests added
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2659/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
324740017 MDU6SXNzdWUzMjQ3NDAwMTc= 2164 holoviews / bokeh doesn't like cftime coords rabernat 1197350 closed 0     16 2018-05-20T20:29:03Z 2019-02-08T00:11:14Z 2019-02-08T00:11:14Z MEMBER      

Code Sample, a copy-pastable example if possible

Consider a simple working example of converting an xarray dataset to holoviews for plotting:

python ref_date = '1981-01-01' ds = xr.DataArray([1, 2, 3], dims=['time'], coords={'time': ('time', [1, 2, 3], {'units': 'days since %s' % ref_date})} ).to_dataset(name='foo') with xr.set_options(enable_cftimeindex=True): ds = xr.decode_cf(ds) print(ds) hv_ds = hv.Dataset(ds) hv_ds.to(hv.Curve)

This gives <xarray.Dataset> Dimensions: (time: 3) Coordinates: * time (time) datetime64[ns] 1981-01-02 1981-01-03 1981-01-04 Data variables: foo (time) int64 ... and

Problem description

Now change ref_date = '0181-01-01' (or anything outside of the valid range for regular pandas datetime index). We get a beautiful new cftimeindex <xarray.Dataset> Dimensions: (time: 3) Coordinates: * time (time) object 0181-01-02 00:00:00 0181-01-03 00:00:00 ... Data variables: foo (time) int64 ...

but holoviews / bokeh doesn't like it ``` /opt/conda/lib/python3.6/site-packages/xarray/coding/times.py:132: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy cftime.datetime objects instead, reason: dates out of range enable_cftimeindex) /opt/conda/lib/python3.6/site-packages/xarray/coding/variables.py:66: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy cftime.datetime objects instead, reason: dates out of range return self.func(self.array[key])


TypeError Traceback (most recent call last) /opt/conda/lib/python3.6/site-packages/IPython/core/formatters.py in call(self, obj, include, exclude) 968 969 if method is not None: --> 970 return method(include=include, exclude=exclude) 971 return None 972 else:

/opt/conda/lib/python3.6/site-packages/holoviews/core/dimension.py in repr_mimebundle(self, include, exclude) 1229 combined and returned. 1230 """ -> 1231 return Store.render(self) 1232 1233

/opt/conda/lib/python3.6/site-packages/holoviews/core/options.py in render(cls, obj) 1287 data, metadata = {}, {} 1288 for hook in hooks: -> 1289 ret = hook(obj) 1290 if ret is None: 1291 continue

/opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in pprint_display(obj) 278 if not ip.display_formatter.formatters['text/plain'].pprint: 279 return None --> 280 return display(obj, raw_output=True) 281 282

/opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in display(obj, raw_output, **kwargs) 248 elif isinstance(obj, (CompositeOverlay, ViewableElement)): 249 with option_state(obj): --> 250 output = element_display(obj) 251 elif isinstance(obj, (Layout, NdLayout, AdjointLayout)): 252 with option_state(obj):

/opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in wrapped(element) 140 try: 141 max_frames = OutputSettings.options['max_frames'] --> 142 mimebundle = fn(element, max_frames=max_frames) 143 if mimebundle is None: 144 return {}, {}

/opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in element_display(element, max_frames) 186 return None 187 --> 188 return render(element) 189 190

/opt/conda/lib/python3.6/site-packages/holoviews/ipython/display_hooks.py in render(obj, kwargs) 63 renderer = renderer.instance(fig='png') 64 ---> 65 return renderer.components(obj, kwargs) 66 67

/opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/renderer.py in components(self, obj, fmt, comm, kwargs) 257 # Bokeh has to handle comms directly in <0.12.15 258 comm = False if bokeh_version < '0.12.15' else comm --> 259 return super(BokehRenderer, self).components(obj,fmt, comm, kwargs) 260 261

/opt/conda/lib/python3.6/site-packages/holoviews/plotting/renderer.py in components(self, obj, fmt, comm, **kwargs) 319 plot = obj 320 else: --> 321 plot, fmt = self._validate(obj, fmt) 322 323 widget_id = None

/opt/conda/lib/python3.6/site-packages/holoviews/plotting/renderer.py in _validate(self, obj, fmt, kwargs) 218 if isinstance(obj, tuple(self.widgets.values())): 219 return obj, 'html' --> 220 plot = self.get_plot(obj, renderer=self, kwargs) 221 222 fig_formats = self.mode_formats['fig'][self.mode]

/opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/renderer.py in get_plot(self_or_cls, obj, doc, renderer) 150 doc = Document() if self_or_cls.notebook_context else curdoc() 151 doc.theme = self_or_cls.theme --> 152 plot = super(BokehRenderer, self_or_cls).get_plot(obj, renderer) 153 plot.document = doc 154 return plot

/opt/conda/lib/python3.6/site-packages/holoviews/plotting/renderer.py in get_plot(self_or_cls, obj, renderer) 205 init_key = tuple(v if d is None else d for v, d in 206 zip(plot.keys[0], defaults)) --> 207 plot.update(init_key) 208 else: 209 plot = obj

/opt/conda/lib/python3.6/site-packages/holoviews/plotting/plot.py in update(self, key) 511 def update(self, key): 512 if len(self) == 1 and ((key == 0) or (key == self.keys[0])) and not self.drawn: --> 513 return self.initialize_plot() 514 item = self.getitem(key) 515 self.traverse(lambda x: setattr(x, '_updated', True))

/opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/element.py in initialize_plot(self, ranges, plot, plots, source) 729 if not self.overlaid: 730 self._update_plot(key, plot, style_element) --> 731 self._update_ranges(style_element, ranges) 732 733 for cb in self.callbacks:

/opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/element.py in _update_ranges(self, element, ranges) 498 if not self.drawn or xupdate: 499 self._update_range(x_range, l, r, xfactors, self.invert_xaxis, --> 500 self._shared['x'], self.logx, streaming) 501 if not self.drawn or yupdate: 502 self._update_range(y_range, b, t, yfactors, self.invert_yaxis,

/opt/conda/lib/python3.6/site-packages/holoviews/plotting/bokeh/element.py in _update_range(self, axis_range, low, high, factors, invert, shared, log, streaming) 525 updates = {} 526 if low is not None and (isinstance(low, util.datetime_types) --> 527 or np.isfinite(low)): 528 updates['start'] = (axis_range.start, low) 529 if high is not None and (isinstance(high, util.datetime_types)

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' ```

Similar but slightly different errors arise for different holoviews types (e.g. hv.Image) and contexts (using time as a holoviews kdim).

Expected Output

This should work.

I'm not sure if this is really an xarray problem. Maybe it needs a fix in holoviews (or bokeh). But I'm raising it here first since clearly we have introduced this new wrinkle in the stack. Cc'ing @philippjfr since he is the expert on all things holoviews.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.111+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.4 pandas: 0.23.0 numpy: 1.14.3 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: None h5py: None Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.17.5 distributed: 1.21.8 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 39.0.1 pip: 10.0.1 conda: 4.3.34 pytest: 3.5.1 IPython: 6.3.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2164/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
193657418 MDU6SXNzdWUxOTM2NTc0MTg= 1154 netCDF reading is not prominent in the docs rabernat 1197350 closed 0     7 2016-12-06T01:18:40Z 2019-02-02T06:33:44Z 2019-02-02T06:33:44Z MEMBER      

Just opening an issue to highlight what I think is a problem with the docs.

For me, the primary use of xarray is to read and process existing netCDF data files. @shoyer's popular blog post illustrates this use case extremely well.

However, when I open the docs, I have to dig quite deep before I can see how to read a netCDF file. This could be turning away many potential users. The stuff about netCDF reading is hidden under "Serialization and IO". Many potential users will have no idea what either of these words mean.

IMO the solution to this is to reorganize the docs to make reading netCDF much more prominent and obvious.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1154/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
225734529 MDU6SXNzdWUyMjU3MzQ1Mjk= 1394 autoclose with distributed doesn't seem to work rabernat 1197350 closed 0     9 2017-05-02T15:37:07Z 2019-01-13T19:35:10Z 2019-01-13T19:35:10Z MEMBER      

I am trying to analyze a very large netCDF dataset using xarray and distributed.

I open my dataset with the new autoclose option: python ds = xr.open_mfdataset(ddir + '*.nc', decode_cf=False, autoclose=True)

However, when I try some reduction operation (e.g. ds['Salt'].mean()), I can see my open file count continue to rise monotonically. Eventually the dask worker dies with OSError: [Errno 24] Too many open files: '/proc/65644/sta once I hit the system ulimit.

Am I doing something wrong here? Why are the files not being closed? cc: @pwolfram

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1394/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
225774140 MDU6SXNzdWUyMjU3NzQxNDA= 1396 selecting a point from an mfdataset rabernat 1197350 closed 0     12 2017-05-02T18:02:50Z 2019-01-13T06:32:45Z 2019-01-13T06:32:45Z MEMBER      

Sorry to be opening so many vague performance issues. I am really having a hard time with my current dataset, which is exposing certain limitations of xarray and dask in a way none of my previous work has done.

I have a directory full of netCDF4 files. There are 1754 files, each 8.1GB in size, each representing a single model timestep. So there is ~14 TB of data total. (In addition to the time-dependent output, there is a single file with information about the grid.)

Imagine I want to extract a timeseries from a single point (indexed by k, j, i) in this simulation. Without xarray, I would do something like this: python import netCDF4 ts = np.zeros(len(all_files)) for n, fname in enumerate(tqdm(all_files)): nc = netCDF4.Dataset(fname) ts[n] = nc.variables['Salt'][k, j, i] nc.close() Which goes reasonably quick: tqdm gives [02:38<00:00, 11.56it/s].

I could do the same sort of loop using xarray: python import xarray as xr ts = np.zeros(len(all_files)) for n, fname in enumerate(tqdm(all_files)): ds = xr.open_dataset(fname) ts[n] = ds['Salt'][k, j, i] ds.close() Which has a <50% performance overhead: [03:29<00:00, 8.74it/s]. Totally acceptable.

Of course, what I really want is to avoid a loop and deal with the whole dataset as a single self-contained object. python ds = xr.open_mfdataset(all_files, decode_cf=False, autoclose=True) This alone takes between 4-5 minutes to run (see #1385). If I want to print the repr, it takes another 3 minutes or so to print(ds). The full dataset looks like this: python <xarray.Dataset> Dimensions: (i: 2160, i_g: 2160, j: 2160, j_g: 2160, k: 90, k_l: 90, k_p1: 91, k_u: 90, time: 1752) Coordinates: * j (j) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ... * k (k) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... * j_g (j_g) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ... * i (i) int64 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 ... * k_p1 (k_p1) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * k_u (k_u) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * i_g (i_g) int64 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 ... * k_l (k_l) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... * time (time) float64 2.592e+05 2.628e+05 2.664e+05 2.7e+05 2.736e+05 ... Data variables: face (time) int64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... PhiBot (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceQnet (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... SIvice (time, j_g, i) float32 0.0516454 0.0523205 0.0308559 ... SIhsalt (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceFWflx (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... V (time, k, j_g, i) float32 0.0491903 0.0496442 0.0276739 ... iter (time) int64 10368 10512 10656 10800 10944 11088 11232 11376 ... oceQsw (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... oceTAUY (time, j_g, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Theta (time, k, j, i) float32 -1.31868 -1.27825 -1.21401 -1.17964 ... SIhsnow (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... U (time, k, j, i_g) float32 0.0281392 0.0203967 0.0075199 ... SIheff (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... SIuice (time, j, i_g) float32 -0.041163 -0.0487612 -0.0614498 ... SIarea (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Salt (time, k, j, i) float32 33.7534 33.7652 33.7755 33.7723 ... oceSflux (time, j, i) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... W (time, k_l, j, i) float32 -2.27453e-05 -2.28018e-05 ... oceTAUX (time, j, i_g) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... Eta (time, j, i) float32 -1.28886 -1.28811 -1.2871 -1.28567 ... YC (j, i) float32 -57.001 -57.001 -57.001 -57.001 -57.001 -57.001 ... YG (j_g, i_g) float32 -57.0066 -57.0066 -57.0066 -57.0066 ... XC (j, i) float32 -15.4896 -15.4688 -15.4479 -15.4271 -15.4062 ... XG (j_g, i_g) float32 -15.5 -15.4792 -15.4583 -15.4375 -15.4167 ... Zp1 (k_p1) float32 0.0 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 ... Z (k) float32 -0.5 -1.57 -2.79 -4.185 -5.78 -7.595 -9.66 -12.01 ... Zl (k_l) float32 0.0 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 ... Zu (k_u) float32 -1.0 -2.14 -3.44 -4.93 -6.63 -8.56 -10.76 -13.26 ... rA (j, i) float32 1.5528e+06 1.5528e+06 1.5528e+06 1.5528e+06 ... rAw (j, i_g) float32 1.5528e+06 1.5528e+06 1.5528e+06 1.5528e+06 ... rAs (j_g, i) float32 9.96921e+36 9.96921e+36 9.96921e+36 ... rAz (j_g, i_g) float32 1.55245e+06 1.55245e+06 1.55245e+06 ... dxG (j_g, i) float32 1261.27 1261.27 1261.27 1261.27 1261.27 ... dyG (j, i_g) float32 1230.96 1230.96 1230.96 1230.96 1230.96 ... dxC (j, i_g) float32 1261.46 1261.46 1261.46 1261.46 1261.46 ... Depth (j, i) float32 4578.67 4611.09 4647.6 4674.88 4766.75 4782.64 ... dyC (j_g, i) float32 1230.86 1230.86 1230.86 1230.86 1230.86 ... PHrefF (k_p1) float32 0.0 9.81 20.9934 33.7464 48.3633 65.0403 ... drF (k) float32 1.0 1.14 1.3 1.49 1.7 1.93 2.2 2.5 2.84 3.21 3.63 ... PHrefC (k) float32 4.905 15.4017 27.3699 41.0549 56.7018 74.507 ... drC (k_p1) float32 0.5 1.07 1.22 1.395 1.595 1.815 2.065 2.35 2.67 ... hFacW (k, j, i_g) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... hFacS (k, j_g, i) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... hFacC (k, j, i) float32 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... Attributes: coordinates: face

Now, to extract the same timeseries, I would like to say python ts = ds.Salt[:, k, j, i].load()

I monitor what is happening under the hood using when I call this by using netdata and the dask.distributed dashboard, using only a single process and thread. First, all the files are opened (see #1394). Then they start getting read. Each read takes between 10 and 30 seconds, and the memory usage starts increasing steadily. My impression is that the entire dataset is being read into memory for concatenation. (I have dumped out the dask graph in case anyone can make sense of it.) I have never let this calculation complete, as it looks like it would eat up all the memory on my system...plus it's extremely slow.

To me, this seems like a failure of lazy indexing. I naively expected that the underlying file access would work similar to my loop, perhaps even in parallel.

Can anyone shed some light on what might be going wrong?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1396/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
108623921 MDU6SXNzdWUxMDg2MjM5MjE= 591 distarray backend? rabernat 1197350 closed 0     5 2015-09-28T09:49:52Z 2019-01-13T04:11:08Z 2019-01-13T04:11:08Z MEMBER      

This is probably a long shot, but I think a distarray backend could potentially be very useful in xray. Distarray implements the numpy interface, so it should be possible in principle.

Distarray has a different architecture from dask (using MPI for parallelization) and in this way is more similar to traditional HPC codes. The application I have in mind is very high resolution GCM output where one wants to tile the data spatially across multiple nodes on a cluster. (This is how a GCM itself works.)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/591/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
280626621 MDU6SXNzdWUyODA2MjY2MjE= 1770 slow performance when storing datasets in gcsfs-backed zarr stores rabernat 1197350 closed 0     11 2017-12-08T21:46:32Z 2019-01-13T03:52:46Z 2019-01-13T03:52:46Z MEMBER      

We are working on integrating zarr with xarray. In the process, we have encountered a performance issue that I am documenting here. At this point, it is not clear if the core issue is in zarr, gcsfs, dask, or xarray. I originally started posting this in zarr, but in the process, I became more convinced the issue was with xarray.

Dask Only

Here is an example using only dask and zarr. ```python

connect to a local dask scheduler

from dask.distributed import Client client = Client('tcp://129.236.20.45:8786')

create a big dask array

import dask.array as dsa shape = (30, 50, 1080, 2160) chunkshape = (1, 1, 1080, 2160) ar = dsa.random.random(shape, chunks=chunkshape)

connect to gcs and create MutableMapping

import gcsfs fs = gcsfs.GCSFileSystem(project='pangeo-181919') gcsmap = gcsfs.mapping.GCSMap('pangeo-data/test999', gcs=fs, check=True, create=True)

create a zarr array to store into

import zarr za = zarr.create(ar.shape, chunks=chunkshape, dtype=ar.dtype, store=gcsmap)

write it

ar.store(za, lock=False) ``` When you do this, it spends a long time serializing stuff before the computation starts.

For a more fine-grained look at the process, one can instead do python delayed_obj = a.store(za, compute=False, lock=False) %prun future = client.compute(dobj) This reveals that the pre-compute step takes about 10s. Monitoring the distributed scheduler, I can see that, once the computation starts, it takes about 1:30 to store the array (27 GB). (This is actually not bad!)

Some debugging by @mrocklin revealed the following step is quite slow python import cloudpickle %time len(cloudpickle.dumps(za)) On my system, this was taking close to 1s. On contrast, when the store passed to gcsmap is not a GCSMap but instead a path, it is in the microsecond territory. So pickling GCSMap objects is relatively slow. I'm not sure whether this pickling happens when we call client.compute or during the task execution.

There is room for improvement here, but overall, zarr + gcsfs + dask seem to integrate well and give decent performance.

Xarray

This get much worse once xarray enters the picture. (Note that this example requires the xarray PR pydata/xarray#1528, which has not been merged yet.)

```python

wrap the dask array in an xarray

import xarray as xr import numpy as np ds = xr.DataArray(ar, dims=['time', 'depth', 'lat', 'lon'], coords={'lat': np.linspace(-90, 90, Ny), 'lon': np.linspace(0, 360, Nx)}).to_dataset(name='temperature')

store to a different bucket

gcsmap = gcsfs.mapping.GCSMap('pangeo-data/test1', gcs=fs, check=True, create=True) ds.to_zarr(store=gcsmap, mode='w') ```

Now the store step takes 18 minutes. Most of this time, is upfront, during which there is little CPU activity and no network activity. After about 15 minutes or so, it finally starts computing, at which point the writes to gcs proceed more-or-less at the same rate as with the dask-only example.

Profiling the to_zarr with snakeviz reveals that it is spending most of its time waiting for thread locks.

I don't understand this, since I specifically eliminated locks when storing the zarr arrays.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1770/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
362866468 MDExOlB1bGxSZXF1ZXN0MjE3NDYzMTU4 2430 WIP: revise top-level package description rabernat 1197350 closed 0     10 2018-09-22T15:35:47Z 2019-01-07T01:04:19Z 2019-01-06T00:31:57Z MEMBER   0 pydata/xarray/pulls/2430

I have often complained that xarray's top-level package description assumes that the user knows all about pandas. I think this alienates many new users.

This is a first draft at revising that top-level description. Feedback from the community very needed here.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2430/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
389594572 MDU6SXNzdWUzODk1OTQ1NzI= 2597 add dayofyear to CFTimeIndex rabernat 1197350 closed 0     2 2018-12-11T04:41:59Z 2018-12-11T19:28:31Z 2018-12-11T19:28:31Z MEMBER      

I have noticed that CFTimeIndex does not provide the .dayofyear attributes. Pandas DatetimeIndex does. Implementing these attributes would make certain grouping operations much easier on non-standard calendars.

Perhaps there are other similar attributes. I don't know if .dayofweek makes sense for non-standard calendars.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2597/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
382497709 MDExOlB1bGxSZXF1ZXN0MjMyMTkwMjg5 2559 Zarr consolidated rabernat 1197350 closed 0     19 2018-11-20T04:39:41Z 2018-12-05T14:58:58Z 2018-12-04T23:51:00Z MEMBER   0 pydata/xarray/pulls/2559

This PR adds support for reading and writing of consolidated metadata in zarr stores.

  • [x] Closes #2558 (remove if there is no corresponding issue, which should only be the case for minor changes)
  • [x] Tests added (for all bug fixes or enhancements)
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2559/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
382043672 MDU6SXNzdWUzODIwNDM2NzI= 2558 how to incorporate zarr's new open_consolidated method? rabernat 1197350 closed 0     1 2018-11-19T03:28:40Z 2018-12-04T23:51:00Z 2018-12-04T23:51:00Z MEMBER      

Zarr has a new feature called consolidated metadata. This feature will make it much faster to open certain zarr datasets, because all the metadata needed to construct the xarray dataset will live in a single .json file.

To use this new feature, the new function zarr.open_consolidated needs to be called. So it won't work with xarray out of the box. We need to decide how to add support for this at the xarray level.

I am seeking feedback on what API people would like to see before starting a PR. My proposal is to add a new keyword argument to xarray.open_zarr called consolidated (default = False). An alternative would be to automatically try open_consolidated and fall back on the standard open_group function if that fails.

I played around with this a bit and realized that https://github.com/zarr-developers/zarr/issues/336 needs to be resolved before we can do the xarray side.

cc @martindurant, who might want to weigh on what would be most convenient for intake.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2558/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
301891754 MDU6SXNzdWUzMDE4OTE3NTQ= 1955 Skipping / failing zarr tests rabernat 1197350 closed 0     3 2018-03-02T20:17:31Z 2018-10-29T00:25:34Z 2018-10-29T00:25:34Z MEMBER      

Zarr tests are currently getting skipped on our main testing environments (because the zarr version is less than 2.2): https://travis-ci.org/pydata/xarray/jobs/348350073#L1264

And failing in the py36-zarr-dev environment https://travis-ci.org/pydata/xarray/jobs/348350087#L4989

I'm not sure how this regression occurred, but the zarr tests have been failing for a long time, e.g. https://travis-ci.org/pydata/xarray/jobs/342651302

Possibly related to #1954

cc @jhamman

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1955/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
332762756 MDU6SXNzdWUzMzI3NjI3NTY= 2234 fillna error with distributed rabernat 1197350 closed 0     3 2018-06-15T12:54:54Z 2018-06-15T13:13:54Z 2018-06-15T13:13:54Z MEMBER      

Code Sample, a copy-pastable example if possible

The following code works with the default dask threaded scheduler. python da = xr.DataArray([1, 1, 1, np.nan]).chunk() da.fillna(0.).mean().load()

It fails with distributed. I see the following error on the client side: ```


KilledWorker Traceback (most recent call last) <ipython-input-7-5ed3c292af2e> in <module>() ----> 1 da.fillna(0.).mean().load()

/opt/conda/lib/python3.6/site-packages/xarray/core/dataarray.py in load(self, kwargs) 631 dask.array.compute 632 """ --> 633 ds = self._to_temp_dataset().load(kwargs) 634 new = self._from_temp_dataset(ds) 635 self._variable = new._variable

/opt/conda/lib/python3.6/site-packages/xarray/core/dataset.py in load(self, kwargs) 489 490 # evaluate all the dask arrays simultaneously --> 491 evaluated_data = da.compute(*lazy_data.values(), kwargs) 492 493 for k, data in zip(lazy_data, evaluated_data):

/opt/conda/lib/python3.6/site-packages/dask/base.py in compute(args, kwargs) 398 keys = [x.dask_keys() for x in collections] 399 postcomputes = [x.dask_postcompute() for x in collections] --> 400 results = schedule(dsk, keys, kwargs) 401 return repack([f(r, a) for r, (f, a) in zip(results, postcomputes)]) 402

/opt/conda/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, **kwargs) 2157 try: 2158 results = self.gather(packed, asynchronous=asynchronous, -> 2159 direct=direct) 2160 finally: 2161 for f in futures.values():

/opt/conda/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous) 1560 return self.sync(self._gather, futures, errors=errors, 1561 direct=direct, local_worker=local_worker, -> 1562 asynchronous=asynchronous) 1563 1564 @gen.coroutine

/opt/conda/lib/python3.6/site-packages/distributed/client.py in sync(self, func, args, kwargs) 650 return future 651 else: --> 652 return sync(self.loop, func, args, **kwargs) 653 654 def repr(self):

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, args, kwargs) 273 e.wait(10) 274 if error[0]: --> 275 six.reraise(error[0]) 276 else: 277 return result[0]

/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb) 691 if value.traceback is not tb: 692 raise value.with_traceback(tb) --> 693 raise value 694 finally: 695 value = None

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in f() 258 yield gen.moment 259 thread_state.asynchronous = True --> 260 result[0] = yield make_coro() 261 except Exception as exc: 262 error[0] = sys.exc_info()

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self) 1097 1098 try: -> 1099 value = future.result() 1100 except Exception: 1101 self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self) 1105 if exc_info is not None: 1106 try: -> 1107 yielded = self.gen.throw(*exc_info) 1108 finally: 1109 # Break up a reference to itself

/opt/conda/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker) 1437 six.reraise(type(exception), 1438 exception, -> 1439 traceback) 1440 if errors == 'skip': 1441 bad_keys.add(key)

/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb) 691 if value.traceback is not tb: 692 raise value.with_traceback(tb) --> 693 raise value 694 finally: 695 value = None

KilledWorker: ("('isna-mean_chunk-where-mean_agg-aggregate-74ec0f30171c1c667640f1f18df5f84b',)", 'tcp://10.20.197.7:43357') While the worker logs show this: distributed.worker - ERROR - Can't get attribute 'isna' on <module 'pandas.core.dtypes.missing' from '/opt/conda/lib/python3.6/site-packages/pandas/core/dtypes/missing.py'> Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/distributed/worker.py", line 346, in handle_scheduler self.ensure_computing]) File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(exc_info) File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 361, in handle_stream msgs = yield comm.read() File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(exc_info) File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 203, in read deserializers=deserializers) File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/opt/conda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 307, in wrapper yielded = next(result) File "/opt/conda/lib/python3.6/site-packages/distributed/comm/utils.py", line 79, in from_frames res = _from_frames() File "/opt/conda/lib/python3.6/site-packages/distributed/comm/utils.py", line 65, in _from_frames deserializers=deserializers) File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/core.py", line 122, in loads value = _deserialize(head, fs, deserializers=deserializers) File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/serialize.py", line 236, in deserialize return loads(header, frames) File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/serialize.py", line 58, in pickle_loads return pickle.loads(b''.join(frames)) File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads return pickle.loads(x) AttributeError: Can't get attribute 'isna' on <module 'pandas.core.dtypes.missing' from '/opt/conda/lib/python3.6/site-packages/pandas/core/dtypes/missing.py'> ```

This could very well be a distributed issue. Or a pandas issue. I'm not too sure what is going on. Why is pandas even involved at all?

Problem description

This should not raise an error. It worked fine in previous versions, but something in our latest environment has caused it to break.

Expected Output

<xarray.DataArray ()> array(0.75)

Output of xr.show_versions()

This is running in the latest pangeo.pydata.org environment (https://github.com/pangeo-data/helm-chart/pull/29). @mrocklin picked a custom set of dask / distributed commits to install.

``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.111+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.7 pandas: 0.23.1 numpy: 1.14.5 scipy: 1.1.0 netCDF4: 1.3.1 h5netcdf: None h5py: None Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.17.4+51.g0a7fe8de distributed: 1.21.8+54.g7909f27d matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 39.2.0 pip: 10.0.1 conda: 4.5.4 pytest: 3.6.1 IPython: 6.4.0 sphinx: None ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2234/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
323359733 MDU6SXNzdWUzMjMzNTk3MzM= 2135 use CF conventions to enhance plot labels rabernat 1197350 closed 0     4 2018-05-15T19:53:51Z 2018-06-02T00:10:26Z 2018-06-02T00:10:26Z MEMBER      

Elsewhere in xarray we use CF conventions to help with automatic decoding of datasets. Here I propose we consider using CF metadata conventions to improve the automatic labelling of plots. If datasets declare long_name, standard_name, and units attributes, we could use these instead of the variable name to label the relevant axes / colorbars. This feature would have helped me avoid several past mistakes due to my failure to examine the units attribute (e.g. data given in cm when I assumed m).

Code Sample, a copy-pastable example if possible

Here I create some data with relevant attributes

python import xarray as xr import numpy as np ds = xr.Dataset({'foo': ('x', np.random.rand(10), {'long_name': 'height', 'units': 'm'})}, coords={'x': ('x', np.arange(10), {'long_name': 'distance', 'units': 'km'})}) ds.foo.plot()

Problem description

We have neglected the variable attributes, which would provide better axis labels.

Expected Output

Consider this instead: python def label_from_attrs(da): attrs = da.attrs if 'long_name' in attrs: name = attrs['long_name'] elif 'standard_name' in attrs: name = attrs['standard_name'] else: name = da.name if 'units' in da.attrs: units = ' [{}]'.format(da.attrs['units']) label = name + units return label ds.foo.plot() plt.xlabel(label_from_attrs(ds.x)) plt.ylabel(label_from_attrs(ds.foo))

I feel like this would be a sensible default. But it would be a breaking change. We could make it optional with a keyword like labels_from_attrs=True.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.4.111+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.3+dev13.g98373f0 pandas: 0.22.0 numpy: 1.14.3 scipy: 1.0.1 netCDF4: 1.3.1 h5netcdf: 0.5.1 h5py: 2.7.1 Nio: None zarr: 2.2.1.dev2 bottleneck: 1.2.1 cyordereddict: None dask: 0.17.4 distributed: 1.21.8 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 39.1.0 pip: 9.0.1 conda: 4.3.29 pytest: 3.5.1 IPython: 6.3.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2135/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
180516114 MDU6SXNzdWUxODA1MTYxMTQ= 1026 multidim groupby on dask arrays: dask.array.reshape error rabernat 1197350 closed 0     17 2016-10-02T14:55:25Z 2018-05-24T17:59:31Z 2018-05-24T17:59:31Z MEMBER      

If I try to run a groupby operation using a multidimensional group, I get an error from dask about "dask.array.reshape requires that reshaped dimensions after the first contain at most one chunk".

This error is arises with dask 0.11.0 but NOT dask 0.8.0.

Consider the following test example:

``` python import dask.array as da import xarray as xr

nz, ny, nx = (10,20,30) data = da.ones((nz,ny,nx), chunks=(5,ny,nx)) coord_2d = da.random.random((ny,nx), chunks=(ny,nx))>0.5 ds = xr.Dataset({'thedata': (('z','y','x'), data)}, coords={'thegroup': (('y','x'), coord_2d)})

this works fine

ds.thedata.groupby('thegroup') ```

Now I rechunk one of the later dimensions and group again:

python ds.chunk({'x': 5}).thedata.groupby('thegroup')

This raises the following error and stack trace

``` ValueError Traceback (most recent call last) <ipython-input-16-1b0095ee24a0> in <module>() ----> 1 ds.chunk({'x': 5}).thedata.groupby('thegroup')

/Users/rpa/RND/open_source/xray/xarray/core/common.pyc in groupby(self, group, squeeze) 343 if isinstance(group, basestring): 344 group = self[group] --> 345 return self.groupby_cls(self, group, squeeze=squeeze) 346 347 def groupby_bins(self, group, bins, right=True, labels=None, precision=3,

/Users/rpa/RND/open_source/xray/xarray/core/groupby.pyc in init(self, obj, group, squeeze, grouper, bins, cut_kwargs) 170 # the copy is necessary here, otherwise read only array raises error 171 # in pandas: https://github.com/pydata/pandas/issues/12813> --> 172 group = group.stack({stacked_dim_name: orig_dims}).copy() 173 obj = obj.stack({stacked_dim_name: orig_dims}) 174 self._stacked_dim = stacked_dim_name

/Users/rpa/RND/open_source/xray/xarray/core/dataarray.pyc in stack(self, dimensions) 857 DataArray.unstack 858 """ --> 859 ds = self._to_temp_dataset().stack(dimensions) 860 return self._from_temp_dataset(ds) 861

/Users/rpa/RND/open_source/xray/xarray/core/dataset.pyc in stack(self, **dimensions) 1359 result = self 1360 for new_dim, dims in dimensions.items(): -> 1361 result = result._stack_once(dims, new_dim) 1362 return result 1363

/Users/rpa/RND/open_source/xray/xarray/core/dataset.pyc in _stack_once(self, dims, new_dim) 1322 shape = [self.dims[d] for d in vdims] 1323 exp_var = var.expand_dims(vdims, shape) -> 1324 stacked_var = exp_var.stack(**{new_dim: dims}) 1325 variables[name] = stacked_var 1326 else:

/Users/rpa/RND/open_source/xray/xarray/core/variable.pyc in stack(self, **dimensions) 801 result = self 802 for new_dim, dims in dimensions.items(): --> 803 result = result._stack_once(dims, new_dim) 804 return result 805

/Users/rpa/RND/open_source/xray/xarray/core/variable.pyc in _stack_once(self, dims, new_dim) 771 772 new_shape = reordered.shape[:len(other_dims)] + (-1,) --> 773 new_data = reordered.data.reshape(new_shape) 774 new_dims = reordered.dims[:len(other_dims)] + (new_dim,) 775

/Users/rpa/anaconda/lib/python2.7/site-packages/dask/array/core.pyc in reshape(self, *shape) 1101 if len(shape) == 1 and not isinstance(shape[0], Number): 1102 shape = shape[0] -> 1103 return reshape(self, shape) 1104 1105 @wraps(topk)

/Users/rpa/anaconda/lib/python2.7/site-packages/dask/array/core.pyc in reshape(array, shape) 2585 2586 if any(len(c) != 1 for c in array.chunks[ndim_same+1:]): -> 2587 raise ValueError('dask.array.reshape requires that reshaped ' 2588 'dimensions after the first contain at most one chunk') 2589

ValueError: dask.array.reshape requires that reshaped dimensions after the first contain at most one chunk ```

I am using the latest xarray master and dask version 0.11.0. Note that the example works fine if I use an earlier version of dask (e.g. 0.8.0, the only other one I tested.) This suggests an upstream issue with dask, but I wanted to bring it up here first.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1026/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
317783678 MDU6SXNzdWUzMTc3ODM2Nzg= 2082 searching is broken on readthedocs rabernat 1197350 closed 0     2 2018-04-25T20:34:13Z 2018-05-04T20:10:31Z 2018-05-04T20:10:31Z MEMBER      

Searches return no results for me. For example: http://xarray.pydata.org/en/latest/search.html?q=xarray&check_keywords=yes&area=default

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2082/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
312986662 MDExOlB1bGxSZXF1ZXN0MTgwNjUwMjc5 2047 Fix decode cf with dask rabernat 1197350 closed 0     1 2018-04-10T15:56:20Z 2018-04-12T23:38:02Z 2018-04-12T23:38:02Z MEMBER   0 pydata/xarray/pulls/2047
  • [x] Closes #1372
  • [x] Tests added
  • [x] Tests passed
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API

This was a very simple fix for an issue that has vexed me for quite a while. Am I missing something obvious here?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2047/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
293913247 MDU6SXNzdWUyOTM5MTMyNDc= 1882 xarray tutorial at SciPy 2018? rabernat 1197350 closed 0     17 2018-02-02T14:52:11Z 2018-04-09T20:30:13Z 2018-04-09T20:30:13Z MEMBER      

It would be great to hold an xarray tutorial at SciPy 2018. Xarray has matured a lot recently, and it would be great to raise awareness of what it can do among the broader scipy community.

From the conference website:

Tutorials should be focused on covering a well-defined topic in a hands-on manner. We want to see attendees coding! We encourage submissions to be designed to allow at least 50% of the time for hands-on exercises even if this means the subject matter needs to be limited. Tutorials will be 4 hours in duration. In your tutorial application, you can indicate what prerequisite skills and knowledge will be needed for your tutorial, and the approximate expected level of knowledge of your students (i.e., beginner, intermediate, advanced).

I'm curious if anyone was already planning on submitting a tutorial. If not, let's put together a team. @jhamman has indicated interest in participating in, but not leading, the tutorial. Anyone else interested?

xref pangeo-data/pangeo#97

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1882/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
106562046 MDU6SXNzdWUxMDY1NjIwNDY= 575 1D line plot with data on the x axis rabernat 1197350 closed 0     13 2015-09-15T13:56:51Z 2018-03-05T22:14:46Z 2018-03-05T22:14:46Z MEMBER      

Consider the following Dataset, representing a function f = cos(z)

python z = np.arange(10) ds = xray.Dataset( {'f': ('z', np.cos(z))}, coords={'z': z})

If I call

python ds.f.plot()

xray naturally puts "z" on the x-axis.

However, since z represents the vertical dimension, it would be more natural do put it on the y-axis, i.e.

python plt.plot(ds.f, ds.z)

This is conventional in atmospheric science and oceanography for buoy data or balloon data.

Is there an easy way to do this with xray's plotting functions? I scanned the code and didn't see an obvious solution, but maybe I missed it.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/575/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
295744504 MDU6SXNzdWUyOTU3NDQ1MDQ= 1898 zarr RTD docs broken rabernat 1197350 closed 0   0.10.3 3008859 1 2018-02-09T03:35:05Z 2018-02-15T23:20:31Z 2018-02-15T23:20:31Z MEMBER      

This is what is getting rendered on RTD http://xarray.pydata.org/en/latest/io.html#zarr

``` In [26]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))}, ....: coords={'x': [10, 20, 30, 40], ....: 'y': pd.date_range('2000-01-01', periods=5), ....: 'z': ('x', list('abcd'))}) ....:

In [27]: ds.to_zarr('path/to/directory.zarr')

AttributeError Traceback (most recent call last) <ipython-input-27-8c5f1b00edbc> in <module>() ----> 1 ds.to_zarr('path/to/directory.zarr')

/home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/core/dataset.py in to_zarr(self, store, mode, synchronizer, group, encoding) 1165 from ..backends.api import to_zarr 1166 return to_zarr(self, store=store, mode=mode, synchronizer=synchronizer, -> 1167 group=group, encoding=encoding) 1168 1169 def unicode(self):

/home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/backends/api.py in to_zarr(dataset, store, mode, synchronizer, group, encoding) 752 # I think zarr stores should always be sync'd immediately 753 # TODO: figure out how to properly handle unlimited_dims --> 754 dataset.dump_to_store(store, sync=True, encoding=encoding) 755 return store

/home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/core/dataset.py in dump_to_store(self, store, encoder, sync, encoding, unlimited_dims) 1068 1069 store.store(variables, attrs, check_encoding, -> 1070 unlimited_dims=unlimited_dims) 1071 if sync: 1072 store.sync()

/home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/backends/zarr.py in store(self, variables, attributes, args, kwargs) 378 def store(self, variables, attributes, args, kwargs): 379 AbstractWritableDataStore.store(self, variables, attributes, --> 380 *args, kwargs) 381 382

/home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, unlimited_dims) 275 variables, attributes = self.encode(variables, attributes) 276 --> 277 self.set_attributes(attributes) 278 self.set_dimensions(variables, unlimited_dims=unlimited_dims) 279 self.set_variables(variables, check_encoding_set,

/home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.5/site-packages/xarray-0.10.0+dev55.g1d32399-py3.5.egg/xarray/backends/zarr.py in set_attributes(self, attributes) 341 342 def set_attributes(self, attributes): --> 343 self.ds.attrs.put(attributes) 344 345 def encode_variable(self, variable):

AttributeError: 'Attributes' object has no attribute 'put' ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1898/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
253136694 MDExOlB1bGxSZXF1ZXN0MTM3ODE5MTA0 1528 WIP: Zarr backend rabernat 1197350 closed 0     103 2017-08-27T02:38:01Z 2018-02-13T21:35:03Z 2017-12-14T02:11:36Z MEMBER   0 pydata/xarray/pulls/1528
  • [x] Closes #1223
  • [x] Tests added / passed
  • [x] Passes git diff upstream/master | flake8 --diff
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API

I think that a zarr backend could be the ideal storage format for xarray datasets, overcoming many of the frustrations associated with netcdf and enabling optimal performance on cloud platforms.

This is a very basic start to implementing a zarr backend (as proposed in #1223); however, I am taking a somewhat different approach. I store the whole dataset in a single zarr group. I encode the extra metadata needed by xarray (so far just dimension information) as attributes within the zarr group and child arrays. I hide these special attributes from the user by wrapping the attribute dictionaries in a "HiddenKeyDict", so that they can't be viewed or modified.

I have no tests yet (:flushed:), but the following code works. ```python from xarray.backends.zarr import ZarrStore import xarray as xr import numpy as np

ds = xr.Dataset( {'foo': (('y', 'x'), np.ones((100, 200)), {'myattr1': 1, 'myattr2': 2}), 'bar': (('x',), np.zeros(200))}, {'y': (('y',), np.arange(100)), 'x': (('x',), np.arange(200))}, {'some_attr': 'copana'} ).chunk({'y': 50, 'x': 40})

zs = ZarrStore(store='zarr_test') ds.dump_to_store(zs) ds2 = xr.Dataset.load_store(zs) assert ds2.equals(ds) ```

There is a very long way to go here, but I thought I would just get a PR started. Some questions that would help me move forward.

  1. What is "encoding" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr?
  2. Should we encode / decode CF for zarr stores?
  3. Do we want to always automatically align dask chunks with the underlying zarr chunks?
  4. What sort of public API should the zarr backend have? Should you be able to load zarr stores via open_dataset? Or do we need a new method? I think .to_zarr() would be quite useful.
  5. zarr arrays are extensible along all axes. What does this imply for unlimited dimensions?
  6. Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed.
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1528/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
287569331 MDExOlB1bGxSZXF1ZXN0MTYyMjI0MTg2 1817 fix rasterio chunking with s3 datasets rabernat 1197350 closed 0     11 2018-01-10T20:37:45Z 2018-01-24T09:33:07Z 2018-01-23T16:33:28Z MEMBER   0 pydata/xarray/pulls/1817
  • [x] Closes #1816 (remove if there is no corresponding issue, which should only be the case for minor changes)
  • [x] Tests added (for all bug fixes or enhancements)
  • [x] Tests passed (for all non-documentation changes)
  • [x] Passes git diff upstream/master **/*py | flake8 --diff (remove if you did not edit any Python files)
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

This is a simple fix for token generation of non-filename targets for rasterio.

The problem is that I have no idea how to test it without actually hitting s3 (which requires boto and aws credentials).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1817/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
287566823 MDU6SXNzdWUyODc1NjY4MjM= 1816 rasterio chunks argument causes loading from s3 to fail rabernat 1197350 closed 0     1 2018-01-10T20:28:40Z 2018-01-23T16:33:28Z 2018-01-23T16:33:28Z MEMBER      

Code Sample, a copy-pastable example if possible

```python

This works

url = 's3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF' ds = xr.open_rasterio(url)

this doesn't

ds = xr.open_rasterio(url, chunks=512) ```

The error is ```


FileNotFoundError Traceback (most recent call last) <ipython-input-17-8b55d7e920b8> in <module>() 6 # https://aws.amazon.com/public-datasets/landsat/ 7 # 512x512 chunking ----> 8 ds = xr.open_rasterio(url, chunks=512) 9 ds

~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/xarray-0.10.0-py3.6.egg/xarray/backends/rasterio_.py in open_rasterio(filename, chunks, cache, lock) 172 from dask.base import tokenize 173 # augment the token with the file modification time --> 174 mtime = os.path.getmtime(filename) 175 token = tokenize(filename, mtime, chunks) 176 name_prefix = 'open_rasterio-%s' % token

~/miniconda3/envs/geo_scipy/lib/python3.6/genericpath.py in getmtime(filename) 53 def getmtime(filename): 54 """Return the last modification time of a file, reported by os.stat().""" ---> 55 return os.stat(filename).st_mtime 56 57

FileNotFoundError: [Errno 2] No such file or directory: 's3://landsat-pds/L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B1.TIF' ```

Problem description

It is pretty clear that the current xarray code expects to receive a filename. (The name of the argument is filename.) But rasterio's open function accepts a much wider range of dataset identifiers. The tokenizing function should be updated to allow for this. Seems like it should be a pretty easy fix.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.0 pandas: 0.20.3 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.3.1 h5netcdf: 0.4.1 Nio: None bottleneck: 1.2.1 cyordereddict: None dask: 0.16.0 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 36.3.0 pip: 9.0.1 conda: None pytest: 3.2.1 IPython: 6.1.0 sphinx: 1.6.5
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1816/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
281983819 MDU6SXNzdWUyODE5ODM4MTk= 1779 decode_cf destroys chunks rabernat 1197350 closed 0     2 2017-12-14T05:12:00Z 2017-12-15T14:50:42Z 2017-12-15T14:50:41Z MEMBER      

Code Sample, a copy-pastable example if possible

python import numpy as np import xarray as xr xr.DataArray(np.random.rand(1000)).to_dataset(name='random').chunk(100) ds_cf = xr.decode_cf(ds) assert not ds_cf.chunks

Problem description

Calling decode_cf causes variables whose data is dask arrays to be wrapped in two layers of abstractions: DaskIndexingAdapter and LazilyIndexedArray. In the example above ```python

ds.random.variable._data dask.array<da.random.random_sample, shape=(1000,), dtype=float64, chunksize=(100,)> ds_cf.random.variable._data LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array<da.random.random_sample, shape=(1000,), dtype=float64, chunksize=(100,)>), key=BasicIndexer((slice(None, None, None),))) ``` At least part of the problem comes from this line: https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L1045

This is especially problematic if we want to concatenate several such datasets together with dask. Chunking the decoded dataset creates a nested dask-within-dask array which is sure to cause undesirable behavior down the line

```python

dict(ds_cf.chunk().random.data.dask) {('xarray-random-bf5298b8790e93c1564b5dca9e04399e', 0): (<function dask.array.core.getter>, 'xarray-random-bf5298b8790e93c1564b5dca9e04399e', (slice(0, 1000, None),)), 'xarray-random-bf5298b8790e93c1564b5dca9e04399e': ImplicitToExplicitIndexingAdapter(array=LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array<da.random.random_sample, shape=(1000,), dtype=float64, chunksize=(100,)>), key=BasicIndexer((slice(None, None, None),))))} ```

Expected Output

If we call decode_cf on a dataset made of dask arrays, it should preserve the chunks of the original dask arrays. Hopefully this can be addressed by #1752.

Output of xr.show_versions()

commit: 85174cda6440c2f6eed7860357e79897e796e623 python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.0-52-gd8842a6 pandas: 0.20.3 numpy: 1.13.1 scipy: 0.19.1 netCDF4: 1.2.9 h5netcdf: 0.4.1 Nio: None bottleneck: 1.2.1 cyordereddict: None dask: 0.16.0 matplotlib: 2.1.0 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 36.3.0 pip: 9.0.1 conda: None pytest: 3.2.1 IPython: 6.1.0 sphinx: 1.6.5
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1779/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
94328498 MDU6SXNzdWU5NDMyODQ5OA== 463 open_mfdataset too many files rabernat 1197350 closed 0     47 2015-07-10T15:24:14Z 2017-11-27T12:17:17Z 2017-03-23T19:22:43Z MEMBER      

I am very excited to try xray.

On my first attempt, I tried to use open_mfdataset on a set of ~8000 netcdf files. I hit a "RuntimeError: Too many open files". The ulimit on my system is 1024, so clearly that is the source of the error.

I am curious whether this is the desired behavior for open_mfdataset. Does xray have to keep all the files open? If so, I will work with my sysadmin to increase the ulimit.

It seems like the whole point of this function is to work with large collections of files, so this could be a significant limitation.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/463/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
229474101 MDExOlB1bGxSZXF1ZXN0MTIxMTQyODkw 1413 concat prealigned objects rabernat 1197350 closed 0     11 2017-05-17T20:16:00Z 2017-07-17T21:53:53Z 2017-07-17T21:53:40Z MEMBER   0 pydata/xarray/pulls/1413
  • [x] Closes #1385
  • [ ] Tests added / passed
  • [ ] Passes git diff upstream/master | flake8 --diff
  • [ ] Fully documented, including whats-new.rst for all changes and api.rst for new API

This is an initial PR to bypass index alignment and coordinate checking when concatenating datasets.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1413/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
229138906 MDExOlB1bGxSZXF1ZXN0MTIwOTAzMjY5 1411 fixed dask prefix naming rabernat 1197350 closed 0     6 2017-05-16T19:10:30Z 2017-05-22T20:39:01Z 2017-05-22T20:38:56Z MEMBER   0 pydata/xarray/pulls/1411
  • [x] Closes #1343
  • [x] Tests added / passed
  • [x] Passes git diff upstream/master | flake8 --diff
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API

I am starting a new PR for this since the original one (#1345) was not branched of my own fork.

As the discussion there stood, @shoyer suggested that dataset.chunk should also be updated to match the latest conventions in dask naming. The relevant code is here

```python def maybe_chunk(name, var, chunks): chunks = selkeys(chunks, var.dims) if not chunks: chunks = None if var.ndim > 0: token2 = tokenize(name, token if token else var._data) name2 = '%s%s-%s' % (name_prefix, name, token2) return var.chunk(chunks, name=name2, lock=lock) else: return var

    variables = OrderedDict([(k, maybe_chunk(k, v, chunks))
                             for k, v in self.variables.items()])

```

Currently, chunk has an optional keyword argument name_prefix='xarray-'. Do we want to keep this optional?

IMO, the current naming logic in chunk is not a problem for dask and will not cause problems for the distributed bokeh dashboard (as open_dataset did).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1411/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
218368855 MDExOlB1bGxSZXF1ZXN0MTEzNTU0Njk4 1345 new dask prefix rabernat 1197350 closed 0     2 2017-03-31T00:56:24Z 2017-05-21T09:45:39Z 2017-05-16T19:11:13Z MEMBER   0 pydata/xarray/pulls/1345
  • [x] closes #1343
  • [ ] tests added / passed
  • [ ] passes git diff upstream/master | flake8 --diff
  • [ ] whatsnew entry
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1345/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
225482023 MDExOlB1bGxSZXF1ZXN0MTE4NDA4NDc1 1390 Fix groupby bins tests rabernat 1197350 closed 0     1 2017-05-01T17:46:41Z 2017-05-01T21:52:14Z 2017-05-01T21:52:14Z MEMBER   0 pydata/xarray/pulls/1390
  • [x] closes #1386
  • [x] tests added / passed
  • [x] passes git diff upstream/master | flake8 --diff
  • [x] whatsnew entry
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1390/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
220078792 MDU6SXNzdWUyMjAwNzg3OTI= 1357 dask strict version check fails rabernat 1197350 closed 0     1 2017-04-07T01:08:56Z 2017-04-07T01:43:53Z 2017-04-07T01:43:53Z MEMBER      

I am on xarray version 0.9.1-28-g1cad803 and dask version 0.14.1+39.g964b377 (both from recent github masters).

I can't save chunked data to netcdf because of a failing dask version check.

python ds = xr.Dataset({'a': (['x'], np.random.rand(100)), 'b': (['x'], np.random.rand(100))}) ds = ds.chunk({'x': 20}) ds.to_netcdf('test.nc')

The relevant part of the stack trace is ``` /home/rpa/xarray/xarray/backends/common.pyc in sync(self) 165 import dask.array as da 166 import dask --> 167 if StrictVersion(dask.version) > StrictVersion('0.8.1'): 168 da.store(self.sources, self.targets, lock=GLOBAL_LOCK) 169 else:

/home/rpa/.conda/envs/lagrangian_vorticity/lib/python2.7/distutils/version.pyc in init(self, vstring) 38 def init (self, vstring=None): 39 if vstring: ---> 40 self.parse(vstring) 41 42 def repr (self):

/home/rpa/.conda/envs/lagrangian_vorticity/lib/python2.7/distutils/version.pyc in parse(self, vstring) 105 match = self.version_re.match(vstring) 106 if not match: --> 107 raise ValueError, "invalid version number '%s'" % vstring 108 109 (major, minor, patch, prerelease, prerelease_num) = \

ValueError: invalid version number '0.14.1+39.g964b377' ```

It appears that StrictVersion does not like the dask version numbering scheme.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1357/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
188537472 MDExOlB1bGxSZXF1ZXN0OTMxNzEyODE= 1104 add optimization tips rabernat 1197350 closed 0     1 2016-11-10T15:26:25Z 2016-11-10T16:49:13Z 2016-11-10T16:49:06Z MEMBER   0 pydata/xarray/pulls/1104

This adds some dask optimization tips from the mailing list (closes #1103).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1104/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
188517316 MDU6SXNzdWUxODg1MTczMTY= 1103 add dask optimization tips to docs rabernat 1197350 closed 0     0 2016-11-10T14:08:39Z 2016-11-10T16:49:06Z 2016-11-10T16:49:06Z MEMBER      

We should add the optimization tips that @shoyer describes in this mailing list thread to @karenamckinnon.

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/xarray/11lDGSeza78/lR1uj9yWDAAJ

Specific things to try (we should add similar guidelines to xarray's docs):

  1. Do your spatial and temporal indexing with .sel() earlier in the pipeline, specifically before you resample. Resample triggers some computation on all the blocks, which in theory should commute with indexing, but we haven't implemented this optimization in dask yet: https://github.com/dask/dask/issues/746
  2. Save the temporal mean to disk as a netCDF file (and then load it again with open_dataset) before subtracting it. Again, in theory, dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the dask scheduler, because it tries to keep every chunk of an array that it computes in memory: https://github.com/dask/dask/issues/874
  3. Specify smaller chunks across space when using open_mfdataset, e.g., chunks={'latitude': 10, 'longitude': 10}. This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you do my suggestion 1).
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1103/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
180536861 MDExOlB1bGxSZXF1ZXN0ODc2NDc0MDk= 1027 Groupby bins empty groups rabernat 1197350 closed 0     7 2016-10-02T21:31:32Z 2016-10-03T15:22:18Z 2016-10-03T15:22:15Z MEMBER   0 pydata/xarray/pulls/1027

This PR fixes a bug in groupby_bins in which empty bins were dropped from the grouped results. Now groupby_bins restores any empty bins automatically. To recover the old behavior, one could apply dropna after a groupby operation.

Fixes #1019

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1027/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
178359375 MDU6SXNzdWUxNzgzNTkzNzU= 1014 dask tokenize error with chunking rabernat 1197350 closed 0     1 2016-09-21T14:14:10Z 2016-09-22T02:38:08Z 2016-09-22T02:38:08Z MEMBER      

I have hit a problem with my custom xarray store: https://github.com/xgcm/xgcm/blob/master/xgcm/models/mitgcm/mds_store.py

Unfortunately it is hard for me to create a re-producible example, since this error is only coming up when I try to read a large binary dataset stored on my server. Nevertheless, I am opening an issue in hopes that someone can help me.

I create an xarray dataset via a custom function

python ds = xgcm.open_mdsdataset(ddir, iters, delta_t=deltaT, prefix=['DiagLAYERS-diapycnal','DiagLAYERS-transport'])

This function creates a dataset object successfully and then calls ds.chunk(). Dask is unable to tokenize the variables and fails. I don't really understand why, but it seems to ultimately depend on the presence and value of the filename attribute in the data getting passed to dask.

Any advice would be appreciated. The relevant stack trace is

``` python /home/rpa/xgcm/xgcm/models/mitgcm/mds_store.pyc in open_mdsdataset(dirname, iters, prefix, read_grid, delta_t, ref_date, calendar, geometry, grid_vars_to_coords, swap_dims, endian, chunks, ignore_unknown_vars) 154 # do we need more fancy logic (like open_dataset), or is this enough 155 if chunks is not None: --> 156 ds = ds.chunk(chunks) 157 158 return ds

/home/rpa/xarray/xarray/core/dataset.py in chunk(self, chunks, name_prefix, token, lock) 863 864 variables = OrderedDict([(k, maybe_chunk(k, v, chunks)) --> 865 for k, v in self.variables.items()]) 866 return self._replace_vars_and_dims(variables) 867

/home/rpa/xarray/xarray/core/dataset.py in maybe_chunk(name, var, chunks) 856 chunks = None 857 if var.ndim > 0: --> 858 token2 = tokenize(name, token if token else var._data) 859 name2 = '%s%s-%s' % (name_prefix, name, token2) 860 return var.chunk(chunks, name=name2, lock=lock)

/home/rpa/dask/dask/base.pyc in tokenize(args, *kwargs) 355 if kwargs: 356 args = args + (kwargs,) --> 357 return md5(str(tuple(map(normalize_token, args))).encode()).hexdigest()

/home/rpa/dask/dask/utils.pyc in call(self, arg) 510 for cls in inspect.getmro(typ)[1:]: 511 if cls in lk: --> 512 return lkcls 513 raise TypeError("No dispatch for {0} type".format(typ)) 514

/home/rpa/dask/dask/base.pyc in normalize_array(x) 320 return (str(x), x.dtype) 321 if hasattr(x, 'mode') and hasattr(x, 'filename'): --> 322 return x.filename, os.path.getmtime(x.filename), x.dtype, x.shape 323 if x.dtype.hasobject: 324 try:

/usr/local/anaconda/lib/python2.7/genericpath.pyc in getmtime(filename) 60 def getmtime(filename): 61 """Return the last modification time of a file, reported by os.stat().""" ---> 62 return os.stat(filename).st_mtime 63 64

TypeError: coercing to Unicode: need string or buffer, NoneType found ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1014/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
146182176 MDExOlB1bGxSZXF1ZXN0NjU0MDc4NzA= 818 Multidimensional groupby rabernat 1197350 closed 0     61 2016-04-06T04:14:37Z 2016-07-31T23:02:59Z 2016-07-08T01:50:38Z MEMBER   0 pydata/xarray/pulls/818

Many datasets have a two dimensional coordinate variable (e.g. longitude) which is different from the logical grid coordinates (e.g. nx, ny). (See #605.) For plotting purposes, this is solved by #608. However, we still might want to split / apply / combine over such coordinates. That has not been possible, because groupby only supports creating groups on one-dimensional arrays.

This PR overcomes that issue by using stack to collapse multiple dimensions in the group variable. A minimal example of the new functionality is

``` python

da = xr.DataArray([[0,1],[2,3]], coords={'lon': (['ny','nx'], [[30,40],[40,50]] ), 'lat': (['ny','nx'], [[10,10],[20,20]] )}, dims=['ny','nx']) da.groupby('lon').sum() <xarray.DataArray (lon: 3)> array([0, 3, 3]) Coordinates: * lon (lon) int64 30 40 50 ```

This feature could have broad applicability for many realistic datasets (particularly model output on irregular grids): for example, averaging non-rectangular grids zonally (i.e. in latitude), binning in temperature, etc.

If you think this is worth pursuing, I would love some feedback.

The PR is not complete. Some items to address are - [x] Create a specialized grouper to allow coarser bins. By default, if no grouper is specified, the GroupBy object uses all unique values to define the groups. With a high resolution dataset, this could balloon to a huge number of groups. With the latitude example, we would like to be able to specify e.g. 1-degree bins. Usage would be da.groupby('lon', bins=range(-90,90)). - [ ] Allow specification of which dims to stack. For example, stack in space but keep time dimension intact. (Currently it just stacks all the dimensions of the group variable.) - [x] A nice example for the docs.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/818/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
162974170 MDExOlB1bGxSZXF1ZXN0NzU2ODI3NzM= 892 fix printing of unicode attributes rabernat 1197350 closed 0     2 2016-06-29T16:47:27Z 2016-07-24T02:57:13Z 2016-07-24T02:57:13Z MEMBER   0 pydata/xarray/pulls/892

fixes #834

I would welcome a suggestion of how to test this in a way that works with both python 2 and 3. This is somewhat outside my expertise.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/892/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
100055216 MDExOlB1bGxSZXF1ZXN0NDIwMTYyMDg= 524 Option for closing files with scipy backend rabernat 1197350 closed 0     6 2015-08-10T12:49:23Z 2016-06-24T17:45:07Z 2016-06-24T17:45:07Z MEMBER   0 pydata/xarray/pulls/524

This is the same as #468, which was accidentally closed. I just copied and pasted my comment below

This addresses issue #463, in which open_mfdataset failed when trying to open a list of files longer than my system's ulimit. I tried to find a solution in which the underlying netcdf file objects are kept closed by default and only reopened "when needed".

I ended up subclassing scipy.io.netcdf_file and overwriting the variable attribute with a property which first checks whether the file is open or closed and opens it if needed. That was the easy part. The hard part was figuring out when to close them. The problem is that a couple of different parts of the code (e.g. each individual variable and also the datastore object itself) keep references to the netcdf_file object. In the end I used the debugger to find out when during initialization the variables were actually being read and added some calls to close() in various different places. It is relatively easy to close the files up at the end of the initialization, but it was much harder to make sure that the whole array of files is never open at the same time. I also had to disable mmap when this option is active.

This solution is messy and, moreover, extremely slow. There is a factor of ~100 performance penalty during initialization for reopening and closing the files all the time (but only a factor of 10 for the actual calculation). I am sure this could be reduced if someone who understands the code better found some judicious points at which to call close() on the netcdf_file. The loss of mmap also sucks.

This option can be accessed with the close_files key word, which I added to api.

Timing for loading and doing a calculation with close_files=True:

python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=True) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files()

output:

3 open files CPU times: user 11.1 s, sys: 17.5 s, total: 28.5 s Wall time: 27.7 s 2 open files 0.0055650632367 CPU times: user 649 ms, sys: 974 ms, total: 1.62 s Wall time: 633 ms 2 open files

Timing for loading and doing a calculation with close_files=False (default, should revert to old behavior):

python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=False) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files()

3 open files CPU times: user 264 ms, sys: 85.3 ms, total: 349 ms Wall time: 291 ms 22 open files 0.0055650632367 CPU times: user 174 ms, sys: 141 ms, total: 315 ms Wall time: 56 ms 22 open files

This is not a very serious pull request, but I spent all day on it, so I thought I would share. Maybe you can see some obvious way to improve it...

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/524/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
111471076 MDU6SXNzdWUxMTE0NzEwNzY= 624 roll method rabernat 1197350 closed 0     8 2015-10-14T19:14:36Z 2015-12-02T23:32:28Z 2015-12-02T23:32:28Z MEMBER      

I would like to pick up my idea to add a roll method. Among many uses, it could help with #623.

The method is pretty simple.

python def roll(darr, n, dim): """Clone of numpy.roll for xray objects.""" left = darr.isel(**{dim: slice(None, -n)}) right = darr.isel(**{dim: slice(-n, None)}) return xray.concat([right, left], dim=dim, data_vars='minimal', coords='minimal')

I have already been using this function a lot (defined from outside xray) and find it quite useful. I would like to create a PR to add it, but I am having a little trouble understanding how to correctly "inject" it into the api. A few words of advice from @shoyer would probably save me a lot of trial and error.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/624/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
115897556 MDU6SXNzdWUxMTU4OTc1NTY= 649 error when using broadcast_arrays with coordinates rabernat 1197350 closed 0     5 2015-11-09T15:16:32Z 2015-11-10T14:27:41Z 2015-11-10T14:27:41Z MEMBER      

I frequently use broadcast_arrays to to feed xray variables to non-xray libraries (e.g. gsw.) Often I need to broadcast the coordinates and variables in order to do call functions that take both as arguments.

I have found that broadcast_arrays doesn't work as I expect with coordinates. For example

python import xray import numpy as np ds = xray.Dataset({'a': (['y','x'], np.ones((20,10)))}, coords={'x': (['x'], np.arange(10)), 'y': (['y'], np.arange(20))}) xbc, ybc, abc = xray.broadcast_arrays(ds.x, ds.y, ds.a)

This raises ValueError: an index variable must be defined with 1-dimensional data.

If I change the last line to

python xbc, ybc, abc = xray.broadcast_arrays(1*ds.x, 1*ds.y, ds.a)

it works fine.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/649/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
101719623 MDExOlB1bGxSZXF1ZXN0NDI3MzE1NDg= 538 Fix contour color rabernat 1197350 closed 0     25 2015-08-18T18:24:36Z 2015-09-01T17:48:12Z 2015-09-01T17:20:56Z MEMBER   0 pydata/xarray/pulls/538

This fixes #537 by adding a check for the presence of the colors kwarg.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/538/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
101716715 MDU6SXNzdWUxMDE3MTY3MTU= 537 xray.plot.contour doesn't handle colors kwarg correctly rabernat 1197350 closed 0     2 2015-08-18T18:11:55Z 2015-09-01T17:20:55Z 2015-09-01T17:20:55Z MEMBER      

I found this while playing around with the plotting functions. (Really nice work btw @clarkfitzg!) I know the plotting is still under heavy development, but I thought I would share this issue anyway. I might take a crack at fixing it myself...

The goal is to make an unfilled contour plot with no colors. In matplotlib this is easy

python x, y = np.arange(20), np.arange(20) xx, yy = np.meshgrid(x, y) f = np.sqrt(xx**2 + yy**2) plt.contour(x, y, f, colors='k')

If I try the same thing in dask

python da = xray.DataArray(f, coords={'y': y, 'x': x}) plt.figure() xray.plot.contour(da, colors='k')

I get ValueError: Either colors or cmap must be None.

I can't find any way around this (e.g. adding a cmap=None argument has no effect). If I remove the colors keyword, it works and makes colored contours, as expected.

I think this could be fixed easily if you agree it is a bug...

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/537/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
99847237 MDExOlB1bGxSZXF1ZXN0NDE5NjI5MDg= 523 Fix datetime decoding when time units are 'days since 0000-01-01 00:00:00' rabernat 1197350 closed 0     22 2015-08-09T00:12:00Z 2015-08-14T17:22:02Z 2015-08-14T17:22:02Z MEMBER   0 pydata/xarray/pulls/523

This fixes #521 using the workaround described in Unidata/netcdf4-python#442.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/523/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
94508580 MDExOlB1bGxSZXF1ZXN0Mzk3NTI1MTQ= 468 Option for closing files with scipy backend rabernat 1197350 closed 0     7 2015-07-11T21:24:24Z 2015-08-10T12:50:45Z 2015-08-09T00:04:12Z MEMBER   0 pydata/xarray/pulls/468

This addresses issue #463, in which open_mfdataset failed when trying to open a list of files longer than my system's ulimit. I tried to find a solution in which the underlying netcdf file objects are kept closed by default and only reopened "when needed".

I ended up subclassing scipy.io.netcdf_file and overwriting the variable attribute with a property which first checks whether the file is open or closed and opens it if needed. That was the easy part. The hard part was figuring out when to close them. The problem is that a couple of different parts of the code (e.g. each individual variable and also the datastore object itself) keep references to the netcdf_file object. In the end I used the debugger to find out when during initialization the variables were actually being read and added some calls to close() in various different places. It is relatively easy to close the files up at the end of the initialization, but it was much harder to make sure that the whole array of files is never open at the same time. I also had to disable mmap when this option is active.

This solution is messy and, moreover, extremely slow. There is a factor of ~100 performance penalty during initialization for reopening and closing the files all the time (but only a factor of 10 for the actual calculation). I am sure this could be reduced if someone who understands the code better found some judicious points at which to call close() on the netcdf_file. The loss of mmap also sucks.

This option can be accessed with the close_files key word, which I added to api.

Timing for loading and doing a calculation with close_files=True:

python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=True) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files()

output:

3 open files CPU times: user 11.1 s, sys: 17.5 s, total: 28.5 s Wall time: 27.7 s 2 open files 0.0055650632367 CPU times: user 649 ms, sys: 974 ms, total: 1.62 s Wall time: 633 ms 2 open files

Timing for loading and doing a calculation with close_files=False (default, should revert to old behavior):

python count_open_files() %time mfds = xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=False) count_open_files() %time print float(mfds.variables['u'].mean()) count_open_files()

3 open files CPU times: user 264 ms, sys: 85.3 ms, total: 349 ms Wall time: 291 ms 22 open files 0.0055650632367 CPU times: user 174 ms, sys: 141 ms, total: 315 ms Wall time: 56 ms 22 open files

This is not a very serious pull request, but I spent all day on it, so I thought I would share. Maybe you can see some obvious way to improve it...

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/468/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
99844089 MDExOlB1bGxSZXF1ZXN0NDE5NjI0NDM= 522 Fix datetime decoding when time units are 'days since 0000-01-01 00:00:00' rabernat 1197350 closed 0     1 2015-08-08T23:26:07Z 2015-08-09T00:10:18Z 2015-08-09T00:06:49Z MEMBER   0 pydata/xarray/pulls/522

This fixes #521 using the workaround described in Unidata/netcdf4-python#442.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/522/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
96732359 MDU6SXNzdWU5NjczMjM1OQ== 489 problems with big endian DataArrays rabernat 1197350 closed 0     4 2015-07-23T05:24:07Z 2015-07-23T20:28:00Z 2015-07-23T20:28:00Z MEMBER      

I have some MITgcm data in a custom binary format that I am trying to wedge into xray. I found that DataArray does not know how to handle big endian datatypes, at least on my system.

python x = xray.DataArray( np.ones(10, dtype='>f4')) print float(x.sum()), x.data.sum()

result:

4.60060298822e-40 10.0

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/489/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
96185559 MDU6SXNzdWU5NjE4NTU1OQ== 484 segfault with hdf4 file rabernat 1197350 closed 0     5 2015-07-20T23:15:06Z 2015-07-21T02:34:16Z 2015-07-21T02:34:16Z MEMBER      

I am trying to read data from the NASA MERRA reanalysis. An example file is: ftp://goldsmr3.sci.gsfc.nasa.gov/data/s4pa/MERRA/MAI3CPASM.5.2.0/2014/01/MERRA300.prod.assim.inst3_3d_asm_Cp.20140101.hdf The file format is hdf4 (NOT hdf5). (full file specification)

This file can be read by netCDF4.Dataset

python from netCDF4 import Dataset fname = 'MERRA300.prod.assim.inst3_3d_asm_Cp.20140101.hdf' nc = Dataset(fname) nc.variables['SLP'][0]

No errors

However, with xray

python import xray ds = xray.open_dataset(fname)

I get a segfault.

Is this behavior unique to my system? Or is this a reproducible bug?

Note: I am not using anaconda's netCDF package, because it does not have hdf4 file support. I had my sysadmin build us a custom netcdf and netCDF4 python.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/484/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 40.602ms · About: xarray-datasette