home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

62 rows where repo = 13221727, type = "issue" and user = 6213168 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, created_at (date), updated_at (date), closed_at (date)

state 2

  • closed 50
  • open 12

type 1

  • issue · 62 ✖

repo 1

  • xarray · 62 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1678587031 I_kwDOAMm_X85kDTSX 7777 xarray minimum versions policy is more aggressive than NEP-29 crusaderky 6213168 closed 0     1 2023-04-21T14:06:15Z 2023-05-01T22:26:57Z 2023-05-01T22:26:57Z MEMBER      

What is your issue?

In #4179 / #4907, the xarray policy around minimum supported version of dependencies was changed, with the reasoning that the previous policy (based on NEP-29) was too aggressive. Ironically, this caused xarray to drop Python 3.8 on Jan 26th (#7461), 3 months before what NEP-29 recommends (Apr 14th). This is hard to defend - and in fact it sparked discontent (see late comments in #7461).

Regardless of what policy xarray decides to use internally, it should never be more aggressive than NEP-29. The xarray documentation is also incorrect, as it states "Python: 24 months (NEP-29)" which is not, in fact, in NEP-29.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7777/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
309691307 MDU6SXNzdWUzMDk2OTEzMDc= 2028 slice using non-index coordinates crusaderky 6213168 closed 0     21 2018-03-29T09:53:33Z 2023-02-08T19:47:22Z 2022-10-03T10:38:57Z MEMBER      

It should be relatively straightforward to allow slicing on coordinates that are not backed by an IndexVariable, or in other words coordinates that are on a dimension with a different name, as long as they are 1-dimensional (unsure about the multidimensional case).

E.g. given this array: ``` a = xarray.DataArray( [10, 20, 30], dims=['country'], coords={ 'country': ['US', 'Germany', 'France'], 'currency': ('country', ['USD', 'EUR', 'EUR']) })

<xarray.DataArray (country: 3)> array([10, 20, 30]) Coordinates: * country (country) <U7 'US' 'Germany' 'France' currency (country) <U3 'USD' 'EUR' 'EUR' ```

This is currently not possible: ``` a.sel(currency='EUR')

ValueError: dimensions or multi-index levels ['currency'] do not exist ```

It should be interpreted as a shorthand for: ``` a.sel(country=a.currency == 'EUR')

<xarray.DataArray (country: 2)> array([20, 30]) Coordinates: * country (country) <U7 'Germany' 'France' currency (country) <U3 'EUR' 'EUR' ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2028/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
166441031 MDU6SXNzdWUxNjY0NDEwMzE= 907 unstack() treats string coords as objects crusaderky 6213168 closed 0     7 2016-07-19T21:33:28Z 2022-09-27T12:11:36Z 2022-09-27T12:11:35Z MEMBER      

unstack() should be smart enough to recognise that all labels in a coord are strings, and convert them to numpy strings. This is particularly relevant e.g. if you want to dump the xarray to netcdf and then read it with a non-python library.

``` python import xarray

a = xarray.DataArray([[1,2],[3,4]], dims=['x', 'y'], coords={'x': ['x1', 'x2'], 'y': ['y1', 'y2']}) a ```

<xarray.DataArray (x: 2, y: 2)> array([[1, 2], [3, 4]]) Coordinates: * y (y) <U2 'y1' 'y2' * x (x) <U2 'x1' 'x2'

python a.stack(s=['x', 'y']).unstack('s')

<xarray.DataArray (x: 2, y: 2)> array([[1, 2], [3, 4]]) Coordinates: * x (x) object 'x1' 'x2' * y (y) object 'y1' 'y2'

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/907/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
264509098 MDU6SXNzdWUyNjQ1MDkwOTg= 1624 Improve documentation and error validation for set_options(arithmetic_join) crusaderky 6213168 closed 0     7 2017-10-11T09:05:49Z 2022-06-25T20:01:07Z 2022-06-25T20:01:07Z MEMBER      

The documentation for set_options laconically says:

arithmetic_join: DataArray/Dataset alignment in binary operations. Default: 'inner'.

leaving the user wonder what the other options are. Also, the set_options code does not make any kind of domain check on the possible values. By scanning the code I gathered that the valid values (and their meanings) should be the same as align(join=...), but I'd like confirmation on that...

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1624/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
309686915 MDU6SXNzdWUzMDk2ODY5MTU= 2027 square-bracket slice a Dataset with a DataArray crusaderky 6213168 open 0     4 2018-03-29T09:39:57Z 2022-04-18T03:51:25Z   MEMBER      

Given this: ``` ds = xarray.Dataset( data_vars={ 'vote': ('pupil', [5, 7, 8]), 'age': ('pupil', [15, 14, 16]) }, coords={ 'pupil': ['Alice', 'Bob', 'Charlie'] })

<xarray.Dataset> Dimensions: (pupil: 3) Coordinates: * pupil (pupil) <U7 'Alice' 'Bob' 'Charlie' Data variables: vote (pupil) int64 5 7 8 age (pupil) int64 15 14 16 ```

Why does this work: ``` ds.age[ds.vote >= 6]

<xarray.DataArray 'age' (pupil: 2)> array([14, 16]) Coordinates: * pupil (pupil) <U7 'Bob' 'Charlie' ```

But this doesn't? ``` ds[ds.vote >= 6]

KeyError: False `ds.vote >= 6`` is a DataArray with dims=('pupil', ) and dtype=bool, so I can't think of any ambiguity in what I want to achieve?

Workaround: ``` ds.sel(pupil=ds.vote >= 6)

<xarray.Dataset> Dimensions: (pupil: 2) Coordinates: * pupil (pupil) <U7 'Bob' 'Charlie' Data variables: vote (pupil) int64 7 8 age (pupil) int64 14 16 ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2027/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
502130982 MDU6SXNzdWU1MDIxMzA5ODI= 3370 Hundreds of Sphinx errors crusaderky 6213168 closed 0     14 2019-10-03T15:17:09Z 2022-04-17T20:33:05Z 2022-04-17T20:33:05Z MEMBER      

sphinx-build emits a ton of errors that need to be polished out:

https://readthedocs.org/projects/xray/builds/ -> latest -> open last step

Options for the long term: - Change the "Docs" azure pipelines job to crash if there are new failures. From past experience though, this should come together with a sensible way to whitelist errors that can't be fixed. This will severely slow down development as PRs will systematically fail on such a check. - Add a task in the release process where, immediately before closing a release, the maintainer needs to manually go through the sphinx-build log and fix any new issues. This would be a major extra piece of work for the maintainer.

I am honestly not excited by either of the above. Alternative suggestions are welcome.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3370/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
666896781 MDU6SXNzdWU2NjY4OTY3ODE= 4279 intersphinx looks for implementation modules crusaderky 6213168 open 0     0 2020-07-28T08:55:12Z 2022-04-09T03:03:30Z   MEMBER      

This is a widespread issue caused by the pattern of defining objects in private module and then exposing them to the final user by importing them in the top-level __init__.py, vs. how intersphinx works.

Exact same issue in different projects: - https://github.com/aio-libs/aiohttp/issues/3714 - https://jira.mongodb.org/browse/MOTOR-338 - https://github.com/tkem/cachetools/issues/178 - https://github.com/AmphoraInc/xarray_mongodb/pull/22 - https://github.com/jonathanslenders/asyncio-redis/issues/143

If a project 1. uses xarray, intersphinx, and autodoc 2. subclasses any of the classes exposed by xarray/__init__.py and documents the new class with the :show-inheritance: flag 3. Starting from Sphinx 3, has any of the above classes anywhere in a type annotation

Then Sphinx emits a warning and fails to create a hyperlink, because intersphinx uses the __module__ attribute to look up the object in objects.inv, but __module__ points to the implementation module while objects.inv points to the top-level xarray module.

Workaround

In conf.py:

python import xarray xarray.DataArray.__module__ = "xarray"

Solution

Put the above hack in xarray/__init__.py

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4279/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
505550120 MDU6SXNzdWU1MDU1NTAxMjA= 3391 map_blocks doesn't work when dask isn't installed crusaderky 6213168 closed 0     1 2019-10-10T22:53:55Z 2021-11-24T17:25:24Z 2021-11-24T17:25:24Z MEMBER      

Iterative improvement on #3276 @dcherian

map_blocks crashes with ImportError if dask isn't installed, even if it's legal to run it on a DataArray/Dataset without any dask variables. This forces writers of extension libraries to either not use map_blocks, add dask as a strict requirement, or write a switch in their own code.

Please change the code so that it works without dask (you'll need to write a stub of dask.is_dask_collection that always returns False) and add relevant tests to be triggered in our py36-bare-minimum CI environment.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3391/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
193294569 MDU6SXNzdWUxOTMyOTQ1Njk= 1151 Scalar coords vs. concat crusaderky 6213168 open 0     11 2016-12-03T15:42:18Z 2021-07-08T17:42:18Z   MEMBER      

Why does this work: ```

import xarray a = xarray.DataArray([1, 2, 3], dims=['x'], coords={'y': 10}) b = xarray.DataArray([4, 5, 6], dims=['x']) a + b <xarray.DataArray (x: 3)> array([5, 7, 9]) Coordinates: y int64 10 But this doesn't? xarray.concat([a, b], dim='x') KeyError: 'y' ``` It doesn't seem coherent to me...

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1151/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
305757822 MDU6SXNzdWUzMDU3NTc4MjI= 1995 apply_ufunc support for chunks on input_core_dims crusaderky 6213168 open 0     13 2018-03-15T23:50:22Z 2021-05-17T18:59:18Z   MEMBER      

I am trying to optimize the following function:

c = (a * b).sum('x', skipna=False)

where a and b are xarray.DataArray's, both with dimension x and both with dask backend.

I successfully obtained a 5.5x speedup with the following:

@numba.guvectorize(['void(float64[:], float64[:], float64[:])'], '(n),(n)->()', nopython=True, cache=True)
def mulsum(a, b, res):
    acc = 0
    for i in range(a.size):
        acc += a[i] * b[i]
    res.flat[0] = acc

c = xarray.apply_ufunc(
    mulsum, a, b,
    input_core_dims=[['x'], ['x']],
    dask='parallelized', output_dtypes=[float])

The problem is that this introduces a (quite problematic, in my case) constraint that a and b can't be chunked on dimension x - which is theoretically avoidable as long as the kernel function doesn't need interaction between x[i] and x[j] (e.g. it can't work for an interpolator, which would require to rely on dask ghosting).

Proposal

Add a parameter to apply_ufunc, reduce_func=None. reduce_func is a function which takes as input two parameters a, b that are the output of func. apply_ufunc will invoke it whenever there's chunking on an input_core_dim.

e.g. my use case above would simply become:

c = xarray.apply_ufunc(
    mulsum, a, b,
    input_core_dims=[['x'], ['x']],
    dask='parallelized', output_dtypes=[float], reduce_func=operator.sum)

So if I have 2 chunks in a and b on dimension x, apply_ufunc will internally do

c1 = mulsum(a1, b1)
c2 = mulsum(a2, b2)
c = operator.sum(c1, c2)

Note that reduce_func will be invoked exclusively in presence of dask='parallelized' and when there's chunking on one or more of the input_core_dims. If reduce_func is left to None, apply_ufunc will keep crashing like it does now.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1995/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
417356439 MDU6SXNzdWU0MTczNTY0Mzk= 2801 NaN-sized chunks crusaderky 6213168 open 0     2 2019-03-05T15:30:14Z 2021-04-24T02:41:34Z   MEMBER      

It would be nice to have support for NaN-sized dask chunks, e.g. x[x > 2]. There are two problems:

  1. x[x > 2] silently resolves the dask graph. It definitely shouldn't. There needs to be some discussion on what needs to happen to indices on the NaN-sized dimension; I can think of 3 options:
  2. silently drop any index that would become undefined
  3. drop any index that would become undefined and issue a warning
  4. hard crash if there is any index that would become undefined
  5. redesign IndexVariable so that it can contain dask data (probably much more complicated than the 3 above). The above design decision is anyway for when there is an index; dims without indices should just work.

  6. This crashes: ```>>> a = xarray.DataArray([1, 2, 3, 4]).chunk(2)

    xarray.DataArray(a.data[a.data > 2]).compute()

ValueError: replacement data must match the Variable's shape ``` I didn't investigate but I suspect it should be trivial to fix. I'm not sure why there is a check at all? Any such health check should be in dask only IMHO.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2801/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
311573817 MDU6SXNzdWUzMTE1NzM4MTc= 2039 open_mfdataset: skip loading for indexes and coordinates from all but the first file crusaderky 6213168 open 0     1 2018-04-05T11:32:02Z 2021-01-27T17:49:21Z   MEMBER      

This is a follow-up from #1521.

When invoking open_mfdataset, very frequently the user knows in advance that all of his coords that aren't on the concat_dim are already aligned, and may be willing to blindly trust such assumption in exchange of a huge performance boost.

My production data: 200x NetCDF files on a not very performant NFS file system, concatenated on the "scenario" dimension:

``` xarray.open_mfdataset('cube.*.nc', engine='h5netcdf', concat_dim='scenario')

<xarray.Dataset> Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: * attribute (attribute) object 'THEO/Value' currency (instr_id) object 'ZAR' 'EUR' 'EUR' 'EUR' 'EUR' 'EUR' 'GBP' ... * fx_id (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ... * instr_id (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ... * timestep (timestep) datetime64[ns] 2016-12-31 type (instr_id) object 'American' 'Bond Future' 'Bond Future' ... * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... Data variables: FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)> instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 19.6 s, sys: 981 ms, total: 20.6 s Wall time: 24.4 s ```

If I skip loading and comparing the non-index coords from all 200 files:

``` xarray.open_mfdataset('cube.*.nc'), engine='h5netcdf', concat_dim='scenario', coords='all')

<xarray.Dataset> Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: * attribute (attribute) object 'THEO/Value' * fx_id (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ... * instr_id (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ... * timestep (timestep) datetime64[ns] 2016-12-31 currency (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)> * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... type (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)> Data variables: FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)> instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 12.7 s, sys: 305 ms, total: 13 s Wall time: 14.8 s ```

If I skip loading and comparing also the index coords from all 200 files:

``` cube = xarray.open_mfdataset(sh.resolve_env(f'{dynamic}/mtf/{cubename}/nc/cube.*.nc'), engine='h5netcdf', concat_dim='scenario', drop_variables=['attribute', 'fx_id', 'instr_id', 'timestep', 'currency', 'type'])

<xarray.Dataset> Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... Dimensions without coordinates: attribute, fx_id, instr_id, timestep Data variables: FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)> instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 7.31 s, sys: 61 ms, total: 7.37 s Wall time: 9.05 s ```

Proposed design

Add a new optional parameter to open_mfdataset, assume_aligned=None. It can be valued to a list of variable names or "all", and requires concat_dim to be explicitly set. It causes open_mfdataset to use the first occurrence of every variable and blindly skip loading the subsequent ones.

Algorithm

  1. Perform the first invocation to the underlying open_dataset like it happens now
  2. if assume_aligned is not None: for each new NetCDF file, figure out which variables need to be aligned & compared (as opposed to concatenated), and add them to a drop_variables list.
  3. if assume_aligned != "all": drop_variables &= assume_aligned
  4. Pass the increasingly long drop_variables list to the underlying open_dataset
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2039/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
272004812 MDU6SXNzdWUyNzIwMDQ4MTI= 1699 apply_ufunc(dask='parallelized') output_dtypes for datasets crusaderky 6213168 open 0     8 2017-11-07T22:18:23Z 2020-04-06T15:31:17Z   MEMBER      

When a Dataset has variables with different dtypes, there's no way to tell apply_ufunc that the same function applied to different variables will produce different dtypes:

``` ds1 = xarray.Dataset(data_vars={'a': ('x', [1, 2]), 'b': ('x', [3.0, 4.5])}).chunk() ds2 = xarray.apply_ufunc(lambda x: x + 1, ds1, dask='parallelized', output_dtypes=[float]) ds2

<xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: a (x) float64 dask.array<shape=(2,), chunksize=(2,)> b (x) float64 dask.array<shape=(2,), chunksize=(2,)>

ds2.compute()

<xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: a (x) int64 2 3 b (x) float64 4.0 5.5 ```

Proposed solution

When the output is a dataset, apply_ufunc could accept either output_dtypes=[t] (if all output variables will have the same dtype) or output_dtypes=[{var1: t1, var2: t2, ...}]. In the example above, it would be output_dtypes=[{'a': int, 'b': float}].

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1699/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
502082831 MDU6SXNzdWU1MDIwODI4MzE= 3369 Define a process to test the readthedocs CI before merging into master crusaderky 6213168 closed 0     3 2019-10-03T13:56:02Z 2020-01-22T15:40:34Z 2020-01-22T15:40:33Z MEMBER      

This is an offshoot of #3358.

The readthedocs CI has a bad habit of failing even after the Azure Pipelines job "Docs" has succeeded.

After major changes that impact the documentation, and before merging everything into master, it would be advisable to explicitly verify that RTD builds correctly.

So far I tried to 1. create my own readthedocs project, https://readthedocs.org/projects/crusaderky-xarray/ 2. point it to my fork https://github.com/crusaderky/xarray/ 3. enable build for the branch I want to merge

This is currently failing because of an issue with versioneer, which incorrectly sets xarray.__version__ to 0+untagged.111.g6d60700. This in turn causes a failure in a minimum version check in pandas.DataFrame.to_xarray() on pandas>=0.25.

In the master RTD project https://readthedocs.org/projects/xray/, I can instead read xarray: 0.13.0+20.gdd2b803a.

So far the only workaround I could find was to downgrade pandas to 0.24 in ci/requirements/doc.yml.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3369/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
510915725 MDU6SXNzdWU1MTA5MTU3MjU= 3434 v0.14.1 Release crusaderky 6213168 closed 0     18 2019-10-22T21:08:15Z 2019-11-19T23:44:52Z 2019-11-19T23:44:52Z MEMBER      

I think with the multiple recent breakages we've just had due to dependency upgrades, we should push out a patch release with some haste.

Please comment/add/object

Must have

  • [x] numpy 1.18 support #3409
  • [x] pseudonetcdf 3.1.0 support #3409, #3420
  • [x] require cftime != 1.0.4 #3463
  • [x] groupby reduce regression fix #3403
  • [x] pandas master support #3440

Nice to have

  • [x] ellipsis (...) work #1081, #3414, #3418, #3421, #3423, #3424
  • [x] HTML repr #3425 (really mouth-watering, but I'm unsure about how far it is from completion)
  • [x] groupby drop nan groups #3406
  • [x] deprecate allow_lazy #3435
  • [x] __dask_tokenize__ #3446
  • [x] dask name equality #3453
  • [x] Leave empty slot when not using accessors #3531
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3434/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
503983776 MDU6SXNzdWU1MDM5ODM3NzY= 3382 Improve indexing performance benchmarks crusaderky 6213168 open 0     0 2019-10-08T11:20:39Z 2019-11-14T15:52:33Z   MEMBER      

As discussed in #3375 - FYI @jhamman

asv_bench/benchmarks/indexing.py is currently missing some key use cases:

  • All tests in the above module use arrays with 2~6 million points. While this is important to spot any case where the numpy underlying functions start being unnecessarily called more than once, it also means any performance improvement or degradation in any of the pure-Python code will be completely drowned out. All tests should be run twice, once with the current nx = 3000; ny = 2000; nt = 1000 and again with nx = 15; ny = 10; nt = 5.
  • DataArray slicing (sel, isel, and square brackets)
  • Slicing when there are no IndexVariables (verify that we're not creating dummy variables, doing a full scan on them, and then discarding them)
  • other?
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3382/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
329251342 MDU6SXNzdWUzMjkyNTEzNDI= 2214 Simplify graph of DataArray.chunk() crusaderky 6213168 closed 0     2 2018-06-04T23:30:19Z 2019-11-10T04:34:58Z 2019-11-10T04:34:58Z MEMBER      

```

dict(xarray.DataArray([1, 2]).chunk().dask_graph()) { ('xarray-<this-array>-7e885b8e329090da3fe58d4483c0cf8b', 0): (<function dask.array.core.getter(a, b, asarray=True, lock=None)>, 'xarray-<this-array>-7e885b8e329090da3fe58d4483c0cf8b', (slice(0, 2, None),)), 'xarray-<this-array>-7e885b8e329090da3fe58d4483c0cf8b': ImplicitToExplicitIndexingAdapter(array=NumpyIndexingAdapter(array=array([1, 2]))) } There is no reason why this should be any more complicated than da.from_array: dict(da.from_array(np.array([1, 2]), chunks=2).dask_graph()) { ('array-de932becc43e72c010bc91ffefe42af1', 0): (<function dask.array.core.getter(a, b, asarray=True, lock=None)>, 'array-original-de932becc43e72c010bc91ffefe42af1', (slice(0, 2, None),)), 'array-original-de932becc43e72c010bc91ffefe42af1': array([1, 2]) } ``` da.from_array itself should be simplified - see twin issue https://github.com/dask/dask/issues/3556

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2214/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
272002705 MDU6SXNzdWUyNzIwMDI3MDU= 1698 apply_ufunc(dask='parallelized') to infer output_dtypes crusaderky 6213168 open 0     3 2017-11-07T22:11:11Z 2019-10-22T08:33:38Z   MEMBER      

If one doesn't provide the dtype parameter to dask.map_blocks(), it automatically infers it by running the kernel on trivial dummy data. It should be straightforward to make xarray.apply_ufunc(dask='parallelized') use the same functionality if the output_dtypes parameter is omitted.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1698/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
506885041 MDU6SXNzdWU1MDY4ODUwNDE= 3397 "How Do I..." formatting issues crusaderky 6213168 closed 0     4 2019-10-14T21:32:27Z 2019-10-16T21:41:06Z 2019-10-16T21:41:06Z MEMBER      

@dcherian The new page http://xarray.pydata.org/en/stable/howdoi.html (#3357) is somewhat painful to read on readthedocs. The table goes out of the screen and one is forced to scroll left and right non stop.

Maybe a better alternative could be with Sphinx definitions syntax (which allows for automatic reflowing)?

rst How do I ... ============ Add variables from other datasets to my dataset? :py:meth:`Dataset.merge` (that's a 4 spaces indent)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3397/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
481250429 MDU6SXNzdWU0ODEyNTA0Mjk= 3222 Minimum versions for optional libraries crusaderky 6213168 closed 0     12 2019-08-15T17:18:16Z 2019-10-08T21:23:47Z 2019-10-08T21:23:47Z MEMBER      

In CI there are:

  • tests for all the latest versions of all libraries, mandatory and optional (py36, py37, py37-windows)
  • tests for the minimum versions of the mandatory libraries only (py35-min)

There are no tests for legacy versions of the optional libraries.

Today I tried downgrading dask in the py37 environment to dask=1.1.2, which is 6 months old...

...it's a bloodbath. 383 errors of the most diverse kind.

In the codebase I found mentions to much older minimum versions: installing.rst mentions dask >=0.16.1, and Dataset.chunk() even asks for dask>=0.9.

It think we should add CI tests for old versions of the optional dependencies. What policy should we adopt when we find an incompatibility? How old a library should be not to bother fixing bugs and just require a newer version? I personally would go for an aggressive 6 months worth' of backwards compatibility; less if the time it takes to fix the issues is excessive. The tests should run on py36 because py35 builds are becoming very scarce in anaconda.

This has the outlook of being an exercise in extreme frustration. I'm afraid I personally hold zero interest towards packages older than the latest available in the anaconda official repo, so I'm not volunteering for this one (sorry).

I'd like to hear other people's opinions and/or offers of self-immolation... :)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3222/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
485708282 MDU6SXNzdWU0ODU3MDgyODI= 3268 Stateful user-defined accessors crusaderky 6213168 open 0     15 2019-08-27T09:54:28Z 2019-10-08T11:13:25Z   MEMBER      

If anybody decorates a stateful class with @register_dataarray_accessor or @register_dataset_accessor, the instance will lose its state on any method that invokes _to_temp_dataset, as well as on a shallow copy.

```python

In [1]: @xarray.register_dataarray_accessor('foo') ...: class Foo: ...: def init(self, obj): ...: self.obj = obj ...: self.x = 1 ...:
...:

In [2]: a = xarray.DataArray()

In [3]: a.foo.x
Out[3]: 1

In [4]: a.foo.x = 2

In [5]: a.foo.x
Out[5]: 2

In [6]: a.roll().foo.x
Out[6]: 1

In [7]: a.copy(deep=False).foo.x
Out[7]: 1 ```

While in the case of _to_temp_dataset it could be possible to spend (substantial) effort to retain the state, on the case of copy() it's impossible without modifying the accessor duck API, as one would need to tamper with the accessor instance in place and modify the pointer back to the DataArray/Dataset.

This issue is so glaring that it makes me strongly suspect that nobody saves any state in accessor classes. This kind of use would also be problematic in practical terms, as the accessor object would have a hard time realising when its own state is no longer coherent with the referenced DataArray/Dataset.

This design also carries the problem that it introduces a circular reference in the DataArray/Dataset. This means that, after someone invokes an accessor method on his DataArray/Dataset, then the whole object - including the numpy buffers! - won't be instantly collected when it's dereferenced by the user, and it will have to instead wait for the next gc pass. This could cause huge increases in RAM usage overnight in a user application, which would be very hard to logically link to a change that just added a custom method.

Finally, with https://github.com/pydata/xarray/pull/3250/, this statefulness forces us to increase the RAM usage of all datasets and dataarrays by an extra slot, for all users, even if this feature is quite niche.

Proposed solution

Get rid of accessor caching altogether, and just recreate the accessor object from scratch every time it is invoked. In the documentation, clarify that the __init__ method should not perform anything computationally intensive.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3268/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
470714103 MDU6SXNzdWU0NzA3MTQxMDM= 3154 pynio causes dependency conflicts in py36 CI build crusaderky 6213168 closed 0     9 2019-07-20T21:00:43Z 2019-10-03T15:22:17Z 2019-10-03T15:22:17Z MEMBER      

On Saturday night, all Python 3.6 CI builds started failing. Python 3.7 is unaffected. See https://dev.azure.com/xarray/xarray/_build/results?buildId=362&view=logs

MacOSX py36: UnsatisfiableError: The following specifications were found to be in conflict: - pynio - python=3.6 - rasterio

Linux py36: UnsatisfiableError: The following specifications were found to be in conflict: - cfgrib[version='>=0.9.2'] - h5netcdf - pynio

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3154/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
501461397 MDU6SXNzdWU1MDE0NjEzOTc= 3366 CI offline? crusaderky 6213168 closed 0     2 2019-10-02T12:35:00Z 2019-10-02T17:32:03Z 2019-10-02T17:32:03Z MEMBER      

Azure pipelines is not being triggered by PRs this morning. See https://github.com/pydata/xarray/pull/3358 and https://github.com/pydata/xarray/pull/3365.

Last run was 12 hours ago.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3366/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
478343417 MDU6SXNzdWU0NzgzNDM0MTc= 3191 DataArray.chunk() from sparse array produces malformed dask array crusaderky 6213168 closed 0     1 2019-08-08T09:08:56Z 2019-08-12T21:02:24Z 2019-08-12T21:02:24Z MEMBER      

3117 by @nvictus introduces support for sparse in plain xarray.

dask already supports it.

Running with: - xarray git head - dask 2.2.0 - numpy 1.16.4 - sparse 0.7.0 - NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1

```python

import numpy, sparse, xarray, dask.array s = sparse.COO(numpy.array([1, 2]))
da1 = dask.array.from_array(s) da1._meta <COO: shape=(0,), dtype=int64, nnz=0, fill_value=0> da1.compute() <COO: shape=(2,), dtype=int64, nnz=2, fill_value=0> da2 = xarray.DataArray(s).chunk().data da2._meta
array([], dtype=int64) # Wrong da2.compute() RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method. ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3191/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
202423683 MDU6SXNzdWUyMDI0MjM2ODM= 1224 fast weighted sum crusaderky 6213168 closed 0     5 2017-01-23T00:29:19Z 2019-08-09T08:36:11Z 2019-08-09T08:36:11Z MEMBER      

In my project I'm struggling with weighted sums of 2000-4000 dask-based xarrays. The time to reach the final dask-based array, the size of the final dask dict, and the time to compute the actual result are horrendous.

So I wrote the below which - as laborious as it may look - gives a performance boost nothing short of miraculous. At the bottom you'll find some benchmarks as well.

https://gist.github.com/crusaderky/62832a5ffc72ccb3e0954021b0996fdf

In my project, this deflated the size of the final dask dict from 5.2 million keys to 3.3 million and cut a 30% from the time required to define it.

I think it's generic enough to be a good addition to the core xarray module. Impressions?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1224/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
466750687 MDU6SXNzdWU0NjY3NTA2ODc= 3092 black formatting crusaderky 6213168 closed 0     14 2019-07-11T08:43:55Z 2019-08-08T22:34:53Z 2019-08-08T22:34:53Z MEMBER      

I, like many others, have irreversibly fallen in love with black. Can we apply it to the existing codebase and as an enforced CI test? The only (big) problem is that developers will need to manually apply it to any open branches and then merge from master - and even then, merging likely won't be trivial. How did the dask project tackle the issue?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3092/reactions",
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
475599589 MDU6SXNzdWU0NzU1OTk1ODk= 3174 CI failure downloading external data crusaderky 6213168 closed 0     2 2019-08-01T10:21:36Z 2019-08-07T08:41:13Z 2019-08-07T08:41:13Z MEMBER      

The 'Docs' ci project is failing because http://naciscdn.org is unresponsive:

https://dev.azure.com/xarray/xarray/_build/results?buildId=408&view=logs&jobId=7e620c85-24a8-5ffa-8b1f-642bc9b1fc36

Excerpt: ``` /usr/share/miniconda/envs/xarray-tests/lib/python3.7/site-packages/cartopy/io/init.py:260: DownloadWarning: Downloading: http://naciscdn.org/naturalearth/110m/physical/ne_110m_coastline.zip warnings.warn('Downloading: {}'.format(url), DownloadWarning)

Exception occurred: File "/usr/share/miniconda/envs/xarray-tests/lib/python3.7/urllib/request.py", line 1319, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [Errno 110] Connection timed out> The full traceback has been saved in /tmp/sphinx-err-nq73diee.log, if you want to report the issue to the developers. Please also report this if it was a user error, so that a better error message can be provided next time. A bug report can be filed in the tracker at https://github.com/sphinx-doc/sphinx/issues. Thanks!

[error]Bash exited with code '2'.

[section]Finishing: Build HTML docs

```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3174/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
466815556 MDU6SXNzdWU0NjY4MTU1NTY= 3094 REGRESSION: copy(deep=True) casts unicode indices to object crusaderky 6213168 closed 0     3 2019-07-11T10:46:28Z 2019-08-02T14:02:50Z 2019-08-02T14:02:50Z MEMBER      

Dataset.copy(deep=True) and DataArray.copy (deep=True/False) accidentally cast IndexVariable's with dtype='<U*' to object. Same applies to copy.copy() and copy.deepcopy().

This is a regression in xarray >= 0.12.2. xarray 0.12.1 and earlier are unaffected.

```

In [1]: ds = xarray.Dataset( ...: coords={'x': ['foo'], 'y': ('x', ['bar'])}, ...: data_vars={'z': ('x', ['baz'])})

In [2]: ds
Out[2]: <xarray.Dataset> Dimensions: (x: 1) Coordinates: * x (x) <U3 'foo' y (x) <U3 'bar' Data variables: z (x) <U3 'baz'

In [3]: ds.copy()
Out[3]: <xarray.Dataset> Dimensions: (x: 1) Coordinates: * x (x) <U3 'foo' y (x) <U3 'bar' Data variables: z (x) <U3 'baz'

In [4]: ds.copy(deep=True)
Out[4]: <xarray.Dataset> Dimensions: (x: 1) Coordinates: * x (x) object 'foo' y (x) <U3 'bar' Data variables: z (x) <U3 'baz'

In [5]: ds.z
Out[5]: <xarray.DataArray 'z' (x: 1)> array(['baz'], dtype='<U3') Coordinates: * x (x) <U3 'foo' y (x) <U3 'bar'

In [6]: ds.z.copy()
Out[6]: <xarray.DataArray 'z' (x: 1)> array(['baz'], dtype='<U3') Coordinates: * x (x) object 'foo' y (x) <U3 'bar'

In [7]: ds.z.copy(deep=True)
Out[7]: <xarray.DataArray 'z' (x: 1)> array(['baz'], dtype='<U3') Coordinates: * x (x) object 'foo' y (x) <U3 'bar' ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3094/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
475244610 MDU6SXNzdWU0NzUyNDQ2MTA= 3171 distributed.Client.compute fails on DataArray crusaderky 6213168 closed 0     2 2019-07-31T16:33:01Z 2019-08-01T21:43:11Z 2019-08-01T21:43:11Z MEMBER      

As of - dask 2.1.0 - distributed 2.1.0 - xarray 0.12.1 or git head (didn't try older versions):

```python

import xarray import distributed client = distributed.Client(set_as_default=False) ds = xarray.Dataset({'d': ('x', [1, 2])}).chunk(1) client.compute(ds).result() <xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: d (x) int64 1 2

client.compute(ds.d).result() distributed.worker - WARNING - Compute Failed Function: _dask_finalize args: ([[array([1]), array([2])]], <function Dataset._dask_postcompute at 0x316a1db70>, ([(True, <this-array>, (<function Variable._dask_finalize at 0x3168f7f28>, (<function finalize at 0x1166bb8c8>, (), ('x',), OrderedDict(), None)))], set(), {'x': 2}, None, None, None, None), 'd') kwargs: {} Exception: KeyError(<this-array>)


KeyError Traceback (most recent call last) <ipython-input-8-2dbfe1b2ff17> in <module> ----> 1 client.compute(ds.d).result()

/anaconda3/lib/python3.7/site-packages/distributed/client.py in result(self, timeout) 226 result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False) 227 if self.status == "error": --> 228 six.reraise(*result) 229 elif self.status == "cancelled": 230 raise result

/anaconda3/lib/python3.7/site-packages/six.py in reraise(tp, value, tb) 690 value = tp() 691 if value.traceback is not tb: --> 692 raise value.with_traceback(tb) 693 raise value 694 finally:

~/PycharmProjects/xarray/xarray/core/dataarray.py in _dask_finalize() 706 def _dask_finalize(results, func, args, name): 707 ds = func(results, *args) --> 708 variable = ds._variables.pop(_THIS_ARRAY) 709 coords = ds._variables 710 return DataArray(variable, coords, name=name, fastpath=True) ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3171/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
252548859 MDU6SXNzdWUyNTI1NDg4NTk= 1524 (trivial) xarray.quantile silently resolves dask arrays crusaderky 6213168 closed 0     9 2017-08-24T09:54:11Z 2019-07-23T00:18:06Z 2017-08-28T17:31:57Z MEMBER      

In variable.py, line 1116, you're missing a raise statement:

if isinstance(self.data, dask_array_type): TypeError("quantile does not work for arrays stored as dask " "arrays. Load the data via .compute() or .load() prior " "to calling this method.")

Currently looking into extending dask.percentile() to support more than 1D arrays, and then use it in xarray too.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1524/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
465984161 MDU6SXNzdWU0NjU5ODQxNjE= 3089 Python 3.5.0-3.5.1 support crusaderky 6213168 closed 0     5 2019-07-09T21:04:28Z 2019-07-13T21:58:31Z 2019-07-13T21:58:31Z MEMBER      

Python 3.5.0 has gone out of the conda-forge repository. 3.5.1 is still there... for now. The anaconda repository starts directly from 3.5.4. 3.5.0 and 3.5.1 are a colossal pain in the back for typing support. Is this a good time to increase the requirement to >= 3.5.2? I honestly can't think how anybody could be unable to upgrade to the latest available 3.5 with minimal effort...

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3089/reactions",
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
264517839 MDU6SXNzdWUyNjQ1MTc4Mzk= 1625 Option for arithmetics to ignore nans created by alignment crusaderky 6213168 closed 0     3 2017-10-11T09:33:34Z 2019-07-11T09:48:07Z 2019-07-11T09:48:07Z MEMBER      

Can anybody tell me if there is anybody who benefits from this behaviour? I can't think of any good use cases.

``` wallet = xarray.DataArray([50, 70], dims=['currency'], coords={'currency': ['EUR', 'USD']}) restaurant_bill = xarray.DataArray([30], dims=['currency'], coords={'currency': ['USD']}) with xarray.set_options(arithmetic_join="outer"): print(wallet - restaurant_bill)

<xarray.DataArray (currency: 2)> array([ nan, 40.]) Coordinates: * currency (currency) object 'EUR' 'USD' ```

While it is fairly clear why it can be desirable to have nan + not nan = nan as a default in arithmetic when the nan is already present in one of the input arrays, when the nan is introduced as part of an automatic align things become much less intuitive.

Proposal: - add a parameter to xarray.align, fillvalue=numpy.nan, which determines what will appear in the newly created array elements - change __add__, __sub__ etc. to invoke xarray.align(fillvalue=0) - change __mul__, __truediv__ etc. to invoke xarray.align(fillvalue=1)

In theory the setting could be left as an opt-in as set_options(arithmetic_align_fillvalue='neutral'), yet I wonder who would actually want the current behaviour?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1625/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
341355638 MDU6SXNzdWUzNDEzNTU2Mzg= 2289 DataArray.to_csv() crusaderky 6213168 closed 0     6 2018-07-15T21:56:20Z 2019-03-12T15:01:18Z 2019-03-12T15:01:18Z MEMBER      

I'm using xarray to aggregate 38 GB worth of NetCDF data into a bunch of CSV reports. I have two problems:

  1. The reports are 500,000 rows by 2,000 columns. Before somebody says "if you're using CSV for this size of data you're doing it wrong" - yes, I know, but it was the only way to make the data accessible to a bunch of people that only know how to use Excel and VBA. :tired_face: The sheer size of the reports means that (1) it's unsavory to keep the whole thing in RAM (2) pandas to_csv will take ages to complete (as it's single-threaded). The slowness is compounded by the fact that I have to compress everything with gzip.
  2. I have to produce up to 40 reports from the exact same NetCDF files. I use dask to perform the computation, and different reports share a large amount of intermediate graph nodes. So I need to do everything in a single invocation to dask.compute() to allow the dask scheduler to de-duplicate the nodes.

To solve both problems, I wrote a new function: http://xarray-extras.readthedocs.io/en/latest/api/csv.html

And now my high level wrapper code looks like this: ```

DataSet from 200 .nc files, with a total of 500000 points on the 'row' dimension

nc = xarray.open_mfdataset('inputs..nc') reports = [ # DataArrays with shape (500000, 2000), with the rows split in 200 chunks gen_report0(nc), gen_report1(nc), .... gen_report39(nc), ] futures = [ # dask.delayed objects to_csv(reports[0], 'report0.csv.gz', compression='gzip'), to_csv(reports[1], 'report1.csv.gz', compression='gzip'), .... to_csv(reports[39], 'report39.csv.gz', compression='gzip'), ] dask.compute(futures) ``` The function is currently production quality in xarray-extras, but it would be very easy to refactor it as a method of xarray.DataArray in the main library.

Opinions?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2289/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
166439490 MDU6SXNzdWUxNjY0Mzk0OTA= 906 unstack() sorts data alphabetically crusaderky 6213168 closed 0     14 2016-07-19T21:25:26Z 2019-02-23T12:47:00Z 2019-02-23T12:47:00Z MEMBER      

DataArray.unstack() sorts the data alphabetically by label. Besides being poor for performance, this is very problematic whenever the order matters, and the labels are not in alphabetical order to begin with.

``` python

import xarray import pandas

index = [ ['x1', 'first' ], ['x1', 'second'], ['x1', 'third' ], ['x1', 'fourth'], ['x0', 'first' ], ['x0', 'second'], ['x0', 'third' ], ['x0', 'fourth'], ] index = pandas.MultiIndex.from_tuples(index, names=['x', 'count']) s = pandas.Series(list(range(8)), index) a = xarray.DataArray(s) a ```

<xarray.DataArray (dim_0: 8)> array([0, 1, 2, 3, 4, 5, 6, 7], dtype=int64) Coordinates: * dim_0 (dim_0) object ('x1', 'first') ('x1', 'second') ('x1', 'third') ...

python a.unstack('dim_0')

<xarray.DataArray (x: 2, count: 4)> array([[4, 7, 5, 6], [0, 3, 1, 2]], dtype=int64) Coordinates: * x (x) object 'x0' 'x1' * count (count) object 'first' 'fourth' 'second' 'third'

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/906/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
168469112 MDU6SXNzdWUxNjg0NjkxMTI= 926 stack() on dask array produces inefficient chunking crusaderky 6213168 closed 0     4 2016-07-30T14:12:34Z 2019-02-01T16:04:43Z 2019-02-01T16:04:43Z MEMBER      

Whe the stack() method is used on a xarray with dask backend, one would expect that every output chunk is produced by exactly 1 input chunk.

This is not the case, as stack() actually produces an extremely fragmented dask array: https://gist.github.com/crusaderky/07991681d49117bfbef7a8870e3cba67

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/926/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
193294729 MDU6SXNzdWUxOTMyOTQ3Mjk= 1152 Scalar coords seep into index coords crusaderky 6213168 closed 0     8 2016-12-03T15:43:53Z 2019-02-01T16:02:12Z 2019-02-01T16:02:12Z MEMBER      

Is this by design? I can't put any sense in it ```

a = xarray.DataArray([1, 2, 3], dims=['x'], coords={'x': [1, 2, 3], 'y': 10}) a.coords['x'] <xarray.DataArray 'x' (x: 3)> array([1, 2, 3]) Coordinates: * x (x) int64 1 2 3 y int64 10 ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1152/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
172291585 MDU6SXNzdWUxNzIyOTE1ODU= 979 align() should align chunks crusaderky 6213168 closed 0     4 2016-08-20T21:25:01Z 2019-01-24T17:19:30Z 2019-01-24T17:19:30Z MEMBER      

In the xarray docs I read

With the current version of dask, there is no automatic alignment of chunks when performing operations between dask arrays with different chunk sizes. If your computation involves multiple dask arrays with different chunks, you may need to explicitly rechunk each array to ensure compatibility.

While chunk auto-alignment could be done within the dask library, that would be limited to arrays with the same dimensionality and same dims order. For example it would not be possible to have a dask library call to align the chunks on xarrays with the following dims: - (time, latitude, longitude) - (time) - (longitude, latitude)

even if it makes perfect sense in xarray.

I think xarray.align() should take care of it automatically.

A safe algorithm would be to always scale down the chunksize when in conflict. This would prevent having chunks larger than expected, and should minimise (in a greedy way) the number of operations. It's also a good idea on dask.distributed, where merging two chunks could cause one of them to travel on the network - which is very expensive.

e.g. to reconcile chunksizes a: (5, 10, 6) b: (5, 7, 9) the algorithm would rechunk both arrays to (5, 7, 3, 6).

Finally, when served with a numpy-based array and a dask-based array, align() should convert the numpy array to dask. The critical use case that would benefit from this behaviour is when align() is invoked inside a broadcast() between a tiny constant you just loaded from csv/pandas/pure python list/whatever - e.g. dims=(time, ) shape=(100, ) - and a huge dask-backed array e.g. dims=(time, scenario) shape=(100, 2**30) chunks=(25, 2**20).

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/979/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
296927704 MDU6SXNzdWUyOTY5Mjc3MDQ= 1909 Failure in test_cross_engine_read_write_netcdf3 crusaderky 6213168 closed 0     3 2018-02-13T23:48:44Z 2019-01-13T20:56:14Z 2019-01-13T20:56:14Z MEMBER      

Two unit tests are failing in the latest git master: - GenericNetCDFDataTest.test_cross_engine_read_write_netcdf3 - GenericNetCDFDataTestAutocloseTrue.test_cross_engine_read_write_netcdf3

Both with the message:

``` xarray/tests/test_backends.py:1558:


xarray/backends/api.py:286: in open_dataset autoclose=autoclose) xarray/backends/netCDF4_.py:275: in open ds = opener() xarray/backends/netCDF4_.py:199: in _open_netcdf4_group ds = nc4.Dataset(filename, mode=mode, **kwargs) netCDF4/_netCDF4.pyx:2015: in netCDF4._netCDF4.Dataset.init ???


??? E OSError: [Errno -36] NetCDF: Invalid argument: b'/tmp/tmpwp675lnc/temp-1069.nc'

netCDF4/_netCDF4.pyx:1636: OSError ```

Attaching conda list: conda.txt

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1909/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
311578894 MDU6SXNzdWUzMTE1Nzg4OTQ= 2040 to_netcdf() to automatically switch to fixed-length strings for compressed variables crusaderky 6213168 open 0     2 2018-04-05T11:50:16Z 2019-01-13T01:42:03Z   MEMBER      

When you have fixed-length numpy arrays of unicode characters (<U...) in a dataset, and you invoke to_netcdf() without any particular encoding, they are automatically stored as variable-length strings, unless you explicitly specify {'dtype': 'S1'}.

Is this in order to save disk space in case strings vary wildly in size? I may be able to see the point in this case. However, this approach is disastrous if variables are compressed, as any compression algorithm will reduce the zero-panning at the end of the strings to a negligible size.

My test data: a dataset with \~50 variables, of which half are strings of 10\~100 english characters and the other half are floats, all on a single dimension with 12k points.

Test 1: ds.to_netcdf('uncompressed.nc') Result: 45MB

Test 2: encoding = {k: {'gzip': True, 'shuffle': True} for k in ds.variables} ds.to_netcdf('bad-compression.nc', encoding=encoding) Result: 42MB

Test 3: encoding = {} for k, v in ds.variables.items(): encoding[k] = {'gzip': True, 'shuffle': True} if v.dtype.kind == 'U': encoding[k]['dtype'] = 'S1' ds.to_netcdf('good-compression.nc', encoding=encoding) Result: 5MB

Proposal

In case of string variables, if no dtype is explicitly defined, to_netcdf() should dynamically assign it to S1 if compression is enabled, str if disabled.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2040/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
330473082 MDU6SXNzdWUzMzA0NzMwODI= 2219 to_netcdf broken encoding: dtype='S1' + chunksizes crusaderky 6213168 open 0     2 2018-06-07T23:46:13Z 2019-01-13T01:38:51Z   MEMBER      

``` xarray.Dataset({'x': ['foo', 'bar', 'baz']}).to_netcdf( 'foo.nc', engine='h5netcdf', encoding={'x': {'dtype': 'S1', 'zlib': True, 'chunksizes': (2, )}})

ValueError: "chunks" must have same rank as dataset shape ` Same withengine='netcdf4'``. The issue is present in 0.10.6 as well as in 0.10.3. The problem is obviously that dtype=S1 changes the shape of the variable before passing it to the backend, but while doing so doesn't also change an eventual chunksizes setting.

The workaround is to omit chunksizes or set it to True.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2219/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
339611449 MDU6SXNzdWUzMzk2MTE0NDk= 2273 to_netcdf uses deprecated and unnecessary dask call crusaderky 6213168 closed 0     4 2018-07-09T21:20:20Z 2018-07-31T20:03:41Z 2018-07-31T19:42:20Z MEMBER      

```

ds = xarray.Dataset({'x': 1}) ds.to_netcdf('foo.nc') dask/utils.py:1010: UserWarning: Deprecated, see dask.base.get_scheduler instead ```

Stack trace: ```

xarray/backends/common.py(44)get_scheduler() 43 from dask.utils import effective_get ---> 44 actual_get = effective_get(get, collection) ``` There are two separate problems here:

  • dask recently changed API from get(get=callable) to get(scheduler=str). Should we
  • just increase the minimum version of dask (I doubt anybody will complain)
  • go through the hoops of dynamically invoking a different API depending on the dask version :sweat:
  • silence the warning now, and then increase the minimum version of dask the day that dask removes the old API entirely (risky)?
  • xarray is calling dask even when it's unnecessary, as none of the variables in the example Dataset had a dask backend. I don't think there are any CI suites for NetCDF without dask. I'm also wondering if they would bring any actual added value, as dask is small, has no exotic dependencies, and is pure Python; so I doubt anybody will have problems installing it whatever his setup is.

@shoyer opinion?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2273/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
324040111 MDU6SXNzdWUzMjQwNDAxMTE= 2149 [REGRESSION] to_netcdf doesn't accept dtype=S1 encoding anymore crusaderky 6213168 closed 0     5 2018-05-17T14:09:15Z 2018-06-01T01:09:38Z 2018-06-01T01:09:38Z MEMBER      

In xarray 0.10.4, the dtype encoding in to_netcdf has stopped working, for all engines: ```

import xarray ds = xarray.Dataset({'x': ['foo', 'bar', 'baz']}) ds.to_netcdf('test.nc', encoding={'x': {'dtype': 'S1'}}) [...]

xarray/backends/netCDF4_.py in _extract_nc4_variable_encoding(variable, raise_on_invalid, lsd_okay, h5py_okay, backend, unlimited_dims) 196 if invalid: 197 raise ValueError('unexpected encoding parameters for %r backend: ' --> 198 ' %r' % (backend, invalid)) 199 else: 200 for k in list(encoding):

ValueError: unexpected encoding parameters for 'netCDF4' backend: ['dtype'] ``` I'm still trying to figure out how the regression tests didn't pick it up and what change introduced it.

@shoyer I'm working on this as my top priority. Do you agree this is serious enough for an emergency re-release? (0.10.4.1 or 0.10.5, your choice)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2149/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
324410381 MDU6SXNzdWUzMjQ0MTAzODE= 2161 Regression: Dataset.update(Dataset) crusaderky 6213168 closed 0     0 2018-05-18T13:26:58Z 2018-05-29T04:34:47Z 2018-05-29T04:34:47Z MEMBER      

Dataset().update(Dataset())

FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.

This is a regression in xarray 0.10.4. @shoyer this isn't serious enough to warrant an immediate release on its own, but we're already doing one so we might as well include it.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2161/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
324409064 MDU6SXNzdWUzMjQ0MDkwNjQ= 2160 pandas-0.23 breaks stack with duplicated indices crusaderky 6213168 closed 0     3 2018-05-18T13:23:26Z 2018-05-26T03:29:46Z 2018-05-26T03:29:46Z MEMBER      

In this script: ``` import pandas import xarray

df = pandas.DataFrame( [[1, 2], [3, 4]], index=['foo', 'foo'], columns=['bar', 'baz']) print(df.stack())

a = xarray.DataArray(df) print(a.stack(s=a.dims)) ```

The first part works both with pandas 0.22 and 0.23. The second part works in xarray 0.10. 4 + pandas 0.22, and crashes with pandas 0.23:

File "/mnt/resource/tmp/anaconda_guido/lib/python3.6/site-packages/xarray/core/dataarray.py", line 1115, in stack ds = self._to_temp_dataset().stack(**dimensions) File "/mnt/resource/tmp/anaconda_guido/lib/python3.6/site-packages/xarray/core/dataset.py", line 2123, in stack result = result._stack_once(dims, new_dim) File "/mnt/resource/tmp/anaconda_guido/lib/python3.6/site-packages/xarray/core/dataset.py", line 2092, in _stack_once idx = utils.multiindex_from_product_levels(levels, names=dims) File "/mnt/resource/tmp/anaconda_guido/lib/python3.6/site-packages/xarray/core/utils.py", line 96, in multiindex_from_product_levels return pd.MultiIndex(levels, labels, sortorder=0, names=names) File "/mnt/resource/tmp/anaconda_guido/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 240, in __new__ result._verify_integrity() File "/mnt/resource/tmp/anaconda_guido/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 283, in _verify_integrity level=i)) ValueError: Level values must be unique: ['foo', 'foo'] on level 0

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2160/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
253476466 MDU6SXNzdWUyNTM0NzY0NjY= 1536 Better compression algorithms for NetCDF crusaderky 6213168 closed 0     28 2017-08-28T22:35:31Z 2018-05-08T02:25:40Z 2018-05-08T02:25:40Z MEMBER      

As of today, Dataset.to_netcdf() exclusively allows writing uncompressed or compressed with zlib. zlib was absolutely revolutionary when it was released... in 1995. Time has passed, and much better compression algorithms have appeared over time. Good news is, h5py supports LZF out of the box, and is extensible with plugins to support theoretically any other algorithm. h5netcdf exposes such interface through its new (non-legacy) API; however Dataset.to_netcdf(engine='h5netcdf') supports the legacy API exclusively.

I already tested that, once you manage to write to disk with LZF (using h5netcdf directly), open_dataset(engine='h5netcdf') transparently opens the compressed store.

Options: - write a new engine for Dataset.to_netcdf() to support the new h5netcdf API. - switch the whole engine='h5netcdf' to the new API. Drop support for the old parameters in to_netcdf(). This is less bad than it sounds, as people can switch to another engine in case of trouble. This is the cleanest solution, but also the most disruptive one. - switch the whole engine='h5netcdf' to the new API; have to_netcdf() accept both new and legacy parameters, and implement a translation layer of parameters from the legacy API to the new API. The benefit here is that, as long as the user sticks to the legacy API, he can hop between engines transparently. On the other hand I have a hard time believing anybody would care. - ?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1536/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
317421267 MDU6SXNzdWUzMTc0MjEyNjc= 2079 New feature: interp1d crusaderky 6213168 closed 0     8 2018-04-24T22:45:03Z 2018-05-06T19:30:32Z 2018-05-06T19:30:32Z MEMBER      

I've written a series of wrappers for the 1-dimensional scipy interpolators.

Prototype code and colourful demo plots: https://gist.github.com/crusaderky/b0aa6b8fdf6e036cb364f6f40476cc67

Features

  • Interpolate a ND array on any arbitrary dimension
  • Nearest-neighbour, linear, quadratic, cubic, Akima, PCHIP, and custom interpolators are supported
  • dask supported on both on the interpolated array and x_new
  • Supports ND x_new arrays
  • The CPU-heavy interpolator generation (splrep) is executed only once and then can be applied to multiple x_new (splev)
  • Pickleable and distributed-friendly

Design hacks

  • Depends on dask module, even when all inputs are based on plain numpy.
  • Abuses attrs and the ability to invoke a.attrname to get the user experience of a new DataArray method.
  • Abuses the fact that the chunks of a dask.array.Array can contain anything and you won't notice until you compute them.

Limitations

  • Can't dump to netcdf. Not solvable without hacking into the implementation details of scipy.
  • Datasets are not supported. Trivial to fix after solving #1699.
  • Chunks are not supported on x_new. Trivial to fix after solving #1995.
  • Chunks are not supported along the interpolated dimension. This is very complicated to solve. If x and x_new were always monotonic ascending,it would be (not trivially) solvable with dask.array.ghost.ghost. If you make no assumptions about monotonicity, things become way more complicated. A solution would need to go in the dask module, and then be invoked trivially from here with dask='allowed'.
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2079/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
316618290 MDU6SXNzdWUzMTY2MTgyOTA= 2074 xarray.dot() dask problems crusaderky 6213168 closed 0     10 2018-04-22T22:18:10Z 2018-05-04T21:51:00Z 2018-05-04T21:51:00Z MEMBER      

xarray.dot() has comparable performance with numpy.einsum. However, when it uses a dask backend, it's much slower than the new dask.array.einsum function (https://github.com/dask/dask/pull/3412). The performance gap widens when the dimension upon which you are reducing is chunked.

Also, for some reason dot(a<s, t>, b<t>, dims=[t]) and dot(a<s,t>, a<s,t>, dims=[s,t]) do work (very slowly) when s and t are chunked, while dot(a<s, t>, a<s, t>, dims=[t]) crashes complaining it can't operate on a chunked core dim (related discussion: https://github.com/pydata/xarray/issues/1995).

The proposed solution is to simply wait for https://github.com/dask/dask/pull/3412 to reach the next release and then reimplement xarray.dot to use dask.array.einsum. This means that dask users will lose the ability to use xarray.dot if they upgrade xarray version but not dask version, but I believe it shouldn't be a big problem for most?

``` import numpy import dask.array import xarray

def bench(tchunk, a_by_a, dims, iis): print(f"\nbench({tchunk}, {a_by_a}, {dims}, {iis})")

a = xarray.DataArray(
    dask.array.random.random((500000, 100), chunks=(50000, tchunk)),
    dims=['s', 't'])
if a_by_a:
    b = a
else:
    b = xarray.DataArray(
        dask.array.random.random((100, ), chunks=tchunk),
        dims=['t'])

print("xarray.dot(numpy backend):")
%timeit xarray.dot(a.compute(), b.compute(), dims=dims)
print("numpy.einsum:")
%timeit numpy.einsum(iis, a, b)
print("xarray.dot(dask backend):")
try:
    %timeit xarray.dot(a, b, dims=dims).compute()
except ValueError as e:
    print(e)
print("dask.array.einsum:")
%timeit dask.array.einsum(iis, a, b).compute()

bench(100, False, ['t'], '...i,...i') bench( 20, False, ['t'], '...i,...i') bench(100, True, ['t'], '...i,...i') bench( 20, True, ['t'], '...i,...i') bench(100, True, ['s', 't'], '...ij,...ij') bench( 20, True, ['s', 't'], '...ij,...ij') Output: bench(100, False, ['t'], ...i,...i) xarray.dot(numpy backend): 195 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) numpy.einsum: 205 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) xarray.dot(dask backend): 356 ms ± 44.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 244 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(20, False, ['t'], ...i,...i) xarray.dot(numpy backend): 297 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 254 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 732 ms ± 74.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 274 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(100, True, ['t'], ...i,...i) xarray.dot(numpy backend): 438 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 415 ms ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 633 ms ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 431 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(20, True, ['t'], ...i,...i) xarray.dot(numpy backend): 457 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 463 ms ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): dimension 't' on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, rechunk into a single dask array chunk along this dimension, i.e., .rechunk({'t': -1}), but beware that this may significantly increase memory usage. dask.array.einsum: 485 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(100, True, ['s', 't'], ...ij,...ij) xarray.dot(numpy backend): 418 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 444 ms ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 384 ms ± 57.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 415 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(20, True, ['s', 't'], ...ij,...ij) xarray.dot(numpy backend): 489 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 443 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 585 ms ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 455 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2074/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
320104170 MDU6SXNzdWUzMjAxMDQxNzA= 2103 An elegant way to guarantee single chunk along dim crusaderky 6213168 closed 0     2 2018-05-03T22:40:48Z 2018-05-04T20:11:30Z 2018-05-04T20:10:50Z MEMBER      

Algorithms that are wrapped by xarray.apply_ufunc(dask='parallelized'), and in general most algorithms for which aren't embarassingly parallel and for which there isn't a sophisticated dask function that allows for multiple chunks, cannot have multiple chunks on their core dimensions.

I have lost count of how many times I prefixed my invocations of apply_ufunc on a DataArray with the same blurb, over and over again: if x.chunks: x = x.chunk({dim: x.shape[x.dims.index(dim)]}) The reason why it looks so awful is that DataArray.shape, DataArray.dims, Variable.shape and Variable.dims are positional.

I can see a few possible solutions to the problem:

Design 1

Change DataArray.chunk etc. to accept a special chunk size, e.g. -1, which means "whatever the size of that dim is". The above would become: if x.chunks: x = x.chunk({dim: -1}) which is much more bearable. One could argue that the implementation would need to happen in dask.array.rechunk; on the other hand in dask it woulf feel silly, because already today you can do it in a very synthetic way: x = x.rechunk({axis: x.shape[axis]}) I'm not overly fond of this solution as it would be rather obscure for anybody who isn't super familiar with the API documentation.

Design 2

Add properties to DataArray and Variable, ddims and dshape (happy to hear suggestions about better names), which would return dims and shape as a OrderedDict, just like Dataset.dims and Dataset.shape.

The above would become: if x.chunks: x = x.chunk({dim: x.dshape[dim]})

Design 3

Change dask.array.rechunk to accept numpy.inf / math.inf as the chunk size. This makes sense, as the function already accepts chunk sizes that are larger than the shape - however, it's currently limited to int. This is probably my personal favourite, and trivial to implement too.

The above would become: if x.chunks: x = x.chunk({dim: np.inf})

Design 4

Introduce a convenience method for DataArray, Dataset, and Variable, ensure_single_chunk(*dims). Below a prototype: ``` def ensure_single_chunk(a, *dims): """If a has dask backend and two or more chunks on dims, rechunk it so that they become single-chunked. This is typically a prerequisite for computing any algorithm along dim that is not embarassingly parallel (short of sophisticated implementations such as those found in the dask module).

:param a:
    any xarray object
:param str dims:
    one or more dims of a to rechunk
:returns:
    copy of a, where all listed dims are guaranteed to be on a single dask chunk.
    if a has numpy backend, return a shallow copy of it.
"""
if isinstance(a, xarray.Dataset):
    dims = set(dims)
    unknown_dims = dims - a.dims.keys()
    if unknown_dims:
        raise ValueError("dim(s) %s not found" % ",".join(unknown_dims))
    a = a.copy(deep=False)
    for k, v in a.variables.items():
        if v.chunks:
            a[k] = ensure_single_chunk(v, *(set(v.dims) & dims))
    return a

if not isinstance(a, (xarray.DataArray, xarray.Variable)):
    raise TypeError('a must be a DataArray, Dataset, or Variable')

if not a.chunks:
    # numpy backend
    return a.copy(deep=False)

return a.chunk({
    dim: a.shape[a.dims.index(dim)]
    for dim in dims
})

```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2103/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
271998358 MDU6SXNzdWUyNzE5OTgzNTg= 1697 apply_ufunc(dask='parallelized') won't accept scalar *args crusaderky 6213168 closed 0   0.10 2415632 1 2017-11-07T21:56:11Z 2017-11-10T16:46:26Z 2017-11-10T16:46:26Z MEMBER      

As of xarray-0.10-rc1:

Works: ``` import xarray import scipy.stats a = xarray.DataArray([1,2], dims=['x'])

xarray.apply_ufunc(scipy.stats.norm.cdf, a, 0, 1)

<xarray.DataArray (x: 2)> array([ 0.841345, 0.97725 ]) Dimensions without coordinates: x ```

Broken: ``` xarray.apply_ufunc( scipy.stats.norm.cdf, a.chunk(), 0, 1, dask='parallelized', output_dtypes=[a.dtype] ).compute()

IndexError Traceback (most recent call last) <ipython-input-35-1d4025e1ebdb> in <module>() ----> 1 xarray.apply_ufunc(scipy.stats.norm.cdf, a.chunk(), 0, 1, dask='parallelized', output_dtypes=[a.dtype]).compute()

~/anaconda3/lib/python3.6/site-packages/xarray/core/computation.py in apply_ufunc(func, args, kwargs) 913 join=join, 914 exclude_dims=exclude_dims, --> 915 keep_attrs=keep_attrs) 916 elif any(isinstance(a, Variable) for a in args): 917 return variables_ufunc(args)

~/anaconda3/lib/python3.6/site-packages/xarray/core/computation.py in apply_dataarray_ufunc(func, args, kwargs) 210 211 data_vars = [getattr(a, 'variable', a) for a in args] --> 212 result_var = func(data_vars) 213 214 if signature.num_outputs > 1:

~/anaconda3/lib/python3.6/site-packages/xarray/core/computation.py in apply_variable_ufunc(func, args, kwargs) 561 raise ValueError('unknown setting for dask array handling in ' 562 'apply_ufunc: {}'.format(dask)) --> 563 result_data = func(input_data) 564 565 if signature.num_outputs > 1:

~/anaconda3/lib/python3.6/site-packages/xarray/core/computation.py in <lambda>(arrays) 555 func = lambda arrays: _apply_with_dask_atop( 556 numpy_func, arrays, input_dims, output_dims, signature, --> 557 output_dtypes, output_sizes) 558 elif dask == 'allowed': 559 pass

~/anaconda3/lib/python3.6/site-packages/xarray/core/computation.py in _apply_with_dask_atop(func, args, input_dims, output_dims, signature, output_dtypes, output_sizes) 624 for element in (arg, dims[-getattr(arg, 'ndim', 0):])] 625 return da.atop(func, out_ind, *atop_args, dtype=dtype, concatenate=True, --> 626 new_axes=output_sizes) 627 628

~/anaconda3/lib/python3.6/site-packages/dask/array/core.py in atop(func, out_ind, args, kwargs) 2231 raise ValueError("Must specify dtype of output array") 2232 -> 2233 chunkss, arrays = unify_chunks(args) 2234 for k, v in new_axes.items(): 2235 chunkss[k] = (v,)

~/anaconda3/lib/python3.6/site-packages/dask/array/core.py in unify_chunks(args, *kwargs) 2117 chunks = tuple(chunkss[j] if a.shape[n] > 1 else a.shape[n] 2118 if not np.isnan(sum(chunkss[j])) else None -> 2119 for n, j in enumerate(i)) 2120 if chunks != a.chunks and all(a.chunks): 2121 arrays.append(a.rechunk(chunks))

~/anaconda3/lib/python3.6/site-packages/dask/array/core.py in <genexpr>(.0) 2117 chunks = tuple(chunkss[j] if a.shape[n] > 1 else a.shape[n] 2118 if not np.isnan(sum(chunkss[j])) else None -> 2119 for n, j in enumerate(i)) 2120 if chunks != a.chunks and all(a.chunks): 2121 arrays.append(a.rechunk(chunks))

IndexError: tuple index out of range ```

Workaround: ``` xarray.apply_ufunc( scipy.stats.norm.cdf, a, kwargs={'loc': 0, 'scale': 1}, dask='parallelized', output_dtypes=[a.dtype]).compute()

<xarray.DataArray (x: 2)> array([ 0.841345, 0.97725 ]) Dimensions without coordinates: x ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1697/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
252541496 MDU6SXNzdWUyNTI1NDE0OTY= 1521 open_mfdataset reads coords from disk multiple times crusaderky 6213168 closed 0     14 2017-08-24T09:29:57Z 2017-10-09T21:15:31Z 2017-10-09T21:15:31Z MEMBER      

I have 200x of the below dataset, split on the 'scenario' axis:

<xarray.Dataset> Dimensions: (fx_id: 39, instr_id: 16095, scenario: 2501) Coordinates: currency (instr_id) object 'GBP' 'USD' 'GBP' 'GBP' 'GBP' 'EUR' 'CHF' ... * fx_id (fx_id) object 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' 'CAD' ... * instr_id (instr_id) object 'property_standard_gbp' ... * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... type (instr_id) object 'Common Stock' 'Fixed Amortizing Bond' ... Data variables: fx_rates (fx_id, scenario) float64 1.236 1.191 1.481 1.12 1.264 ... instruments (instr_id, scenario) float64 1.0 1.143 0.9443 1.013 1.176 ... Attributes: base_currency: GBP

I individually dump them to disk with Dataset.to_netcdf(fname, engine='h5netcdf'). Then I try loading them back up with open_mfdataset, but it's mortally slow:

``` %%time xarray.open_mfdataset('*.nc', engine='h5netcdf')

Wall time: 30.3 s ```

The problem is caused by the coords being read from disk multiple times. Workaround:

%%time def load_coords(ds): for coord in ds.coords.values(): coord.load() return ds xarray.open_mfdataset('*.nc', engine='h5netcdf', preprocess=load_coords) Wall time: 12.3 s

Proposed solutions: 1. Implement the above workaround directly inside open_mfdataset() 2. change open_dataset() to always eagerly load the coords to memory, regardless of the chunks parameter. Is there any valid use case where lazy coords are actually desirable?

An additional, more radical observation is that, very frequently, a user knows in advance that all coords are aligned. In this use case, the user could explicitly request xarray to blindly trust this assumption, and thus skip loading the coords not based on concat_dim in all datasets beyond the first.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1521/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
259935100 MDU6SXNzdWUyNTk5MzUxMDA= 1586 Dataset.copy() drops encoding crusaderky 6213168 closed 0     6 2017-09-22T20:58:30Z 2017-10-08T16:01:20Z 2017-10-08T16:01:20Z MEMBER      

ds = Dataset() ds.encoding = {"unlimited_dims": 'x'} ds.copy().encoding {} By looking at dataset.py, there's a lot of calls to Dataset._construct_direct that omit the encoding. Is it correct to add it in all cases?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1586/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
253279298 MDU6SXNzdWUyNTMyNzkyOTg= 1531 @requires_pinio mass disables unrelated tests crusaderky 6213168 closed 0     3 2017-08-28T09:45:29Z 2017-10-04T23:12:48Z 2017-10-04T23:12:48Z MEMBER      

I think I'm losing my sanity here. I have a anaconda3 Python 3.6 environment with all required and optional dependencies of xarray installed and updated to the latest available version, except pyNio. If I run test.py on the latest xarray package from the git tip, the vast majority of the tests in test_backends.py are skipped - including those that have nothing to do with pyNio! e.g.

tests/test_backends.py::ScipyInMemoryDataTest::test_bytesio_pickle PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_coordinates_encoding SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_dataset_caching SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_dataset_compute SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_default_fill_value SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_encoding_kwarg SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_encoding_same_dtype SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_invalid_dataarray_names_raise SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_load SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_orthogonal_indexing PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_pickle SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_pickle_dataarray SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_None_variable SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_boolean_dtype SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_coordinates SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_datetime_data SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_endian SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_example_1_netcdf SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_float64_data SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_mask_and_scale SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_object_dtype SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_string_data SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_strings_with_fill_value SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_test_data SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_timedelta_data SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_to_netcdf_explicit_engine PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_unsigned_roundtrip_mask_and_scale SKIPPED tests/test_backends.py::ScipyInMemoryDataTest::test_write_store PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_zero_dimensional_variable SKIPPED

If I comment out line 1462:

``` @requires_scipy

@requires_pynio

class TestPyNio(CFEncodedDataTest, Only32BitTypes, TestCase): ```

Then magically everything starts working again!

tests/test_backends.py::ScipyInMemoryDataTest::test_bytesio_pickle PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_coordinates_encoding PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_dataset_caching PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_dataset_compute PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_default_fill_value PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_encoding_kwarg PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_encoding_same_dtype PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_invalid_dataarray_names_raise PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_load PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_orthogonal_indexing PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_pickle PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_pickle_dataarray PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_None_variable PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_boolean_dtype PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_coordinates PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_datetime_data PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_endian PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_example_1_netcdf PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_float64_data PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_mask_and_scale PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_object_dtype PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_string_data PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_strings_with_fill_value PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_test_data PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_roundtrip_timedelta_data PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_to_netcdf_explicit_engine PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_unsigned_roundtrip_mask_and_scale PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_write_store PASSED tests/test_backends.py::ScipyInMemoryDataTest::test_zero_dimensional_variable PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_bytesio_pickle PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_coordinates_encoding PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_dataset_caching PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_dataset_compute PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_default_fill_value PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_encoding_kwarg PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_encoding_same_dtype PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_invalid_dataarray_names_raise PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_load PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_orthogonal_indexing PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_pickle PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_pickle_dataarray PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_None_variable PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_boolean_dtype PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_coordinates PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_datetime_data PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_endian PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_example_1_netcdf PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_float64_data PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_mask_and_scale PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_object_dtype PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_string_data PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_strings_with_fill_value PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_test_data PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_roundtrip_timedelta_data PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_to_netcdf_explicit_engine PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_unsigned_roundtrip_mask_and_scale PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_write_store PASSED tests/test_backends.py::ScipyInMemoryDataTestAutocloseTrue::test_zero_dimensional_variable PASSED

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1531/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
260097045 MDU6SXNzdWUyNjAwOTcwNDU= 1588 concat() loads dask arrays if the first array is numpy crusaderky 6213168 closed 0     0 2017-09-24T16:29:09Z 2017-09-25T00:55:36Z 2017-09-25T00:55:36Z MEMBER      

duck_array_ops.concatenate and duck_array_ops.stack load dask variables if the first one is numpy-based:

``` xarray.concat([ xarray.DataArray([1]).chunk(), xarray.DataArray([1]), ], dim='dim_0')

Out[1]: <xarray.DataArray (dim_0: 2)> dask.array<shape=(2,), dtype=int64, chunksize=(1,)> Dimensions without coordinates: dim_0

xarray.concat([ xarray.DataArray([1]), xarray.DataArray([1]).chunk(), ], dim='dim_0')

Out[2]: <xarray.DataArray (dim_0: 2)> array([1, 1]) Dimensions without coordinates: dim_0 ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1588/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
252543868 MDU6SXNzdWUyNTI1NDM4Njg= 1522 Dataset.__repr__ computes dask variables crusaderky 6213168 closed 0     8 2017-08-24T09:37:12Z 2017-09-21T20:55:43Z 2017-09-21T20:55:43Z MEMBER      

DataArray.__repr__ and Variable.__repr__ print a placeholder if the data uses the dask backend. Not so Dataset.__repr__, which tries computing the data before printing a tiny preview of it. This issue is extremely annoying when working in Jupyter, and particularly acute if the chunks are very big or are at the end of a very long chain of computation.

For data variables, the expected behaviour is to print a placeholder just like DataArray does. For coords, we could either - print a placeholders (same treatment as data variables) - automatically invoke load() when the coord is added to the dataset - see #1521 for discussion.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1522/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
252547273 MDU6SXNzdWUyNTI1NDcyNzM= 1523 Pass arguments to dask.compute() crusaderky 6213168 closed 0     5 2017-08-24T09:48:14Z 2017-09-05T19:55:46Z 2017-09-05T19:55:46Z MEMBER      

I work with a very large dask-based algorithm in xarray, and I do my optimization by hand before hitting compute(). In other cases, I need using multiple dask schedulers at once (e.g. a multithreaded one for numpy-based work and a multiprocessing one for pure python work).

This change proposal (which I'm happy to do) is about accepting *args, **kwds parameters in all .compute(), .load(), and .persist() xarray methods and pass them verbatim to the underlying dask compute() and persist() functions.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1523/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
184722754 MDU6SXNzdWUxODQ3MjI3NTQ= 1058 shallow copies become deep copies when pickling crusaderky 6213168 closed 0     10 2016-10-23T23:12:03Z 2017-02-05T21:13:41Z 2017-01-17T01:53:18Z MEMBER      

Whenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays. This design fails when the object is pickled.

Whenever a numpy view is pickled, it becomes a regular array:

```

a = numpy.arange(226) print(len(pickle.dumps(a)) / 220) 256.00015354156494 b = a.view() print(len(pickle.dumps((a, b))) / 2**20) 512.0001964569092 b.base is a True a2, b2 = pickle.loads(pickle.dumps((a, b))) b2.base is a2 False ```

This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer. Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k * 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they're pickled expand to several GBs as they become 3000 independent copies.

I see a few possible solutions to this: 1. Implement pandas range indexes in xarray. This would be nice as a general thing and would solve my specific problem, but anybody who does not fall in my very specific use case won't benefit from it. 2. Do not auto-generate a coord with numpy.arange() if the user doesn't explicitly ask for it; just leave a None and maybe generate it on the fly when requested. Again, this would solve my specific problem but not other people's. 3. Force the coord to be a dask.array.arange. Actually supporting unconverted dask arrays as coordinates would take a considerable amount of work; they would get converted to numpy several times, and other issues. Again it wouldn't solve the general problem. 4. Fix the issue upstream in numpy. I didn't look into it yet and it's definitely worth investigating, but I found about it as early as 2012, so I suspect there might be some pretty good reason why it works like that... 5. Whenever xarray performs a shallow copy, take the numpy array instead of creating a view.

I implemented (5) as a workaround in my getstate method. Before:

%%time print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30) 2.535497265867889 Wall time: 33.3 s

Workaround:

``` def get_base(array): if not isinstance(array, numpy.ndarray): return array
elif array.base is None: return array elif array.base.dtype != array.dtype: return array elif array.base.shape != array.shape: return array else: return array.base

for v in cache.values(): if isinstance(v, xarray.DataArray): v.data = get_base(v.data) for coord in v.coords.values(): coord.data = get_base(coord.data) elif isinstance(v, xarray.Dataset): for var in v.variables(): var.data = get_base(var.data) ```

After:

%%time print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30) 0.9733252348378301 Wall time: 21.1 s

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1058/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
172290413 MDU6SXNzdWUxNzIyOTA0MTM= 978 broadcast() broken on dask backend crusaderky 6213168 closed 0     4 2016-08-20T20:56:33Z 2016-12-09T20:28:42Z 2016-12-09T20:28:42Z MEMBER      

``` python

a = xarray.DataArray([1,2]).chunk(1) a <xarray.DataArray (dim_0: 2)> dask.array<xarray-..., shape=(2,), dtype=int64, chunksize=(1,)> Coordinates: * dim_0 (dim_0) int64 0 1 xarray.broadcast(a) (<xarray.DataArray (dim_0: 2)> array([1, 2]) Coordinates: * dim_0 (dim_0) int64 0 1,) ```

The problem is actually somewhere in the constructor of DataArray. In alignment.py:362, we have return DataArray(data, ...) where data is a Variable with dask backend. The returned DataArray object has a numpy backend. As a workaround, changing that line to return DataArray(data.data, ...) (thus passing a dask array) fixes the problem.

After that however there's a new issue: whenever broadcast adds a dimension to an array, it creates it in a single chunk, as opposed to copying the chunking of the other arrays. This can easily call a host to go out of memory, and makes it harder to work with the arrays afterwards because chunks won't match.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/978/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
188395497 MDU6SXNzdWUxODgzOTU0OTc= 1102 full_like, zeros_like, ones_like crusaderky 6213168 closed 0     2 2016-11-10T01:12:58Z 2016-11-28T03:42:39Z 2016-11-28T03:42:39Z MEMBER      

I'd like to add the following top-level functions to xarray:

``` def const_like(array, value=0): """Return a new array with the same shape of array and the given constant value. If array is dask-backed, return a new dask-backed array with the same chunks.

:param array:
    a numpy or dask-backed xarray.DataArray
:param value:
    any scalar number
"""
if isinstance(array.data, dask.array.Array):
    if value == 0:
        data = dask.array.zeros(
            array.data.shape,
            chunks=array.data.chunks,
            dtype=array.data.dtype)
    else:
        data = dask.array.ones(
            array.data.shape,
            chunks=array.data.chunks,
            dtype=array.data.dtype)
else:
    if value == 0:
        data = numpy.zeros_like(array.data)
    else:
        data = numpy.ones_like(array.data)
if value not in (0, 1):
    data = data * value

return xarray.DataArray(data, dims=array.dims, coords=array.coords, attrs=array.attrs)

def zeros_like(array): return const_like(array, 0)

def ones_like(array): return const_like(array, 1) ```

The above would need to be expanded to support Dataset and Variable objects. In Datasets, the data_vars would be constants whereas all other variables would be copied verbatim. Thoughts?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1102/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
166287789 MDU6SXNzdWUxNjYyODc3ODk= 902 Pickle and .value vs. dask backend crusaderky 6213168 closed 0     6 2016-07-19T09:34:30Z 2016-11-14T16:56:44Z 2016-11-14T16:56:44Z MEMBER      

Pickling a xarray.DataArray with dask backend will cause it to resolve the .data to a numpy array. This is not desirable, as there's legitimate use cases where you may want to e.g. save a computation for later, or send it somewhere across the network.

Analogously, auto-converting a dask xarray to a numpy xarray as soon as you invoke the .value property is probably nice when you are working on a jupyter terminal, but not in a general purpose situation, particularly when xarray is used at the foundation of a very complex framework. Most of my headaches so far have been caused trying to figure out when, where and why the dask backend was replaced with numpy.

IMHO a module-wide switch to disable implicit dask->numpy conversion would be a nice solution. A new method, compute(), could explicitly convert in place from dask to numpy.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/902/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
168470276 MDU6SXNzdWUxNjg0NzAyNzY= 927 align() and broadcast() before concat() crusaderky 6213168 closed 0     9 2016-07-30T14:35:33Z 2016-08-21T01:00:27Z 2016-08-21T01:00:27Z MEMBER      

I have two arrays with misaligned dimensions x and y, and I want to concatenate them on dimension y. I can't seem to find any way to do it, because: 1. If I do not invoke align(), it will fail complaining that dimension x is not aligned 2. if I invoke align(), it will create unwanted elements on dimension y

See example: https://gist.github.com/crusaderky/a96db5b59396d94fe1e22694bc091d55

Am I missing something obvious? Possibly align() should accept an optional parameter e.g. exclude==['y']?

Thanks in advance

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/927/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
166286097 MDU6SXNzdWUxNjYyODYwOTc= 901 Pickle xarray.ufuncs crusaderky 6213168 closed 0     3 2016-07-19T09:26:06Z 2016-08-02T17:34:15Z 2016-08-02T17:34:15Z MEMBER      

It's currently impossible to pickle xarray.ufuncs.

import xarray.ufuncs, pickle pickle.dumps(xarray.ufuncs.maximum)

AttributeError: Can't pickle local object '_create_op.<locals>.func'

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/901/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
159117442 MDU6SXNzdWUxNTkxMTc0NDI= 876 xarray.ufuncs.maximum() between constant and dask array crusaderky 6213168 closed 0     1 2016-06-08T09:23:01Z 2016-07-20T05:51:02Z 2016-07-20T05:51:02Z MEMBER      

Take a dask-backed array: a = xarray.DataArray(dask.array.random.random(100 * 2**30, chunks=2**20)) This works: b = xarray.ufuncs.maximum(a, 0) This will cripple your computer and force you to reboot: b = xarray.ufuncs.maximum(0, a)

In the second case, xarray.ufuncs.maximum is resolving the dask array - in other wods, it's doing numpy.maximum(0, a.values)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/876/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 80.839ms · About: xarray-datasette