home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

12 rows where state = "open", type = "issue" and user = 6213168 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, created_at (date), updated_at (date)

type 1

  • issue · 12 ✖

state 1

  • open · 12 ✖

repo 1

  • xarray 12
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
309686915 MDU6SXNzdWUzMDk2ODY5MTU= 2027 square-bracket slice a Dataset with a DataArray crusaderky 6213168 open 0     4 2018-03-29T09:39:57Z 2022-04-18T03:51:25Z   MEMBER      

Given this: ``` ds = xarray.Dataset( data_vars={ 'vote': ('pupil', [5, 7, 8]), 'age': ('pupil', [15, 14, 16]) }, coords={ 'pupil': ['Alice', 'Bob', 'Charlie'] })

<xarray.Dataset> Dimensions: (pupil: 3) Coordinates: * pupil (pupil) <U7 'Alice' 'Bob' 'Charlie' Data variables: vote (pupil) int64 5 7 8 age (pupil) int64 15 14 16 ```

Why does this work: ``` ds.age[ds.vote >= 6]

<xarray.DataArray 'age' (pupil: 2)> array([14, 16]) Coordinates: * pupil (pupil) <U7 'Bob' 'Charlie' ```

But this doesn't? ``` ds[ds.vote >= 6]

KeyError: False `ds.vote >= 6`` is a DataArray with dims=('pupil', ) and dtype=bool, so I can't think of any ambiguity in what I want to achieve?

Workaround: ``` ds.sel(pupil=ds.vote >= 6)

<xarray.Dataset> Dimensions: (pupil: 2) Coordinates: * pupil (pupil) <U7 'Bob' 'Charlie' Data variables: vote (pupil) int64 7 8 age (pupil) int64 14 16 ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2027/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
666896781 MDU6SXNzdWU2NjY4OTY3ODE= 4279 intersphinx looks for implementation modules crusaderky 6213168 open 0     0 2020-07-28T08:55:12Z 2022-04-09T03:03:30Z   MEMBER      

This is a widespread issue caused by the pattern of defining objects in private module and then exposing them to the final user by importing them in the top-level __init__.py, vs. how intersphinx works.

Exact same issue in different projects: - https://github.com/aio-libs/aiohttp/issues/3714 - https://jira.mongodb.org/browse/MOTOR-338 - https://github.com/tkem/cachetools/issues/178 - https://github.com/AmphoraInc/xarray_mongodb/pull/22 - https://github.com/jonathanslenders/asyncio-redis/issues/143

If a project 1. uses xarray, intersphinx, and autodoc 2. subclasses any of the classes exposed by xarray/__init__.py and documents the new class with the :show-inheritance: flag 3. Starting from Sphinx 3, has any of the above classes anywhere in a type annotation

Then Sphinx emits a warning and fails to create a hyperlink, because intersphinx uses the __module__ attribute to look up the object in objects.inv, but __module__ points to the implementation module while objects.inv points to the top-level xarray module.

Workaround

In conf.py:

python import xarray xarray.DataArray.__module__ = "xarray"

Solution

Put the above hack in xarray/__init__.py

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4279/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
193294569 MDU6SXNzdWUxOTMyOTQ1Njk= 1151 Scalar coords vs. concat crusaderky 6213168 open 0     11 2016-12-03T15:42:18Z 2021-07-08T17:42:18Z   MEMBER      

Why does this work: ```

import xarray a = xarray.DataArray([1, 2, 3], dims=['x'], coords={'y': 10}) b = xarray.DataArray([4, 5, 6], dims=['x']) a + b <xarray.DataArray (x: 3)> array([5, 7, 9]) Coordinates: y int64 10 But this doesn't? xarray.concat([a, b], dim='x') KeyError: 'y' ``` It doesn't seem coherent to me...

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1151/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
305757822 MDU6SXNzdWUzMDU3NTc4MjI= 1995 apply_ufunc support for chunks on input_core_dims crusaderky 6213168 open 0     13 2018-03-15T23:50:22Z 2021-05-17T18:59:18Z   MEMBER      

I am trying to optimize the following function:

c = (a * b).sum('x', skipna=False)

where a and b are xarray.DataArray's, both with dimension x and both with dask backend.

I successfully obtained a 5.5x speedup with the following:

@numba.guvectorize(['void(float64[:], float64[:], float64[:])'], '(n),(n)->()', nopython=True, cache=True)
def mulsum(a, b, res):
    acc = 0
    for i in range(a.size):
        acc += a[i] * b[i]
    res.flat[0] = acc

c = xarray.apply_ufunc(
    mulsum, a, b,
    input_core_dims=[['x'], ['x']],
    dask='parallelized', output_dtypes=[float])

The problem is that this introduces a (quite problematic, in my case) constraint that a and b can't be chunked on dimension x - which is theoretically avoidable as long as the kernel function doesn't need interaction between x[i] and x[j] (e.g. it can't work for an interpolator, which would require to rely on dask ghosting).

Proposal

Add a parameter to apply_ufunc, reduce_func=None. reduce_func is a function which takes as input two parameters a, b that are the output of func. apply_ufunc will invoke it whenever there's chunking on an input_core_dim.

e.g. my use case above would simply become:

c = xarray.apply_ufunc(
    mulsum, a, b,
    input_core_dims=[['x'], ['x']],
    dask='parallelized', output_dtypes=[float], reduce_func=operator.sum)

So if I have 2 chunks in a and b on dimension x, apply_ufunc will internally do

c1 = mulsum(a1, b1)
c2 = mulsum(a2, b2)
c = operator.sum(c1, c2)

Note that reduce_func will be invoked exclusively in presence of dask='parallelized' and when there's chunking on one or more of the input_core_dims. If reduce_func is left to None, apply_ufunc will keep crashing like it does now.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1995/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
417356439 MDU6SXNzdWU0MTczNTY0Mzk= 2801 NaN-sized chunks crusaderky 6213168 open 0     2 2019-03-05T15:30:14Z 2021-04-24T02:41:34Z   MEMBER      

It would be nice to have support for NaN-sized dask chunks, e.g. x[x > 2]. There are two problems:

  1. x[x > 2] silently resolves the dask graph. It definitely shouldn't. There needs to be some discussion on what needs to happen to indices on the NaN-sized dimension; I can think of 3 options:
  2. silently drop any index that would become undefined
  3. drop any index that would become undefined and issue a warning
  4. hard crash if there is any index that would become undefined
  5. redesign IndexVariable so that it can contain dask data (probably much more complicated than the 3 above). The above design decision is anyway for when there is an index; dims without indices should just work.

  6. This crashes: ```>>> a = xarray.DataArray([1, 2, 3, 4]).chunk(2)

    xarray.DataArray(a.data[a.data > 2]).compute()

ValueError: replacement data must match the Variable's shape ``` I didn't investigate but I suspect it should be trivial to fix. I'm not sure why there is a check at all? Any such health check should be in dask only IMHO.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2801/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
311573817 MDU6SXNzdWUzMTE1NzM4MTc= 2039 open_mfdataset: skip loading for indexes and coordinates from all but the first file crusaderky 6213168 open 0     1 2018-04-05T11:32:02Z 2021-01-27T17:49:21Z   MEMBER      

This is a follow-up from #1521.

When invoking open_mfdataset, very frequently the user knows in advance that all of his coords that aren't on the concat_dim are already aligned, and may be willing to blindly trust such assumption in exchange of a huge performance boost.

My production data: 200x NetCDF files on a not very performant NFS file system, concatenated on the "scenario" dimension:

``` xarray.open_mfdataset('cube.*.nc', engine='h5netcdf', concat_dim='scenario')

<xarray.Dataset> Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: * attribute (attribute) object 'THEO/Value' currency (instr_id) object 'ZAR' 'EUR' 'EUR' 'EUR' 'EUR' 'EUR' 'GBP' ... * fx_id (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ... * instr_id (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ... * timestep (timestep) datetime64[ns] 2016-12-31 type (instr_id) object 'American' 'Bond Future' 'Bond Future' ... * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... Data variables: FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)> instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 19.6 s, sys: 981 ms, total: 20.6 s Wall time: 24.4 s ```

If I skip loading and comparing the non-index coords from all 200 files:

``` xarray.open_mfdataset('cube.*.nc'), engine='h5netcdf', concat_dim='scenario', coords='all')

<xarray.Dataset> Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: * attribute (attribute) object 'THEO/Value' * fx_id (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ... * instr_id (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ... * timestep (timestep) datetime64[ns] 2016-12-31 currency (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)> * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... type (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)> Data variables: FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)> instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 12.7 s, sys: 305 ms, total: 13 s Wall time: 14.8 s ```

If I skip loading and comparing also the index coords from all 200 files:

``` cube = xarray.open_mfdataset(sh.resolve_env(f'{dynamic}/mtf/{cubename}/nc/cube.*.nc'), engine='h5netcdf', concat_dim='scenario', drop_variables=['attribute', 'fx_id', 'instr_id', 'timestep', 'currency', 'type'])

<xarray.Dataset> Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... Dimensions without coordinates: attribute, fx_id, instr_id, timestep Data variables: FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)> instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 7.31 s, sys: 61 ms, total: 7.37 s Wall time: 9.05 s ```

Proposed design

Add a new optional parameter to open_mfdataset, assume_aligned=None. It can be valued to a list of variable names or "all", and requires concat_dim to be explicitly set. It causes open_mfdataset to use the first occurrence of every variable and blindly skip loading the subsequent ones.

Algorithm

  1. Perform the first invocation to the underlying open_dataset like it happens now
  2. if assume_aligned is not None: for each new NetCDF file, figure out which variables need to be aligned & compared (as opposed to concatenated), and add them to a drop_variables list.
  3. if assume_aligned != "all": drop_variables &= assume_aligned
  4. Pass the increasingly long drop_variables list to the underlying open_dataset
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2039/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
272004812 MDU6SXNzdWUyNzIwMDQ4MTI= 1699 apply_ufunc(dask='parallelized') output_dtypes for datasets crusaderky 6213168 open 0     8 2017-11-07T22:18:23Z 2020-04-06T15:31:17Z   MEMBER      

When a Dataset has variables with different dtypes, there's no way to tell apply_ufunc that the same function applied to different variables will produce different dtypes:

``` ds1 = xarray.Dataset(data_vars={'a': ('x', [1, 2]), 'b': ('x', [3.0, 4.5])}).chunk() ds2 = xarray.apply_ufunc(lambda x: x + 1, ds1, dask='parallelized', output_dtypes=[float]) ds2

<xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: a (x) float64 dask.array<shape=(2,), chunksize=(2,)> b (x) float64 dask.array<shape=(2,), chunksize=(2,)>

ds2.compute()

<xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: a (x) int64 2 3 b (x) float64 4.0 5.5 ```

Proposed solution

When the output is a dataset, apply_ufunc could accept either output_dtypes=[t] (if all output variables will have the same dtype) or output_dtypes=[{var1: t1, var2: t2, ...}]. In the example above, it would be output_dtypes=[{'a': int, 'b': float}].

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1699/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
503983776 MDU6SXNzdWU1MDM5ODM3NzY= 3382 Improve indexing performance benchmarks crusaderky 6213168 open 0     0 2019-10-08T11:20:39Z 2019-11-14T15:52:33Z   MEMBER      

As discussed in #3375 - FYI @jhamman

asv_bench/benchmarks/indexing.py is currently missing some key use cases:

  • All tests in the above module use arrays with 2~6 million points. While this is important to spot any case where the numpy underlying functions start being unnecessarily called more than once, it also means any performance improvement or degradation in any of the pure-Python code will be completely drowned out. All tests should be run twice, once with the current nx = 3000; ny = 2000; nt = 1000 and again with nx = 15; ny = 10; nt = 5.
  • DataArray slicing (sel, isel, and square brackets)
  • Slicing when there are no IndexVariables (verify that we're not creating dummy variables, doing a full scan on them, and then discarding them)
  • other?
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3382/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
272002705 MDU6SXNzdWUyNzIwMDI3MDU= 1698 apply_ufunc(dask='parallelized') to infer output_dtypes crusaderky 6213168 open 0     3 2017-11-07T22:11:11Z 2019-10-22T08:33:38Z   MEMBER      

If one doesn't provide the dtype parameter to dask.map_blocks(), it automatically infers it by running the kernel on trivial dummy data. It should be straightforward to make xarray.apply_ufunc(dask='parallelized') use the same functionality if the output_dtypes parameter is omitted.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1698/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
485708282 MDU6SXNzdWU0ODU3MDgyODI= 3268 Stateful user-defined accessors crusaderky 6213168 open 0     15 2019-08-27T09:54:28Z 2019-10-08T11:13:25Z   MEMBER      

If anybody decorates a stateful class with @register_dataarray_accessor or @register_dataset_accessor, the instance will lose its state on any method that invokes _to_temp_dataset, as well as on a shallow copy.

```python

In [1]: @xarray.register_dataarray_accessor('foo') ...: class Foo: ...: def init(self, obj): ...: self.obj = obj ...: self.x = 1 ...:
...:

In [2]: a = xarray.DataArray()

In [3]: a.foo.x
Out[3]: 1

In [4]: a.foo.x = 2

In [5]: a.foo.x
Out[5]: 2

In [6]: a.roll().foo.x
Out[6]: 1

In [7]: a.copy(deep=False).foo.x
Out[7]: 1 ```

While in the case of _to_temp_dataset it could be possible to spend (substantial) effort to retain the state, on the case of copy() it's impossible without modifying the accessor duck API, as one would need to tamper with the accessor instance in place and modify the pointer back to the DataArray/Dataset.

This issue is so glaring that it makes me strongly suspect that nobody saves any state in accessor classes. This kind of use would also be problematic in practical terms, as the accessor object would have a hard time realising when its own state is no longer coherent with the referenced DataArray/Dataset.

This design also carries the problem that it introduces a circular reference in the DataArray/Dataset. This means that, after someone invokes an accessor method on his DataArray/Dataset, then the whole object - including the numpy buffers! - won't be instantly collected when it's dereferenced by the user, and it will have to instead wait for the next gc pass. This could cause huge increases in RAM usage overnight in a user application, which would be very hard to logically link to a change that just added a custom method.

Finally, with https://github.com/pydata/xarray/pull/3250/, this statefulness forces us to increase the RAM usage of all datasets and dataarrays by an extra slot, for all users, even if this feature is quite niche.

Proposed solution

Get rid of accessor caching altogether, and just recreate the accessor object from scratch every time it is invoked. In the documentation, clarify that the __init__ method should not perform anything computationally intensive.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3268/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
311578894 MDU6SXNzdWUzMTE1Nzg4OTQ= 2040 to_netcdf() to automatically switch to fixed-length strings for compressed variables crusaderky 6213168 open 0     2 2018-04-05T11:50:16Z 2019-01-13T01:42:03Z   MEMBER      

When you have fixed-length numpy arrays of unicode characters (<U...) in a dataset, and you invoke to_netcdf() without any particular encoding, they are automatically stored as variable-length strings, unless you explicitly specify {'dtype': 'S1'}.

Is this in order to save disk space in case strings vary wildly in size? I may be able to see the point in this case. However, this approach is disastrous if variables are compressed, as any compression algorithm will reduce the zero-panning at the end of the strings to a negligible size.

My test data: a dataset with \~50 variables, of which half are strings of 10\~100 english characters and the other half are floats, all on a single dimension with 12k points.

Test 1: ds.to_netcdf('uncompressed.nc') Result: 45MB

Test 2: encoding = {k: {'gzip': True, 'shuffle': True} for k in ds.variables} ds.to_netcdf('bad-compression.nc', encoding=encoding) Result: 42MB

Test 3: encoding = {} for k, v in ds.variables.items(): encoding[k] = {'gzip': True, 'shuffle': True} if v.dtype.kind == 'U': encoding[k]['dtype'] = 'S1' ds.to_netcdf('good-compression.nc', encoding=encoding) Result: 5MB

Proposal

In case of string variables, if no dtype is explicitly defined, to_netcdf() should dynamically assign it to S1 if compression is enabled, str if disabled.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2040/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
330473082 MDU6SXNzdWUzMzA0NzMwODI= 2219 to_netcdf broken encoding: dtype='S1' + chunksizes crusaderky 6213168 open 0     2 2018-06-07T23:46:13Z 2019-01-13T01:38:51Z   MEMBER      

``` xarray.Dataset({'x': ['foo', 'bar', 'baz']}).to_netcdf( 'foo.nc', engine='h5netcdf', encoding={'x': {'dtype': 'S1', 'zlib': True, 'chunksizes': (2, )}})

ValueError: "chunks" must have same rank as dataset shape ` Same withengine='netcdf4'``. The issue is present in 0.10.6 as well as in 0.10.3. The problem is obviously that dtype=S1 changes the shape of the variable before passing it to the backend, but while doing so doesn't also change an eventual chunksizes setting.

The workaround is to omit chunksizes or set it to True.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2219/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 25.357ms · About: xarray-datasette