github: issues: 12 rows where repo = 13221727, state = "open" and user = 6213168 sorted by updated

12 rows where repo = 13221727, state = "open" and user = 6213168 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	author_association	body	reactions	repo	type
309686915	MDU6SXNzdWUzMDk2ODY5MTU=	2027	square-bracket slice a Dataset with a DataArray	crusaderky 6213168	open	4	2018-03-29T09:39:57Z	2022-04-18T03:51:25Z	MEMBER	Given this: ``` ds = xarray.Dataset( data_vars={ 'vote': ('pupil', [5, 7, 8]), 'age': ('pupil', [15, 14, 16]) }, coords={ 'pupil': ['Alice', 'Bob', 'Charlie'] }) <xarray.Dataset> Dimensions: (pupil: 3) Coordinates: * pupil (pupil) <U7 'Alice' 'Bob' 'Charlie' Data variables: vote (pupil) int64 5 7 8 age (pupil) int64 15 14 16 ``` Why does this work: ``` ds.age[ds.vote >= 6] <xarray.DataArray 'age' (pupil: 2)> array([14, 16]) Coordinates: * pupil (pupil) <U7 'Bob' 'Charlie' ``` But this doesn't? ``` ds[ds.vote >= 6] KeyError: False `ds.vote >= 6`` is a DataArray with dims=('pupil', ) and dtype=bool, so I can't think of any ambiguity in what I want to achieve? Workaround: ``` ds.sel(pupil=ds.vote >= 6) <xarray.Dataset> Dimensions: (pupil: 2) Coordinates: * pupil (pupil) <U7 'Bob' 'Charlie' Data variables: vote (pupil) int64 7 8 age (pupil) int64 14 16 ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2027/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
666896781	MDU6SXNzdWU2NjY4OTY3ODE=	4279	intersphinx looks for implementation modules	crusaderky 6213168	open	0	2020-07-28T08:55:12Z	2022-04-09T03:03:30Z	MEMBER	This is a widespread issue caused by the pattern of defining objects in private module and then exposing them to the final user by importing them in the top-level `__init__.py`, vs. how intersphinx works. Exact same issue in different projects: - https://github.com/aio-libs/aiohttp/issues/3714 - https://jira.mongodb.org/browse/MOTOR-338 - https://github.com/tkem/cachetools/issues/178 - https://github.com/AmphoraInc/xarray_mongodb/pull/22 - https://github.com/jonathanslenders/asyncio-redis/issues/143 If a project 1. uses xarray, intersphinx, and autodoc 2. subclasses any of the classes exposed by `xarray/__init__.py` and documents the new class with the `:show-inheritance:` flag 3. Starting from Sphinx 3, has any of the above classes anywhere in a type annotation Then Sphinx emits a warning and fails to create a hyperlink, because intersphinx uses the `__module__` attribute to look up the object in objects.inv, but `__module__` points to the implementation module while objects.inv points to the top-level `xarray` module. Workaround In conf.py: `python import xarray xarray.DataArray.__module__ = "xarray"` Solution Put the above hack in `xarray/__init__.py`	{ "url": "https://api.github.com/repos/pydata/xarray/issues/4279/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
193294569	MDU6SXNzdWUxOTMyOTQ1Njk=	1151	Scalar coords vs. concat	crusaderky 6213168	open	11	2016-12-03T15:42:18Z	2021-07-08T17:42:18Z	MEMBER	Why does this work: ``` import xarray a = xarray.DataArray([1, 2, 3], dims=['x'], coords={'y': 10}) b = xarray.DataArray([4, 5, 6], dims=['x']) a + b <xarray.DataArray (x: 3)> array([5, 7, 9]) Coordinates: y int64 10 `But this doesn't?` xarray.concat([a, b], dim='x') KeyError: 'y' ``` It doesn't seem coherent to me...	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1151/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
305757822	MDU6SXNzdWUzMDU3NTc4MjI=	1995	apply_ufunc support for chunks on input_core_dims	crusaderky 6213168	open	13	2018-03-15T23:50:22Z	2021-05-17T18:59:18Z	MEMBER	I am trying to optimize the following function: `c = (a * b).sum('x', skipna=False)` where a and b are xarray.DataArray's, both with dimension x and both with dask backend. I successfully obtained a 5.5x speedup with the following: `@numba.guvectorize(['void(float64[:], float64[:], float64[:])'], '(n),(n)->()', nopython=True, cache=True) def mulsum(a, b, res): acc = 0 for i in range(a.size): acc += a[i] * b[i] res.flat[0] = acc c = xarray.apply_ufunc( mulsum, a, b, input_core_dims=[['x'], ['x']], dask='parallelized', output_dtypes=[float])` The problem is that this introduces a (quite problematic, in my case) constraint that a and b can't be chunked on dimension x - which is theoretically avoidable as long as the kernel function doesn't need interaction between x[i] and x[j] (e.g. it can't work for an interpolator, which would require to rely on dask ghosting). Proposal Add a parameter to apply_ufunc, `reduce_func=None`. reduce_func is a function which takes as input two parameters a, b that are the output of func. apply_ufunc will invoke it whenever there's chunking on an input_core_dim. e.g. my use case above would simply become: `c = xarray.apply_ufunc( mulsum, a, b, input_core_dims=[['x'], ['x']], dask='parallelized', output_dtypes=[float], reduce_func=operator.sum)` So if I have 2 chunks in a and b on dimension x, apply_ufunc will internally do `c1 = mulsum(a1, b1) c2 = mulsum(a2, b2) c = operator.sum(c1, c2)` Note that reduce_func will be invoked exclusively in presence of dask='parallelized' and when there's chunking on one or more of the input_core_dims. If reduce_func is left to None, apply_ufunc will keep crashing like it does now.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1995/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
417356439	MDU6SXNzdWU0MTczNTY0Mzk=	2801	NaN-sized chunks	crusaderky 6213168	open	2	2019-03-05T15:30:14Z	2021-04-24T02:41:34Z	MEMBER	It would be nice to have support for NaN-sized dask chunks, e.g. `x[x > 2]`. There are two problems: `x[x > 2]` silently resolves the dask graph. It definitely shouldn't. There needs to be some discussion on what needs to happen to indices on the NaN-sized dimension; I can think of 3 options: silently drop any index that would become undefined drop any index that would become undefined and issue a warning hard crash if there is any index that would become undefined redesign IndexVariable so that it can contain dask data (probably much more complicated than the 3 above). The above design decision is anyway for when there is an index; dims without indices should just work. This crashes: ```>>> a = xarray.DataArray([1, 2, 3, 4]).chunk(2) xarray.DataArray(a.data[a.data > 2]).compute() ValueError: replacement data must match the Variable's shape ``` I didn't investigate but I suspect it should be trivial to fix. I'm not sure why there is a check at all? Any such health check should be in dask only IMHO.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2801/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
311573817	MDU6SXNzdWUzMTE1NzM4MTc=	2039	open_mfdataset: skip loading for indexes and coordinates from all but the first file	crusaderky 6213168	open	1	2018-04-05T11:32:02Z	2021-01-27T17:49:21Z	MEMBER	This is a follow-up from #1521. When invoking open_mfdataset, very frequently the user knows in advance that all of his coords that aren't on the concat_dim are already aligned, and may be willing to blindly trust such assumption in exchange of a huge performance boost. My production data: 200x NetCDF files on a not very performant NFS file system, concatenated on the "scenario" dimension: ``` xarray.open_mfdataset('cube..nc', engine='h5netcdf', concat_dim='scenario') <xarray.Dataset> Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: attribute (attribute) object 'THEO/Value' currency (instr_id) object 'ZAR' 'EUR' 'EUR' 'EUR' 'EUR' 'EUR' 'GBP' ... * fx_id (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ... * instr_id (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ... * timestep (timestep) datetime64[ns] 2016-12-31 type (instr_id) object 'American' 'Bond Future' 'Bond Future' ... * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... Data variables: FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)> instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)> CPU times: user 19.6 s, sys: 981 ms, total: 20.6 s Wall time: 24.4 s ``` If I skip loading and comparing the non-index coords from all 200 files: ``` xarray.open_mfdataset('cube..nc'), engine='h5netcdf', concat_dim='scenario', coords='all') <xarray.Dataset> Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: attribute (attribute) object 'THEO/Value' * fx_id (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ... * instr_id (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ... * timestep (timestep) datetime64[ns] 2016-12-31 currency (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)> * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... type (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)> Data variables: FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)> instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)> CPU times: user 12.7 s, sys: 305 ms, total: 13 s Wall time: 14.8 s ``` If I skip loading and comparing also the index coords from all 200 files: ``` cube = xarray.open_mfdataset(sh.resolve_env(f'{dynamic}/mtf/{cubename}/nc/cube..nc'), engine='h5netcdf', concat_dim='scenario', drop_variables=['attribute', 'fx_id', 'instr_id', 'timestep', 'currency', 'type']) <xarray.Dataset> Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... Dimensions without coordinates: attribute, fx_id, instr_id, timestep Data variables: FX (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)> instruments (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)> CPU times: user 7.31 s, sys: 61 ms, total: 7.37 s Wall time: 9.05 s ``` Proposed design Add a new optional parameter to open_mfdataset, `assume_aligned=None`. It can be valued to a list of variable names or "all", and requires `concat_dim` to be explicitly set. It causes open_mfdataset to use the first occurrence of every variable and blindly skip loading the subsequent ones. Algorithm Perform the first invocation to the underlying open_dataset like it happens now if assume_aligned is not None: for each new NetCDF file, figure out which variables need to be aligned & compared (as opposed to concatenated), and add them to a drop_variables list. if assume_aligned != "all": drop_variables &= assume_aligned Pass the increasingly long drop_variables list to the underlying open_dataset	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2039/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
272004812	MDU6SXNzdWUyNzIwMDQ4MTI=	1699	apply_ufunc(dask='parallelized') output_dtypes for datasets	crusaderky 6213168	open	8	2017-11-07T22:18:23Z	2020-04-06T15:31:17Z	MEMBER	When a Dataset has variables with different dtypes, there's no way to tell apply_ufunc that the same function applied to different variables will produce different dtypes: ``` ds1 = xarray.Dataset(data_vars={'a': ('x', [1, 2]), 'b': ('x', [3.0, 4.5])}).chunk() ds2 = xarray.apply_ufunc(lambda x: x + 1, ds1, dask='parallelized', output_dtypes=[float]) ds2 <xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: a (x) float64 dask.array<shape=(2,), chunksize=(2,)> b (x) float64 dask.array<shape=(2,), chunksize=(2,)> ds2.compute() <xarray.Dataset> Dimensions: (x: 2) Dimensions without coordinates: x Data variables: a (x) int64 2 3 b (x) float64 4.0 5.5 ``` Proposed solution When the output is a dataset, apply_ufunc could accept either `output_dtypes=[t]` (if all output variables will have the same dtype) or `output_dtypes=[{var1: t1, var2: t2, ...}]`. In the example above, it would be `output_dtypes=[{'a': int, 'b': float}]`.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1699/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
503983776	MDU6SXNzdWU1MDM5ODM3NzY=	3382	Improve indexing performance benchmarks	crusaderky 6213168	open	0	2019-10-08T11:20:39Z	2019-11-14T15:52:33Z	MEMBER	As discussed in #3375 - FYI @jhamman `asv_bench/benchmarks/indexing.py` is currently missing some key use cases: All tests in the above module use arrays with 2~6 million points. While this is important to spot any case where the numpy underlying functions start being unnecessarily called more than once, it also means any performance improvement or degradation in any of the pure-Python code will be completely drowned out. All tests should be run twice, once with the current `nx = 3000; ny = 2000; nt = 1000` and again with `nx = 15; ny = 10; nt = 5`. DataArray slicing (sel, isel, and square brackets) Slicing when there are no IndexVariables (verify that we're not creating dummy variables, doing a full scan on them, and then discarding them) other?	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3382/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
272002705	MDU6SXNzdWUyNzIwMDI3MDU=	1698	apply_ufunc(dask='parallelized') to infer output_dtypes	crusaderky 6213168	open	3	2017-11-07T22:11:11Z	2019-10-22T08:33:38Z	MEMBER	If one doesn't provide the `dtype` parameter to `dask.map_blocks()`, it automatically infers it by running the kernel on trivial dummy data. It should be straightforward to make `xarray.apply_ufunc(dask='parallelized')` use the same functionality if the `output_dtypes` parameter is omitted.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1698/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
485708282	MDU6SXNzdWU0ODU3MDgyODI=	3268	Stateful user-defined accessors	crusaderky 6213168	open	15	2019-08-27T09:54:28Z	2019-10-08T11:13:25Z	MEMBER	If anybody decorates a stateful class with `@register_dataarray_accessor` or `@register_dataset_accessor`, the instance will lose its state on any method that invokes `_to_temp_dataset`, as well as on a shallow copy. ```python In [1]: @xarray.register_dataarray_accessor('foo') ...: class Foo: ...: def init(self, obj): ...: self.obj = obj ...: self.x = 1 ...: ...: In [2]: a = xarray.DataArray() In [3]: a.foo.x Out[3]: 1 In [4]: a.foo.x = 2 In [5]: a.foo.x Out[5]: 2 In [6]: a.roll().foo.x Out[6]: 1 In [7]: a.copy(deep=False).foo.x Out[7]: 1 ``` While in the case of `_to_temp_dataset` it could be possible to spend (substantial) effort to retain the state, on the case of copy() it's impossible without modifying the accessor duck API, as one would need to tamper with the accessor instance in place and modify the pointer back to the DataArray/Dataset. This issue is so glaring that it makes me strongly suspect that nobody saves any state in accessor classes. This kind of use would also be problematic in practical terms, as the accessor object would have a hard time realising when its own state is no longer coherent with the referenced DataArray/Dataset. This design also carries the problem that it introduces a circular reference in the DataArray/Dataset. This means that, after someone invokes an accessor method on his DataArray/Dataset, then the whole object - including the numpy buffers! - won't be instantly collected when it's dereferenced by the user, and it will have to instead wait for the next `gc` pass. This could cause huge increases in RAM usage overnight in a user application, which would be very hard to logically link to a change that just added a custom method. Finally, with https://github.com/pydata/xarray/pull/3250/, this statefulness forces us to increase the RAM usage of all datasets and dataarrays by an extra slot, for all users, even if this feature is quite niche. Proposed solution Get rid of accessor caching altogether, and just recreate the accessor object from scratch every time it is invoked. In the documentation, clarify that the `__init__` method should not perform anything computationally intensive.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3268/reactions", "total_count": 4, "+1": 4, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
311578894	MDU6SXNzdWUzMTE1Nzg4OTQ=	2040	to_netcdf() to automatically switch to fixed-length strings for compressed variables	crusaderky 6213168	open	2	2018-04-05T11:50:16Z	2019-01-13T01:42:03Z	MEMBER	When you have fixed-length numpy arrays of unicode characters (<U...) in a dataset, and you invoke to_netcdf() without any particular encoding, they are automatically stored as variable-length strings, unless you explicitly specify `{'dtype': 'S1'}`. Is this in order to save disk space in case strings vary wildly in size? I may be able to see the point in this case. However, this approach is disastrous if variables are compressed, as any compression algorithm will reduce the zero-panning at the end of the strings to a negligible size. My test data: a dataset with \~50 variables, of which half are strings of 10\~100 english characters and the other half are floats, all on a single dimension with 12k points. Test 1: `ds.to_netcdf('uncompressed.nc')` Result: 45MB Test 2: `encoding = {k: {'gzip': True, 'shuffle': True} for k in ds.variables} ds.to_netcdf('bad-compression.nc', encoding=encoding)` Result: 42MB Test 3: `encoding = {} for k, v in ds.variables.items(): encoding[k] = {'gzip': True, 'shuffle': True} if v.dtype.kind == 'U': encoding[k]['dtype'] = 'S1' ds.to_netcdf('good-compression.nc', encoding=encoding)` Result: 5MB Proposal In case of string variables, if no dtype is explicitly defined, to_netcdf() should dynamically assign it to S1 if compression is enabled, str if disabled.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2040/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
330473082	MDU6SXNzdWUzMzA0NzMwODI=	2219	to_netcdf broken encoding: dtype='S1' + chunksizes	crusaderky 6213168	open	2	2018-06-07T23:46:13Z	2019-01-13T01:38:51Z	MEMBER	``` xarray.Dataset({'x': ['foo', 'bar', 'baz']}).to_netcdf( 'foo.nc', engine='h5netcdf', encoding={'x': {'dtype': 'S1', 'zlib': True, 'chunksizes': (2, )}}) ValueError: "chunks" must have same rank as dataset shape ` Same withengine='netcdf4'``. The issue is present in 0.10.6 as well as in 0.10.3. The problem is obviously that dtype=S1 changes the shape of the variable before passing it to the backend, but while doing so doesn't also change an eventual chunksizes setting. The workaround is to omit chunksizes or set it to True.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2219/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

12 rows where repo = 13221727, state = "open" and user = 6213168 sorted by updated_at descending

Workaround

Solution

Proposal

Proposed design

Algorithm

Proposed solution

Proposal

Advanced export