home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

2 rows where user = 35689176 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

type 1

  • issue 2

state 1

  • open 2

repo 1

  • xarray 2
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2004250796 I_kwDOAMm_X853dnCs 8473 Regular (linspace) Coordinates/Index JulienBrn 35689176 open 0     9 2023-11-21T13:08:08Z 2024-04-18T22:11:39Z   NONE      

Is your feature request related to a problem?

Most of my dimension coordinates fall into three categories: - Categorical coordinates - Pandas multiindex - Regular coordinates, that is of the form start + np.arange(n)/fs for some start, fs

I feel the way the latter is currently handled in xarray is suboptimal (unless I'm misusing this great library) as it has the following drawbacks: - Visually: It is not obvious that the coordinate is a linear space: when printing the dataset/array we see some of the values. - Computation Usage: applying scipy functions that require a regular sampling (for example scipy spectrogram is very annoying as one has to extract the fs and check that the coordinate is indeed regularly sampled. I currently use step=np.diff(a)[0], assert (np.abs(np.diff(a)-step))<epsilon).all(), fs=1/step - Rounding errors: sometimes one gets rounding errors in the values for the coordinate - Memory/Disk performance: when storing a dataset with few arrays, the storing of the coordinate values does take up some non negligible space (I have an example where one of my raw data is a one dimensional time array of 3gb and I like adding a coordinate system as soon as possible, thus doubling its size) - Speed: I would expect joins/alignment/rolling/... to be very fast on such coordinates

Note: It is not obvious for me from the documentation whether this is more of a "coordinate" enhancement or an "index" enhancement (index being to my knowledge discussed only in this part of the documentation ).

Describe the solution you'd like

A new type of index/coordinate where only the "start" and "fs" are stored. The _repr_inline may look like "RegularIndex(start, end, step=1/fs)". Perhaps another more generic possibility would be a type of coordinate system that is expressed as a transform fromnp.arange(s, e) by the bijective function f (with the inverse of f also provided). RegularIndex(start, end, fs) would then be an instance withf = lambda x: x/fs, inv(f) = lambda y: y*fs, s=round(start*fs), e = round(end*fs)+1 The advantage of this approach is that joins/alignment/selection/... could be handled generically on the np.arange(s, e) and this would also work on non linear spaces (for example log spaces)

Describe alternatives you've considered

I have tried writing an Index subclass but I struggle on the create_variables method. If I do not return a coordinate for the current dimension, then a.set_xindex(["t"], RegularIndex) keeps the previous coordinates and if I do, then I need to provide a Variable from the np.array that I do not want to create (for memory efficiency). I have tried to drop the coordinate after setting my custom index, but that seems to remove the index as well...

There may be many other problems as I have just quickly tried. Should this be a viable approach I may be open to writing a version myself and post it for review. However, I am relatively new to xarray and I would appreciate to first know if I am on the right track.

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8473/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2109989289 I_kwDOAMm_X859w-Gp 8687 Add a filter option to stack JulienBrn 35689176 open 0     1 2024-01-31T12:28:15Z 2024-01-31T18:15:43Z   NONE      

Is your feature request related to a problem?

I currently have a dataset where one of my dimensions (let's call it x) is of size 10^5. Later in my analysis, I want to consider pairs of values of that dimension but not all of them. Note that considering all of them would lead to 10^10 entries (not viable memory usage), when in practice I only want to consider around 10^6 of them.

Therefore, the final dataset should have a dimension x_pair which is the stacking of dimensions x_1 and x_2.

However, it seems I have no straightforward way of using stack for that purpose: whatever I do it will create a 10^8 array that I can then filter it using where(drop=True).

Should this problem be unclear, I could provide a minimal example, but hopefully the explanation of the issue is enough (and my current code is provided as additional context).

Describe the solution you'd like

Have a filter parameter to stack. The filter function should take a dataset and return the set of elements that should appear in the final multiindex.

Describe alternatives you've considered

Currently, I have solved my problem by dividing the dataset into many smaller datasets, stacking and filtering each of these datasets separately and then merging the filtered datasets together.

Note: the stacking time without any parallelization of all the smaller datasets still feels very along (almost 2h). I do not know whether this is sensible.

Additional context

Currently, my code looks like the following and I have three initial dimensions to my dataset, Contact, sig_preprocessing, f. Both Contact and sig_preprocessing should be transformed into pairs.

```python signal_pairs = xr.merge([ signals.rename({x:f"{x}_1" for x in signals.coords if not x=="f"}, {x:f"{x}_1" for x in signals.data_vars}), signals.rename({x:f"{x}_2" for x in signals.coords if not x=="f"}, {x:f"{x}_2" for x in signals.data_vars}) ])

def stack_dataset(dataset): dataset=dataset.copy() dataset["common_duration"] = xr.where(dataset["start_time_1"] > dataset["start_time_2"], xr.where(dataset["end_time_1"] > dataset["end_time_2"], dataset["end_time_2"]- dataset["start_time_1"], dataset["end_time_1"]- dataset["start_time_1"] ), xr.where(dataset["end_time_1"] > dataset["end_time_2"], dataset["end_time_2"]- dataset["start_time_2"], dataset["end_time_1"]- dataset["start_time_2"] ) ) dataset["relevant_pair"] = ( (dataset["Session_1"] == dataset["Session_2"]) & (dataset["Contact_1"] != dataset["Contact_2"]) & (dataset["Structure_1"] == dataset["Structure_2"]) & (dataset["sig_type_1"] =="bua") & (dataset["sig_type_2"] =="spike_times") & (~dataset["resampled_continuous_path_1"].isnull()) & (~dataset["resampled_continuous_path_2"].isnull()) & (dataset["common_duration"] >10) ) dataset=dataset.stack(sig_preprocessing_pair=("sig_preprocessing_1","sig_preprocessing_2"), Contact_pair=("Contact_1", "Contact_2")) dataset = dataset.where(dataset["relevant_pair"].any("sig_preprocessing_pair"), drop=True) dataset = dataset.where(dataset["relevant_pair"].any("Contact_pair"), drop=True) return dataset

stack_size = 100 signal_pairs_split = [signal_pairs.isel(dict(Contact_1=slice(stack_sizei, stack_size(i+1)), Contact_2=slice(stack_sizej, stack_size(j+1)))) for i in range(int(np.ceil(signal_pairs.sizes["Contact_1"]/stack_size))) for j in range(int(np.ceil(signal_pairs.sizes["Contact_2"]/stack_size))) ] import concurrent.futures with concurrent.futures.ProcessPoolExecutor(max_workers=30) as executor: futures = [executor.submit(stack_dataset, dataset) for dataset in signal_pairs_split] signal_pairs_split_stacked = [future.result() for future in tqdm.tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Stacking")] signal_pairs = xr.merge(signal_pairs_split_stacked) ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8687/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 32.382ms · About: xarray-datasette