home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 2109989289

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2109989289 I_kwDOAMm_X859w-Gp 8687 Add a filter option to stack 35689176 open 0     1 2024-01-31T12:28:15Z 2024-01-31T18:15:43Z   NONE      

Is your feature request related to a problem?

I currently have a dataset where one of my dimensions (let's call it x) is of size 10^5. Later in my analysis, I want to consider pairs of values of that dimension but not all of them. Note that considering all of them would lead to 10^10 entries (not viable memory usage), when in practice I only want to consider around 10^6 of them.

Therefore, the final dataset should have a dimension x_pair which is the stacking of dimensions x_1 and x_2.

However, it seems I have no straightforward way of using stack for that purpose: whatever I do it will create a 10^8 array that I can then filter it using where(drop=True).

Should this problem be unclear, I could provide a minimal example, but hopefully the explanation of the issue is enough (and my current code is provided as additional context).

Describe the solution you'd like

Have a filter parameter to stack. The filter function should take a dataset and return the set of elements that should appear in the final multiindex.

Describe alternatives you've considered

Currently, I have solved my problem by dividing the dataset into many smaller datasets, stacking and filtering each of these datasets separately and then merging the filtered datasets together.

Note: the stacking time without any parallelization of all the smaller datasets still feels very along (almost 2h). I do not know whether this is sensible.

Additional context

Currently, my code looks like the following and I have three initial dimensions to my dataset, Contact, sig_preprocessing, f. Both Contact and sig_preprocessing should be transformed into pairs.

```python signal_pairs = xr.merge([ signals.rename({x:f"{x}_1" for x in signals.coords if not x=="f"}, {x:f"{x}_1" for x in signals.data_vars}), signals.rename({x:f"{x}_2" for x in signals.coords if not x=="f"}, {x:f"{x}_2" for x in signals.data_vars}) ])

def stack_dataset(dataset): dataset=dataset.copy() dataset["common_duration"] = xr.where(dataset["start_time_1"] > dataset["start_time_2"], xr.where(dataset["end_time_1"] > dataset["end_time_2"], dataset["end_time_2"]- dataset["start_time_1"], dataset["end_time_1"]- dataset["start_time_1"] ), xr.where(dataset["end_time_1"] > dataset["end_time_2"], dataset["end_time_2"]- dataset["start_time_2"], dataset["end_time_1"]- dataset["start_time_2"] ) ) dataset["relevant_pair"] = ( (dataset["Session_1"] == dataset["Session_2"]) & (dataset["Contact_1"] != dataset["Contact_2"]) & (dataset["Structure_1"] == dataset["Structure_2"]) & (dataset["sig_type_1"] =="bua") & (dataset["sig_type_2"] =="spike_times") & (~dataset["resampled_continuous_path_1"].isnull()) & (~dataset["resampled_continuous_path_2"].isnull()) & (dataset["common_duration"] >10) ) dataset=dataset.stack(sig_preprocessing_pair=("sig_preprocessing_1","sig_preprocessing_2"), Contact_pair=("Contact_1", "Contact_2")) dataset = dataset.where(dataset["relevant_pair"].any("sig_preprocessing_pair"), drop=True) dataset = dataset.where(dataset["relevant_pair"].any("Contact_pair"), drop=True) return dataset

stack_size = 100 signal_pairs_split = [signal_pairs.isel(dict(Contact_1=slice(stack_sizei, stack_size(i+1)), Contact_2=slice(stack_sizej, stack_size(j+1)))) for i in range(int(np.ceil(signal_pairs.sizes["Contact_1"]/stack_size))) for j in range(int(np.ceil(signal_pairs.sizes["Contact_2"]/stack_size))) ] import concurrent.futures with concurrent.futures.ProcessPoolExecutor(max_workers=30) as executor: futures = [executor.submit(stack_dataset, dataset) for dataset in signal_pairs_split] signal_pairs_split_stacked = [future.result() for future in tqdm.tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Stacking")] signal_pairs = xr.merge(signal_pairs_split_stacked) ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8687/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 80.334ms · About: xarray-datasette