home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 481838855

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
481838855 MDU6SXNzdWU0ODE4Mzg4NTU= 3224 Add "on"-parameter to "merge" method 1200058 closed 0     2 2019-08-17T02:44:46Z 2022-04-18T15:57:09Z 2022-04-18T15:57:09Z NONE      

I'd like to propose a change to the merge method.

Often, I meet cases where I'd like to merge subsets of the same dataset. However, this currently requires renaming of all dimensions, changing indices and merging them by hand.

As an example, please consider the following dataset: Dimensions: (genes: 8787, observations: 8166) Coordinates: * observations (observations) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-0005-SM-57WCN' * genes (genes) object 'ENSG00000227232' ... 'ENSG00000198727' individual (observations) object 'GTEX-111CU' ... 'GTEX-ZXG5' subtissue (observations) object 'Adipose_Subcutaneous' ... 'Whole_Blood' Data variables: cdf (observations, genes) float32 0.18883839 ... 0.4876754 l2fc (observations, genes) float32 -0.21032093 ... -0.032540113 padj (observations, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0 There is for each subtissue and individuum at most one observation.

Now, I'd like to plot all values in subtissue == "Whole_Blood" against subtissue == "Adipose_Subcutaneous". Therefore, I have to join all "Whole_Blood" observations with all "Adipose_Subcutaneous" observations by the "individual" coordinate.

To simplify this task, I'd like to have the following abstraction: ```python3

select tissues

tissue_1 = ds.sel(observations = (ds.subtissue == "Whole_Blood")) tissue_2 = ds.sel(observations = (ds.subtissue == "Adipose_Subcutaneous"))

inner join by individual

merged = tissue_1.merge(tissue_2, on="individual", newdim="merge_dim", join="inner")

print(merged) The result should look like this: Dimensions: ("genes": 8787, "individual": 286) Coordinates: * genes (genes) object 'ENSG00000227232' ... 'ENSG00000198727' * merge_dim (merge_dim) object 'GTEX-111CU' ... 'GTEX-ZXG5' observations:1 (merge_dim) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-1826-SM-5GZYN' observations:2 (merge_dim) object 'GTEX-111CU-0005-SM-57WCN' ... 'GTEX-ZXG5-0005-SM-57WCN' subtissue:1 (merge_dim) object 'Whole_Blood' ... 'Whole_Blood' subtissue:1 (merge_dim) object 'Adipose_Subcutaneous' ... 'Adipose_Subcutaneous' Data variables: cdf:1 (merge_dim, genes) float32 0.18883839 ... 0.4876754 cdf:2 (merge_dim, genes) float32 ... l2fc:1 (merge_dim, genes) float32 -0.21032093 ... -0.032540113 l2fc:2 (merge_dim, genes) float32 ... padj:1 (merge_dim, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0 padj:2 (merge_dim, genes) float32 ... ```


To summarize, I'd propose the following changes: - Add parameter on: Union[str, List[str], Tuple[str], Dict[str, str]] This should specify one or multiple coordinates which should be merged. - Simple merge: string => merge by left[str] and right[str] - Merge of multiple coords: list or tuple of strings => merge by left[str1, str2, ...] and right[str1, str2, ...] - To merge differently named coords: dict, e.g. {"str_left": "str_right}) => merge by left[str_left] and right[str_right] - Add some parameter like newdim to specify the newly created index dimension. If on specifies multiple coords, this new index dimension should be a multi-index of these coords. - Rename all duplicate coordinates not specified in on to some unique name e.g. left["cdf"] => merged["cdf:1"] and right["cdf"] => merged["cdf:2"]

In case if the on parameter's coordinates do not unambiguously describe each data point, they should be combined in a cross-product manner. However, since this could cause a quadratic runtime and memory requirement, I am not sure how this can be handled in a safe manner.

What do you think about this addition?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3224/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 2 rows from issue in issue_comments
Powered by Datasette · Queries took 0.888ms · About: xarray-datasette