issues: 481838855

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
481838855	MDU6SXNzdWU0ODE4Mzg4NTU=	3224	Add "on"-parameter to "merge" method	1200058	closed	0			2	2019-08-17T02:44:46Z	2022-04-18T15:57:09Z	2022-04-18T15:57:09Z	NONE				I'd like to propose a change to the merge method. Often, I meet cases where I'd like to merge subsets of the same dataset. However, this currently requires renaming of all dimensions, changing indices and merging them by hand. As an example, please consider the following dataset: Dimensions: (genes: 8787, observations: 8166) Coordinates: * observations (observations) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-0005-SM-57WCN' * genes (genes) object 'ENSG00000227232' ... 'ENSG00000198727' individual (observations) object 'GTEX-111CU' ... 'GTEX-ZXG5' subtissue (observations) object 'Adipose_Subcutaneous' ... 'Whole_Blood' Data variables: cdf (observations, genes) float32 0.18883839 ... 0.4876754 l2fc (observations, genes) float32 -0.21032093 ... -0.032540113 padj (observations, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0 There is for each `subtissue` and `individuum` at most one observation. Now, I'd like to plot all values in `subtissue == "Whole_Blood"` against `subtissue == "Adipose_Subcutaneous"`. Therefore, I have to join all "Whole_Blood" observations with all "Adipose_Subcutaneous" observations by the "individual" coordinate. To simplify this task, I'd like to have the following abstraction: ```python3 select tissues tissue_1 = ds.sel(observations = (ds.subtissue == "Whole_Blood")) tissue_2 = ds.sel(observations = (ds.subtissue == "Adipose_Subcutaneous")) inner join by individual merged = tissue_1.merge(tissue_2, on="individual", newdim="merge_dim", join="inner") print(merged) `The result should look like this:` Dimensions: ("genes": 8787, "individual": 286) Coordinates: * genes (genes) object 'ENSG00000227232' ... 'ENSG00000198727' * merge_dim (merge_dim) object 'GTEX-111CU' ... 'GTEX-ZXG5' observations:1 (merge_dim) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-1826-SM-5GZYN' observations:2 (merge_dim) object 'GTEX-111CU-0005-SM-57WCN' ... 'GTEX-ZXG5-0005-SM-57WCN' subtissue:1 (merge_dim) object 'Whole_Blood' ... 'Whole_Blood' subtissue:1 (merge_dim) object 'Adipose_Subcutaneous' ... 'Adipose_Subcutaneous' Data variables: cdf:1 (merge_dim, genes) float32 0.18883839 ... 0.4876754 cdf:2 (merge_dim, genes) float32 ... l2fc:1 (merge_dim, genes) float32 -0.21032093 ... -0.032540113 l2fc:2 (merge_dim, genes) float32 ... padj:1 (merge_dim, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0 padj:2 (merge_dim, genes) float32 ... ``` To summarize, I'd propose the following changes: - Add parameter `on: Union[str, List[str], Tuple[str], Dict[str, str]]` This should specify one or multiple coordinates which should be merged. - Simple merge: string => merge by `left[str]` and `right[str]` - Merge of multiple coords: list or tuple of strings => merge by left[str1, str2, ...] and right[str1, str2, ...] - To merge differently named coords: dict, e.g. `{"str_left": "str_right}`) => merge by `left[str_left]` and `right[str_right]` - Add some parameter like `newdim` to specify the newly created index dimension. If `on` specifies multiple coords, this new index dimension should be a multi-index of these coords. - Rename all duplicate coordinates not specified in `on` to some unique name e.g. `left["cdf"] => merged["cdf:1"]` and `right["cdf"] => merged["cdf:2"]` In case if the `on` parameter's coordinates do not unambiguously describe each data point, they should be combined in a cross-product manner. However, since this could cause a quadratic runtime and memory requirement, I am not sure how this can be handled in a safe manner. What do you think about this addition?	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3224/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

1 row from issues_id in issues_labels
2 rows from issue in issue_comments