home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 569176457

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
569176457 MDU6SXNzdWU1NjkxNzY0NTc= 3791 Self joins with non-unique indexes 6130352 closed 0     5 2020-02-21T20:47:35Z 2020-03-26T17:51:35Z 2020-03-05T19:32:38Z NONE      

Hi, is there a good way to self join arrays?

For example, given a dataset like this:

python import pandas as pd df = pd.DataFrame(dict( x=[1, 1, 2, 2], y=['1', '1', '2', '2'], z=['a', 'b', 'c', 'd'])) df

I am not looking for the pandas concat behavior for alignment:

python pd.concat([ df.set_index(['x', 'y'])[['z']].rename(columns={'z': 'z_x'}), df.set_index(['x', 'y'])[['z']].rename(columns={'z': 'z_y'}) ], axis=1, join='inner')

but rather the merge behavior for a join by index:

python pd.merge(df, df, on=['x', 'y'])

I tried using xarray.merge but that seems to give the behavior like concat (i.e. alignment and not joining). Even if it is possible, it's a large dataset that I need to process out-of-core via dask, and I have found that it takes some elbow grease to get this working with dask dataframes by ensuring that the number of partitions is set well and that the divisions are known prior to joining by index. Should I expect that this sort of operation will work well with xarray (if it is possible) knowing that it's hard enough to do directly with dask without hitting OOM errors?

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 5.3.0-28-generic machine: x86_64 processor: byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.7.3 xarray: 0.15.0 pandas: 0.25.2 numpy: 1.17.2 scipy: 1.4.1 netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: 2.3.2 cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.1 dask: 2.11.0 distributed: 2.11.0 matplotlib: 3.1.1 cartopy: None seaborn: 0.9.0 numbagg: None setuptools: 45.2.0.post20200209 pip: 20.0.2 conda: None pytest: None IPython: 7.12.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3791/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 5 rows from issue in issue_comments
Powered by Datasette · Queries took 0.546ms · About: xarray-datasette