home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

6 rows where issue = 201617371 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 4

  • jacklovell 2
  • fmaussion 2
  • shoyer 1
  • stale[bot] 1

author_association 2

  • MEMBER 3
  • NONE 3

issue 1

  • Using where() in datasets with dataarrays with different dimensions results in huge RAM consumption · 6 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
457093294 https://github.com/pydata/xarray/issues/1217#issuecomment-457093294 https://api.github.com/repos/pydata/xarray/issues/1217 MDEyOklzc3VlQ29tbWVudDQ1NzA5MzI5NA== stale[bot] 26384082 2019-01-24T07:20:42Z 2019-01-24T07:20:42Z NONE

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using where() in datasets with dataarrays with different dimensions results in huge RAM consumption 201617371
273687248 https://github.com/pydata/xarray/issues/1217#issuecomment-273687248 https://api.github.com/repos/pydata/xarray/issues/1217 MDEyOklzc3VlQ29tbWVudDI3MzY4NzI0OA== shoyer 1217238 2017-01-19T05:42:25Z 2017-01-19T05:43:22Z MEMBER

For reference, it may be helpful to try your example on a smaller dataset: ds = xr.Dataset() ds['data1'] = xr.DataArray(np.arange(100), coords={'t1': np.linspace(0, 1, 100)}) ds['data1b'] = xr.DataArray(np.arange(100, 200), coords={'t1': np.linspace(0, 1, 100)}) ds['data2'] = xr.DataArray(np.arange(200, 500), coords={'t2': np.linspace(0, 1, 300)}) ds['data2b'] = xr.DataArray(np.arange(600, 900), coords={'t2': np.linspace(0, 1, 300)}) This is what the data looks like: ```

ds <xarray.Dataset> Dimensions: (t1: 100, t2: 300) Coordinates: * t1 (t1) float64 0.0 0.0101 0.0202 0.0303 0.0404 0.05051 0.06061 ... * t2 (t2) float64 0.0 0.003344 0.006689 0.01003 0.01338 0.01672 ... Data variables: data1 (t1) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... data1b (t1) int64 100 101 102 103 104 105 106 107 108 109 110 111 112 ... data2 (t2) int64 200 201 202 203 204 205 206 207 208 209 210 211 212 ... data2b (t2) int64 600 601 602 603 604 605 606 607 608 609 610 611 612 ... and here's what happens with `where`: ds.where(ds.data1 < 50, drop=True) <xarray.Dataset> Dimensions: (t1: 50, t2: 300) Coordinates: * t1 (t1) float64 0.0 0.0101 0.0202 0.0303 0.0404 0.05051 0.06061 ... * t2 (t2) float64 0.0 0.003344 0.006689 0.01003 0.01338 0.01672 ... Data variables: data1 (t1) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ... data1b (t1) float64 100.0 101.0 102.0 103.0 104.0 105.0 106.0 107.0 ... data2 (t2, t1) float64 200.0 200.0 200.0 200.0 200.0 200.0 200.0 ... data2b (t2, t1) float64 600.0 600.0 600.0 600.0 600.0 600.0 600.0 ... ```

I suspect this probably isn't really doing what you want, unless you really want two-dimensional versions of data2 and data2b. It probably makes more sense to subset out the relevant variables first.

Broadcasting producing gigantic arrays without any warning is really a NumPy issue, e.g., trynp.where(True, np.zeros((1000000, 1)), np.ones((1, 1000000))) (this will probably crash your computer!).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using where() in datasets with dataarrays with different dimensions results in huge RAM consumption 201617371
273529203 https://github.com/pydata/xarray/issues/1217#issuecomment-273529203 https://api.github.com/repos/pydata/xarray/issues/1217 MDEyOklzc3VlQ29tbWVudDI3MzUyOTIwMw== jacklovell 4849151 2017-01-18T16:43:03Z 2017-01-19T05:15:52Z NONE

The problem isn't as bad with a smaller example (though the runtime is doubled). I've attached a minimum working example, which seems to suggest that maybe there was a problem with xarray creating a MultiIndex and duplicating all the data? (I've left in input() to allow checking memory usage before the program exists, but there isn't much difference in this example). xrmin.py.txt

Edit by @shoyer: added code from attachment inline: ```python

!/usr/bin/env python3

import time import sys import numpy as np import xarray as xr

ds = xr.Dataset() ds['data1'] = xr.DataArray(np.arange(1000), coords={'t1': np.linspace(0, 1, 1000)}) ds['data1b'] = xr.DataArray(np.arange(1000, 2000), coords={'t1': np.linspace(0, 1, 1000)}) ds['data2'] = xr.DataArray(np.arange(2000, 5000), coords={'t2': np.linspace(0, 1, 3000)}) ds['data2b'] = xr.DataArray(np.arange(6000, 9000), coords={'t2': np.linspace(0, 1, 3000)}) if sys.argv[1] == "nodrop": now = time.time() print(ds.where(ds.data1 < 50, drop=True)) print("Took {} seconds".format(time.time() - now)) elif sys.argv[1] == "drop": ds1 = ds.drop('t2') now = time.time() print(ds1.where(ds1.data1 < 50, drop=True)) print("Took {} seconds".format(time.time() - now)) input("Press return to exit") ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using where() in datasets with dataarrays with different dimensions results in huge RAM consumption 201617371
273544152 https://github.com/pydata/xarray/issues/1217#issuecomment-273544152 https://api.github.com/repos/pydata/xarray/issues/1217 MDEyOklzc3VlQ29tbWVudDI3MzU0NDE1Mg== fmaussion 10050469 2017-01-18T17:34:13Z 2017-01-18T17:34:13Z MEMBER

. So in my case extracting the data with the shared dimension using ds.drop() is appropriate. It would be nice to have xarray throw a warning or error to prevent me chomping up all the RAM in my system if I do try to do this sort of thing though.

I'll let @shoyer give a definitive answer here, but I don't think that .where is meant to check whether the input makes "sense" or not. What happens is related to how xarray chooses to broadcast non matching dimensions:

```python import xarray as xr import numpy as np d1 = xr.DataArray(np.arange(3), coords={'t1': np.linspace(0, 1, 3)}, dims='t1') d2 = xr.DataArray(np.arange(4), coords={'t2': np.linspace(0, 1, 4)}, dims='t2')

d2 * d1 <xarray.DataArray (t2: 4, t1: 3)> array([[0, 0, 0], [0, 1, 2], [0, 2, 4], [0, 3, 6]]) Coordinates: * t2 (t2) float64 0.0 0.3333 0.6667 1.0 * t1 (t1) float64 0.0 0.5 1.0

d2.where(d1 == 1) <xarray.DataArray (t2: 4, t1: 3)> array([[ nan, 0., nan], [ nan, 1., nan], [ nan, 2., nan], [ nan, 3., nan]]) Coordinates: * t2 (t2) float64 0.0 0.3333 0.6667 1.0 * t1 (t1) float64 0.0 0.5 1.0 ``` which "makes sense", but is going to have a huge memory consumption if your arrays are large.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using where() in datasets with dataarrays with different dimensions results in huge RAM consumption 201617371
273523770 https://github.com/pydata/xarray/issues/1217#issuecomment-273523770 https://api.github.com/repos/pydata/xarray/issues/1217 MDEyOklzc3VlQ29tbWVudDI3MzUyMzc3MA== jacklovell 4849151 2017-01-18T16:25:19Z 2017-01-18T16:25:19Z NONE

data1 and data2 represent two stages of data acquisition within one "shot" of our experiment. I'd like to be able to group each shot's data into a single dataset.

I want to extract from the dataset only the values for which my where() condition is true, and I'll only be using DataArrays which share the same dimension as the one in the condition. For example, if I do: ds_low = ds.where(ds.data1 < 0.1, drop=True) I'll only use stuff in ds_low with the same dimension as ds.data1. So in my case extracting the data with the shared dimension using ds.drop(<unused dim>) is appropriate.

It would be nice to have xarray throw a warning or error to prevent me chomping up all the RAM in my system if I do try to do this sort of thing though. Or it could simply mask off with NaN everything in the DataArrays which have a different dimension.

Give me a second to provide a minimal working example.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using where() in datasets with dataarrays with different dimensions results in huge RAM consumption 201617371
273520435 https://github.com/pydata/xarray/issues/1217#issuecomment-273520435 https://api.github.com/repos/pydata/xarray/issues/1217 MDEyOklzc3VlQ29tbWVudDI3MzUyMDQzNQ== fmaussion 10050469 2017-01-18T16:14:19Z 2017-01-18T16:14:19Z MEMBER

Thanks for the report! It would be great if you could be a bit more specific: - if data1 and data2 are unrelated, why do you want to apply where on both variables? What is your expectation on the output? - do you have the possibility to produce a minimal, self-contained working example?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Using where() in datasets with dataarrays with different dimensions results in huge RAM consumption 201617371

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 54.02ms · About: xarray-datasette