home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where user = 34276374 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 6

  • Writing a a dataset to .zarr in a loop makes all the data NaNs 4
  • Better rolling reductions 2
  • How can I drop attribute of DataArray 1
  • Optimize ndrolling nanreduce 1
  • xarray.DataArray.str.cat() doesn't work on chunked data 1
  • Inconsistency between xr.where() and da.where() 1

user 1

  • tbloch1 · 10 ✖

author_association 1

  • NONE 10
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1514473763 https://github.com/pydata/xarray/issues/7767#issuecomment-1514473763 https://api.github.com/repos/pydata/xarray/issues/7767 IC_kwDOAMm_X85aRQkj tbloch1 34276374 2023-04-19T10:08:52Z 2023-04-19T10:08:52Z NONE

Thanks for the replies

So while xr.where(cond, x, y) is semantically, "where condition is true, x, else y", da.where(cond, x) is "where condition is true da, else x".

The latter feels quite unintuitive to me. Is the reason they're different only for the mask example you provided? Where NaN is returned as the default 'x' value?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Inconsistency between xr.where() and da.where() 1674532233
1507201606 https://github.com/pydata/xarray/issues/4325#issuecomment-1507201606 https://api.github.com/repos/pydata/xarray/issues/4325 IC_kwDOAMm_X85Z1hJG tbloch1 34276374 2023-04-13T15:48:31Z 2023-04-13T15:48:31Z NONE

I think I may have found a way to make the variance/standard deviation calculation more memory efficient, but I don't know enough about writing the sort of code that would be needed for a PR.

I basically wrote out the calculation for variance trying to only use the functions that have already been optimsed. Derived from:

$$ var = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 $$

$$ var = \frac{1}{n} \left( (x_1 - \mu)^2 + (x_2 - \mu)^2 + (x_3 - \mu)^2 + ... \right) $$

$$ var = \frac{1}{n} \left(x_1^2 -2x_1\mu + \mu^2 + \ x_2^2 -2x_2\mu + \mu^2 + \ x_3^2 -2x_3\mu + \mu^2 + ... \right) $$

$$ var = \frac{1}{n} \left( \sum_{i=1}^{n} x_i^2 - 2\mu\sum_{i=1}^{n} x_i + n\mu^2 \right)$$

I coded this up and demonstrate that it uses approximately 10% of the memory as the current .var() implementation:

```python %load_ext memory_profiler

import numpy as np import xarray as xr

temp = xr.DataArray(np.random.randint(0, 10, (5000, 500)), dims=("x", "y"))

def new_var(da, x=10, y=20): # Defining the re-used parts roll = da.rolling(x=x, y=y) mean = roll.mean() count = roll.count() # First term: sum of squared values term1 = (da2).rolling(x=x, y=y).sum() # Second term cross term sum term2 = -2 * mean * roll.sum() # Third term 'sum' of squared means term3 = count * mean2 # Combining into the variance var = (term1 + term2 + term3) / count return var

def old_var(da, x=10, y=20): roll = da.rolling(x=x, y=y) var = roll.var() return var

%memit new_var(temp) %memit old_var(temp) ```

peak memory: 429.77 MiB, increment: 134.92 MiB peak memory: 5064.07 MiB, increment: 4768.45 MiB

I wanted to double check that the calculation was working correctly:

python print((var_o.where(~np.isnan(var_o), 0) == var_n.where(~np.isnan(var_n), 0)).all().values) print(np.allclose(var_o, var_n, equal_nan = True))

False True

I think the difference here is just due to floating point errors, but maybe someone who knows how to check that in more detail could have a look.

The standard deviation can be trivially implemented from this if the approach works.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Optimize ndrolling nanreduce 675482176
1506749920 https://github.com/pydata/xarray/pull/4915#issuecomment-1506749920 https://api.github.com/repos/pydata/xarray/issues/4915 IC_kwDOAMm_X85Zzy3g tbloch1 34276374 2023-04-13T10:47:38Z 2023-04-13T10:47:38Z NONE

I think I may have found a way to make it more memory efficient, but I don't know enough about writing the sort of code that would be needed for a PR.

I basically wrote out the calculation for variance trying to only use the functions that have already been optimsed. Derived from:

$$ var = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 $$

$$ var = \frac{1}{n} \left( (x_1 - \mu)^2 + (x_2 - \mu)^2 + (x_3 - \mu)^2 + ... \right) $$

$$ var = \frac{1}{n} \left(x_1^2 -2x_1\mu + \mu^2 + \ x_2^2 -2x_2\mu + \mu^2 + \ x_3^2 -2x_3\mu + \mu^2 + ... \right) $$

$$ var = \frac{1}{n} \left( \sum_{i=1}^{n} x_i^2 - 2\mu\sum_{i=1}^{n} x_i + n\mu^2 \right)$$

I coded this up and demonstrate that it uses approximately 10% of the memory as the current .var() implementation:

```python %load_ext memory_profiler

import numpy as np import xarray as xr

temp = xr.DataArray(np.random.randint(0, 10, (5000, 500)), dims=("x", "y"))

def new_var(da, x=10, y=20): # Defining the re-used parts roll = da.rolling(x=x, y=y) mean = roll.mean() count = roll.count() # First term: sum of squared values term1 = (da2).rolling(x=x, y=y).sum() # Second term cross term sum term2 = -2 * mean * roll.sum() # Third term 'sum' of squared means term3 = count * mean2 # Combining into the variance var = (term1 + term2 + term3) / count return var

def old_var(da, x=10, y=20): roll = da.rolling(x=x, y=y) var = roll.var() return var

%memit new_var(temp) %memit old_var(temp) ```

peak memory: 429.77 MiB, increment: 134.92 MiB peak memory: 5064.07 MiB, increment: 4768.45 MiB

I wanted to double check that the calculation was working correctly:

python print((var_o.where(~np.isnan(var_o), 0) == var_n.where(~np.isnan(var_n), 0)).all().values) print(np.allclose(var_o, var_n, equal_nan = True))

False True

I think the difference here is just due to floating point errors, but maybe someone who knows how to check that in more detail could have a look.

The standard deviation can be trivially implemented from this if the approach works.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Better rolling reductions 809366777
1505575188 https://github.com/pydata/xarray/pull/4915#issuecomment-1505575188 https://api.github.com/repos/pydata/xarray/issues/4915 IC_kwDOAMm_X85ZvUEU tbloch1 34276374 2023-04-12T16:27:55Z 2023-04-12T16:27:55Z NONE

Has there been any progress on this for var/std?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Better rolling reductions 809366777
1196410247 https://github.com/pydata/xarray/issues/6828#issuecomment-1196410247 https://api.github.com/repos/pydata/xarray/issues/6828 IC_kwDOAMm_X85HT8WH tbloch1 34276374 2022-07-27T08:19:28Z 2022-07-27T08:19:28Z NONE

Thanks for the workaround @mathause!

Is there a benefit to your approach, rather than calling compute() on each DataArray? It seems like calling compute() twice is faster for the MVCE example (but maybe it won't scale that way).

But either way, it would be nice if the function threw a warning/error for handling dask arrays!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray.DataArray.str.cat() doesn't work on chunked data 1318369110
1098856530 https://github.com/pydata/xarray/issues/6456#issuecomment-1098856530 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BfzhS tbloch1 34276374 2022-04-14T08:37:11Z 2022-04-14T08:37:11Z NONE

@delgadom thanks! This did help with my actual code, and I've now done my processing.

But this bug report was more about the fact that overwriting was converting data to NaNs (in two different ways depending on the code apparently).

In my case there is no longer any need to do the overwriting, but this doesn't seem like the expected behaviour of overwriting, and I'm sure there are some valid reasons to overwrite data - hence me opening the bug report.

If overwriting is supposed to convert data to NaNs then I guess we could close this issue, but I'm not sure that's intended?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1096382964 https://github.com/pydata/xarray/issues/6456#issuecomment-1096382964 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BWXn0 tbloch1 34276374 2022-04-12T08:47:55Z 2022-04-12T08:48:48Z NONE

@max-sixty could you explain which bit isn't working for you? The initial example I shared works fine in colab for me, so that might be a you problem. The second one required specifying the chunks when making the datasets (I've editted above).

Here's a link to the colab (which has both examples).

It's worth noting that the way in which the dataset is broken does seem to be slightly different in each of these examples - in the former example all data becomes NaN, in the latter example only the initially saved data becomes NaN.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1094583214 https://github.com/pydata/xarray/issues/6456#issuecomment-1094583214 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BPgOu tbloch1 34276374 2022-04-11T06:01:44Z 2022-04-12T08:48:13Z NONE

@max-sixty - I've tried to slim it down below (no loop, and only one save). From the print statements, it's clear that before overwriting the .zarr ds3 is working correctly, but once ds3 is saved it breaks the data corresponding to the initial save (now all NaNs). I am guessing this is due to trying to read from and save over the same data, but I wouldn't have expected it to be a problem if it was loading the chunks into memory during the saving.

``` import pandas as pd import numpy as np import glob import xarray as xr from tqdm import tqdm

Creating pkl files

[pd.DataFrame(np.random.randint(0,10, (1000,500))).astype(object).to_pickle('df{}.pkl'.format(i)) for i in range(4)]

fnames = glob.glob('*.pkl')

df1 = pd.read_pickle(fnames[0]) df1.columns = np.arange(0,500).astype(object) # the real pkl files contain all objects df1.index = np.arange(0,1000).astype(object) df1 = df1.astype(np.float32)

ds = xr.DataArray(df1.values, dims=['fname', 'res_dim'], coords={'fname': df1.index.values, 'res_dim': df1.columns.values}) ds = ds.to_dataset(name='low_dim').chunk({'fname': 500, 'res_dim': 1})

ds.to_zarr('zarr_bug.zarr', mode='w') ds1 = xr.open_zarr('zarr_bug.zarr', decode_coords="all")

df2 = pd.read_pickle(fnames[1]) df2.columns = np.arange(0,500).astype(object) df2.index = np.arange(0,1000).astype(object) df2 = df2.astype(np.float32)

ds2 = xr.DataArray(df2.values, dims=['fname', 'res_dim'], coords={'fname': df2.index.values, 'res_dim': df2.columns.values}) ds2 = ds2.to_dataset(name='low_dim').chunk({'fname': 500, 'res_dim': 1})

ds3 = xr.concat([ds1, ds2], dim='fname') ds3['fname'] = ds3.fname.astype(str)

print(ds3.low_dim.values)

ds3.to_zarr('zarr_bug.zarr', mode='w')

print(ds3.low_dim.values) ```

The output:

[[7. 8. 4. ... 9. 6. 7.] [0. 4. 5. ... 9. 7. 6.] [3. 4. 3. ... 1. 6. 1.] ... [4. 0. 4. ... 5. 6. 9.] [5. 2. 5. ... 1. 7. 1.] [8. 9. 7. ... 4. 4. 1.]] [[nan nan nan ... nan nan nan] [nan nan nan ... nan nan nan] [nan nan nan ... nan nan nan] ... [ 4. 0. 4. ... 5. 6. 9.] [ 5. 2. 5. ... 1. 7. 1.] [ 8. 9. 7. ... 4. 4. 1.]]

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1094587632 https://github.com/pydata/xarray/issues/6456#issuecomment-1094587632 https://api.github.com/repos/pydata/xarray/issues/6456 IC_kwDOAMm_X85BPhTw tbloch1 34276374 2022-04-11T06:07:06Z 2022-04-11T10:42:51Z NONE

@delgadom - In the example it's saving every iteration, but in my actual code it's much less frequent. I figured there was probably a better way to achieve the same thing, but it still doesn't seem like the expected behaviour, which is why I thought I should raise the issue here.

The files are just sequentially names (as in my example), but the indices of the resulting dataframes are a bunch of unique strings (file-paths, not dates).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301
1090330555 https://github.com/pydata/xarray/issues/1437#issuecomment-1090330555 https://api.github.com/repos/pydata/xarray/issues/1437 IC_kwDOAMm_X85A_R-7 tbloch1 34276374 2022-04-06T14:21:15Z 2022-04-06T14:21:15Z NONE

Had the same issue, fixed it by using del ds.my_var.attrs['attr_to_delete'] before I tried to save my dataset.

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  How can I drop attribute of DataArray 232743076

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.643ms · About: xarray-datasette