html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/7767#issuecomment-1514473763,https://api.github.com/repos/pydata/xarray/issues/7767,1514473763,IC_kwDOAMm_X85aRQkj,34276374,2023-04-19T10:08:52Z,2023-04-19T10:08:52Z,NONE,"Thanks for the replies
So while `xr.where(cond, x, y)` is semantically, ""where condition is true, x, else y"", `da.where(cond, x)` is ""where condition is true `da`, else x"".
The latter feels quite unintuitive to me. Is the reason they're different only for the mask example you provided? Where NaN is returned as the default 'x' value?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1674532233
https://github.com/pydata/xarray/issues/4325#issuecomment-1507201606,https://api.github.com/repos/pydata/xarray/issues/4325,1507201606,IC_kwDOAMm_X85Z1hJG,34276374,2023-04-13T15:48:31Z,2023-04-13T15:48:31Z,NONE,"I think I may have found a way to make the variance/standard deviation calculation more memory efficient, but I don't know enough about writing the sort of code that would be needed for a PR.
I basically wrote out the calculation for variance trying to only use the functions that have already been optimsed. Derived from:
$$ var = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 $$
$$ var = \frac{1}{n} \left( (x_1 - \mu)^2 + (x_2 - \mu)^2 + (x_3 - \mu)^2 + ... \right) $$
$$ var = \frac{1}{n} \left(x_1^2 -2x_1\mu + \mu^2 + \\ x_2^2 -2x_2\mu + \mu^2 + \\ x_3^2 -2x_3\mu + \mu^2 + ... \right) $$
$$ var = \frac{1}{n} \left( \sum_{i=1}^{n} x_i^2 - 2\mu\sum_{i=1}^{n} x_i + n\mu^2 \right)$$
I coded this up and demonstrate that it uses approximately 10% of the memory as the current `.var()` implementation:
```python
%load_ext memory_profiler
import numpy as np
import xarray as xr
temp = xr.DataArray(np.random.randint(0, 10, (5000, 500)), dims=(""x"", ""y""))
def new_var(da, x=10, y=20):
# Defining the re-used parts
roll = da.rolling(x=x, y=y)
mean = roll.mean()
count = roll.count()
# First term: sum of squared values
term1 = (da**2).rolling(x=x, y=y).sum()
# Second term cross term sum
term2 = -2 * mean * roll.sum()
# Third term 'sum' of squared means
term3 = count * mean**2
# Combining into the variance
var = (term1 + term2 + term3) / count
return var
def old_var(da, x=10, y=20):
roll = da.rolling(x=x, y=y)
var = roll.var()
return var
%memit new_var(temp)
%memit old_var(temp)
```
```
peak memory: 429.77 MiB, increment: 134.92 MiB
peak memory: 5064.07 MiB, increment: 4768.45 MiB
```
I wanted to double check that the calculation was working correctly:
```python
print((var_o.where(~np.isnan(var_o), 0) == var_n.where(~np.isnan(var_n), 0)).all().values)
print(np.allclose(var_o, var_n, equal_nan = True))
```
```
False
True
```
I think the difference here is just due to floating point errors, but maybe someone who knows how to check that in more detail could have a look.
The standard deviation can be trivially implemented from this if the approach works.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,675482176
https://github.com/pydata/xarray/pull/4915#issuecomment-1506749920,https://api.github.com/repos/pydata/xarray/issues/4915,1506749920,IC_kwDOAMm_X85Zzy3g,34276374,2023-04-13T10:47:38Z,2023-04-13T10:47:38Z,NONE,"I think I may have found a way to make it more memory efficient, but I don't know enough about writing the sort of code that would be needed for a PR.
I basically wrote out the calculation for variance trying to only use the functions that have already been optimsed. Derived from:
$$ var = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 $$
$$ var = \frac{1}{n} \left( (x_1 - \mu)^2 + (x_2 - \mu)^2 + (x_3 - \mu)^2 + ... \right) $$
$$ var = \frac{1}{n} \left(x_1^2 -2x_1\mu + \mu^2 + \\ x_2^2 -2x_2\mu + \mu^2 + \\ x_3^2 -2x_3\mu + \mu^2 + ... \right) $$
$$ var = \frac{1}{n} \left( \sum_{i=1}^{n} x_i^2 - 2\mu\sum_{i=1}^{n} x_i + n\mu^2 \right)$$
I coded this up and demonstrate that it uses approximately 10% of the memory as the current `.var()` implementation:
```python
%load_ext memory_profiler
import numpy as np
import xarray as xr
temp = xr.DataArray(np.random.randint(0, 10, (5000, 500)), dims=(""x"", ""y""))
def new_var(da, x=10, y=20):
# Defining the re-used parts
roll = da.rolling(x=x, y=y)
mean = roll.mean()
count = roll.count()
# First term: sum of squared values
term1 = (da**2).rolling(x=x, y=y).sum()
# Second term cross term sum
term2 = -2 * mean * roll.sum()
# Third term 'sum' of squared means
term3 = count * mean**2
# Combining into the variance
var = (term1 + term2 + term3) / count
return var
def old_var(da, x=10, y=20):
roll = da.rolling(x=x, y=y)
var = roll.var()
return var
%memit new_var(temp)
%memit old_var(temp)
```
```
peak memory: 429.77 MiB, increment: 134.92 MiB
peak memory: 5064.07 MiB, increment: 4768.45 MiB
```
I wanted to double check that the calculation was working correctly:
```python
print((var_o.where(~np.isnan(var_o), 0) == var_n.where(~np.isnan(var_n), 0)).all().values)
print(np.allclose(var_o, var_n, equal_nan = True))
```
```
False
True
```
I think the difference here is just due to floating point errors, but maybe someone who knows how to check that in more detail could have a look.
The standard deviation can be trivially implemented from this if the approach works.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,809366777
https://github.com/pydata/xarray/pull/4915#issuecomment-1505575188,https://api.github.com/repos/pydata/xarray/issues/4915,1505575188,IC_kwDOAMm_X85ZvUEU,34276374,2023-04-12T16:27:55Z,2023-04-12T16:27:55Z,NONE,Has there been any progress on this for var/std?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,809366777
https://github.com/pydata/xarray/issues/6828#issuecomment-1196410247,https://api.github.com/repos/pydata/xarray/issues/6828,1196410247,IC_kwDOAMm_X85HT8WH,34276374,2022-07-27T08:19:28Z,2022-07-27T08:19:28Z,NONE,"Thanks for the workaround @mathause!
Is there a benefit to your approach, rather than calling `compute()` on each DataArray? It seems like calling `compute()` twice is faster for the MVCE example (but maybe it won't scale that way).
But either way, it would be nice if the function threw a warning/error for handling dask arrays!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1318369110
https://github.com/pydata/xarray/issues/6456#issuecomment-1098856530,https://api.github.com/repos/pydata/xarray/issues/6456,1098856530,IC_kwDOAMm_X85BfzhS,34276374,2022-04-14T08:37:11Z,2022-04-14T08:37:11Z,NONE,"@delgadom thanks! This did help with my actual code, and I've now done my processing.
But this bug report was more about the fact that overwriting was converting data to NaNs (in two different ways depending on the code apparently).
In my case there is no longer any need to do the overwriting, but this doesn't seem like the expected behaviour of overwriting, and I'm sure there are some valid reasons to overwrite data - hence me opening the bug report.
If overwriting is supposed to convert data to NaNs then I guess we could close this issue, but I'm not sure that's intended?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1197117301
https://github.com/pydata/xarray/issues/6456#issuecomment-1096382964,https://api.github.com/repos/pydata/xarray/issues/6456,1096382964,IC_kwDOAMm_X85BWXn0,34276374,2022-04-12T08:47:55Z,2022-04-12T08:48:48Z,NONE,"@max-sixty could you explain which bit isn't working for you? The initial example I shared works fine in colab for me, so that might be a you problem. The second one required specifying the chunks when making the datasets (I've editted above).
[Here's a link to the colab](https://colab.research.google.com/drive/1H6ugbz9Ug208x5fLpmvNxIBdKgASjz7V?usp=sharing) (which has both examples).
It's worth noting that the way in which the dataset is broken does seem to be slightly different in each of these examples - in the former example all data becomes NaN, in the latter example only the initially saved data becomes NaN.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1197117301
https://github.com/pydata/xarray/issues/6456#issuecomment-1094583214,https://api.github.com/repos/pydata/xarray/issues/6456,1094583214,IC_kwDOAMm_X85BPgOu,34276374,2022-04-11T06:01:44Z,2022-04-12T08:48:13Z,NONE,"@max-sixty - I've tried to slim it down below (no loop, and only one save). From the print statements, it's clear that before overwriting the .zarr `ds3` is working correctly, but once `ds3` is saved it breaks the data corresponding to the initial save (now all NaNs). I am guessing this is due to trying to read from and save over the same data, but I wouldn't have expected it to be a problem if it was loading the chunks into memory during the saving.
```
import pandas as pd
import numpy as np
import glob
import xarray as xr
from tqdm import tqdm
# Creating pkl files
[pd.DataFrame(np.random.randint(0,10, (1000,500))).astype(object).to_pickle('df{}.pkl'.format(i)) for i in range(4)]
fnames = glob.glob('*.pkl')
df1 = pd.read_pickle(fnames[0])
df1.columns = np.arange(0,500).astype(object) # the real pkl files contain all objects
df1.index = np.arange(0,1000).astype(object)
df1 = df1.astype(np.float32)
ds = xr.DataArray(df1.values, dims=['fname', 'res_dim'],
coords={'fname': df1.index.values, 'res_dim': df1.columns.values})
ds = ds.to_dataset(name='low_dim').chunk({'fname': 500, 'res_dim': 1})
ds.to_zarr('zarr_bug.zarr', mode='w')
ds1 = xr.open_zarr('zarr_bug.zarr', decode_coords=""all"")
df2 = pd.read_pickle(fnames[1])
df2.columns = np.arange(0,500).astype(object)
df2.index = np.arange(0,1000).astype(object)
df2 = df2.astype(np.float32)
ds2 = xr.DataArray(df2.values, dims=['fname', 'res_dim'],
coords={'fname': df2.index.values, 'res_dim': df2.columns.values})
ds2 = ds2.to_dataset(name='low_dim').chunk({'fname': 500, 'res_dim': 1})
ds3 = xr.concat([ds1, ds2], dim='fname')
ds3['fname'] = ds3.fname.astype(str)
print(ds3.low_dim.values)
ds3.to_zarr('zarr_bug.zarr', mode='w')
print(ds3.low_dim.values)
```
The output:
```
[[7. 8. 4. ... 9. 6. 7.]
[0. 4. 5. ... 9. 7. 6.]
[3. 4. 3. ... 1. 6. 1.]
...
[4. 0. 4. ... 5. 6. 9.]
[5. 2. 5. ... 1. 7. 1.]
[8. 9. 7. ... 4. 4. 1.]]
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[ 4. 0. 4. ... 5. 6. 9.]
[ 5. 2. 5. ... 1. 7. 1.]
[ 8. 9. 7. ... 4. 4. 1.]]
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1197117301
https://github.com/pydata/xarray/issues/6456#issuecomment-1094587632,https://api.github.com/repos/pydata/xarray/issues/6456,1094587632,IC_kwDOAMm_X85BPhTw,34276374,2022-04-11T06:07:06Z,2022-04-11T10:42:51Z,NONE,"@delgadom - In the example it's saving every iteration, but in my actual code it's much less frequent. I figured there was probably a better way to achieve the same thing, but it still doesn't seem like the expected behaviour, which is why I thought I should raise the issue here.
The files are just sequentially names (as in my example), but the indices of the resulting dataframes are a bunch of unique strings (file-paths, not dates).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1197117301
https://github.com/pydata/xarray/issues/1437#issuecomment-1090330555,https://api.github.com/repos/pydata/xarray/issues/1437,1090330555,IC_kwDOAMm_X85A_R-7,34276374,2022-04-06T14:21:15Z,2022-04-06T14:21:15Z,NONE,"Had the same issue, fixed it by using
`del ds.my_var.attrs['attr_to_delete']`
before I tried to save my dataset.","{""total_count"": 3, ""+1"": 3, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,232743076