html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/6733#issuecomment-1171130914,https://api.github.com/repos/pydata/xarray/issues/6733,1171130914,IC_kwDOAMm_X85Fzgoi,9569132,2022-06-30T11:59:07Z,2022-06-30T11:59:07Z,NONE,"I still see strange memory spikes that kill my jobs but the behaviour is not reproducible - the conversion will fail with > 4x memory use and then succeed the next time with the same inputs. My guess is that this isn't anything to do with `xarray`, but noting it just in case.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1286995366
https://github.com/pydata/xarray/issues/6733#issuecomment-1170900930,https://api.github.com/repos/pydata/xarray/issues/6733,1170900930,IC_kwDOAMm_X85FyofC,9569132,2022-06-30T08:05:47Z,2022-06-30T08:05:47Z,NONE,"Thanks @dcherian - completely agree that assuming 65535 is a fill can be confusing.
My question is basically solved, but the big memory increase is surprising to me. If you cast first, when required, you still have the user data at the original precision as a reference for the filling step?
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1286995366
https://github.com/pydata/xarray/issues/6733#issuecomment-1170015912,https://api.github.com/repos/pydata/xarray/issues/6733,1170015912,IC_kwDOAMm_X85FvQao,9569132,2022-06-29T13:57:51Z,2022-06-29T13:57:51Z,NONE,"Ah. I think I get it now. If you are setting `_FillValue` via encoding, that is an instruction to
represent `np.NaN` with a specific value. `xarray` does not want to alter the provided data, so
does that subsitution on a fresh copy, rather than doing it in place.
So, for _any_ `dtype`, setting `_FillValue` in input to the `encoding` argument, always requires
twice the memory of the input data. If `encoding` also sets `dtype`, then more memory is
required to hold the cast data - hence 2.5 times for `float32` to `uint16`.
Where a cast is specified in `encoding`, could `xarray` not cast the data _first_ to get that
isolated copy and then set the fill on the cast array?
The manual encoding does indeed work as suggested - the only possible gotcha here for users is
that data stored in a netcdf file as integer type data but with a _FillValue is loaded as a float using
`np.NaN` because there is no `np.nan` equivalent for integer types.
```python
>>> import xarray
>>> import numpy as np
>>>
>>> data = np.arange(65536, dtype='uint16')
>>> xda = xarray.DataArray(data=data, attrs={'_FillValue': 65535})
>>> xda.to_netcdf('test.nc')
>>>
>>> roundtrip = xarray.load_dataarray('test.nc')
>>> roundtrip
array([0.0000e+00, 1.0000e+00, 2.0000e+00, ..., 6.5533e+04, 6.5534e+04,
nan], dtype=float32)
Dimensions without coordinates: dim_0
```
There might be a problem here with consistency with `ncdump`? If you do not set
`attrs={'_FillValue'=65535}` then `xarray` loads the file as `uint16` and shows all the values:
```python
>>> data = np.arange(65536, dtype='uint16')
>>> xda = xarray.DataArray(data=data, attrs={'_FillValue': 65535})
>>> xda.to_netcdf('test.nc')
>>>
>>> roundtrip = xarray.load_dataarray('test.nc')
>>> roundtrip
array([ 0, 1, 2, ..., 65533, 65534, 65535], dtype=uint16)
Dimensions without coordinates: dim_0
```
However, `ncdump` on the same file interprets that last values as missing:
```bash
$ ncdump test.nc
netcdf test {
dimensions:
dim_0 = 65536 ;
variables:
ushort __xarray_dataarray_variable__(dim_0) ;
data:
__xarray_dataarray_variable__ = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
...
65528, 65529, 65530, 65531, 65532, 65533, 65534, _ ;
}
```
This is because 65535 is the default fill value for `u2` data in the netCDF4 library. It
seems like `ncdump` asserts that there has to be a missing data value, which is 65535 unless set
otherwise, but `xarray` is using the presence of the `_FillValue` attribute to signal the presence
of missing data.
Using `netCDF4` to read the same file gives the same result as `ncdump`, reporting the assumed
fill value and having to use a masked array to show missing values .
```python
>>> import netCDF4
>>> x = netCDF4.Dataset('test.nc')
>>> x['__xarray_dataarray_variable__']
uint16 __xarray_dataarray_variable__(dim_0)
unlimited dimensions:
current shape = (65536,)
filling on, default _FillValue of 65535 used
>>> x['__xarray_dataarray_variable__'][:]
masked_array(data=[0, 1, 2, ..., 65533, 65534, --],
mask=[False, False, False, ..., False, False, True],
fill_value=65535,
dtype=uint16)
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1286995366
https://github.com/pydata/xarray/issues/6733#issuecomment-1169175453,https://api.github.com/repos/pydata/xarray/issues/6733,1169175453,IC_kwDOAMm_X85FsDOd,9569132,2022-06-28T20:02:14Z,2022-06-28T20:02:14Z,NONE,"Thanks again for your help!
I think that is what I am doing. If I understand right:
### Using `to_netcdf` to handle the encoding
* I'm passing a DataArray containing `float32` data.
* The `np.ndarray` containing the data can simply use `np.nan` to represent missing data.
* Using `to_netcdf` with an encoding to `uint16` has memory usage of 2 x float + 1 x int - set NaN on a copy and convert to int
* The other thing that is puzzling here is that 35GB * 2.5 (two `float32` copies + one `uint16` copy) is ~ 90GB but many of the processes are using much more than that.
### Manual encoding
* I'm passing a DataArray containing `uint16` data.
* However - as far as I can see - DataArray _itself_ doesn't specify an alternative missing data value. Because `np.nan` is a float, you can't represent missing data within an integer DataArray?
* So, I am using `_FillValue=65535` in the encoding to `to_netcdf`.
* But that still appears to be triggering the encoding step - the Traceback in my second comment was from a manually encoded `uint16` DataArray.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1286995366
https://github.com/pydata/xarray/issues/6733#issuecomment-1169128311,https://api.github.com/repos/pydata/xarray/issues/6733,1169128311,IC_kwDOAMm_X85Fr3t3,9569132,2022-06-28T19:20:00Z,2022-06-28T19:20:00Z,NONE,"Thanks for the quick response.
I don't quite follow the process for the `FillValue`. If I manually encode as integer, setting NaN to `65535` as above, how would I ensure that the correct fill value is set in the resulting file? If I leave the `_FillValue = 65535` out of the encoding for `DataArray.to_netcdf`, then those values won't be interpreted correctly.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1286995366
https://github.com/pydata/xarray/issues/6733#issuecomment-1168483200,https://api.github.com/repos/pydata/xarray/issues/6733,1168483200,IC_kwDOAMm_X85FpaOA,9569132,2022-06-28T09:37:59Z,2022-06-28T09:37:59Z,NONE,"I've also tried pre-converting the `float32` array to `uint16`:
```python
if pack:
out_data = np.round(base_grid * scale_factor, 0)
out_data[np.isnan(out_data)] = 65535
out_data = out_data.astype('uint16')
else:
out_data = base_grid
```
I expect that to add that extra 17GB for a total memory of 53GB or so but then exporting to netcdf still shows unexpectedly variable peak memory use:
```bash
$ grep peak conversion_*
conversion_10.out: Used : 133 (peak) 0.53 (ave)
conversion_11.out: Used : 117 (peak) 0.73 (ave)
conversion_12.out: Used : 92 (peak) 0.93 (ave)
conversion_13.out: Used : 103 (peak) 0.75 (ave)
conversion_14.out: Used : 79 (peak) 0.64 (ave)
conversion_15.out: Used : 94 (peak) 0.66 (ave)
conversion_16.out: Used : 92 (peak) 0.95 (ave)
conversion_17.out: Used : 129 (peak) 0.66 (ave)
conversion_18.out: Used : 92 (peak) 0.91 (ave)
conversion_19.out: Used : 105 (peak) 0.67 (ave)
conversion_1.out: Used : 77 (peak) 0.94 (ave)
conversion_20.out: Used : 87 (peak) 0.65 (ave)
conversion_21.out: Used : 93 (peak) 0.63 (ave)
conversion_2.out: Used : 92 (peak) 0.95 (ave)
conversion_3.out: Used : 92 (peak) 0.94 (ave)
conversion_4.out: Used : 92 (peak) 0.93 (ave)
conversion_5.out: Used : 121 (peak) 0.47 (ave)
conversion_6.out: Used : 92 (peak) 0.94 (ave)
conversion_7.out: Used : 92 (peak) 0.96 (ave)
conversion_8.out: Used : 92 (peak) 0.93 (ave)
conversion_9.out: Used : 129 (peak) 0.47 (ave)
```
One thing I do see for _some_ failing files in the script reporting is this exception - the `to_netcdf` process appears to be creating another 32GB `float32` array?
```python
Data loaded; Memory usage: 35.70772171020508
Conversion complete; Memory usage: 53.329856872558594
Array created; Memory usage: 53.329856872558594
Traceback (most recent call last):
File ""/rds/general/project/lemontree/live/source/SNU_Ryu_FPAR_LAI/convert_SNU_Ryu_to_netcdf.py"", line 162, in
xds.to_netcdf(out_file, encoding=encoding)
File ""/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/core/dataarray.py"", line 2839, in to_netcdf
return dataset.to_netcdf(*args, **kwargs)
File ""/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/core/dataset.py"", line 1902, in to_netcdf
return to_netcdf(
File ""/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/backends/api.py"", line 1072, in to_netcdf
dump_to_store(
File ""/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/backends/api.py"", line 1119, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File ""/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/backends/common.py"", line 261, in store
variables, attributes = self.encode(variables, attributes)
File ""/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/backends/common.py"", line 350, in encode
variables, attributes = cf_encoder(variables, attributes)
File ""/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/conventions.py"", line 855, in cf_encoder
new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
File ""/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/conventions.py"", line 855, in
new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
File ""/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/conventions.py"", line 269, in encode_cf_variable
var = coder.encode(var, name=name)
File ""/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/coding/variables.py"", line 168, in encode
data = duck_array_ops.fillna(data, fill_value)
File ""/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/core/duck_array_ops.py"", line 298, in fillna
return where(notnull(data), data, other)
File ""/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/core/duck_array_ops.py"", line 285, in where
return _where(condition, *as_shared_dtype([x, y]))
File ""<__array_function__ internals>"", line 180, in where
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 35.2 GiB for an array with shape (365, 3600, 7200) and data type float32
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1286995366