issue_comments: 1170015912

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/6733#issuecomment-1170015912	https://api.github.com/repos/pydata/xarray/issues/6733	1170015912	IC_kwDOAMm_X85FvQao	9569132	2022-06-29T13:57:51Z	2022-06-29T13:57:51Z	NONE	Ah. I think I get it now. If you are setting `_FillValue` via encoding, that is an instruction to represent `np.NaN` with a specific value. `xarray` does not want to alter the provided data, so does that subsitution on a fresh copy, rather than doing it in place. So, for any `dtype`, setting `_FillValue` in input to the `encoding` argument, always requires twice the memory of the input data. If `encoding` also sets `dtype`, then more memory is required to hold the cast data - hence 2.5 times for `float32` to `uint16`. Where a cast is specified in `encoding`, could `xarray` not cast the data first to get that isolated copy and then set the fill on the cast array? The manual encoding does indeed work as suggested - the only possible gotcha here for users is that data stored in a netcdf file as integer type data but with a _FillValue is loaded as a float using `np.NaN` because there is no `np.nan` equivalent for integer types. ```python import xarray import numpy as np data = np.arange(65536, dtype='uint16') xda = xarray.DataArray(data=data, attrs={'_FillValue': 65535}) xda.to_netcdf('test.nc') roundtrip = xarray.load_dataarray('test.nc') roundtrip <xarray.DataArray (dim_0: 65536)> array([0.0000e+00, 1.0000e+00, 2.0000e+00, ..., 6.5533e+04, 6.5534e+04, nan], dtype=float32) Dimensions without coordinates: dim_0 ``` There might be a problem here with consistency with `ncdump`? If you do not set `attrs={'_FillValue'=65535}` then `xarray` loads the file as `uint16` and shows all the values: ```python data = np.arange(65536, dtype='uint16') xda = xarray.DataArray(data=data, attrs={'_FillValue': 65535}) xda.to_netcdf('test.nc') roundtrip = xarray.load_dataarray('test.nc') roundtrip <xarray.DataArray (dim_0: 65536)> array([ 0, 1, 2, ..., 65533, 65534, 65535], dtype=uint16) Dimensions without coordinates: dim_0 ``` However, `ncdump` on the same file interprets that last values as missing: ```bash $ ncdump test.nc netcdf test { dimensions: dim_0 = 65536 ; variables: ushort xarray_dataarray_variable(dim_0) ; data: xarray_dataarray_variable = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ... 65528, 65529, 65530, 65531, 65532, 65533, 65534, _ ; } ``` This is because 65535 is the default fill value for `u2` data in the netCDF4 library. It seems like `ncdump` asserts that there has to be a missing data value, which is 65535 unless set otherwise, but `xarray` is using the presence of the `_FillValue` attribute to signal the presence of missing data. Using `netCDF4` to read the same file gives the same result as `ncdump`, reporting the assumed fill value and having to use a masked array to show missing values . ```python import netCDF4 x = netCDF4.Dataset('test.nc') x['xarray_dataarray_variable'] <class 'netCDF4._netCDF4.Variable'> uint16 xarray_dataarray_variable(dim_0) unlimited dimensions: current shape = (65536,) filling on, default _FillValue of 65535 used x['xarray_dataarray_variable'][:] masked_array(data=[0, 1, 2, ..., 65533, 65534, --], mask=[False, False, False, ..., False, False, True], fill_value=65535, dtype=uint16) ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		1286995366