home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 1170015912

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/6733#issuecomment-1170015912 https://api.github.com/repos/pydata/xarray/issues/6733 1170015912 IC_kwDOAMm_X85FvQao 9569132 2022-06-29T13:57:51Z 2022-06-29T13:57:51Z NONE

Ah. I think I get it now. If you are setting _FillValue via encoding, that is an instruction to represent np.NaN with a specific value. xarray does not want to alter the provided data, so does that subsitution on a fresh copy, rather than doing it in place.

So, for any dtype, setting _FillValue in input to the encoding argument, always requires twice the memory of the input data. If encoding also sets dtype, then more memory is required to hold the cast data - hence 2.5 times for float32 to uint16.

Where a cast is specified in encoding, could xarray not cast the data first to get that isolated copy and then set the fill on the cast array?

The manual encoding does indeed work as suggested - the only possible gotcha here for users is that data stored in a netcdf file as integer type data but with a _FillValue is loaded as a float using np.NaN because there is no np.nan equivalent for integer types.

```python

import xarray import numpy as np

data = np.arange(65536, dtype='uint16') xda = xarray.DataArray(data=data, attrs={'_FillValue': 65535}) xda.to_netcdf('test.nc')

roundtrip = xarray.load_dataarray('test.nc') roundtrip <xarray.DataArray (dim_0: 65536)> array([0.0000e+00, 1.0000e+00, 2.0000e+00, ..., 6.5533e+04, 6.5534e+04, nan], dtype=float32) Dimensions without coordinates: dim_0 ```

There might be a problem here with consistency with ncdump? If you do not set attrs={'_FillValue'=65535} then xarray loads the file as uint16 and shows all the values:

```python

data = np.arange(65536, dtype='uint16') xda = xarray.DataArray(data=data, attrs={'_FillValue': 65535}) xda.to_netcdf('test.nc')

roundtrip = xarray.load_dataarray('test.nc') roundtrip <xarray.DataArray (dim_0: 65536)> array([ 0, 1, 2, ..., 65533, 65534, 65535], dtype=uint16) Dimensions without coordinates: dim_0 ```

However, ncdump on the same file interprets that last values as missing:

```bash $ ncdump test.nc netcdf test { dimensions: dim_0 = 65536 ; variables: ushort xarray_dataarray_variable(dim_0) ; data:

xarray_dataarray_variable = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ... 65528, 65529, 65530, 65531, 65532, 65533, 65534, _ ; } ```

This is because 65535 is the default fill value for u2 data in the netCDF4 library. It seems like ncdump asserts that there has to be a missing data value, which is 65535 unless set otherwise, but xarray is using the presence of the _FillValue attribute to signal the presence of missing data.

Using netCDF4 to read the same file gives the same result as ncdump, reporting the assumed fill value and having to use a masked array to show missing values .

```python

import netCDF4 x = netCDF4.Dataset('test.nc') x['xarray_dataarray_variable'] <class 'netCDF4._netCDF4.Variable'> uint16 xarray_dataarray_variable(dim_0) unlimited dimensions: current shape = (65536,) filling on, default _FillValue of 65535 used x['xarray_dataarray_variable'][:] masked_array(data=[0, 1, 2, ..., 65533, 65534, --], mask=[False, False, False, ..., False, False, True], fill_value=65535, dtype=uint16) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  1286995366
Powered by Datasette · Queries took 1.543ms · About: xarray-datasette