home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

6 rows where user = 9569132 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • davidorme · 6 ✖

issue 1

  • CFMaskCoder creates unnecessary copy for `uint16` variables 6

author_association 1

  • NONE 6
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1171130914 https://github.com/pydata/xarray/issues/6733#issuecomment-1171130914 https://api.github.com/repos/pydata/xarray/issues/6733 IC_kwDOAMm_X85Fzgoi davidorme 9569132 2022-06-30T11:59:07Z 2022-06-30T11:59:07Z NONE

I still see strange memory spikes that kill my jobs but the behaviour is not reproducible - the conversion will fail with > 4x memory use and then succeed the next time with the same inputs. My guess is that this isn't anything to do with xarray, but noting it just in case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1170900930 https://github.com/pydata/xarray/issues/6733#issuecomment-1170900930 https://api.github.com/repos/pydata/xarray/issues/6733 IC_kwDOAMm_X85FyofC davidorme 9569132 2022-06-30T08:05:47Z 2022-06-30T08:05:47Z NONE

Thanks @dcherian - completely agree that assuming 65535 is a fill can be confusing.

My question is basically solved, but the big memory increase is surprising to me. If you cast first, when required, you still have the user data at the original precision as a reference for the filling step?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1170015912 https://github.com/pydata/xarray/issues/6733#issuecomment-1170015912 https://api.github.com/repos/pydata/xarray/issues/6733 IC_kwDOAMm_X85FvQao davidorme 9569132 2022-06-29T13:57:51Z 2022-06-29T13:57:51Z NONE

Ah. I think I get it now. If you are setting _FillValue via encoding, that is an instruction to represent np.NaN with a specific value. xarray does not want to alter the provided data, so does that subsitution on a fresh copy, rather than doing it in place.

So, for any dtype, setting _FillValue in input to the encoding argument, always requires twice the memory of the input data. If encoding also sets dtype, then more memory is required to hold the cast data - hence 2.5 times for float32 to uint16.

Where a cast is specified in encoding, could xarray not cast the data first to get that isolated copy and then set the fill on the cast array?

The manual encoding does indeed work as suggested - the only possible gotcha here for users is that data stored in a netcdf file as integer type data but with a _FillValue is loaded as a float using np.NaN because there is no np.nan equivalent for integer types.

```python

import xarray import numpy as np

data = np.arange(65536, dtype='uint16') xda = xarray.DataArray(data=data, attrs={'_FillValue': 65535}) xda.to_netcdf('test.nc')

roundtrip = xarray.load_dataarray('test.nc') roundtrip <xarray.DataArray (dim_0: 65536)> array([0.0000e+00, 1.0000e+00, 2.0000e+00, ..., 6.5533e+04, 6.5534e+04, nan], dtype=float32) Dimensions without coordinates: dim_0 ```

There might be a problem here with consistency with ncdump? If you do not set attrs={'_FillValue'=65535} then xarray loads the file as uint16 and shows all the values:

```python

data = np.arange(65536, dtype='uint16') xda = xarray.DataArray(data=data, attrs={'_FillValue': 65535}) xda.to_netcdf('test.nc')

roundtrip = xarray.load_dataarray('test.nc') roundtrip <xarray.DataArray (dim_0: 65536)> array([ 0, 1, 2, ..., 65533, 65534, 65535], dtype=uint16) Dimensions without coordinates: dim_0 ```

However, ncdump on the same file interprets that last values as missing:

```bash $ ncdump test.nc netcdf test { dimensions: dim_0 = 65536 ; variables: ushort xarray_dataarray_variable(dim_0) ; data:

xarray_dataarray_variable = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ... 65528, 65529, 65530, 65531, 65532, 65533, 65534, _ ; } ```

This is because 65535 is the default fill value for u2 data in the netCDF4 library. It seems like ncdump asserts that there has to be a missing data value, which is 65535 unless set otherwise, but xarray is using the presence of the _FillValue attribute to signal the presence of missing data.

Using netCDF4 to read the same file gives the same result as ncdump, reporting the assumed fill value and having to use a masked array to show missing values .

```python

import netCDF4 x = netCDF4.Dataset('test.nc') x['xarray_dataarray_variable'] <class 'netCDF4._netCDF4.Variable'> uint16 xarray_dataarray_variable(dim_0) unlimited dimensions: current shape = (65536,) filling on, default _FillValue of 65535 used x['xarray_dataarray_variable'][:] masked_array(data=[0, 1, 2, ..., 65533, 65534, --], mask=[False, False, False, ..., False, False, True], fill_value=65535, dtype=uint16) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1169175453 https://github.com/pydata/xarray/issues/6733#issuecomment-1169175453 https://api.github.com/repos/pydata/xarray/issues/6733 IC_kwDOAMm_X85FsDOd davidorme 9569132 2022-06-28T20:02:14Z 2022-06-28T20:02:14Z NONE

Thanks again for your help!

I think that is what I am doing. If I understand right:

Using to_netcdf to handle the encoding

  • I'm passing a DataArray containing float32 data.
  • The np.ndarray containing the data can simply use np.nan to represent missing data.
  • Using to_netcdf with an encoding to uint16 has memory usage of 2 x float + 1 x int - set NaN on a copy and convert to int
  • The other thing that is puzzling here is that 35GB * 2.5 (two float32 copies + one uint16 copy) is ~ 90GB but many of the processes are using much more than that.

Manual encoding

  • I'm passing a DataArray containing uint16 data.
  • However - as far as I can see - DataArray itself doesn't specify an alternative missing data value. Because np.nan is a float, you can't represent missing data within an integer DataArray?
  • So, I am using _FillValue=65535 in the encoding to to_netcdf.
  • But that still appears to be triggering the encoding step - the Traceback in my second comment was from a manually encoded uint16 DataArray.
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1169128311 https://github.com/pydata/xarray/issues/6733#issuecomment-1169128311 https://api.github.com/repos/pydata/xarray/issues/6733 IC_kwDOAMm_X85Fr3t3 davidorme 9569132 2022-06-28T19:20:00Z 2022-06-28T19:20:00Z NONE

Thanks for the quick response.

I don't quite follow the process for the FillValue. If I manually encode as integer, setting NaN to 65535 as above, how would I ensure that the correct fill value is set in the resulting file? If I leave the _FillValue = 65535 out of the encoding for DataArray.to_netcdf, then those values won't be interpreted correctly.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1168483200 https://github.com/pydata/xarray/issues/6733#issuecomment-1168483200 https://api.github.com/repos/pydata/xarray/issues/6733 IC_kwDOAMm_X85FpaOA davidorme 9569132 2022-06-28T09:37:59Z 2022-06-28T09:37:59Z NONE

I've also tried pre-converting the float32 array to uint16:

python if pack: out_data = np.round(base_grid * scale_factor, 0) out_data[np.isnan(out_data)] = 65535 out_data = out_data.astype('uint16') else: out_data = base_grid

I expect that to add that extra 17GB for a total memory of 53GB or so but then exporting to netcdf still shows unexpectedly variable peak memory use:

bash $ grep peak conversion_* conversion_10.out: Used : 133 (peak) 0.53 (ave) conversion_11.out: Used : 117 (peak) 0.73 (ave) conversion_12.out: Used : 92 (peak) 0.93 (ave) conversion_13.out: Used : 103 (peak) 0.75 (ave) conversion_14.out: Used : 79 (peak) 0.64 (ave) conversion_15.out: Used : 94 (peak) 0.66 (ave) conversion_16.out: Used : 92 (peak) 0.95 (ave) conversion_17.out: Used : 129 (peak) 0.66 (ave) conversion_18.out: Used : 92 (peak) 0.91 (ave) conversion_19.out: Used : 105 (peak) 0.67 (ave) conversion_1.out: Used : 77 (peak) 0.94 (ave) conversion_20.out: Used : 87 (peak) 0.65 (ave) conversion_21.out: Used : 93 (peak) 0.63 (ave) conversion_2.out: Used : 92 (peak) 0.95 (ave) conversion_3.out: Used : 92 (peak) 0.94 (ave) conversion_4.out: Used : 92 (peak) 0.93 (ave) conversion_5.out: Used : 121 (peak) 0.47 (ave) conversion_6.out: Used : 92 (peak) 0.94 (ave) conversion_7.out: Used : 92 (peak) 0.96 (ave) conversion_8.out: Used : 92 (peak) 0.93 (ave) conversion_9.out: Used : 129 (peak) 0.47 (ave)

One thing I do see for some failing files in the script reporting is this exception - the to_netcdf process appears to be creating another 32GB float32 array?

python Data loaded; Memory usage: 35.70772171020508 Conversion complete; Memory usage: 53.329856872558594 Array created; Memory usage: 53.329856872558594 Traceback (most recent call last): File "/rds/general/project/lemontree/live/source/SNU_Ryu_FPAR_LAI/convert_SNU_Ryu_to_netcdf.py", line 162, in <module> xds.to_netcdf(out_file, encoding=encoding) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/core/dataarray.py", line 2839, in to_netcdf return dataset.to_netcdf(*args, **kwargs) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/core/dataset.py", line 1902, in to_netcdf return to_netcdf( File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/backends/api.py", line 1072, in to_netcdf dump_to_store( File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/backends/api.py", line 1119, in dump_to_store store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/backends/common.py", line 261, in store variables, attributes = self.encode(variables, attributes) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/backends/common.py", line 350, in encode variables, attributes = cf_encoder(variables, attributes) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/conventions.py", line 855, in cf_encoder new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()} File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/conventions.py", line 855, in <dictcomp> new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()} File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/conventions.py", line 269, in encode_cf_variable var = coder.encode(var, name=name) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/coding/variables.py", line 168, in encode data = duck_array_ops.fillna(data, fill_value) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/core/duck_array_ops.py", line 298, in fillna return where(notnull(data), data, other) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/core/duck_array_ops.py", line 285, in where return _where(condition, *as_shared_dtype([x, y])) File "<__array_function__ internals>", line 180, in where numpy.core._exceptions._ArrayMemoryError: Unable to allocate 35.2 GiB for an array with shape (365, 3600, 7200) and data type float32

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.411ms · About: xarray-datasette