github: issue_comments: 10 rows where issue = 1286995366 sorted by updated

10 rows where issue = 1286995366 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
1171130914	https://github.com/pydata/xarray/issues/6733#issuecomment-1171130914	https://api.github.com/repos/pydata/xarray/issues/6733	IC_kwDOAMm_X85Fzgoi	davidorme 9569132	2022-06-30T11:59:07Z	2022-06-30T11:59:07Z	NONE	I still see strange memory spikes that kill my jobs but the behaviour is not reproducible - the conversion will fail with > 4x memory use and then succeed the next time with the same inputs. My guess is that this isn't anything to do with `xarray`, but noting it just in case.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1170900930	https://github.com/pydata/xarray/issues/6733#issuecomment-1170900930	https://api.github.com/repos/pydata/xarray/issues/6733	IC_kwDOAMm_X85FyofC	davidorme 9569132	2022-06-30T08:05:47Z	2022-06-30T08:05:47Z	NONE	Thanks @dcherian - completely agree that assuming 65535 is a fill can be confusing. My question is basically solved, but the big memory increase is surprising to me. If you cast first, when required, you still have the user data at the original precision as a reference for the filling step?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1170142019	https://github.com/pydata/xarray/issues/6733#issuecomment-1170142019	https://api.github.com/repos/pydata/xarray/issues/6733	IC_kwDOAMm_X85FvvND	dcherian 2448579	2022-06-29T15:41:45Z	2022-06-29T15:41:45Z	MEMBER	Where a cast is specified in encoding, could xarray not cast the data first to get that isolated copy and then set the fill on the cast array? If you cast float to int you might lose information neeeded to accurately do the filling step. the only possible gotcha here for users is that data stored in a netcdf file as integer type data but with a _FillValue is loaded as a float using np.NaN because there is no np.nan equivalent for integer types. Yes. but your data originated as floating point so this is correct. ncdump asserts that there has to be a missing data value, which is 65535 unless set otherwise, but xarray is using the presence of the _FillValue attribute to signal the presence of missing data. You can specify `missing_value` and/or `_FillValue` attributes. So you could try that. Xarray does not follow the default "fill values" because it can be confusing; it is valid to store 65535 as u2 for example.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1170015912	https://github.com/pydata/xarray/issues/6733#issuecomment-1170015912	https://api.github.com/repos/pydata/xarray/issues/6733	IC_kwDOAMm_X85FvQao	davidorme 9569132	2022-06-29T13:57:51Z	2022-06-29T13:57:51Z	NONE	Ah. I think I get it now. If you are setting `_FillValue` via encoding, that is an instruction to represent `np.NaN` with a specific value. `xarray` does not want to alter the provided data, so does that subsitution on a fresh copy, rather than doing it in place. So, for any `dtype`, setting `_FillValue` in input to the `encoding` argument, always requires twice the memory of the input data. If `encoding` also sets `dtype`, then more memory is required to hold the cast data - hence 2.5 times for `float32` to `uint16`. Where a cast is specified in `encoding`, could `xarray` not cast the data first to get that isolated copy and then set the fill on the cast array? The manual encoding does indeed work as suggested - the only possible gotcha here for users is that data stored in a netcdf file as integer type data but with a _FillValue is loaded as a float using `np.NaN` because there is no `np.nan` equivalent for integer types. ```python import xarray import numpy as np data = np.arange(65536, dtype='uint16') xda = xarray.DataArray(data=data, attrs={'_FillValue': 65535}) xda.to_netcdf('test.nc') roundtrip = xarray.load_dataarray('test.nc') roundtrip <xarray.DataArray (dim_0: 65536)> array([0.0000e+00, 1.0000e+00, 2.0000e+00, ..., 6.5533e+04, 6.5534e+04, nan], dtype=float32) Dimensions without coordinates: dim_0 ``` There might be a problem here with consistency with `ncdump`? If you do not set `attrs={'_FillValue'=65535}` then `xarray` loads the file as `uint16` and shows all the values: ```python data = np.arange(65536, dtype='uint16') xda = xarray.DataArray(data=data, attrs={'_FillValue': 65535}) xda.to_netcdf('test.nc') roundtrip = xarray.load_dataarray('test.nc') roundtrip <xarray.DataArray (dim_0: 65536)> array([ 0, 1, 2, ..., 65533, 65534, 65535], dtype=uint16) Dimensions without coordinates: dim_0 ``` However, `ncdump` on the same file interprets that last values as missing: ```bash $ ncdump test.nc netcdf test { dimensions: dim_0 = 65536 ; variables: ushort xarray_dataarray_variable(dim_0) ; data: xarray_dataarray_variable = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ... 65528, 65529, 65530, 65531, 65532, 65533, 65534, _ ; } ``` This is because 65535 is the default fill value for `u2` data in the netCDF4 library. It seems like `ncdump` asserts that there has to be a missing data value, which is 65535 unless set otherwise, but `xarray` is using the presence of the `_FillValue` attribute to signal the presence of missing data. Using `netCDF4` to read the same file gives the same result as `ncdump`, reporting the assumed fill value and having to use a masked array to show missing values . ```python import netCDF4 x = netCDF4.Dataset('test.nc') x['xarray_dataarray_variable'] <class 'netCDF4._netCDF4.Variable'> uint16 xarray_dataarray_variable(dim_0) unlimited dimensions: current shape = (65536,) filling on, default _FillValue of 65535 used x['xarray_dataarray_variable'][:] masked_array(data=[0, 1, 2, ..., 65533, 65534, --], mask=[False, False, False, ..., False, False, True], fill_value=65535, dtype=uint16) ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1169321303	https://github.com/pydata/xarray/issues/6733#issuecomment-1169321303	https://api.github.com/repos/pydata/xarray/issues/6733	IC_kwDOAMm_X85Fsm1X	dcherian 2448579	2022-06-28T21:57:05Z	2022-06-28T21:57:05Z	MEMBER	So, I am using _FillValue=65535 in the encoding to to_netcdf. `encoding` is really an instruction to Xarray to encode the data. But you've already done that. So specify `_FillValue` in `attrs` instead of `encoding`. This will get written to file and be interpreted on read. At least IIUC =)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1169175453	https://github.com/pydata/xarray/issues/6733#issuecomment-1169175453	https://api.github.com/repos/pydata/xarray/issues/6733	IC_kwDOAMm_X85FsDOd	davidorme 9569132	2022-06-28T20:02:14Z	2022-06-28T20:02:14Z	NONE	Thanks again for your help! I think that is what I am doing. If I understand right: Using `to_netcdf` to handle the encoding I'm passing a DataArray containing `float32` data. The `np.ndarray` containing the data can simply use `np.nan` to represent missing data. Using `to_netcdf` with an encoding to `uint16` has memory usage of 2 x float + 1 x int - set NaN on a copy and convert to int The other thing that is puzzling here is that 35GB * 2.5 (two `float32` copies + one `uint16` copy) is ~ 90GB but many of the processes are using much more than that. Manual encoding I'm passing a DataArray containing `uint16` data. However - as far as I can see - DataArray itself doesn't specify an alternative missing data value. Because `np.nan` is a float, you can't represent missing data within an integer DataArray? So, I am using `_FillValue=65535` in the encoding to `to_netcdf`. But that still appears to be triggering the encoding step - the Traceback in my second comment was from a manually encoded `uint16` DataArray.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1169150139	https://github.com/pydata/xarray/issues/6733#issuecomment-1169150139	https://api.github.com/repos/pydata/xarray/issues/6733	IC_kwDOAMm_X85Fr9C7	dcherian 2448579	2022-06-28T19:35:36Z	2022-06-28T19:35:36Z	MEMBER	Try setting `_FillValue` in attrs? I haven't triedf this...	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1169128311	https://github.com/pydata/xarray/issues/6733#issuecomment-1169128311	https://api.github.com/repos/pydata/xarray/issues/6733	IC_kwDOAMm_X85Fr3t3	davidorme 9569132	2022-06-28T19:20:00Z	2022-06-28T19:20:00Z	NONE	Thanks for the quick response. I don't quite follow the process for the `FillValue`. If I manually encode as integer, setting NaN to `65535` as above, how would I ensure that the correct fill value is set in the resulting file? If I leave the `_FillValue = 65535` out of the encoding for `DataArray.to_netcdf`, then those values won't be interpreted correctly.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1169014257	https://github.com/pydata/xarray/issues/6733#issuecomment-1169014257	https://api.github.com/repos/pydata/xarray/issues/6733	IC_kwDOAMm_X85Frb3x	dcherian 2448579	2022-06-28T17:23:12Z	2022-06-28T17:26:17Z	MEMBER	Yeah I think the issue is that the "CFMaskCoder" tries to repalce NaNs regardless of the dtype of the variable. Doing this creates a copy in this step: `where(notnull(data), data, other)`. https://github.com/pydata/xarray/blob/787a96c15161c9025182291b672b3d3c5548a6c7/xarray/coding/variables.py#L149 You should set `FillValue` to None after manually encoding to ints to skip the extra copy. We should probably raise an error or at least a warning for integer dtypes and not-None `FillValue` As for your initial question, we create a copy of the float array when replacing NaNs (does not happen in-place), then convert to int. So you'll need to account for 2x float array + 1x int array memory use.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366
1168483200	https://github.com/pydata/xarray/issues/6733#issuecomment-1168483200	https://api.github.com/repos/pydata/xarray/issues/6733	IC_kwDOAMm_X85FpaOA	davidorme 9569132	2022-06-28T09:37:59Z	2022-06-28T09:37:59Z	NONE	I've also tried pre-converting the `float32` array to `uint16`: `python if pack: out_data = np.round(base_grid * scale_factor, 0) out_data[np.isnan(out_data)] = 65535 out_data = out_data.astype('uint16') else: out_data = base_grid` I expect that to add that extra 17GB for a total memory of 53GB or so but then exporting to netcdf still shows unexpectedly variable peak memory use: bash $ grep peak conversion_* conversion_10.out: Used : 133 (peak) 0.53 (ave) conversion_11.out: Used : 117 (peak) 0.73 (ave) conversion_12.out: Used : 92 (peak) 0.93 (ave) conversion_13.out: Used : 103 (peak) 0.75 (ave) conversion_14.out: Used : 79 (peak) 0.64 (ave) conversion_15.out: Used : 94 (peak) 0.66 (ave) conversion_16.out: Used : 92 (peak) 0.95 (ave) conversion_17.out: Used : 129 (peak) 0.66 (ave) conversion_18.out: Used : 92 (peak) 0.91 (ave) conversion_19.out: Used : 105 (peak) 0.67 (ave) conversion_1.out: Used : 77 (peak) 0.94 (ave) conversion_20.out: Used : 87 (peak) 0.65 (ave) conversion_21.out: Used : 93 (peak) 0.63 (ave) conversion_2.out: Used : 92 (peak) 0.95 (ave) conversion_3.out: Used : 92 (peak) 0.94 (ave) conversion_4.out: Used : 92 (peak) 0.93 (ave) conversion_5.out: Used : 121 (peak) 0.47 (ave) conversion_6.out: Used : 92 (peak) 0.94 (ave) conversion_7.out: Used : 92 (peak) 0.96 (ave) conversion_8.out: Used : 92 (peak) 0.93 (ave) conversion_9.out: Used : 129 (peak) 0.47 (ave) One thing I do see for some failing files in the script reporting is this exception - the `to_netcdf` process appears to be creating another 32GB `float32` array? python Data loaded; Memory usage: 35.70772171020508 Conversion complete; Memory usage: 53.329856872558594 Array created; Memory usage: 53.329856872558594 Traceback (most recent call last): File "/rds/general/project/lemontree/live/source/SNU_Ryu_FPAR_LAI/convert_SNU_Ryu_to_netcdf.py", line 162, in <module> xds.to_netcdf(out_file, encoding=encoding) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/core/dataarray.py", line 2839, in to_netcdf return dataset.to_netcdf(args, kwargs) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/core/dataset.py", line 1902, in to_netcdf return to_netcdf( File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/backends/api.py", line 1072, in to_netcdf dump_to_store( File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/backends/api.py", line 1119, in dump_to_store store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/backends/common.py", line 261, in store variables, attributes = self.encode(variables, attributes) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/backends/common.py", line 350, in encode variables, attributes = cf_encoder(variables, attributes) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/conventions.py", line 855, in cf_encoder new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()} File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/conventions.py", line 855, in <dictcomp> new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()} File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/conventions.py", line 269, in encode_cf_variable var = coder.encode(var, name=name) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/coding/variables.py", line 168, in encode data = duck_array_ops.fillna(data, fill_value) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/core/duck_array_ops.py", line 298, in fillna return where(notnull(data), data, other) File "/rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/xarray/core/duck_array_ops.py", line 285, in where return _where(condition, as_shared_dtype([x, y])) File "<__array_function__ internals>", line 180, in where numpy.core._exceptions._ArrayMemoryError: Unable to allocate 35.2 GiB for an array with shape (365, 3600, 7200) and data type float32	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	CFMaskCoder creates unnecessary copy for `uint16` variables 1286995366

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

10 rows where issue = 1286995366 sorted by updated_at descending

Using to_netcdf to handle the encoding

Manual encoding

Advanced export

Using `to_netcdf` to handle the encoding