github: issues: 1 row where repo = 13221727, state = "closed" and user = 9569132 sorted by updated

1 row where repo = 13221727, state = "closed" and user = 9569132 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at ▲	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1286995366	I_kwDOAMm_X85Mtf2m	6733	CFMaskCoder creates unnecessary copy for `uint16` variables	davidorme 9569132	closed	0			11	2022-06-28T08:38:34Z	2023-09-13T12:43:26Z	2023-09-13T12:43:25Z	NONE				What is your issue? Hi, I have a bunch of global gridded data as a 20 year sequence of daily Matlab files that I am consolidating into annual netcdfs using `xarray`. The dataset is a reasonably big DataArray (7200 x 3600 x 365, for ~35GB as `float32`), but compiling the data into a `numpy` array and then creating a DataArray and writing using `to_netcdf` works fine with `float32`. However, I've been implementing encoding the data using `uint16` to save disk space - the range of the data and actual precision mean that this seems reasonable. I've been using the docs as a reference: https://docs.xarray.dev/en/stable/user-guide/io.html#scaling-and-type-conversions The problem I then get is that the memory usage spikes unpredictably. I've been using `psutil` to track the process memory in the script: ```python def report_mem(process, prefix=''): `mem = process.memory_info()[0] / float(2 30) sys.stdout.write(f"{prefix}Memory usage: {mem}\n") sys.stdout.flush()` ``` The actual data ingestion and creation of the DataArray seems to be absolutely fine. With the DataArray created, the overall process memory of 35.69GB as expected. python xds = xarray.DataArray( base_grid, # 365 x 3600 x 7200 np.array with dtype `float32` coords=[dates, latitude, longitude], dims=["time", "latitude", "longitude"], name=canonical_name, attrs={"units": unit}, ) The next bit of the script is then: ```python if pack: encoding = { canonical_name: { "zlib": True, "complevel": 6, "dtype": "uint16", "scale_factor": scale_factor, "_FillValue": 65535, } } else: encoding = {canonical_name: {"zlib": True, "complevel": 6}} xds.to_netcdf(out_file, encoding=encoding) ``` When `pack=False`, the data is written out as `float32` and the reported peak memory usage for the job is 37GB. However, when `pack=TRUE`, I get an unpredictable increase in memory usage. I was anticipating that another ~17GB could be needed to hold a `uint16` version of the array, but what I see is a huge and variable increase in memory use. The list below shows the job reported memory usage for one test run of the script over 19 years. The first number in the list below is the peak RAM usage in GB. I'm running these on an HPC cluster and anything over 96GB gets killed, so only a handful of these are actually completing, so it could be that the memory requirements might run even higher, but are getting killed. Another odd thing is that the files that complete are unpredictable - the memory usage is not stable for a particular year. conversion_10.out: Used : 115 (peak) 0.82 (ave) conversion_11.out: Used : 82 (peak) 0.88 (ave) conversion_12.out: Used : 106 (peak) 0.50 (ave) conversion_13.out: Used : 115 (peak) 0.50 (ave) conversion_14.out: Used : 124 (peak) 0.56 (ave) conversion_15.out: Used : 83 (peak) 0.84 (ave) conversion_16.out: Used : 124 (peak) 0.58 (ave) conversion_17.out: Used : 94 (peak) 0.87 (ave) conversion_18.out: Used : 82 (peak) 0.83 (ave) conversion_19.out: Used : 110 (peak) 0.82 (ave) conversion_20.out: Used : 106 (peak) 0.72 (ave) conversion_21.out: Used : 107 (peak) 0.80 (ave) conversion_3.out: Used : 83 (peak) 0.93 (ave) conversion_4.out: Used : 124 (peak) 0.70 (ave) conversion_5.out: Used : 130 (peak) 0.77 (ave) conversion_6.out: Used : 112 (peak) 0.68 (ave) conversion_7.out: Used : 80 (peak) 0.52 (ave) conversion_8.out: Used : 97 (peak) 0.85 (ave) conversion_9.out: Used : 117 (peak) 0.59 (ave) The files that do run end up with exactly the expected structure: bash (python3.10) []$ ncdump -h LAI_cf_uint16/LAI_2007.nc netcdf LAI_2007 { dimensions: time = 365 ; latitude = 3600 ; longitude = 7200 ; variables: int64 time(time) ; time:units = "days since 2007-01-01 00:00:00" ; time:calendar = "proleptic_gregorian" ; double latitude(latitude) ; latitude:_FillValue = NaN ; double longitude(longitude) ; longitude:_FillValue = NaN ; ushort leaf_area_index(time, latitude, longitude) ; leaf_area_index:_FillValue = 65535US ; leaf_area_index:units = "1" ; leaf_area_index:scale_factor = 0.00015625 ; } Any suggestions? It sounds like this should work! Python version and package versions ``` (python3.10) $ python --version Python 3.10.4 (python3.10) $ pip list Package Version Bottleneck 1.3.4 certifi 2022.6.15 cftime 1.5.1.1 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 netCDF4 1.5.7 numexpr 2.8.1 numpy 1.22.3 packaging 21.3 pandas 1.4.2 pip 21.2.4 psutil 5.8.0 pyparsing 3.0.4 python-dateutil 2.8.2 pytz 2022.1 scipy 1.7.3 setuptools 61.2.0 six 1.16.0 typing_extensions 4.1.1 wheel 0.37.1 xarray 0.20.1 (python3.10) [dorme@login-c SNU_Ryu_FPAR_LAI]$ python Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] on linux Type "help", "copyright", "credits" or "license" for more information. import xarray xarray.show_versions() /rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/_distutils_hack/init**.py:30: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS commit: None python: 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.18.0-348.20.1.el8_5.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.6.1 xarray: 0.20.1 pandas: 1.4.2 numpy: 1.22.3 scipy: 1.7.3 netCDF4: 1.5.7 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.5.1.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.4 dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None setuptools: 61.2.0 pip: 21.2.4 conda: None pytest: None IPython: None sphinx: None ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6733/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

1 row where repo = 13221727, state = "closed" and user = 9569132 sorted by updated_at descending

What is your issue?

Python version and package versions

INSTALLED VERSIONS

Advanced export