home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

1 row where repo = 13221727, state = "closed" and user = 9569132 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 1

  • issue 1

state 1

  • closed · 1 ✖

repo 1

  • xarray · 1 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1286995366 I_kwDOAMm_X85Mtf2m 6733 CFMaskCoder creates unnecessary copy for `uint16` variables davidorme 9569132 closed 0     11 2022-06-28T08:38:34Z 2023-09-13T12:43:26Z 2023-09-13T12:43:25Z NONE      

What is your issue?

Hi,

I have a bunch of global gridded data as a 20 year sequence of daily Matlab files that I am consolidating into annual netcdfs using xarray. The dataset is a reasonably big DataArray (7200 x 3600 x 365, for ~35GB as float32), but compiling the data into a numpy array and then creating a DataArray and writing using to_netcdf works fine with float32. However, I've been implementing encoding the data using uint16 to save disk space - the range of the data and actual precision mean that this seems reasonable. I've been using the docs as a reference:

https://docs.xarray.dev/en/stable/user-guide/io.html#scaling-and-type-conversions

The problem I then get is that the memory usage spikes unpredictably. I've been using psutil to track the process memory in the script:

```python def report_mem(process, prefix=''):

mem = process.memory_info()[0] / float(2 ** 30)
sys.stdout.write(f"{prefix}Memory usage: {mem}\n")
sys.stdout.flush()

```

The actual data ingestion and creation of the DataArray seems to be absolutely fine. With the DataArray created, the overall process memory of 35.69GB as expected.

python xds = xarray.DataArray( base_grid, # 365 x 3600 x 7200 np.array with dtype `float32` coords=[dates, latitude, longitude], dims=["time", "latitude", "longitude"], name=canonical_name, attrs={"units": unit}, )

The next bit of the script is then:

```python if pack: encoding = { canonical_name: { "zlib": True, "complevel": 6, "dtype": "uint16", "scale_factor": scale_factor, "_FillValue": 65535, } } else: encoding = {canonical_name: {"zlib": True, "complevel": 6}}

xds.to_netcdf(out_file, encoding=encoding) ```

When pack=False, the data is written out as float32 and the reported peak memory usage for the job is 37GB. However, when pack=TRUE, I get an unpredictable increase in memory usage. I was anticipating that another ~17GB could be needed to hold a uint16 version of the array, but what I see is a huge and variable increase in memory use.

The list below shows the job reported memory usage for one test run of the script over 19 years. The first number in the list below is the peak RAM usage in GB. I'm running these on an HPC cluster and anything over 96GB gets killed, so only a handful of these are actually completing, so it could be that the memory requirements might run even higher, but are getting killed. Another odd thing is that the files that complete are unpredictable - the memory usage is not stable for a particular year.

conversion_10.out: Used : 115 (peak) 0.82 (ave) conversion_11.out: Used : 82 (peak) 0.88 (ave) conversion_12.out: Used : 106 (peak) 0.50 (ave) conversion_13.out: Used : 115 (peak) 0.50 (ave) conversion_14.out: Used : 124 (peak) 0.56 (ave) conversion_15.out: Used : 83 (peak) 0.84 (ave) conversion_16.out: Used : 124 (peak) 0.58 (ave) conversion_17.out: Used : 94 (peak) 0.87 (ave) conversion_18.out: Used : 82 (peak) 0.83 (ave) conversion_19.out: Used : 110 (peak) 0.82 (ave) conversion_20.out: Used : 106 (peak) 0.72 (ave) conversion_21.out: Used : 107 (peak) 0.80 (ave) conversion_3.out: Used : 83 (peak) 0.93 (ave) conversion_4.out: Used : 124 (peak) 0.70 (ave) conversion_5.out: Used : 130 (peak) 0.77 (ave) conversion_6.out: Used : 112 (peak) 0.68 (ave) conversion_7.out: Used : 80 (peak) 0.52 (ave) conversion_8.out: Used : 97 (peak) 0.85 (ave) conversion_9.out: Used : 117 (peak) 0.59 (ave)

The files that do run end up with exactly the expected structure:

bash (python3.10) []$ ncdump -h LAI_cf_uint16/LAI_2007.nc netcdf LAI_2007 { dimensions: time = 365 ; latitude = 3600 ; longitude = 7200 ; variables: int64 time(time) ; time:units = "days since 2007-01-01 00:00:00" ; time:calendar = "proleptic_gregorian" ; double latitude(latitude) ; latitude:_FillValue = NaN ; double longitude(longitude) ; longitude:_FillValue = NaN ; ushort leaf_area_index(time, latitude, longitude) ; leaf_area_index:_FillValue = 65535US ; leaf_area_index:units = "1" ; leaf_area_index:scale_factor = 0.00015625 ; }

Any suggestions? It sounds like this should work!

Python version and package versions

``` (python3.10) $ python --version Python 3.10.4 (python3.10) $ pip list Package Version


Bottleneck 1.3.4 certifi 2022.6.15 cftime 1.5.1.1 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 netCDF4 1.5.7 numexpr 2.8.1 numpy 1.22.3 packaging 21.3 pandas 1.4.2 pip 21.2.4 psutil 5.8.0 pyparsing 3.0.4 python-dateutil 2.8.2 pytz 2022.1 scipy 1.7.3 setuptools 61.2.0 six 1.16.0 typing_extensions 4.1.1 wheel 0.37.1 xarray 0.20.1

(python3.10) [dorme@login-c SNU_Ryu_FPAR_LAI]$ python Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import xarray xarray.show_versions() /rds/general/user/dorme/home/anaconda3/envs/python3.10/lib/python3.10/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit: None python: 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.18.0-348.20.1.el8_5.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.6.1

xarray: 0.20.1 pandas: 1.4.2 numpy: 1.22.3 scipy: 1.7.3 netCDF4: 1.5.7 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.5.1.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.4 dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None setuptools: 61.2.0 pip: 21.2.4 conda: None pytest: None IPython: None sphinx: None

```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6733/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 75.737ms · About: xarray-datasette