home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

6 rows where issue = 1506437087 and user = 720460 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • benoitespinola · 6 ✖

issue 1

  • Memory issue merging NetCDF files using xarray.open_mfdataset and to_netcdf · 6 ✖

author_association 1

  • NONE 6
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1363988341 https://github.com/pydata/xarray/issues/7397#issuecomment-1363988341 https://api.github.com/repos/pydata/xarray/issues/7397 IC_kwDOAMm_X85RTM91 benoitespinola 720460 2022-12-23T14:15:25Z 2022-12-23T14:15:53Z NONE

Because I want to have a worry-free holidays, I wrote a bit of code that basically creates a new NetCDF file from scratch. I load the data from Xarray, change the data to Numpy arrays and use the NetCDF4 library to write the files (does what I want).

In the process, I also slice the data and drop unwanted variables to keep just the bits I want (unlike my original post).

If I call .load() or .compute() on my xarray variable, the memory goes crazy (even if I am dropping unwanted variables - which I would expect to release memory). The same happens for slicing followed by .compute().

Unfortunately, the MCVE will have to wait until I am back from my holidays.

Happy holidays to all!

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory issue merging NetCDF files using xarray.open_mfdataset and to_netcdf 1506437087
1362583979 https://github.com/pydata/xarray/issues/7397#issuecomment-1362583979 https://api.github.com/repos/pydata/xarray/issues/7397 IC_kwDOAMm_X85RN2Gr benoitespinola 720460 2022-12-22T09:04:17Z 2022-12-22T09:04:17Z NONE

By the way, prior to writing this ticket, I also did the following (which did not help): Drop variables I do not care, keeping dimensions only and toce + soce ; I would expect to need less memory after that.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory issue merging NetCDF files using xarray.open_mfdataset and to_netcdf 1506437087
1362564754 https://github.com/pydata/xarray/issues/7397#issuecomment-1362564754 https://api.github.com/repos/pydata/xarray/issues/7397 IC_kwDOAMm_X85RNxaS benoitespinola 720460 2022-12-22T08:44:06Z 2022-12-22T08:44:06Z NONE

Answering to the question 'Did you do some processing with the data, changing attributes/encoding etc?': No processing. I do ask xarray to load the data (and I tried also loading + computing) and the final outcome is the same.

I try now to do an MCVE with dummy data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory issue merging NetCDF files using xarray.open_mfdataset and to_netcdf 1506437087
1362562275 https://github.com/pydata/xarray/issues/7397#issuecomment-1362562275 https://api.github.com/repos/pydata/xarray/issues/7397 IC_kwDOAMm_X85RNwzj benoitespinola 720460 2022-12-22T08:41:21Z 2022-12-22T08:41:21Z NONE

Just tested with to_zarr and it goes through:

State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 00:07:55 CPU Efficiency: 63.00% of 00:12:34 core-walltime Job Wall-clock time: 00:06:17 Memory Utilized: 164.89 GB Memory Efficiency: 44.56% of 370.00 GB

I did an extra run using a memory profiler as such:

``` import xarray as xr import zarr from memory_profiler import profile

@profile def main(): path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc data = xr.open_mfdataset(path)

data = data.load()
data = data.compute()

data.to_zarr()

if name=='main': main() The profiled code was also completed with great success: State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 00:07:52 CPU Efficiency: 63.61% of 00:12:22 core-walltime Job Wall-clock time: 00:06:11 Memory Utilized: 165.53 GB Memory Efficiency: 44.74% of 370.00 GB ```

Here is the outcome for the memory profiling: ``` Line # Mem usage Increment Occurrences Line Contents ============================================================= 5 156.9 MiB 156.9 MiB 1 @profile 6 def main(): 7 156.9 MiB 0.0 MiB 1 path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc

8    209.3 MiB     52.4 MiB           1       data = xr.open_mfdataset(path)
9                                         
10  82150.1 MiB  81940.8 MiB           1       data = data.load()
11  82101.2 MiB    -49.0 MiB           1       data = data.compute()
12                                         
13  90091.2 MiB   7990.0 MiB           1       data.to_zarr()

``` PS: in this test I just realized I loaded 8 files instead of 5.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory issue merging NetCDF files using xarray.open_mfdataset and to_netcdf 1506437087
1362544813 https://github.com/pydata/xarray/issues/7397#issuecomment-1362544813 https://api.github.com/repos/pydata/xarray/issues/7397 IC_kwDOAMm_X85RNsit benoitespinola 720460 2022-12-22T08:21:31Z 2022-12-22T08:21:31Z NONE

A single file (from ncdump -h):

dimensions: axis_nbounds = 2 ; x = 754 ; y = 277 ; deptht = 200 ; time_counter = UNLIMITED ; // (28 currently) variables: float nav_lat(y, x) ; nav_lat:standard_name = "latitude" ; nav_lat:long_name = "Latitude" ; nav_lat:units = "degrees_north" ; float nav_lon(y, x) ; nav_lon:standard_name = "longitude" ; nav_lon:long_name = "Longitude" ; nav_lon:units = "degrees_east" ; float deptht(deptht) ; deptht:name = "deptht" ; deptht:long_name = "Vertical T levels" ; deptht:units = "m" ; deptht:positive = "down" ; deptht:bounds = "deptht_bounds" ; float deptht_bounds(deptht, axis_nbounds) ; deptht_bounds:units = "m" ; double time_centered(time_counter) ; time_centered:standard_name = "time" ; time_centered:long_name = "Time axis" ; time_centered:calendar = "gregorian" ; time_centered:units = "seconds since 1900-01-01 00:00:00" ; time_centered:time_origin = "1900-01-01 00:00:00" ; time_centered:bounds = "time_centered_bounds" ; double time_centered_bounds(time_counter, axis_nbounds) ; double time_counter(time_counter) ; time_counter:axis = "T" ; time_counter:standard_name = "time" ; time_counter:long_name = "Time axis" ; time_counter:calendar = "gregorian" ; time_counter:units = "seconds since 1900-01-01 00:00:00" ; time_counter:time_origin = "1900-01-01 00:00:00" ; time_counter:bounds = "time_counter_bounds" ; double time_counter_bounds(time_counter, axis_nbounds) ; float toce(time_counter, deptht, y, x) ; toce:standard_name = "sea_water_potential_temperature" ; toce:long_name = "temperature" ; toce:units = "degC" ; toce:online_operation = "average" ; toce:interval_operation = "60 s" ; toce:interval_write = "6 h" ; toce:cell_methods = "time: mean (interval: 60 s)" ; toce:_FillValue = 1.e+20f ; toce:missing_value = 1.e+20f ; toce:coordinates = "time_centered nav_lat nav_lon" ; float soce(time_counter, deptht, y, x) ; soce:standard_name = "sea_water_practical_salinity" ; soce:long_name = "salinity" ; soce:units = "1e-3" ; soce:online_operation = "average" ; soce:interval_operation = "60 s" ; soce:interval_write = "6 h" ; soce:cell_methods = "time: mean (interval: 60 s)" ; soce:_FillValue = 1.e+20f ; soce:missing_value = 1.e+20f ; soce:coordinates = "time_centered nav_lat nav_lon" ; float taum(time_counter, y, x) ; taum:standard_name = "magnitude_of_surface_downward_stress" ; taum:long_name = "wind stress module" ; taum:units = "N/m2" ; taum:online_operation = "average" ; taum:interval_operation = "120 s" ; taum:interval_write = "6 h" ; taum:cell_methods = "time: mean (interval: 120 s)" ; taum:_FillValue = 1.e+20f ; taum:missing_value = 1.e+20f ; taum:coordinates = "time_centered nav_lat nav_lon" ; float wspd(time_counter, y, x) ; wspd:standard_name = "wind_speed" ; wspd:long_name = "wind speed module" ; wspd:units = "m/s" ; wspd:online_operation = "average" ; wspd:interval_operation = "120 s" ; wspd:interval_write = "6 h" ; wspd:cell_methods = "time: mean (interval: 120 s)" ; wspd:_FillValue = 1.e+20f ; wspd:missing_value = 1.e+20f ; wspd:coordinates = "time_centered nav_lat nav_lon" ; And after the merge, the only difference is in the time dimension that goes from 28 to 280 (or so)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory issue merging NetCDF files using xarray.open_mfdataset and to_netcdf 1506437087
1361621826 https://github.com/pydata/xarray/issues/7397#issuecomment-1361621826 https://api.github.com/repos/pydata/xarray/issues/7397 IC_kwDOAMm_X85RKLNC benoitespinola 720460 2022-12-21T16:28:15Z 2022-12-21T16:28:15Z NONE

By the way,

Using .encoding to my data yields to 'complevel': 1.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Memory issue merging NetCDF files using xarray.open_mfdataset and to_netcdf 1506437087

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.343ms · About: xarray-datasette