home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 345715825

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
345715825 MDU6SXNzdWUzNDU3MTU4MjU= 2329 Out-of-core processing with dask not working properly? 12278765 closed 0     16 2018-07-30T11:19:41Z 2019-01-13T01:57:12Z 2019-01-13T01:57:12Z NONE      

Hi,

I have a bunch of GRIB files that amount to ~250GB. I want to concatenate them and save to zarr. I concatenated them with CDO and saved to netcdf. Now I have a ~500GB netcdf that I want to convert to zarr. I want to convert to zarr because: - I plan to run the analysis on a cluster, and I understand that zarr is better for that - By using float16 and lz4 compression, I believe I can reduce the size to ~100GB and have faster access (I think the analysis will be i/o bound).

The netcdf: <xarray.Dataset> Dimensions: (lat: 721, lon: 1440, time: 119330) Coordinates: * time (time) datetime64[ns] 2000-01-01T06:00:00 2000-01-01T06:00:00 ... * lon (lon) float32 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 2.25 2.5 ... * lat (lat) float32 90.0 89.75 89.5 89.25 89.0 88.75 88.5 88.25 88.0 ... Data variables: mtpr (time, lat, lon) float32 ... Attributes: CDI: Climate Data Interface version ?? (http://mpimet.mpg.de/cdi) Conventions: CF-1.6 history: Fri Jul 27 16:35:19 2018: cdo -f nc4 mergetime 2000-01.grib... institution: European Centre for Medium-Range Weather Forecasts CDO: Climate Data Operators version 1.9.3 (http://mpimet.mpg.de/...

The code:

python def netcdf2zarr(nc_path): chunks = {'time': -1, 'lat': 'auto', 'lon': 'auto'} encoding = {'mtpr': {'dtype': 'float16', 'compressor': zarr.Blosc(cname='lz4', clevel=9)}, 'lat': {'dtype': 'float32'}, 'lon': {'dtype': 'float32'}} ds = xr.open_dataset(nc_path).chunk(chunks) ds.to_zarr('myfile.zarr', encoding=encoding)

Problem description

I left my code to run over the weekend. After 63h of processing, the zarr store was only 1GB in size. The system monitor indicated that the Python process had 17TB worth of disk read. At that rate, it would have taken months to finish. Is there something that I can do to increase the processing speed?

I run Ubuntu 18.04 on a Core i7-6700 with 16GB of RAM. The disk is an HDD with a speed of ~100MB/s.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.15.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-24-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: None.None xarray: 0.10.8 pandas: 0.23.3 numpy: 1.14.2 scipy: 1.1.0 netCDF4: 1.3.1 h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: 2.2.0 bottleneck: 1.2.1 cyordereddict: 1.0.0 dask: 0.18.2 distributed: 1.22.0 matplotlib: 2.2.2 cartopy: 0.16.0 seaborn: 0.9.0 setuptools: 39.1.0 pip: 10.0.1 conda: None pytest: None IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2329/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 16 rows from issue in issue_comments
Powered by Datasette · Queries took 155.728ms · About: xarray-datasette