home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1340474484

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1340474484 I_kwDOAMm_X85P5gR0 6920 Writing a netCDF file is slow 64621312 closed 1     3 2022-08-16T14:48:37Z 2022-08-16T17:05:37Z 2022-08-16T17:05:37Z NONE      

What is your issue?

This has been discussed in another thread, but the proposed solution there (first .load() the dataset into memory before running to_netcdf) does not work for me since my dataset is too large to fit into memory. The following code takes around 8 hours to run. You'll notice that I tried both xr.open_mfdataset and xr.concat in case it would make a difference, but it doesn't. I also tried profiling the code according to this example. The results are in this html (dropbox link) but I'm not really sure what I'm looking at.

Data: dropbox link to 717 netcdf files containing radar rainfall data for 6/28/2014 over the United States that is around 1GB in total.

Code: ```python

%% Import libraries

import xarray as xr from glob import glob import pandas as pd import time import dask dask.config.set(**{'array.slicing.split_large_chunks': False})

files = glob("data/*.nc")

%% functions

def extract_file_timestep(fname): fname = fname.split('/')[-1] fname = fname.split(".") ftype = fname.pop(-1) fname = ''.join(fname) str_tstep = fname.split("_")[-1] if ftype == "nc": date_format = '%Y%m%d%H%M' if ftype == "grib2": date_format = '%Y%m%d-%H%M%S'

tstep = pd.to_datetime(str_tstep, format=date_format)

return tstep

def ds_preprocessing(ds): tstamp = extract_file_timestep(ds.encoding['source']) ds.coords["time"] = tstamp ds = ds.expand_dims({"time":1}) ds = ds.rename({"lon":"longitude", "lat":"latitude", "mrms_a2m":"rainrate"}) ds = ds.chunk(chunks={"latitude":3500, "longitude":7000, "time":1}) return ds

%% Loading and formatting data

lst_ds = [] start_time = time.time() for f in files: ds = xr.open_dataset(f, chunks={"latitude":3500, "longitude":7000}) ds = ds_preprocessing(ds) lst_ds.append(ds)

ds_comb_frm_lst = xr.concat(lst_ds, dim="time") print("Time to load dataset using concat on list of datasets: {}".format(time.time() - start_time))

start_time = time.time() ds_comb_frm_open_mfdataset = xr.open_mfdataset(files, chunks={"latitude":3500, "longitude":7000}, concat_dim = "time", preprocess=ds_preprocessing, combine="nested") print("Time to load dataset using open_mfdataset: {}".format(time.time() - start_time))

%% exporting to netcdf

start_time = time.time() ds_comb_frm_lst.to_netcdf("ds_comb_frm_lst.nc", encoding= {"rainrate":{"zlib":True}}) print("Time to export dataset created using concat on list of datasets: {}".format(time.time() - start_time))

start_time = time.time() ds_comb_frm_open_mfdataset.to_netcdf("ds_comb_frm_open_mfdataset.nc", encoding= {"rainrate":{"zlib":True}}) print("Time to export dataset created using open_mfdataset: {}".format(time.time() - start_time)) ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6920/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 3 rows from issue in issue_comments
Powered by Datasette · Queries took 0.515ms · About: xarray-datasette