issue_comments: 1112552981
This data as json
html_url | issue_url | id | node_id | user | created_at | updated_at | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
https://github.com/pydata/xarray/issues/2780#issuecomment-1112552981 | https://api.github.com/repos/pydata/xarray/issues/2780 | 1112552981 | IC_kwDOAMm_X85CUDYV | 22566757 | 2022-04-28T18:57:26Z | 2022-04-28T19:01:34Z | CONTRIBUTOR | I found a way to get the sample dataset to save to a smaller netCDF: ```python import os import numpy as np import numpy.testing as np_tst import pandas as pd import xarray as xr Original exampleCreate pandas DataFramedf = pd.DataFrame( np.random.randint(low=0, high=10, size=(100000, 5)), columns=["a", "b", "c", "d", "e"], ) Make 'e' a column of stringsdf["e"] = df["e"].astype(str) Make 'f' a column of floatsDIGITS = 1 df["f"] = np.around(10 ** DIGITS * np.random.random(size=df.shape[0]), DIGITS) Save to csvdf.to_csv("df.csv") Convert to an xarray's Datasetds = xr.Dataset.from_dataframe(df) Save NetCDF fileds.to_netcdf("ds.nc") Additionsdef dtype_for_int_array(arry: "array of integers") -> np.dtype: """Find the smallest integer dtype that will encode arry.
def dtype_for_str_array( arry: "xr.DataArray of strings", for_disk: bool = True ) -> np.dtype: """Find a good string dtype for encoding arry.
Set up encoding for saving to netCDFencoding = {} for name, var in ds.items(): encoding[name] = {}
ds.to_netcdf("ds_encoded.nc", encoding=encoding) Display resultsstat_csv = os.stat("df.csv") stat_nc = os.stat("ds.nc") stat_enc = os.stat("ds_encoded.nc") sizes = pd.Series( index=["CSV", "default netCDF", "encoded netCDF"], data=[stats.st_size for stats in [stat_csv, stat_nc, stat_enc]], name="File sizes", ) print("File sizes (kB):", np.right_shift(sizes, 10), sep="\n", end="\n\n") print("Sizes relative to CSV:", sizes / sizes.iloc[0], sep="\n", end="\n\n") Check that I didn't break the floatsfrom_disk = xr.open_dataset("ds_encoded.nc")
np_tst.assert_allclose(ds["f"], from_disk["f"], rtol=10-DIGITS, atol=10-DIGITS)
Sizes relative to CSV: CSV 1.000000 default netCDF 5.230366 encoded netCDF 0.708063 Name: File sizes, dtype: float64 10M ds.nc 1.9M df.csv 1.4M ds_encoded.nc ``` I added a column of floats with one digit before and after the decimal point to the example dataset, because why not. Does this satisfy your use-case? Should I turn the giant loop into a function to go into xarray somewhere? If so, I should probably tie the float handling in with the new |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
412180435 |