issues: 336458472
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
336458472 | MDU6SXNzdWUzMzY0NTg0NzI= | 2256 | xarray to zarr | 4338975 | closed | 0 | 16 | 2018-06-28T03:17:51Z | 2018-12-20T17:49:13Z | 2018-12-20T17:49:13Z | NONE | @jhamman Hi I've been experimenting with converting Argo float profiles (http://www.argo.ucsd.edu/About_Argo.html) data to zarr as a cache for cloud processing of Argo data. One thing I've noticed is that Argo floats have each cycle (up down in the water column). The samples depths are not consistent across cycles. and there are a lot of single value attributes in the cycle file. e.g. Latitude etc. I loaded 250 cycle files from a single float and pushed them into a zarr using .to_zarr on each file putting each cycle into its own group: cache/123456 (float id)/1(cycle) This resulted in over 70k small files being created. small files are very inefficient on disk utilisation my data went from 10Meg to over 100 of disk utilisation. With a straight pickle to zarr array the compression had the whole data series down to <1 MB! |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/2256/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |