home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 730792268

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
730792268 MDU6SXNzdWU3MzA3OTIyNjg= 4544 Writing subset of array with to_zarr 58827984 closed 0     2 2020-10-27T20:31:18Z 2022-04-11T13:31:13Z 2022-04-11T13:31:13Z NONE      

Related to #4035, just using 'region' might be the answer for me.

Within my system, I reprocess a subset of an already written dataset (written to zarr) that I would then like to write back to zarr, overwriting the stored data. It seems like the only way to do that currently is to load the zarr array in memory, replace the changed bit, and then write the full array back with mode='w'.

I have a hacky way of doing this (outside of to_zarr) that kind of aligns with how append_dim works with to_zarr. I specify the overwrite_dim, match the incoming data to the part of the written zarr array that is of name=overwrite_dim and that has the same values, and use numpy syntax to overwrite just that part of the array that matches the incoming data.

```

initialize zarr array

ds = xr.Dataset({'arr': (('time', 'data_dim'), np.ones((10,3)))}, coords={'time': np.arange(100,110,1), 'data_dim': np.arange(3)}) ans = ds.to_zarr(r'C:\collab\dasktest\data_dir\test', mode='w') ans.get_variables()

this would be the result of the first write

Frozen({'arr': <xarray.Variable (time: 10, data_dim: 3)> array([[1., 1., 1.], [1., 1., 1.], [1., 1., 1.], [1., 1., 1.], [1., 1., 1.], [1., 1., 1.], [1., 1., 1.], [1., 1., 1.], [1., 1., 1.], [1., 1., 1.]])

what I'd like to use to_zarr to do, just overwrite part of the first write, leaving the rest intact

import zarr overwrite_dim = 'time' rg = zarr.open(r'C:\collab\dasktest\data_dir\test') overwrite_subset_ds = xr.Dataset({'arr': (('time', 'data_dim'), np.full((2,3), 2))}, coords={'time': np.array([103, 104]), 'data_dim': np.arange(3)}) overwrite_index = np.isin(rg[overwrite_dim], overwrite_subset_ds[overwrite_dim].values)

if overwrite_index.any(): for darray in overwrite_subset_ds: if overwrite_dim in overwrite_subset_ds[darray].dims: d_dims_loc = d_dims.index('time') msk = np.zeros_like(rg['arr'], dtype=bool) msk[overwrite_index] = True rg[darray].set_mask_selection(msk, overwrite_subset_ds[darray].stack({'zwrite':d_dims}))

rg[darray][:]

array([[1., 1., 1.], [1., 1., 1.], [1., 1., 1.], [2., 2., 2.], [2., 2., 2.], [1., 1., 1.], [1., 1., 1.], [1., 1., 1.], [1., 1., 1.], [1., 1., 1.]]) ```

So the old zarr array remains except for the portion i wanted changed. It would be nice if I could just do:

overwrite_subset_ds.to_zarr(r'C:\collab\dasktest\data_dir\test', mode='w', overwrite_dim='time')

And it would just do it.

Would this be in the scope of to_zarr? It seems almost necessary if you are going to use xarray/zarr as a working system, with data changing as it is processed.

Or maybe this is already possible and I need to RTFM?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4544/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 2 rows from issue in issue_comments
Powered by Datasette · Queries took 0.906ms · About: xarray-datasette