html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/5961#issuecomment-965803186,https://api.github.com/repos/pydata/xarray/issues/5961,965803186,IC_kwDOAMm_X845kPyy,35968931,2021-11-10T22:26:12Z,2021-11-11T01:08:14Z,MEMBER,"Update: I tried making a custom mapping class (code in drop-down below), then swapping out `._variables = dict(variables)` for `._variables = DataManifest(variables=variables)` in the `Dataset` constructors to see if that would break anything, as a first step towards that kind of integration. (At some point it would be good to be able to automatically run the `test_dataset.py` tests again for the manifest case.)
It kind of works?
> From a xarray.Dataset perspective, Dataset._variables just needs to be a MutableMapping of xarray.Variable objects.
It's not quite as simple as this - you need a `.copy` method (fine), a repr (okay), and there are several places inside `Dataset` and `DataArray` that explicitly check that the type of `._variables` is a dict.
To get tests to pass I can either relax those type constraints (which leads to >2/3 of `test_dataset.py` passing immediately) or maybe try making `DataManifest` inherit from `dict` so that it passes `isinstance(ds._variables, dict)`? This probably deserves a new PR...
(EDIT: Though maybe [inheriting from dict is more trouble than it's worth](https://treyhunner.com/2019/04/why-you-shouldnt-inherit-from-list-and-dict-in-python/))
Code for custom mapping class
```python
from collections.abc import MutableMapping
from typing import Dict, Hashable, Mapping, Iterator, Sequence
from xarray.core.variable import Variable
#from xarray.tree.datatree import DataTree
class DataTree:
""""""Purely for type hinting purposes for now (and to avoid a circular import)""""""
...
class DataManifest(MutableMapping):
""""""
Stores variables like a dict, but also stores children alongside in a hidden manner, to check against.
Acts like a dict of keys to variables, but prevents setting variables to same key as any children. It prevents name
collisions by acting as a common record of stored items for both the DataTree instance and its wrapped Dataset instance.
""""""
def __init__(
self,
variables: Dict[Hashable, Variable] = {},
children: Dict[Hashable, DataTree] = {},
):
if variables and children:
keys_in_both = set(variables.keys()) & set(children.keys())
if keys_in_both:
raise KeyError(
f""The keys {keys_in_both} exist in both the variables and child nodes""
)
self._variables = variables
self._children = children
@property
def children(self) -> Dict[Hashable, DataTree]:
""""""Stores list of the node's children""""""
return self._children
@children.setter
def children(self, children: Dict[Hashable, DataTree]):
for key, child in children.items():
if key in self.keys():
raise KeyError(""Cannot add child under key {key} because a variable is already stored under that key"")
if not isinstance(child, DataTree):
raise TypeError
self._children = children
def __getitem__(self, key: Hashable) -> Variable:
""""""Forward to the variables here so the manifest acts like a normal dict of variables""""""
return self._variables[key]
def __setitem__(self, key: Hashable, value: Variable):
""""""Allow adding new variables, but first check if they conflict with children""""""
if key in self._children:
raise KeyError(
f""key {key} already in use to denote a child""
""node in wrapping DataTree node""
)
if isinstance(value, Variable):
self._variables[key] = value
else:
raise TypeError(f""Cannot store object of type {type(value)}"")
def __delitem__(self, key: Hashable):
""""""Forward to the variables here so the manifest acts like a normal dict of variables""""""
if key in self._variables:
del self._variables[key]
elif key in self.children:
# TODO might be better not to del children here?
del self._children[key]
else:
raise KeyError(f""Cannot remove item because nothing is stored under {key}"")
def __contains__(self, item: object) -> bool:
""""""Forward to the variables here so the manifest acts like a normal dict of variables""""""
return item in self._variables
def __iter__(self) -> Iterator:
""""""Forward to the variables here so the manifest acts like a normal dict of variables""""""
return iter(self._variables)
def __len__(self) -> int:
""""""Forward to the variables here so the manifest acts like a normal dict of variables""""""
return len(self._variables)
def copy(self) -> ""DataManifest"":
""""""Required for consistency with dict""""""
return DataManifest(variables=self._variables.copy(), children=self._children.copy())
# TODO __repr__
```
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1048697792
https://github.com/pydata/xarray/pull/5961#issuecomment-965861064,https://api.github.com/repos/pydata/xarray/issues/5961,965861064,IC_kwDOAMm_X845kd7I,35968931,2021-11-11T00:08:44Z,2021-11-11T00:08:44Z,MEMBER,"Question: Does this change to `._variables` need to propagate down to `DatasetCoordinates` and `DataArrayCoordinates`? I'm not sure what the intended behaviour is if the user alters `.coords` directly.
1) It seems it is possible to alter a `ds` via `ds.coords[new_coord_name] = new_coord`:
```python
In [34]: ds = xr.Dataset({'a': 0})
In [35]: ds.coords['c'] = 2
In [36]: ds
Out[36]:
Dimensions: ()
Coordinates:
c int64 2
Data variables:
a int64 0
```
(That's a bit weird given that the docstring of `DatasetCoordinates` describes it as an ""immutable dictionary"" :confused:)
2) It also seems it is possible to similarly alter a `da` via `da.coords[new_coord_name] = new_coord`:
```python
In [30]: da = xr.DataArray(0)
In [31]: da.coords['c'] = 1
In [32]: da
Out[32]:
array(0)
Coordinates:
c int64 1
```
3) However is it meant to be possible to alter a `ds` via `ds[var].coords[new_coord_name] = new_coord`? Because that currently silently fails to update:
```python
In [37]: ds = xr.Dataset({'a': 0})
In [38]: ds['a'].coords['c'] = 2
In [39]: ds
Out[39]:
Dimensions: ()
Data variables:
a int64 0
In [40]: ds['a']
Out[40]:
array(0)
In [41]: ds['a'].coords
Out[41]:
Coordinates:
*empty*
```
Bizarrely this does change though:
```python
In [42]: coords = ds['a'].coords
In [43]: coords['c'] = 2
In [44]: coords
Out[44]:
Coordinates:
c int64 2
```
If altering `.coords` is intended behaviour though that means that my `DataManifest` also has to be accessible from `DataArrayCoordinates` too, so that the wrapping `DataTree` node can know about any changes to `dt.ds[var].coords`.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1048697792
https://github.com/pydata/xarray/pull/5961#issuecomment-965544051,https://api.github.com/repos/pydata/xarray/issues/5961,965544051,IC_kwDOAMm_X845jQhz,35968931,2021-11-10T16:57:08Z,2021-11-10T17:06:13Z,MEMBER,"> From a xarray.Dataset perspective, Dataset._variables just needs to be a MutableMapping of xarray.Variable objects. And in most cases (when not using DataTree), _variables would still be a plain dictionary, which means adding DataTree support would have no performance implications for normal Dataset objects.
That sounds nice, and might not require any changes to `Dataset` at all!
> My tentative suggestion would be to use a mixed dictionary with either xarray.Variable or nested dictionaries as entries for the data in DataTree.
I think it's a lot easier to have a dict of DataTree objects rather than a nested dict of data, as then each node just points to its child nodes instead of having a node which knows about all the data in the whole tree (if that's what you meant).
> How about making custom Mapping for use as Dataset._variables directly, which directly is a mapping of dataset variables?
So this is my understanding of what you're suggesting - I'm just not sure if it solves all the requirements:
```python
class DataManifest(MutableMapping):
""""""
Acts like a dict of keys to variables, but
prevents setting variables to same key as any
children
""""""
def __init__(self, variables={}, children={}):
# check for collisions here
self._variables = {}
self._children = {}
def __getitem__(self, key):
# only expose the variables so this acts like a normal dict of variables
return self._variables[key]
def __setitem__(self, key, var):
if key in self._children:
raise KeyError(
""key already in use to denote a child""
""node in wrapping DataTree node""
)
self.__dict__[key] = var
class Dataset:
self._variables = Mapping[Any, Variable]
# in standard case just use dict of vars as before
# Use ._construct_direct as the constructor
# as it allows for setting ._variables directly
# therefore no changes to Dataset required!
class DataTree:
def __init__(self, name, data, parent, children):
self._children
self._variables
self._coord_names
self._dims
...
@property
def ds(self):
manifest = DataManifest(variables, children)
return Dataset._from_treenode(
variables=manifest,
coord_names=self._coord_names,
dims=self._dims,
...
)
@ds.setter
def ds(self, ds):
# check for collisions between ds.data_vars and self.children
...
----------------
ds = Dataset({'a': 0})
subtree1 = Datatree('group1')
dt = Datatree('root', data=ds, children=[subtree])
wrapped_ds = dt.ds
wrapped_ds['group1'] = 1 # raises KeyError - good!
subtree2 = Datatree('b')
dt.ds['b'] = 2 # this will happily add a variable to the dataset
dt.add_child(subtree2) # want to ensure this raises a KeyError as it conflicts with the new variable, but with this design I'm not sure if it will...
```
EDIT: Actually maybe this would work? So long as in `DataTree` we have
```python
class DataTree:
self._variables = manifest
self._children = manifest.children
```
Then adding a new child node would also update the manifest, meaning that the linked dataset should know about it too...","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1048697792
https://github.com/pydata/xarray/pull/5961#issuecomment-964477183,https://api.github.com/repos/pydata/xarray/issues/5961,964477183,IC_kwDOAMm_X845fMD_,1217238,2021-11-09T19:43:07Z,2021-11-09T19:43:07Z,MEMBER,"From a xarray.Dataset perspective, `Dataset._variables` just needs to be a MutableMapping of xarray.Variable objects. And in most cases (when not using DataTree), `_variables` would still be a plain dictionary, which means adding DataTree support would have no performance implications for normal Dataset objects.
My tentative suggestion would be to use a mixed dictionary with either `xarray.Variable` or nested dictionaries as entries for the data in `DataTree`. Then you could make a proxy mapping object that only exposes Variable objects for use in `Dataset`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1048697792
https://github.com/pydata/xarray/pull/5961#issuecomment-964450835,https://api.github.com/repos/pydata/xarray/issues/5961,964450835,IC_kwDOAMm_X845fFoT,35968931,2021-11-09T19:08:09Z,2021-11-09T19:24:32Z,MEMBER,"It just seems confusing to have an object referred to as `._variables`
which actually is just as much a container of child `DataTree` nodes as it
is of variables...
Also the `DataTree` class needs to wrap this ""manifest"" too (otherwise I'm
storing children in multiple places at once), and so I want it to
make sense in that context too.
On Tue, 9 Nov 2021, 13:46 Stephan Hoyer, ***@***.***> wrote:
> How about making custom Mapping for use as Dataset._variables directly,
> which directly is a mapping of dataset variables? You could still be
> storing the underlying variables in a different way.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> , or
> unsubscribe
>
> .
>
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1048697792
https://github.com/pydata/xarray/pull/5961#issuecomment-964433424,https://api.github.com/repos/pydata/xarray/issues/5961,964433424,IC_kwDOAMm_X845fBYQ,1217238,2021-11-09T18:46:11Z,2021-11-09T18:46:11Z,MEMBER,"How about making custom `Mapping` for use as `Dataset._variables` directly, which directly is a mapping of dataset variables? You could still be storing the underlying variables in a different way.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1048697792