html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/pull/5961#issuecomment-965803186,https://api.github.com/repos/pydata/xarray/issues/5961,965803186,IC_kwDOAMm_X845kPyy,35968931,2021-11-10T22:26:12Z,2021-11-11T01:08:14Z,MEMBER,"Update: I tried making a custom mapping class (code in drop-down below), then swapping out `._variables = dict(variables)` for `._variables = DataManifest(variables=variables)` in the `Dataset` constructors to see if that would break anything, as a first step towards that kind of integration. (At some point it would be good to be able to automatically run the `test_dataset.py` tests again for the manifest case.) It kind of works? > From a xarray.Dataset perspective, Dataset._variables just needs to be a MutableMapping of xarray.Variable objects. It's not quite as simple as this - you need a `.copy` method (fine), a repr (okay), and there are several places inside `Dataset` and `DataArray` that explicitly check that the type of `._variables` is a dict. To get tests to pass I can either relax those type constraints (which leads to >2/3 of `test_dataset.py` passing immediately) or maybe try making `DataManifest` inherit from `dict` so that it passes `isinstance(ds._variables, dict)`? This probably deserves a new PR... (EDIT: Though maybe [inheriting from dict is more trouble than it's worth](https://treyhunner.com/2019/04/why-you-shouldnt-inherit-from-list-and-dict-in-python/))
Code for custom mapping class ```python from collections.abc import MutableMapping from typing import Dict, Hashable, Mapping, Iterator, Sequence from xarray.core.variable import Variable #from xarray.tree.datatree import DataTree class DataTree: """"""Purely for type hinting purposes for now (and to avoid a circular import)"""""" ... class DataManifest(MutableMapping): """""" Stores variables like a dict, but also stores children alongside in a hidden manner, to check against. Acts like a dict of keys to variables, but prevents setting variables to same key as any children. It prevents name collisions by acting as a common record of stored items for both the DataTree instance and its wrapped Dataset instance. """""" def __init__( self, variables: Dict[Hashable, Variable] = {}, children: Dict[Hashable, DataTree] = {}, ): if variables and children: keys_in_both = set(variables.keys()) & set(children.keys()) if keys_in_both: raise KeyError( f""The keys {keys_in_both} exist in both the variables and child nodes"" ) self._variables = variables self._children = children @property def children(self) -> Dict[Hashable, DataTree]: """"""Stores list of the node's children"""""" return self._children @children.setter def children(self, children: Dict[Hashable, DataTree]): for key, child in children.items(): if key in self.keys(): raise KeyError(""Cannot add child under key {key} because a variable is already stored under that key"") if not isinstance(child, DataTree): raise TypeError self._children = children def __getitem__(self, key: Hashable) -> Variable: """"""Forward to the variables here so the manifest acts like a normal dict of variables"""""" return self._variables[key] def __setitem__(self, key: Hashable, value: Variable): """"""Allow adding new variables, but first check if they conflict with children"""""" if key in self._children: raise KeyError( f""key {key} already in use to denote a child"" ""node in wrapping DataTree node"" ) if isinstance(value, Variable): self._variables[key] = value else: raise TypeError(f""Cannot store object of type {type(value)}"") def __delitem__(self, key: Hashable): """"""Forward to the variables here so the manifest acts like a normal dict of variables"""""" if key in self._variables: del self._variables[key] elif key in self.children: # TODO might be better not to del children here? del self._children[key] else: raise KeyError(f""Cannot remove item because nothing is stored under {key}"") def __contains__(self, item: object) -> bool: """"""Forward to the variables here so the manifest acts like a normal dict of variables"""""" return item in self._variables def __iter__(self) -> Iterator: """"""Forward to the variables here so the manifest acts like a normal dict of variables"""""" return iter(self._variables) def __len__(self) -> int: """"""Forward to the variables here so the manifest acts like a normal dict of variables"""""" return len(self._variables) def copy(self) -> ""DataManifest"": """"""Required for consistency with dict"""""" return DataManifest(variables=self._variables.copy(), children=self._children.copy()) # TODO __repr__ ```
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1048697792 https://github.com/pydata/xarray/pull/5961#issuecomment-965861064,https://api.github.com/repos/pydata/xarray/issues/5961,965861064,IC_kwDOAMm_X845kd7I,35968931,2021-11-11T00:08:44Z,2021-11-11T00:08:44Z,MEMBER,"Question: Does this change to `._variables` need to propagate down to `DatasetCoordinates` and `DataArrayCoordinates`? I'm not sure what the intended behaviour is if the user alters `.coords` directly. 1) It seems it is possible to alter a `ds` via `ds.coords[new_coord_name] = new_coord`: ```python In [34]: ds = xr.Dataset({'a': 0}) In [35]: ds.coords['c'] = 2 In [36]: ds Out[36]: Dimensions: () Coordinates: c int64 2 Data variables: a int64 0 ``` (That's a bit weird given that the docstring of `DatasetCoordinates` describes it as an ""immutable dictionary"" :confused:) 2) It also seems it is possible to similarly alter a `da` via `da.coords[new_coord_name] = new_coord`: ```python In [30]: da = xr.DataArray(0) In [31]: da.coords['c'] = 1 In [32]: da Out[32]: array(0) Coordinates: c int64 1 ``` 3) However is it meant to be possible to alter a `ds` via `ds[var].coords[new_coord_name] = new_coord`? Because that currently silently fails to update: ```python In [37]: ds = xr.Dataset({'a': 0}) In [38]: ds['a'].coords['c'] = 2 In [39]: ds Out[39]: Dimensions: () Data variables: a int64 0 In [40]: ds['a'] Out[40]: array(0) In [41]: ds['a'].coords Out[41]: Coordinates: *empty* ``` Bizarrely this does change though: ```python In [42]: coords = ds['a'].coords In [43]: coords['c'] = 2 In [44]: coords Out[44]: Coordinates: c int64 2 ``` If altering `.coords` is intended behaviour though that means that my `DataManifest` also has to be accessible from `DataArrayCoordinates` too, so that the wrapping `DataTree` node can know about any changes to `dt.ds[var].coords`. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1048697792 https://github.com/pydata/xarray/pull/5961#issuecomment-965544051,https://api.github.com/repos/pydata/xarray/issues/5961,965544051,IC_kwDOAMm_X845jQhz,35968931,2021-11-10T16:57:08Z,2021-11-10T17:06:13Z,MEMBER,"> From a xarray.Dataset perspective, Dataset._variables just needs to be a MutableMapping of xarray.Variable objects. And in most cases (when not using DataTree), _variables would still be a plain dictionary, which means adding DataTree support would have no performance implications for normal Dataset objects. That sounds nice, and might not require any changes to `Dataset` at all! > My tentative suggestion would be to use a mixed dictionary with either xarray.Variable or nested dictionaries as entries for the data in DataTree. I think it's a lot easier to have a dict of DataTree objects rather than a nested dict of data, as then each node just points to its child nodes instead of having a node which knows about all the data in the whole tree (if that's what you meant). > How about making custom Mapping for use as Dataset._variables directly, which directly is a mapping of dataset variables? So this is my understanding of what you're suggesting - I'm just not sure if it solves all the requirements: ```python class DataManifest(MutableMapping): """""" Acts like a dict of keys to variables, but prevents setting variables to same key as any children """""" def __init__(self, variables={}, children={}): # check for collisions here self._variables = {} self._children = {} def __getitem__(self, key): # only expose the variables so this acts like a normal dict of variables return self._variables[key] def __setitem__(self, key, var): if key in self._children: raise KeyError( ""key already in use to denote a child"" ""node in wrapping DataTree node"" ) self.__dict__[key] = var class Dataset: self._variables = Mapping[Any, Variable] # in standard case just use dict of vars as before # Use ._construct_direct as the constructor # as it allows for setting ._variables directly # therefore no changes to Dataset required! class DataTree: def __init__(self, name, data, parent, children): self._children self._variables self._coord_names self._dims ... @property def ds(self): manifest = DataManifest(variables, children) return Dataset._from_treenode( variables=manifest, coord_names=self._coord_names, dims=self._dims, ... ) @ds.setter def ds(self, ds): # check for collisions between ds.data_vars and self.children ... ---------------- ds = Dataset({'a': 0}) subtree1 = Datatree('group1') dt = Datatree('root', data=ds, children=[subtree]) wrapped_ds = dt.ds wrapped_ds['group1'] = 1 # raises KeyError - good! subtree2 = Datatree('b') dt.ds['b'] = 2 # this will happily add a variable to the dataset dt.add_child(subtree2) # want to ensure this raises a KeyError as it conflicts with the new variable, but with this design I'm not sure if it will... ``` EDIT: Actually maybe this would work? So long as in `DataTree` we have ```python class DataTree: self._variables = manifest self._children = manifest.children ``` Then adding a new child node would also update the manifest, meaning that the linked dataset should know about it too...","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1048697792 https://github.com/pydata/xarray/pull/5961#issuecomment-964477183,https://api.github.com/repos/pydata/xarray/issues/5961,964477183,IC_kwDOAMm_X845fMD_,1217238,2021-11-09T19:43:07Z,2021-11-09T19:43:07Z,MEMBER,"From a xarray.Dataset perspective, `Dataset._variables` just needs to be a MutableMapping of xarray.Variable objects. And in most cases (when not using DataTree), `_variables` would still be a plain dictionary, which means adding DataTree support would have no performance implications for normal Dataset objects. My tentative suggestion would be to use a mixed dictionary with either `xarray.Variable` or nested dictionaries as entries for the data in `DataTree`. Then you could make a proxy mapping object that only exposes Variable objects for use in `Dataset`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1048697792 https://github.com/pydata/xarray/pull/5961#issuecomment-964450835,https://api.github.com/repos/pydata/xarray/issues/5961,964450835,IC_kwDOAMm_X845fFoT,35968931,2021-11-09T19:08:09Z,2021-11-09T19:24:32Z,MEMBER,"It just seems confusing to have an object referred to as `._variables` which actually is just as much a container of child `DataTree` nodes as it is of variables... Also the `DataTree` class needs to wrap this ""manifest"" too (otherwise I'm storing children in multiple places at once), and so I want it to make sense in that context too. On Tue, 9 Nov 2021, 13:46 Stephan Hoyer, ***@***.***> wrote: > How about making custom Mapping for use as Dataset._variables directly, > which directly is a mapping of dataset variables? You could still be > storing the underlying variables in a different way. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > , or > unsubscribe > > . > ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1048697792 https://github.com/pydata/xarray/pull/5961#issuecomment-964433424,https://api.github.com/repos/pydata/xarray/issues/5961,964433424,IC_kwDOAMm_X845fBYQ,1217238,2021-11-09T18:46:11Z,2021-11-09T18:46:11Z,MEMBER,"How about making custom `Mapping` for use as `Dataset._variables` directly, which directly is a mapping of dataset variables? You could still be storing the underlying variables in a different way.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1048697792