html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/3959#issuecomment-612598462,https://api.github.com/repos/pydata/xarray/issues/3959,612598462,MDEyOklzc3VlQ29tbWVudDYxMjU5ODQ2Mg==,14808389,2020-04-12T11:11:26Z,2020-04-12T22:18:31Z,MEMBER,"> Is there any reason not to put the name of the type into `attrs` and just switch on that rather than the keys in `data_vars`? Not really, I just thought the variables in the dataset were a way to uniquely identify its variant (i.e. do the validation of the dataset's structure). If you have different means to do so, of course you can use that instead. Re `TypedDict`: the PEP introducing `TypedDict` especially mentions that it is only intended for `Dict[str, Any]` (so no subclasses of `Dict` for `TypedDict`). However, looking at the [code of `TypedDict`](https://github.com/python/cpython/blob/3e0dd3730b5eff7e9ae6fb921aa77cd26efc9e3a/Lib/typing.py#L1795-L1905), we should be able to do something similar for `Dataset`. Edit: we'd still need to convince `mypy` that the custom `TypedDict` is a type... > so I'm curious if that has been discussed much I don't think so? There were a few discussions about subclassing, but I couldn't find anything about static type analysis. It's definitely worth having this discussion, either here (repurposing this issue) or in a new issue.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,597475005 https://github.com/pydata/xarray/issues/3959#issuecomment-612076605,https://api.github.com/repos/pydata/xarray/issues/3959,612076605,MDEyOklzc3VlQ29tbWVudDYxMjA3NjYwNQ==,14808389,2020-04-10T15:23:08Z,2020-04-10T15:56:08Z,MEMBER,"you could emulate the availability of the accessors by checking your variables in the constructor of the accessor using ```python dataset_types = { frozenset(""variable1"", ""variable2""): ""type1"", frozenset(""variable2"", ""variable3""): ""type2"", frozenset(""variable1"", ""variable3""): ""type3"", } def _dataset_type(ds): data_vars = frozenset(ds.data_vars.keys()) return dataset_types[data_vars] @xr.register_dataset_accessor(""type1"") class Type1Accessor: def __init__(self, ds): if _dataset_type(ds) != ""type1"": raise AttributeError(""not a type1 dataset"") self.dataset = ds ``` though now that we have a ""type"" registry, we could also have one accessor, and pass a `kind` parameter to your `analyze` function: ```python def analyze(self, kind=""auto""): analyzers = { ""type1"": _analyze_type1, ""type2"": _analyze_type2, } if kind == ""auto"": kind = self.dataset_type return analyzers.get(kind)(self.dataset) ``` If you just wanted to use static code analysis using e.g. `mypy`, consider using `TypedDict`. I don't know anything about `mypy`, though, so I wasn't able to get it to accept `Dataset` objects instead of `dict`. If someone actually gets this to work, we might be able to provide a `xarray.typing` module to allow something like (but depending on the amount of code needed, this could also fit in the `Cookbook` docs section): ```python from xarray.typing as DatasetType, Coordinate, ArrayType, Int64Type, FloatType class Dataset1(DatasetType): longitude : Coordinate[ArrayType[Float64Type]] latitude : Coordinate[ArrayType[Float64Type]] temperature : ArrayType[Float64Type] def function(ds : Dataset1): # ... return ds ``` and have the type checker validate the structure of the dataset.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,597475005 https://github.com/pydata/xarray/issues/3959#issuecomment-611997039,https://api.github.com/repos/pydata/xarray/issues/3959,611997039,MDEyOklzc3VlQ29tbWVudDYxMTk5NzAzOQ==,14808389,2020-04-10T11:49:32Z,2020-04-10T11:49:32Z,MEMBER,"do you have any control on how the datasets are created? If so, you could provide a factory function (maybe pass in arrays via required kwargs?) that does the checks and describes the required dataset structure in its docstring. >> If you have other questions about dtypes in xarray then please feel free to raise another issue about that. > > Will do. This probably won't happen in the near future, though, since the custom dtypes for `numpy` are still a work in progress ([NEP-40](https://numpy.org/neps/nep-0040-legacy-datatype-impl.html), etc.)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,597475005 https://github.com/pydata/xarray/issues/3959#issuecomment-611967822,https://api.github.com/repos/pydata/xarray/issues/3959,611967822,MDEyOklzc3VlQ29tbWVudDYxMTk2NzgyMg==,35968931,2020-04-10T10:02:39Z,2020-04-10T10:02:39Z,MEMBER,"> A docstring on a constructor is great -- is there a way to do something like that with accessors? There surely must be some way to do that, but I'm afraid I'm not a docs wizard. However the accessor is still just a class, whose methods you want to document - would it be too unclear for them to hang off each `HaploAccessor.specific_method()`? > Is there a way to avoid running check_* methods multiple times? There is some caching, but you shouldn't rely on it. In #3268 @crusaderky said ""The more high level discussion is that the statefulness of the accessor is something that is OK to use for caching and performance improvements, and not OK for storing functional information like yours."" > I think those checks could be expensive > those arrays should meet different dtype and dimensionality constraints Checking dtype and dimensions shouldn't be expensive though, or is it more than that? > Well, we do actually have that problem in trying to find some way to represent 2-bit integers with sub-byte data types but I wasn't trying to get into that on this thread. I'll make the title better. If you have other questions about dtypes in xarray then please feel free to raise another issue about that.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,597475005 https://github.com/pydata/xarray/issues/3959#issuecomment-611719548,https://api.github.com/repos/pydata/xarray/issues/3959,611719548,MDEyOklzc3VlQ29tbWVudDYxMTcxOTU0OA==,35968931,2020-04-09T19:46:50Z,2020-04-09T19:47:45Z,MEMBER,"> All that said, is it still a bad idea to try to subclass Xarray data structures even if the intent was never to touch any part of the internal APIs? One of the more immediate problems you'll find if you subclass is that xarray internally uses methods like `self._construct_dataarray(dims, values, coords, attrs)` to construct return values, so you will likely find that for a lot of methods you call you will only get back a bare `DataArray`, not the subclass you put in. You could make custom accessors which perform checks on the input arrays when they get used? ```python @xr.register_dataset_accessor('haplo') def HaploDatasetAccessor: def __init__(self, ds) check_conforms_to_haplo_requirements(ds) self.data = ds def analyse(self): ... ds.haplo.analyse() ``` I'm also wondering whether given that the only real difference (not just by convention) of your desired data structures from xarray's is the dtype, then (if xarray actually offered it) would something akin to pandas' [`ExtensionDtype`](https://pandas.pydata.org/docs/development/extending.html#extensiondtype) solve your problem?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,597475005