issue_comments
10 rows where issue = 597475005 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: reactions, created_at (date), updated_at (date)
issue 1
- Extending Xarray for domain-specific toolkits · 10 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
612978261 | https://github.com/pydata/xarray/issues/3959#issuecomment-612978261 | https://api.github.com/repos/pydata/xarray/issues/3959 | MDEyOklzc3VlQ29tbWVudDYxMjk3ODI2MQ== | eric-czech 6130352 | 2020-04-13T16:36:32Z | 2020-04-13T16:36:32Z | NONE | Thanks again @keewis! I moved the static typing discussion to https://github.com/pydata/xarray/issues/3967. This is closed out now as far as I'm concerned. |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Extending Xarray for domain-specific toolkits 597475005 | |
612598462 | https://github.com/pydata/xarray/issues/3959#issuecomment-612598462 | https://api.github.com/repos/pydata/xarray/issues/3959 | MDEyOklzc3VlQ29tbWVudDYxMjU5ODQ2Mg== | keewis 14808389 | 2020-04-12T11:11:26Z | 2020-04-12T22:18:31Z | MEMBER |
Not really, I just thought the variables in the dataset were a way to uniquely identify its variant (i.e. do the validation of the dataset's structure). If you have different means to do so, of course you can use that instead. Re Edit: we'd still need to convince
I don't think so? There were a few discussions about subclassing, but I couldn't find anything about static type analysis. It's definitely worth having this discussion, either here (repurposing this issue) or in a new issue. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Extending Xarray for domain-specific toolkits 597475005 | |
612513722 | https://github.com/pydata/xarray/issues/3959#issuecomment-612513722 | https://api.github.com/repos/pydata/xarray/issues/3959 | MDEyOklzc3VlQ29tbWVudDYxMjUxMzcyMg== | eric-czech 6130352 | 2020-04-11T21:07:07Z | 2020-04-11T21:39:42Z | NONE | Thanks @keewis! I like those ideas so I experimented a bit and found a few things.
Is there any reason not to put the name of the type into
I would love to try to use something like that. I couldn't get it to work either when trying to have a TypedDict that represents entire datasets, so I tried creating them for ```python MyDict = TypedDict('MyDict', {'x': str}) v1: MyDict = MyDict(x='x') This is finev2: Mapping = v1 But this doesn't work:v2: Mapping[Hashable, Any] = v1 # A notable examples since it's used in xr.Dataset error: Incompatible types in assignment (expression has type "MyDict", variable has type "Mapping[Hashable, Any]")And neither do any of these:v2: dict = v1 error: Incompatible types in assignment (expression has type "MyDict", variable has type "Dict[Any, Any]")v2: Mapping[str, str] = v1 error: Incompatible types in assignment (expression has type "MyDict", variable has type "Mapping[str, str]")``` Going the other direction isn't possible at all (i.e. from ```python ds = xr.Dataset(data_vars=MyTypedDict(data=...)) Now assume a user wants to use data_vars/coords with type safety:data_vars: MyTypedDict = ds.data_vars # This doesn't work ``` Generics seem like a decent solution to all these problems, but it would obviously involve a lot of type annotation changes: ```python Ideally, xarray.typing would help specify more specific constraints,but this works with what exists today:GenotypeDataVars = TypedDict('GenotypeDataVars', {'data': DataArray, 'mask': DataArray}) GenotypeCoords = TypedDict('GenotypeCoords', {'variant': DataArray, 'sample': DataArray}) D = TypeVar('D', bound=Mapping) C = TypeVar('C', bound=Mapping) Assume xr.Dataset was written something like this instead:class Dataset(Generic[D, C]):
ds1: Dataset[GenotypeDataVars, GenotypeCoords] = Dataset( GenotypeDataVars(data=xr.DataArray(), mask=xr.DataArray()), GenotypeCoords(variant=xr.DataArray(), sample=xr.DataArray()) ) Types should then be preserved even if xarray is constantly redefiningnew instances in internal functions:ds2: Dataset[GenotypeDataVars, GenotypeCoords] = type(ds1)(ds1.data_vars, ds1.coords) # This is OK ``` Anyways, my takeaways from everything on the thread so far are:
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Extending Xarray for domain-specific toolkits 597475005 | |
612076605 | https://github.com/pydata/xarray/issues/3959#issuecomment-612076605 | https://api.github.com/repos/pydata/xarray/issues/3959 | MDEyOklzc3VlQ29tbWVudDYxMjA3NjYwNQ== | keewis 14808389 | 2020-04-10T15:23:08Z | 2020-04-10T15:56:08Z | MEMBER | you could emulate the availability of the accessors by checking your variables in the constructor of the accessor using ```python dataset_types = { frozenset("variable1", "variable2"): "type1", frozenset("variable2", "variable3"): "type2", frozenset("variable1", "variable3"): "type3", } def _dataset_type(ds): data_vars = frozenset(ds.data_vars.keys()) return dataset_types[data_vars] @xr.register_dataset_accessor("type1")
class Type1Accessor:
def init(self, ds):
if _dataset_type(ds) != "type1":
raise AttributeError("not a type1 dataset")
self.dataset = ds
``` If you just wanted to use static code analysis using e.g. class Dataset1(DatasetType): longitude : Coordinate[ArrayType[Float64Type]] latitude : Coordinate[ArrayType[Float64Type]]
def function(ds : Dataset1): # ... return ds ``` and have the type checker validate the structure of the dataset. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Extending Xarray for domain-specific toolkits 597475005 | |
612050871 | https://github.com/pydata/xarray/issues/3959#issuecomment-612050871 | https://api.github.com/repos/pydata/xarray/issues/3959 | MDEyOklzc3VlQ29tbWVudDYxMjA1MDg3MQ== | eric-czech 6130352 | 2020-04-10T14:23:20Z | 2020-04-10T14:24:35Z | NONE | Thanks @keewis, that would work though I think it leads to an awkward result if I'm understanding correctly. Here's what I'm imagining: ```python from genetics import api These are different types of data structures I originally wanted to model as classesds1 = api.create_genotype_call_dataset(...) ds2 = api.create_genotype_probability_dataset(...) ds3 = api.create_haplotype_call_dataset(...) ds1, ds2, and ds3 are now just xr.Dataset instancesFor each of these different types of datasets I have separate accessorsthat expose dataset-type-specific analytical methods:@xr.register_dataset_accessor("genotype_calls") class GenotypeCallAccessor: def init(self, ds): self.ds = ds
@xr.register_dataset_accessor("genotype_probabilities") class GenotypeProbabilityAccessor: ??? # This also has some "analyze" method @xr.register_dataset_accessor("haplotype_calls") class HaplotypeCallAccessor: ??? # This also has some "analyze" method * Now, how do I prevent this? ***ds1.haplotype_calls.analyze() ds1 is really genotype call data so it shouldn't be possible to do a haplotype analysis on it``` Is there a way to make accessors available on an xr.Dataset based on some conditions about the dataset itself? That still seems like a bad solution, but I think it would help me here. I was trying to think of some way to use static structural subtyping but I don't see how that could ever work with accessors given that 1) they're attached at runtime and 2) all accessors are available on ALL Dataset instances, regardless of whether or not I know only certain things should be possible based on their content. If accessors are the only way Xarray plans to facilitate extension, has anyone managed to enable static type analysis on their extensions? In my case, I'd be happy to have any kind of safety whether its static or monkey-patched in at runtime, but I'm curious if making static analysis impossible was a part of the discussion in deciding on accessors. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Extending Xarray for domain-specific toolkits 597475005 | |
611997039 | https://github.com/pydata/xarray/issues/3959#issuecomment-611997039 | https://api.github.com/repos/pydata/xarray/issues/3959 | MDEyOklzc3VlQ29tbWVudDYxMTk5NzAzOQ== | keewis 14808389 | 2020-04-10T11:49:32Z | 2020-04-10T11:49:32Z | MEMBER | do you have any control on how the datasets are created? If so, you could provide a factory function (maybe pass in arrays via required kwargs?) that does the checks and describes the required dataset structure in its docstring.
This probably won't happen in the near future, though, since the custom dtypes for |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Extending Xarray for domain-specific toolkits 597475005 | |
611973587 | https://github.com/pydata/xarray/issues/3959#issuecomment-611973587 | https://api.github.com/repos/pydata/xarray/issues/3959 | MDEyOklzc3VlQ29tbWVudDYxMTk3MzU4Nw== | eric-czech 6130352 | 2020-04-10T10:20:37Z | 2020-04-10T10:22:16Z | NONE |
That works for documenting the methods but I'm more concerned with documenting how to build the Dataset in the first place. Specifically, this would mean describing how to construct several arrays relating to genotype calls, phasing information, variant call quality scores, individual pedigree info, blah blah etc. and all these domain-specific things can have some pretty nuanced relationships so I think describing how to create a sensible Dataset with them will be a big part of the learning curve for users. I want to essentially override the constructor docs for Dataset and make it more specific to our use cases. I can't see a good way to do that with accessors since the dataset would already need to have been created.
It is, or at least I'd like not to preclude the checks from doing things like checking min/max values and asserting conditions along axes (i.e. sums to 1).
Will do. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Extending Xarray for domain-specific toolkits 597475005 | |
611967822 | https://github.com/pydata/xarray/issues/3959#issuecomment-611967822 | https://api.github.com/repos/pydata/xarray/issues/3959 | MDEyOklzc3VlQ29tbWVudDYxMTk2NzgyMg== | TomNicholas 35968931 | 2020-04-10T10:02:39Z | 2020-04-10T10:02:39Z | MEMBER |
There surely must be some way to do that, but I'm afraid I'm not a docs wizard. However the accessor is still just a class, whose methods you want to document - would it be too unclear for them to hang off each
There is some caching, but you shouldn't rely on it. In #3268 @crusaderky said "The more high level discussion is that the statefulness of the accessor is something that is OK to use for caching and performance improvements, and not OK for storing functional information like yours."
Checking dtype and dimensions shouldn't be expensive though, or is it more than that?
If you have other questions about dtypes in xarray then please feel free to raise another issue about that. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Extending Xarray for domain-specific toolkits 597475005 | |
611950517 | https://github.com/pydata/xarray/issues/3959#issuecomment-611950517 | https://api.github.com/repos/pydata/xarray/issues/3959 | MDEyOklzc3VlQ29tbWVudDYxMTk1MDUxNw== | eric-czech 6130352 | 2020-04-10T09:08:06Z | 2020-04-10T09:08:06Z | NONE | Thanks @TomNicholas, some thoughts on your points:
I'm ok with the subtype being lost after running some methods. I saw that so I'm assuming all functions that do anything with the data structures take and return Xarray objects alone.
Accessors could work but the issues I see with them are:
```python ds.haplo.do_custom_analysis_1() Do something with coords/indexes that causes a new Dataset to be createde.g. ds.reset_index ultimately hits https://github.com/pydata/xarray/blob/1eedc5c146d9e6ebd46ab2cc8b271e51b3a25959/xarray/core/dataset.py#L882which creates a new Datasetds = ds.reset_index() The
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Extending Xarray for domain-specific toolkits 597475005 | |
611719548 | https://github.com/pydata/xarray/issues/3959#issuecomment-611719548 | https://api.github.com/repos/pydata/xarray/issues/3959 | MDEyOklzc3VlQ29tbWVudDYxMTcxOTU0OA== | TomNicholas 35968931 | 2020-04-09T19:46:50Z | 2020-04-09T19:47:45Z | MEMBER |
One of the more immediate problems you'll find if you subclass is that xarray internally uses methods like You could make custom accessors which perform checks on the input arrays when they get used? ```python @xr.register_dataset_accessor('haplo') def HaploDatasetAccessor: def init(self, ds) check_conforms_to_haplo_requirements(ds) self.data = ds
ds.haplo.analyse() ``` I'm also wondering whether given that the only real difference (not just by convention) of your desired data structures from xarray's is the dtype, then (if xarray actually offered it) would something akin to pandas' |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Extending Xarray for domain-specific toolkits 597475005 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 3