home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where user = 4441338 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, created_at (date), updated_at (date)

issue 9

  • Feature Request: Hierarchical storage and processing in xarray 2
  • Dataset groups 1
  • zarr and xarray chunking compatibility and `to_zarr` performance 1
  • Lazy concatenation of arrays 1
  • ENH: Compute hash of xarray objects 1
  • output_dtypes needs to be a tuple, not a sequence 1
  • sel(dim=slice(a,b,c)) only accepts integers for c, uses c as isel does 1
  • Allow chunks=None for dask-less lazy loading 1
  • diff('non existing dimension') does not raise exception 1

user 1

  • LunarLanding · 10 ✖

author_association 1

  • NONE 10
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1506592074 https://github.com/pydata/xarray/issues/7748#issuecomment-1506592074 https://api.github.com/repos/pydata/xarray/issues/7748 IC_kwDOAMm_X85ZzMVK LunarLanding 4441338 2023-04-13T08:51:47Z 2023-04-13T08:51:47Z NONE

My user story is that I had up to some point a DataArray with index 'a', at some point code changed and I used as index 'b', but some code was still operating on 'a'; and I expected that code to error on accessing 'a'.

If I do np.diff(x,axis=(some non-existent axis)), numpy errors.

I might be generalizing, please correct me if I'm wrong, but afaik broadcasting happens in operations with 2 or more arrays, and the resulting dimensions are the union of the dimensions of each input array; the dimensions operated on are always at least in one of the involved arrays. Broadcasting does not happen if the arrays have the same shape, as it is for the diff operation.

So by following the same logic, when using an unary operator on a Dataset, an error should be thrown if none of the contained variables have the requested dimension. Since xarray allows scalar-sized dimensions (where the index shape is () ), those could be used to keep current behavior.

I read the linked issue. imho, the operation on a DataArray should error if the dimension is not there; on a Dataset the operation should be applied to each DataArray that contains the dimension, and if there are none, an error should be thrown. i.e. (set of dimensions allowed as arguments to operators on a set of containers) = ( union of the dimensions in each container).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  diff('non existing dimension') does not raise exception 1664193419
1139544716 https://github.com/pydata/xarray/issues/4738#issuecomment-1139544716 https://api.github.com/repos/pydata/xarray/issues/4738 IC_kwDOAMm_X85D7BKM LunarLanding 4441338 2022-05-27T11:48:14Z 2022-05-27T11:48:14Z NONE

This looks like a bug in my opinion...

@andersy005

This runs with not issues atm.

python with dask.config.set({"tokenize.ensure-deterministic":True}): ds = xr.tutorial.open_dataset('rasm') b = ds.isel(y=0) assert dask.base.tokenize(b) == dask.base.tokenize(b)

With: xarray 2022.3.0 pyhd8ed1ab_0 conda-forge dask 2022.5.0 pyhd8ed1ab_0 conda-forge

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  ENH: Compute hash of xarray objects 775502974
1059382908 https://github.com/pydata/xarray/issues/4118#issuecomment-1059382908 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X84_JOZ8 LunarLanding 4441338 2022-03-04T17:46:03Z 2022-03-04T18:14:19Z NONE

Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst.

By "variable" length, do you mean that the length of dimensions differs between variables in the same group, or just that you don't know the length of the dimension in advance?

I mean that I might have, for instance, a map from 2 variables to data, ie (x,y)->c, that I can write as a DataArray XY with two dimensions x and y and the values being c. Then I have a function f so that f(c)->d[g(c)], i.e. it yields an array whose length depends on c. I wish I could say : apply f to XY, building a variable length array as you get the output. It could be stored as sparse matrice (X,Y,G). This is a bit out of scope for this discussion; but it is related since creating a differently named group per dimension length is often mentioned as a workaround ( which does not scale when you have a 1000x(variable length dimension) data).

Is there a specific use case which you think would require explicit dimensions to solve?

The use-case is iteratively adding values to a dataset by mapping functions over multiple variables / dimensions in arbitrary compositions. This happens in the context of data analysis, where you start with some source data and then iteratively create analysis functions, and then want to query / display / do statistics/reductions on the set of original data + analysis. Explicit hierarchical dimensions allow for merging and referring to data with no collisions in a single datatree/group.

PS: in netcdf-4 dimensions are seen by children, it matches what I previously posted; in HDF5 nodes are hardlinks to the actual data , this might be exactly the xarray-datagroup posted above.

Example of ideal datastructure

The datastructure that is more useful for this kind of analysis is the one that is an arbitrary graph of n-dimensional arrays; forcing the graph to have a hierarchical access allows optional organization; the graph itself can exist as python objects for nodes and references for edges. If the tree is not necessary/required everything can be placed on the first level, as it is done on a Dataset. # Example: ## Notation - `a:b` value `a` has type `b` - `t[...,n,...]` : type of data array of values of type `t`, with axis of length `n` - `D(n(,l))` dimension of size `n` with optional labels `l` - `A(t,*(dims:tuple[D])}` : type of data array of values of type `t`, with dimension `dims` - a tree node `T` is either: - a dict from hashables to tree nodes, `dict[Hashable,T]` - a dimension `D` - a data array `A` - `a[*tags]:=a[tag[0]][tag[1]]...[tag[len(tag)-1]]` - `map(f,*args:A,dims:tuple[D])` maps `f` over `args` broadcasting over `dims` Start with a 2d-dimensional DataArray: ``` d0 ( Graph : ( x->D(x_n,float[x_n]) y->D(y_n) v->A(float,x,y) ) Tree : ( { 'x':x, 'y':y, 'v':v, } ) ) ``` Map a function `f` that introduces a new dimension `w` with constant labels `f_w_l:int[f_w_n]` (through map_blocks or apply_ufunc) and add it to d0: ``` f : x:float->( Graph: f_w->D(f_w_n,f_w_l) a->A(float,f_w) b->A(float) Tree: { 'w':f_w, 'a':a, 'b':b, }) d1=d0.copy() d1['f']=map( f, d0['v'], (d0['x'],d0['y']) ) d1 ( Graph : x->D(x_n,float[x_n]) y->D(y_n) v->A(float,x,y) f_w->D(f_w_n,f_w_l) f_a->A(float,x,y,f_w) f_b->A(float,x,y) Tree : { 'x':x, 'y':y, 'v':v, 'f':{ 'w':f_w, 'a':f_a, 'b':f_b, } } ) ``` Map a function `g`, that has a dimension of the same name but different meaning and therefore possibly different length `g_w_n` and `g_w_l`: ``` g : x:float->( Graph: g_w->D(g_w_n,g_w_l) a->A(float,g_w) b->A(float) Tree: { 'w':g_w, 'a':a, 'b':b, }) d2=d1.copy() d2['g']=map( g, d1['v'], (d1['x'],d1['y']) ) d2 ( Graph : x->D(x_n,float[x_n]) y->D(y_n) v->A(float,x,y) f_w->D(f_w_n,f_w_l) f_a->A(float,x,y,f_w) f_b->A(float,x,y) g_w->D(g_w_n,g_w_l) g_a->A(float,x,y,g_w) g_b->A(float,x,y) Tree : { 'x':x, 'y':y, 'v':v, 'f':{ 'w':f_w, 'a':f_a, 'b':f_b, }, 'g':{ 'w':g_w, 'a':g_a, 'b':g_b, } } ) ``` Notice that both `f` and `g` output a dimension named 'w' but that they have different lengths and possibly different meanings. Suppose I now want to run analysis on f's and g's output, with a function that takes two a's and outputs a float Then d3 looks like: ``` h : a1:float,a2:float->( Graph: r->A(float) Tree: r d3=d2.copy() d3['f_g_aa']=map( h, d2['f','a'],d2['g','a'], (d2['x'],d2['y'],d2['f','w'],d2['g','w']) ) d3 { Graph : x->D(x_n,float[x_n]) y->D(y_n) v->A(float,x,y) f_w->D(f_w_n,f_w_l) f_a->A(float,x,y,f_w) f_b->A(float,x,y) g_w->D(g_w_n,g_w_l) g_a->A(float,x,y,g_w) g_b->A(float,x,y) f_g_aa->A(float,x,y,f_w,g_w) Tree : { 'x':x, 'y':y, 'v':v, 'f':{ 'w':f_w, 'a':f_a, 'b':f_b, }, 'g':{ 'w':g_w, 'a':g_a, 'b':g_b, } 'f_g_aa': f_g_aa } } ``` Compared to what I posted before, I dropped the resolving the dimension for a array by its position in the hierarchy since it would be innaplicable when a variable refers to dimensions in a different branch of the tree.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
1047915016 https://github.com/pydata/xarray/issues/4118#issuecomment-1047915016 https://api.github.com/repos/pydata/xarray/issues/4118 IC_kwDOAMm_X84-deoI LunarLanding 4441338 2022-02-22T15:30:00Z 2022-02-22T15:38:52Z NONE

Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst. For this, it would make more sense to be able to have dimensions ( with optional labels and coordinates ) assigned to nodes (and these would be inherited by any descendants). Leaf nodes would hold data. On merge, dimensions could be bubbled up as long as length (and labels) matched. Operations with dimensions would then go down to corresponding dimension level before applying the operator, i.e. container['A/B'].mean('time') would be different from container['A'].mean('time')['B'].

Datagroup and Datatree are subcases of this general structure, which could be enforced via flags/checks. Option 1 is where the extremities of the tree are a node with two sets of child nodes, dimension labels and n-dimensional arrays. Option 2 is where the extremities of the tree are a node with a child node for a n-dimensional array A, and a sibling node for each dimension of A, containing the corresponding labels.

I'm sure I'm missing some big issue with the mental model I have, for instance I haven't thought of transformations at all and about coordinates. But for clarity I tried to write it down below.

The most general structure for a dataset I can think of is a directed graph. Each node A is a n-dimensional (sparse) array, where each dimension D points optionally to a one-dimensional node B with the same length.

To get a hierarchical structure, we:

  • add edges of a different color, each with a label
  • restrict their graph to a tree T
  • add labels to each dimension D

We can resolve D's target by (A) checking for a sibling in T with the same name, and then going up one level and goto (A).

Multindexes ( multi-dimensional (sparse) labels ) generalize this model, but require tuple labels in T's edges i.e. : h/j/a[x,y,z] has a sybling h/j/(x,y)[x,y] , with z's labels being one level above, i.e. h/z[z] ( the notation a[b] means map of index b to value a ).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
979412822 https://github.com/pydata/xarray/issues/4628#issuecomment-979412822 https://api.github.com/repos/pydata/xarray/issues/4628 IC_kwDOAMm_X846YKdW LunarLanding 4441338 2021-11-25T18:23:28Z 2021-11-25T18:23:28Z NONE

Any pointers regarding where to start / modules involved to implement this? I would like to have a try.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Lazy concatenation of arrays 753852119
979320921 https://github.com/pydata/xarray/pull/6028#issuecomment-979320921 https://api.github.com/repos/pydata/xarray/issues/6028 IC_kwDOAMm_X846X0BZ LunarLanding 4441338 2021-11-25T15:54:40Z 2021-11-25T15:54:40Z NONE

@keewis thanks, it is a duplicate. it must work in my specific case because I was testing with just one file.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow chunks=None for dask-less lazy loading 1063661422
828355557 https://github.com/pydata/xarray/issues/5228#issuecomment-828355557 https://api.github.com/repos/pydata/xarray/issues/5228 MDEyOklzc3VlQ29tbWVudDgyODM1NTU1Nw== LunarLanding 4441338 2021-04-28T10:46:59Z 2021-04-28T10:46:59Z NONE

I think I should have not expected a simple slice to make the choices that are necessary for evenly sampling a possibly irregular index. Assuming most users won't, I'm closing this. Below is what I'm using now. python def make_intervals(a,b,c): return a+np.arange(0,1+int(np.floor((b-a)/c)))*c print(d.sel(dim0=make_intervals(0,1,.1),method='nearest',tolerance=.01)) <xarray.DataArray (dim0: 11)> array([ 0, 10, 20, 30, 40, 49, 59, 69, 79, 89, 99]) Coordinates: * dim0 (dim0) float64 0.0 0.101 0.202 0.303 ... 0.697 0.798 0.899 1.0

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  sel(dim=slice(a,b,c)) only accepts integers for c, uses c as isel does 869786882
822729940 https://github.com/pydata/xarray/issues/5034#issuecomment-822729940 https://api.github.com/repos/pydata/xarray/issues/5034 MDEyOklzc3VlQ29tbWVudDgyMjcyOTk0MA== LunarLanding 4441338 2021-04-19T19:33:14Z 2021-04-19T19:33:14Z NONE

@dcherian I tried to reproduce, with this minimal example I couldn't, so I'm closing the issue. ```python

import xarray as xr import numpy as np

n0 = 10 n1 = 3 x1 = xr.DataArray(np.empty((n0,n1),dtype=np.float64),dims=('dim0','dim1')).chunk({'dim0':2}) x2 = xr.DataArray(np.empty(n0,dtype=bool),dims=('dim0',)).chunk({'dim0':2})

n2 = 10

def f(x1,x2): return np.empty(n2,dtype=x1.dtype),np.empty(n2,dtype=np.min_scalar_type(n2))

m,w = xr.apply_ufunc( f, x1,x2, input_core_dims=[('dim0','dim1'),('dim0',)], output_core_dims=[('dim2',),('dim2',)], vectorize=True, dask='parallelized', dask_gufunc_kwargs={ 'output_sizes':{'dim2':n2}, 'allow_rechunk':True,

'meta':(np.empty((M,),dtype=p.dtype),np.empty((M,),dtype=np.min_scalar_type(M)))

'output_dtypes':[p.dtype,np.min_scalar_type(M)],

},
output_dtypes=[x1.dtype,np.min_scalar_type(n2)] # now works

output_dtypes=(x.dtype,np.min_scalar_type(ny)) # works

) m.compute(),w.compute() (<xarray.DataArray (dim2: 10)> array([1.e-323, 2.e-323, 3.e-323, 4.e-323, 5.e-323, 6.e-323, 7.e-323, 8.e-323, 9.e-323, 1.e-322]) Dimensions without coordinates: dim2, <xarray.DataArray (dim2: 10)> array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=uint8) Dimensions without coordinates: dim2) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  output_dtypes needs to be a tuple, not a sequence 831148018
683824965 https://github.com/pydata/xarray/issues/1092#issuecomment-683824965 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDY4MzgyNDk2NQ== LunarLanding 4441338 2020-08-31T14:45:22Z 2020-08-31T15:15:20Z NONE

I did a ctrl-f for zarr in this issue, found nothing, so here's my two cents: it should be possible to write a Datagroup with either zarr or netcdf. I wonder if @emilbiju (posted https://github.com/pydata/xarray/issues/4118 ) has any of that code laying around, could be a starting point. In general, a tree structure to which I can broadcast operations in the same dimensions to different datasets that do not necessary share dimension lengths would solve my use case . This corresponds to bullet point number 3 in https://github.com/pydata/xarray/issues/1092#issuecomment-290159834 . My use case is a set of experiments, that have: the same parameter variables, with different values; the same dimensions with different lengths for their data. The parameters and data would benefit of having a hierarchical naming structure. Currently I build a master dataset containing experiment datasets, with a coordinate for each parameter. Then I map functions over it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
673565228 https://github.com/pydata/xarray/issues/2300#issuecomment-673565228 https://api.github.com/repos/pydata/xarray/issues/2300 MDEyOklzc3VlQ29tbWVudDY3MzU2NTIyOA== LunarLanding 4441338 2020-08-13T16:04:04Z 2020-08-13T16:04:04Z NONE

I arrived here due to a different use case / problem, which ultimately I solved, but I think there's value in documenting it here. My use case is the following workflow: 1 . take raw data, build a dataset, append it to a zarr store Z 2 . analyze the data on Z, then maybe goto 1. Step 2's performance is much better when data on Z is chunked properly along the appending dimension 'frame' (chunks of size 50), however step 1 only adds 1 element along it. I end up with Z having chunks (1,1,1,1,1...) on 'frame'. On xarray 0.16.0, this seems solvable via the encoding parameter, if we take care to only use it on the store creation. Before that version, I was using something like the monkey patch posted by @chrisbarber . Code: ```python import shutil import xarray as xr import numpy as np import tempfile zarr_path = tempfile.mkdtemp()

def append_test(ds,chunks): shutil.rmtree(zarr_path)

for i in range(21):
    d = ds.isel(frame=slice(i,i+1))
    d = d.chunk(chunks)
    d.to_zarr(zarr_path,consolidated=True,**(dict(mode='a',append_dim='frame') if i>0 else {}))
dsa = xr.open_zarr(str(zarr_path),consolidated=True)
print(dsa.chunks,dsa.dims)

sometime before 0.16.0

import contextlib @contextlib.contextmanager def change_determine_zarr_chunks(chunks): orig_determine_zarr_chunks = xr.backends.zarr._determine_zarr_chunks try: def new_determine_zarr_chunks( enc_chunks, var_chunks, ndim, name): da = ds[name] zchunks = tuple(chunks[dim] if (dim in chunks and chunks[dim] is not None) else da.shape[i] for i,dim in enumerate(da.dims)) return zchunks xr.backends.zarr._determine_zarr_chunks = new_determine_zarr_chunks yield finally: xr.backends.zarr._determine_zarr_chunks = orig_determine_zarr_chunks chunks = {'frame':10,'other':50} ds = xr.Dataset({'data':xr.DataArray(data=np.random.rand(100,100),dims=('frame','other'))})

append_test(ds,chunks) with change_determine_zarr_chunks(chunks): append_test(ds,chunks)

with 0.16.0

def append_test_encoding(ds,chunks): shutil.rmtree(zarr_path)

encoding = {}
for k,v in ds.variables.items():
    encoding[k]={'chunks':tuple(chunks[dk] if dk in chunks else v.shape[i] for i,dk in enumerate(v.dims))}

for i in range(21):
    d = ds.isel(frame=slice(i,i+1))
    d = d.chunk(chunks)
    d.to_zarr(zarr_path,consolidated=True,**(dict(mode='a',append_dim='frame') if i>0 else dict(encoding = encoding)))
dsa = xr.open_zarr(str(zarr_path),consolidated=True)
print(dsa.chunks,dsa.dims)

append_test_encoding(ds,chunks) ```

Frozen(SortedKeysDict({'frame': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 'other': (50, 50)})) Frozen(SortedKeysDict({'frame': 21, 'other': 100})) Frozen(SortedKeysDict({'frame': (10, 10, 1), 'other': (50, 50)})) Frozen(SortedKeysDict({'frame': 21, 'other': 100})) Frozen(SortedKeysDict({'frame': (10, 10, 1), 'other': (50, 50)})) Frozen(SortedKeysDict({'frame': 21, 'other': 100}))

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  zarr and xarray chunking compatibility and `to_zarr` performance 342531772

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.324ms · About: xarray-datasette