home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where user = 25071375 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, created_at (date), updated_at (date)

issue 4

  • Forward Fill not working when there are all-NaN chunks 4
  • Appending data to a dataset stored in Zarr format produce PermissonError or NaN values in the final result 3
  • New algorithm for forward filling 2
  • {DataArray,Dataset}.rank() should support an optional list of dimensions 1

user 1

  • josephnowak · 10 ✖

author_association 1

  • CONTRIBUTOR 10
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1546942397 https://github.com/pydata/xarray/issues/5511#issuecomment-1546942397 https://api.github.com/repos/pydata/xarray/issues/5511 IC_kwDOAMm_X85cNHe9 josephnowak 25071375 2023-05-14T16:41:38Z 2023-05-14T17:03:57Z CONTRIBUTOR

Hi @shoyer, sorry for bothering you with this issue again, I know that it is old right now, but I have been dealing with it again some days ago and I have also noticed the same problem using the region parameter, so I was thinking that based on this issue I opened on Zarr (https://github.com/zarr-developers/zarr-python/issues/1414) it would be good to implement any of this options to solve the problem:

  1. A warning on the docs indicating that it is necessary to add a synchronizer if you want to append or update data to a Zarr file, or that you need to manually align the chunks based on the size of the missing data on the last chunk to be able to get independent writes.

  2. Automatically align the chunks to get independent writes (which I think can produce slower writes due to the modification of the chunks).

  3. Raise an error if there is no synchronizer and the chunks are not properly aligned, I think that the error can be controlled using the parameter safe_chunks that you offer on the to_zarr method.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Appending data to a dataset stored in Zarr format produce PermissonError or NaN values in the final result 927617256
1002302800 https://github.com/pydata/xarray/pull/6118#issuecomment-1002302800 https://api.github.com/repos/pydata/xarray/issues/6118 IC_kwDOAMm_X847ve1Q josephnowak 25071375 2021-12-28T22:15:36Z 2021-12-28T22:15:36Z CONTRIBUTOR

@dcherian I think that with the last changes everything is ready.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  New algorithm for forward filling 1089504942
1001786877 https://github.com/pydata/xarray/pull/6118#issuecomment-1001786877 https://api.github.com/repos/pydata/xarray/issues/6118 IC_kwDOAMm_X847tg39 josephnowak 25071375 2021-12-27T22:42:07Z 2021-12-28T15:09:14Z CONTRIBUTOR

this fixes #6112 and also enable the limit option (I did not find any issue about this)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  New algorithm for forward filling 1089504942
1001787425 https://github.com/pydata/xarray/issues/6112#issuecomment-1001787425 https://api.github.com/repos/pydata/xarray/issues/6112 IC_kwDOAMm_X847thAh josephnowak 25071375 2021-12-27T22:44:43Z 2021-12-27T22:45:04Z CONTRIBUTOR

I will be on the lookout for any changes that may be required.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Forward Fill not working when there are all-NaN chunks 1088893989
1001740657 https://github.com/pydata/xarray/issues/6112#issuecomment-1001740657 https://api.github.com/repos/pydata/xarray/issues/6112 IC_kwDOAMm_X847tVlx josephnowak 25071375 2021-12-27T20:27:16Z 2021-12-27T20:27:16Z CONTRIBUTOR

Two questions: 1. Is possible to set the array used for the test_push_dask as np.array([np.nan, 1, 2, 3, np.nan, np.nan, np.nan, np.nan, 4, 5, np.nan, 6])?, using that array you can validate the test case that I put on this issue without creating another array (It's the original array but permuted). 2. Can I erase the conditional that checks for the case where all the chunks have size 1?, I think that with the new method that is not necessary. py # I think this is only necessary due to the use of the map_overlap of the previous method. if all(c == 1 for c in array.chunks[axis]): array = array.rechunk({axis: 2})

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Forward Fill not working when there are all-NaN chunks 1088893989
1001676665 https://github.com/pydata/xarray/issues/6112#issuecomment-1001676665 https://api.github.com/repos/pydata/xarray/issues/6112 IC_kwDOAMm_X847tF95 josephnowak 25071375 2021-12-27T17:53:07Z 2021-12-27T17:59:57Z CONTRIBUTOR

yes, of course, by the way, it would be possible to add something like the following code for the case that there is a limit? I know this code generates like 4x more tasks but at least it does the job so, probably a warning could be sufficient. (If it is not good enough to be added there is no problem, probably building the graph manually will be a better option than using this algorithm for the forward fill with limits).

```py def ffill(x: xr.DataArray, dim: str, limit=None):

def _fill_with_last_one(a, b):
    # cumreduction apply the push func over all the blocks first so, 
    # the only missing part is filling the missing values using
    # the last data for every one of them
    if isinstance(a, np.ma.masked_array) or isinstance(b, np.ma.masked_array):
        a = np.ma.getdata(a)
        b = np.ma.getdata(b)
        values = np.where(~np.isnan(b), b, a)
        return np.ma.masked_array(values, mask=np.ma.getmaskarray(b))

    return np.where(~np.isnan(b), b, a)


from bottleneck import push


def _ffill(arr):
    return xr.DataArray(
        da.reductions.cumreduction(
            func=push,
            binop=_fill_with_last_one,
            ident=np.nan,
            x=arr.data,
            axis=arr.dims.index(dim),
            dtype=arr.dtype,
            method="sequential",
        ),
        dims=x.dims,
        coords=x.coords
    )

if limit is not None:
    axis = x.dims.index(dim)
    arange = xr.DataArray(
        da.broadcast_to(
            da.arange(
                x.shape[axis],
                chunks=x.chunks[axis],
                dtype=x.dtype
            ).reshape(
                tuple(size if i == axis else 1 for i, size in enumerate(x.shape))
            ),
            x.shape,
            x.chunks
        ),
        coords=x.coords,
        dims=x.dims
    )
    valid_limits = (arange - _ffill(arange.where(x.notnull(), np.nan))) <= limit
    return _ffill(arr).where(valid_limits, np.nan)

return _ffill(arr)

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Forward Fill not working when there are all-NaN chunks 1088893989
1001656569 https://github.com/pydata/xarray/issues/6112#issuecomment-1001656569 https://api.github.com/repos/pydata/xarray/issues/6112 IC_kwDOAMm_X847tBD5 josephnowak 25071375 2021-12-27T17:00:53Z 2021-12-27T17:00:53Z CONTRIBUTOR

Probably using the logic of the cumsum and cumprod of dask you can implement the forward fill. I check a little bit the dask code that is on Xarray and apparently none of them use the HighLevelGraph so if the idea is to avoid building the graph manually I think that you can use the cumreduction function of dask to make the work (Probably there is a better dask function for doing this kind of computations but I haven't find it).

```py def ffill(x: xr.DataArray, dim: str, limit=None):

def _fill_with_last_one(a, b):
    # cumreduction apply the push func over all the blocks first so, 
    # the only missing part is filling the missing values using
    # the last data for every one of them
    if isinstance(a, np.ma.masked_array) or isinstance(b, np.ma.masked_array):
        a = np.ma.getdata(a)
        b = np.ma.getdata(b)
        values = np.where(~np.isnan(b), b, a)
        return np.ma.masked_array(values, mask=np.ma.getmaskarray(b))

    return np.where(~np.isnan(b), b, a)


from bottleneck import push

return xr.DataArray(
    da.reductions.cumreduction(
        func=push,
        binop=_fill_with_last_one,
        ident=np.nan,
        x=x.data,
        axis=x.dims.index(dim),
        dtype=x.dtype,
        method="sequential",
    ),
    dims=x.dims,
    coords=x.coords
)

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Forward Fill not working when there are all-NaN chunks 1088893989
973623524 https://github.com/pydata/xarray/issues/3810#issuecomment-973623524 https://api.github.com/repos/pydata/xarray/issues/3810 IC_kwDOAMm_X846CFDk josephnowak 25071375 2021-11-19T01:00:11Z 2021-11-19T15:09:10Z CONTRIBUTOR

Is it possible to add the option of modifying what happens when there is a tie in the rank? (If you want I can create a separate issue for this)

I think this can be done using the scipy rankdata function instead of the bottleneck rank (but also I think that adding the method option for the bottleneck package is also possible).

Small example: ```py

arr = xarray.DataArray( dask.array.random.random((11, 10), chunks=(3, 2)), coords={'a': list(range(11)), 'b': list(range(10))} )

def rank(x: xarray.DataArray, dim: str, method: str): # This option generate less tasks, I don't know why

axis = x.dims.index(dim)
return xarray.DataArray(
    dask.array.apply_along_axis(
        rankdata,
        axis,
        x.data,
        dtype=float,
        shape=(x.sizes[dim], ),
        method=method
    ),
    coords=x.coords,
    dims=x.dims
)

def rank2(x: xarray.DataArray, dim: str, method: str): from scipy.stats import rankdata

axis = x.dims.index(dim)
return xarray.apply_ufunc(
    rankdata,
    x.chunk({dim: x.sizes[dim]}),
    dask='parallelized',
    kwargs={'method': method, 'axis': axis},
    meta=x.data._meta
)

arr_rank1 = rank(arr, 'a', 'ordinal') arr_rank2 = rank2(arr, 'a', 'ordinal')

assert arr_rank1.equals(arr_rank2) ```

```py

Probably this can work for ranking arrays with nan values

def _nanrankdata1(a, method): y = np.empty(a.shape, dtype=np.float64) y.fill(np.nan) idx = ~np.isnan(a) y[idx] = rankdata(a[idx], method=method) return y

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  {DataArray,Dataset}.rank() should support an optional list of dimensions 572875480
869196682 https://github.com/pydata/xarray/issues/5511#issuecomment-869196682 https://api.github.com/repos/pydata/xarray/issues/5511 MDEyOklzc3VlQ29tbWVudDg2OTE5NjY4Mg== josephnowak 25071375 2021-06-27T17:15:20Z 2021-06-27T17:15:20Z CONTRIBUTOR

Hi again, I check a little bit more the behavior of Zarr and Dask and I found that the problem only occurs when the lock option in the 'dask.store' method is set as None or False, below you can find an example: ```py

import numpy as np import zarr import dask.array as da

Writing an small zarr array with 42.2 as the value

z1 = zarr.open('data/example.zarr', mode='w', shape=(152), chunks=(30), dtype='f4') z1[:] = 42.2

resizing the array

z2 = zarr.open('data/example.zarr', mode='a') z2.resize(308)

New data to append

append_data = da.from_array(np.array([50.3] * 156), chunks=(30))

If you pass to the lock parameters None or False you will get the PermissonError or some 0s in the final result

so I think this is the problem when Xarray writes to Zarr with Dask, (I saw in the code that by default use lock = None)

If you put lock = True all the problems disappear.

da.store(append_data, z2, regions=[tuple([slice(152, 308)])], lock=None)

the result can contain many 0s or throw an error

print(z2[:]) ```

Hope this help to fix the bug.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Appending data to a dataset stored in Zarr format produce PermissonError or NaN values in the final result 927617256
867715379 https://github.com/pydata/xarray/issues/5511#issuecomment-867715379 https://api.github.com/repos/pydata/xarray/issues/5511 MDEyOklzc3VlQ29tbWVudDg2NzcxNTM3OQ== josephnowak 25071375 2021-06-24T15:08:47Z 2021-06-24T15:08:47Z CONTRIBUTOR

Hi, (sorry if this sound annoying) I check a little bit the code used to append data to Zarr files, and from my perspective the logic is correct and it takes into account the case where the last chunks have differents shape because it works with the shape of the unmodified array and then it resizes it to write in regions with Dask:

I ran the same code that I let in the previous comment but I passed a synchronizer to the 'to_zarr' method (synchronizer=zarr.ThreadSynchronizer()) and all the problems related to the nans and to PermissonErrors disappeared, so this looks more like a synchronization problem between Zarr and Dask.

Hope this helps in something to fix the bug.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Appending data to a dataset stored in Zarr format produce PermissonError or NaN values in the final result 927617256

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.703ms · About: xarray-datasette