home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

24 rows where author_association = "CONTRIBUTOR" and user = 8382834 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 3

  • expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 18
  • Opening a variable as several chunks works fine, but opening it "fully" crashes 5
  • Cannot access zarr data on Azure using shared access signatures (SAS) 1

user 1

  • jerabaul29 · 24 ✖

author_association 1

  • CONTRIBUTOR · 24 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1523870704 https://github.com/pydata/xarray/issues/7789#issuecomment-1523870704 https://api.github.com/repos/pydata/xarray/issues/7789 IC_kwDOAMm_X85a1Gvw jerabaul29 8382834 2023-04-26T18:30:58Z 2023-04-26T18:32:33Z CONTRIBUTOR

Just found the solution (ironic, I had been bumping my head into this for quite a while before writing this issue, but found the solution right after writing this): one needs to provide both account_name and sas_token together, the adlfs exception is actually pointing to the right issue, I was just confused. I.e., this works:

xr.open_mfdataset([filename], engine="zarr", storage_options={'account_name':AZURE_STORAGE_ACCOUNT_NAME, 'sas_token': AZURE_STORAGE_SAS})

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Cannot access zarr data on Azure using shared access signatures (SAS) 1685503657
1373697191 https://github.com/pydata/xarray/issues/7421#issuecomment-1373697191 https://api.github.com/repos/pydata/xarray/issues/7421 IC_kwDOAMm_X85R4PSn jerabaul29 8382834 2023-01-06T14:13:52Z 2023-01-06T14:13:52Z CONTRIBUTOR

Creating a conda environment as you suggest, I am fully able to read etc the file, so this solves my issue. Many thanks! Then I guess this means there is some weird issue leading to segfaults on this file with some of the older libnetcdf versions. Closing as using a conda env and a more recent stack fixes things.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Opening a variable as several chunks works fine, but opening it "fully" crashes 1520760951
1373600032 https://github.com/pydata/xarray/issues/7421#issuecomment-1373600032 https://api.github.com/repos/pydata/xarray/issues/7421 IC_kwDOAMm_X85R33kg jerabaul29 8382834 2023-01-06T13:09:29Z 2023-01-06T13:09:29Z CONTRIBUTOR

Ok, thanks, this does crash too on my machine. Then likely something to do with my software stack somewhere, I will try with a new mamba / conda environment and check if this fixes things.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Opening a variable as several chunks works fine, but opening it "fully" crashes 1520760951
1373587759 https://github.com/pydata/xarray/issues/7421#issuecomment-1373587759 https://api.github.com/repos/pydata/xarray/issues/7421 IC_kwDOAMm_X85R30kv jerabaul29 8382834 2023-01-06T12:57:38Z 2023-01-06T12:57:38Z CONTRIBUTOR

@keewis regarding the engine: I have netcdf4 installed and I do not provide a dedicated engine in the open_dataset command, so I guess this is using the netcdf4 engine by default? Anyway I can run a command to double check and confirm to you? :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Opening a variable as several chunks works fine, but opening it "fully" crashes 1520760951
1373584045 https://github.com/pydata/xarray/issues/7421#issuecomment-1373584045 https://api.github.com/repos/pydata/xarray/issues/7421 IC_kwDOAMm_X85R3zqt jerabaul29 8382834 2023-01-06T12:53:18Z 2023-01-06T12:53:18Z CONTRIBUTOR

@keewis interesting. Just to be sure: I am able to open the dataset just fine too, the issue arises when trying to actually read the field, i.e.:

python xr_file = xr.open_dataset(input_file, decode_times=False)

is just fine, but

python xr_file["accD"][0, 0:3235893].data

is what segfaults; just to be sure there is no misunderstanding, are you actually able to run the last command without isse? :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Opening a variable as several chunks works fine, but opening it "fully" crashes 1520760951
1373580808 https://github.com/pydata/xarray/issues/7421#issuecomment-1373580808 https://api.github.com/repos/pydata/xarray/issues/7421 IC_kwDOAMm_X85R3y4I jerabaul29 8382834 2023-01-06T12:49:05Z 2023-01-06T12:49:05Z CONTRIBUTOR

I got help to extract more information in gdb; converting the ipynb to a py file and running it in gdb context:

```

jupyter nbconvert --to script issue_opening_2018_03_b.ipynb [NbConvertApp] Converting notebook issue_opening_2018_03_b.ipynb to script [NbConvertApp] Writing 1313 bytes to issue_opening_2018_03_b.py gdb --args python3 issue_opening_2018_03_b.py [...] (gdb) run Starting program: /usr/bin/python3 issue_opening_2018_03_b.py [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [...] Thread 1 "python3" received signal SIGSEGV, Segmentation fault. __memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:314 314 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory. (gdb) bt

0 __memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:314

1 0x00007ffff6af4bdc in NC4_get_vars () from /home/jrmet/.local/lib/python3.8/site-packages/netCDF4/.libs/libnetcdf-5e98d7e6.so.15.0.0

2 0x00007ffff6af337d in NC4_get_vara () from /home/jrmet/.local/lib/python3.8/site-packages/netCDF4/.libs/libnetcdf-5e98d7e6.so.15.0.0

3 0x00007ffff6a959aa in NC_get_vara () from /home/jrmet/.local/lib/python3.8/site-packages/netCDF4/.libs/libnetcdf-5e98d7e6.so.15.0.0

4 0x00007ffff6a96b9b in nc_get_vara () from /home/jrmet/.local/lib/python3.8/site-packages/netCDF4/.libs/libnetcdf-5e98d7e6.so.15.0.0

5 0x00007ffff6ec24bc in ?? () from /home/jrmet/.local/lib/python3.8/site-packages/netCDF4/_netCDF4.cpython-38-x86_64-linux-gnu.so

6 0x00000000005f5b39 in PyCFunction_Call ()

```

which seems to be originating in libnetcdf?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Opening a variable as several chunks works fine, but opening it "fully" crashes 1520760951
1340951101 https://github.com/pydata/xarray/issues/7363#issuecomment-1340951101 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P7Uo9 jerabaul29 8382834 2022-12-07T13:14:56Z 2022-12-07T13:14:56Z CONTRIBUTOR

(really feeling bad about missing your nice suggestion @headtr1ck , I must find a better way to jump between computer / smartphone / tablet and not miss some comments :see_no_evil: . again thanks for all the help).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1340947035 https://github.com/pydata/xarray/issues/7363#issuecomment-1340947035 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P7Tpb jerabaul29 8382834 2022-12-07T13:12:28Z 2022-12-07T13:12:28Z CONTRIBUTOR

Oooh I am so sorry @headtr1ck , apologies. I am using a lot the email received messages to check things out, and your message and the one from @keewis arrived in the same time and I missed yours. Really sorry, many thanks for pointing to this first, my bad.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1340939982 https://github.com/pydata/xarray/issues/7363#issuecomment-1340939982 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P7R7O jerabaul29 8382834 2022-12-07T13:07:10Z 2022-12-07T13:07:10Z CONTRIBUTOR

Following the pointer by @keewis, I just did an:

extended_observations = previous_observations.pad(pad_width={"time": (0, needed_padding)}, mode="constant", constant_values=-999)

This runs nearly instantaneously and does exactly what I need. Many thanks to all for your help, and sorry for missing that there was the pad function. I close for now (the only question, is why the call to reindex is costly on my machine; I wonder if there may be some old version of some underlying software at stake).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1340816132 https://github.com/pydata/xarray/issues/7363#issuecomment-1340816132 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P6zsE jerabaul29 8382834 2022-12-07T11:14:03Z 2022-12-07T11:14:03Z CONTRIBUTOR

Aaah, you are right @keewis , pad should do exactly what I need :) . Many thanks. Interesting, I did spend a bit of time looking for this, somehow I could not find it - it is always hard to find the correct function to use when not knowing exactly what name to look for in advance :) .

Then I will check the use of pad this afternoon and I think this will fit my need. Still not sure why reindex was so problematic on my machine.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1340806519 https://github.com/pydata/xarray/issues/7363#issuecomment-1340806519 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P6xV3 jerabaul29 8382834 2022-12-07T11:06:21Z 2022-12-07T11:06:21Z CONTRIBUTOR

Yes, this is representative of my dataset :) .

Ok, interesting. I start this on my machine (Ubuntu 20.04, with 16GB of RAM, 15.3GB reported by the system as max available for memory).

  • I start at around 6GB used, ie 9.3 GB available
  • I run the script, in ipython3, after a few seconds my machine exhausts RAM and freezes, then the process gets killed:

``` [ins] In [1]: import numpy as np ...: import xarray as xr ...: import datetime ...: ...: # create two timeseries', second is for reindex ...: itime = np.arange(0, 3208464).astype("<M8[s]") ...: itime2 = np.arange(0, 4000000).astype("<M8[s]") ...: ...: # create two dataset with the time only ...: ds1 = xr.Dataset({"time": itime}) ...: ds2 = xr.Dataset({"time": itime2}) ...: ...: # add random data to ds1 ...: ds1 = ds1.expand_dims("station") ...: ds1 = ds1.assign({"test": (["station", "time"], np.random.rand(106, 3208464))})

[ins] In [2]: %%time ...: ds3 = ds1.reindex(time=ds2.time) ...: ...: Killed ```

I will try again later trying with fewer things open so I can start from a lower RAM level / more available RAM and see if this helps.

Can this be a different in performance due to different versions? What kind of machine are you running on? Still, not being able to do this with over 9GB of RAM available feels a bit limiting :) .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1340744559 https://github.com/pydata/xarray/issues/7363#issuecomment-1340744559 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P6iNv jerabaul29 8382834 2022-12-07T10:38:45Z 2022-12-07T10:39:10Z CONTRIBUTOR

Good point that a timeseries is usually ordered already (and that the corresponding type should support comparison / weak ordering), but I wonder (only speculating): is it possible that xarray makes no assumptions of this kind, to be as general as possible (not all xarray datasets are timeseries :) ), in which case there may or may not be a dichotomy search under the hood?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1340701570 https://github.com/pydata/xarray/issues/7363#issuecomment-1340701570 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P6XuC jerabaul29 8382834 2022-12-07T10:12:07Z 2022-12-07T10:12:20Z CONTRIBUTOR

Regarding your questions @kmuehlbauer :

  • the dataset does fit in memory, and dumped as netCDF on the hard drive, it does take about 2.7GB as you say :) .
  • I am not using dask backed arrays
  • I think you are right, I think that the typical RAM consumption due to this thing is maybe around the 6GB you mention :)
  • I did a bit of testing and it looks like VSC (that I am using to run the notebooks) is wasting a lot of RAM in particular when there is large cell outputs, so that may be the biggest culprit here...

Regarding your question @keewis : nothing special here:

In [] repr(timestamps_extended_basis) Out [] 'array([-315619200, -315618600, -315618000, ..., 1667259000, 1667259600,\n 1667260200])'

but it is quite big:

In [] len(timestamps_extended_basis) Out [] 3304800

Regarding the points of discussion / suggestions:

  • I think that the suggestion of @keewis to use concat is nice. This is probably how I will solve things for now :) .
  • @kmuehlbauer is it so surprising that the call to reindex is really slow? :) I am not sure of how reindex tries to find the match between new and previous indexes, but if does a lookup for each new index of "does this new index exist in the old indexes" by just walking through them, this is potentially a heavy computational cost, right? (ie trying 3.5e6 times to see if an element is present among 3e6 elements, right?). I do not know how this is implemented in practice (for example, is it possible that reindex first sorts the previous indexes (but they have to have an ordering relation then), then uses a dichotomy search and not a naive search? that would cut down complexity quite a bit). But in all case, when just adding new indexes at the end of existing indexes, keeping old indexes unchanged, this will always be quite a lot more work than just concat / extending the arrays, right? :)

My feeling is that while concat works, it may be an operation that is common enough that there may be interest in implementing a "grow_coordinate" function to grow / reallocate larger arrays copying the previous chunk along a coordinate, as a usability / convenience feature? Something like:

xr_dataset.grow_coordinate(coordinate_grow_along="time", location_extra_capacity="end", default_fill=1.0e37)

which would grow the coordinate "time" itself and all data variables that depend on it, adding the default filled extra entries at the end. Not sure if this should be relative to coordinates, or dimensions - a bit n00b on this, and always confused at coordinates vs dimensions.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1339973248 https://github.com/pydata/xarray/issues/7363#issuecomment-1339973248 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P3l6A jerabaul29 8382834 2022-12-06T20:33:38Z 2022-12-06T20:33:38Z CONTRIBUTOR

(and I guess this pattern of appending at the end of time dimension is quite common)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1339972500 https://github.com/pydata/xarray/issues/7363#issuecomment-1339972500 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P3luU jerabaul29 8382834 2022-12-06T20:32:49Z 2022-12-06T20:32:49Z CONTRIBUTOR

@keewis I will come back to my computer tomorrow but the basis is big - like going from 3 million time points before growing to 3.5 million time points after growing, and there are 100 'stations' with this number of time points each. So if re indexing does a search without dichotomy for each station and each time point, that may take some time. The specificity here is that the 3 first million time points are unchanged and the new 500k are just empty by default, but I guess reindex has no way to know it if it is written to be general?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1339607893 https://github.com/pydata/xarray/issues/7363#issuecomment-1339607893 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P2MtV jerabaul29 8382834 2022-12-06T16:07:17Z 2022-12-06T16:07:17Z CONTRIBUTOR

The call to reindex is eating up my RAM and not finishing after 15 minutes, killing it, I will apply the "allocate larger np arrays, block copy preexisting data, create new dataset" approach, it could be useful to have a turn key function for doing so :) .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1339595819 https://github.com/pydata/xarray/issues/7363#issuecomment-1339595819 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P2Jwr jerabaul29 8382834 2022-12-06T15:59:19Z 2022-12-06T16:00:51Z CONTRIBUTOR

This has been running for 10 minutes now; if there is a "stupid", "non searchsorted" lookup for every entry (which would make sense, there is no reason to make some assumption about how the index looks like), reindex may take a reeeeeally long time, I think I will drop this in a few minutes and do the i) create extended numpy arrays, ii) extract the xarray data as numpy arrays iii) block copy the data that is not modified, iv) block fill the data that are modified instead.

So this discussion may still be relevant for adding a new way of extending by just re-allocating with more memory at the end of a dimension, copying the previously existing data up to the previous size, and filling the new entries corresponding to the additional entries created with a user value, as this will be much faster than using reindex and lookup for every entry.

I think this is a quite typical workflow needed when working in geosciences and adding some new observations to an aggregated dataset, so this may be useful for quite many people :) .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1339588658 https://github.com/pydata/xarray/issues/7363#issuecomment-1339588658 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P2IAy jerabaul29 8382834 2022-12-06T15:53:53Z 2022-12-06T15:53:53Z CONTRIBUTOR

You are right, many thanks, applying the first solution works fine :) .

New "issue": the call to reindex seems to take a lot of time (guess this is because there is a lookup for every single entry), while extending a numpy array would be close to instantaneous from my point of view.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1339549627 https://github.com/pydata/xarray/issues/7363#issuecomment-1339549627 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P1-e7 jerabaul29 8382834 2022-12-06T15:25:33Z 2022-12-06T15:25:33Z CONTRIBUTOR

A bit of context (sorry in advance for the screenshots rather than snippets; I could generate snippets if we need, it would just be a bit extra work): my dataset initially looks like (from a netCDF file):

I add a coord so that it fits the documentation above:

however the reindex then fails (either I use time or timestamps):

If you have an idea why (I googled the error message, could not find much, though I may have missed something), this could be great :) .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1339540315 https://github.com/pydata/xarray/issues/7363#issuecomment-1339540315 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P18Nb jerabaul29 8382834 2022-12-06T15:18:26Z 2022-12-06T15:18:26Z CONTRIBUTOR

Sorry, actually it does seem to work following the example from the documentation above adding a station... Then I need to understand why it does not work in my example.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1339533763 https://github.com/pydata/xarray/issues/7363#issuecomment-1339533763 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P16nD jerabaul29 8382834 2022-12-06T15:13:48Z 2022-12-06T15:13:48Z CONTRIBUTOR

Ahh, actually it seems like reindexing only works if the size remains the same (?). Getting:

ValueError: cannot reindex or align along dimension 'time' without an index because its size 3208464 is different from the size of the new index 3304800

then the reindex solution would not work.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1339499858 https://github.com/pydata/xarray/issues/7363#issuecomment-1339499858 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P1yVS jerabaul29 8382834 2022-12-06T14:50:08Z 2022-12-06T14:50:08Z CONTRIBUTOR

I will do the following: use only int types in these "critical" dimensions (I should do so anyways), this way there will be no issue of numerical equality roundoffs. I keep this open so that maintainers can see it, but feel free to close if you feel the initial suggestion is too close to reindex :) .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1339455353 https://github.com/pydata/xarray/issues/7363#issuecomment-1339455353 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P1nd5 jerabaul29 8382834 2022-12-06T14:16:36Z 2022-12-06T14:16:36Z CONTRIBUTOR

Yes, this is exactly what I plan on doing, I will find my way around by myself on this, no worries :) . I just wonder if for example there may be some float rounding issues etc for example when "matching" values that may potentially lead to silent issues - just saying that "re-allocated with a bunch more memory and default initializing new entries with a given value" just feels a bit safer to me, but of course I may just be a bit paranoid :) .

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713
1339440638 https://github.com/pydata/xarray/issues/7363#issuecomment-1339440638 https://api.github.com/repos/pydata/xarray/issues/7363 IC_kwDOAMm_X85P1j3- jerabaul29 8382834 2022-12-06T14:06:02Z 2022-12-06T14:12:04Z CONTRIBUTOR

@kmuehlbauer many thanks for your answer :) . I think this is a very good fit indeed. The only drawback I see is the need to create the time array in advance (i.e. I have to say to xarray "use this time array instead" and trust that the right "matching" is done on existing data, rather than just say "keep the existing arrays as they are, just extend their size and fill them with default value"), but agree this is otherwise equivalent to the thing I ask for :) .

I will udpate the SO thread with your suggestion, pointing to this issue, and giving credits to you of course, if this is ok :) .


edit: re-reading the SO answer, I think it is not exactly what we discuss here, I will wait for now.

edit: with your help, I am able to search better on SO, and this is well described at https://stackoverflow.com/questions/70370667/how-do-i-expand-a-data-variable-along-a-time-dimension-using-xarray :) . Keeping open just so that maintainers can decide if the "size extension" is so close to the "reindex" that they just want to direct users to "reindex", or if they want to add an "growdim" or similar.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  expand dimension by re-allocating larger arrays with more space "at the end of the corresponding dimension", block copying previously existing data, and autofill newly created entry by a default value (note: alternative to reindex, but much faster for extending large arrays along, for example, the time dimension) 1479121713

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.623ms · About: xarray-datasette