home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

22 rows where issue = 180451196 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 4

  • crusaderky 11
  • shoyer 7
  • kynan 2
  • benbovy 2

author_association 2

  • MEMBER 20
  • NONE 2

issue 1

  • Disable automatic cache with dask · 22 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
261047576 https://github.com/pydata/xarray/pull/1024#issuecomment-261047576 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI2MTA0NzU3Ng== kynan 346079 2016-11-16T19:33:08Z 2016-11-16T19:33:08Z NONE

@shoyer Great, thanks, I'll give that a try.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
261046945 https://github.com/pydata/xarray/pull/1024#issuecomment-261046945 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI2MTA0Njk0NQ== shoyer 1217238 2016-11-16T19:30:53Z 2016-11-16T19:30:53Z MEMBER

@kynan I think this is fixed in #1128, which has a slightly more robust solution.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
260818311 https://github.com/pydata/xarray/pull/1024#issuecomment-260818311 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI2MDgxODMxMQ== kynan 346079 2016-11-16T00:45:51Z 2016-11-16T00:45:51Z NONE

@crusaderky @shoyer There are still cases where dask arrays are converted to ndarrays where I think they shouldn't be: if you create a Variable with a dask array (e.g. in a custom data store), it gets wrapped into a LazilyIndexedArray at the [end of decode_cf_variable](https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L833). This will subsequently be cast to an ndarray in_data_cast`.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
260436147 https://github.com/pydata/xarray/pull/1024#issuecomment-260436147 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI2MDQzNjE0Nw== crusaderky 6213168 2016-11-14T19:28:38Z 2016-11-14T19:28:38Z MEMBER

Happy to contribute!

On 14 Nov 2016 16:58, "Stephan Hoyer" notifications@github.com wrote:

Thanks for your patience! This is a nice improvement.

I have an idea for a variation that might make for a cleaner (less dask specific) way to handle the remaining caching logic -- I'll add you a reviewer on that PR.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1024#issuecomment-260393416, or mute the thread https://github.com/notifications/unsubscribe-auth/AF7OML0y1xCCfg4j0o0OUxMJ8gBpEIB1ks5q-JMYgaJpZM4KLurN .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
260393416 https://github.com/pydata/xarray/pull/1024#issuecomment-260393416 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI2MDM5MzQxNg== shoyer 1217238 2016-11-14T16:57:59Z 2016-11-14T16:57:59Z MEMBER

Thanks for your patience! This is a nice improvement.

I have an idea for a variation that might make for a cleaner (less dask specific) way to handle the remaining caching logic -- I'll add you a reviewer on that PR.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
260202467 https://github.com/pydata/xarray/pull/1024#issuecomment-260202467 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI2MDIwMjQ2Nw== crusaderky 6213168 2016-11-13T18:21:26Z 2016-11-13T18:21:26Z MEMBER

Changed to cache IndexVariable._data on init. Please review...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
260137073 https://github.com/pydata/xarray/pull/1024#issuecomment-260137073 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI2MDEzNzA3Mw== crusaderky 6213168 2016-11-12T17:47:10Z 2016-11-12T17:47:10Z MEMBER

Finished and waiting for code review

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
258679497 https://github.com/pydata/xarray/pull/1024#issuecomment-258679497 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1ODY3OTQ5Nw== shoyer 1217238 2016-11-06T13:01:50Z 2016-11-06T13:01:50Z MEMBER

Awesome, thanks for your help!

On Sat, Nov 5, 2016 at 6:56 PM crusaderky notifications@github.com wrote:

roger that, getting to work :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1024#issuecomment-258647829, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1mu6Gjv5ehzr-d_3gwKr8PPIgqarks5q7QmcgaJpZM4KLurN .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
258647829 https://github.com/pydata/xarray/pull/1024#issuecomment-258647829 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1ODY0NzgyOQ== crusaderky 6213168 2016-11-05T22:56:27Z 2016-11-05T22:56:27Z MEMBER

roger that, getting to work :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
258620328 https://github.com/pydata/xarray/pull/1024#issuecomment-258620328 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1ODYyMDMyOA== shoyer 1217238 2016-11-05T15:53:06Z 2016-11-05T15:53:06Z MEMBER

Anyway, I can open a new issue to discuss more about this if you think it's worth it.

Yes, please do!

@crusaderky I think we are OK going ahead here with Option D. If we do eventually extend xarray with out of core indexes, that will be done with a separate layer (not in IndexVariable).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
258619392 https://github.com/pydata/xarray/pull/1024#issuecomment-258619392 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1ODYxOTM5Mg== benbovy 4160723 2016-11-05T15:37:46Z 2016-11-05T15:37:46Z MEMBER

we already cache an index in memory for any labeled indexing operations

Oh yes, true!

So at best, you could do something like ds.isel(mesh_edge=slice(int(1e6)))

Indeed, that doesn't look very nice.

For out-of-core operations with labels on big unstructured meshes, you really need a generalization of the pandas.Index that doesn't need to live in memory

From what I intend to do next with xarray, I'd say that extending its support for out-of-core operations to big indexes would be a great feature! I haven't seen yet how dask.Dataframe works internally (including dask.Dataframe.indexand dask.Dataframe.loc), but I guess maybe this could be transposed in some way to the indexing logic in xarray? Though I'm certainly missing a lot of potential issues here... Anyway, I can open a new issue to discuss more about this if you think it's worth it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
258524115 https://github.com/pydata/xarray/pull/1024#issuecomment-258524115 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1ODUyNDExNQ== shoyer 1217238 2016-11-04T19:19:00Z 2016-11-04T19:19:00Z MEMBER

I admit that currently xarray is perhaps not very suited for handling unstructured meshes, but IMO it has great potential (especially considering multi-index support) and I'd love to use it here.

Right now, xarray is not going to be great fit for such cases, because we already cache an index in memory for any labeled indexing operations. So at best, you could do something like ds.isel(mesh_edge=slice(int(1e6))). Maybe people already do this?

I doubt very many people are relying on this, though indeed, this would include some users of an array database we wrote at my former employer, which did not support different chunking schemes for different variables (which could make coordinate lookup very slow). CC @ToddSmall in case he has opinions here.

For out-of-core operations with labels on big unstructured meshes, you really need a generalization of the pandas.Index that doesn't need to live in memory (or maybe that lives in memory on some remote server).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
258496425 https://github.com/pydata/xarray/pull/1024#issuecomment-258496425 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1ODQ5NjQyNQ== benbovy 4160723 2016-11-04T17:29:54Z 2016-11-04T17:29:54Z MEMBER

Option D seems indeed the cleanest and safest option, but

Even eagerly loading indexes is potentially problematic, if loading the index values is expensive.

I can see use cases where this might happen. For example, It is common for 1, 2 or higher-dimension unstructured meshes that the coordinates x, y, z are arranged as 1-d arrays of length that equals the number of nodes (which can be very high!). See for example ugrid conventions.

I admit that currently xarray is perhaps not very suited for handling unstructured meshes, but IMO it has great potential (especially considering multi-index support) and I'd love to use it here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
256125722 https://github.com/pydata/xarray/pull/1024#issuecomment-256125722 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1NjEyNTcyMg== shoyer 1217238 2016-10-25T18:25:30Z 2016-10-25T18:25:30Z MEMBER

I'm going to ping the mailing list for input, but I think it would be pretty safe.

On Tue, Oct 25, 2016 at 11:11 AM, crusaderky notifications@github.com wrote:

Hi Stephen, Thank you for your thinking. IMHO option D is the cleanest and safest. Could you come up with any example where it may be problematic?

On 21 Oct 2016 4:36 am, "Stephan Hoyer" notifications@github.com wrote:

I've been thinking about this... Maybe the simple, clean solution is to simply invoke compute() on all coords as soon as they are assigned to the DataArray / Dataset?

I'm nervous about eager loading, especially for non-index coordinates. They can have more than one dimension, and thus can contain a lot of data. So potentially eagerly loading non-index coordinates could break existing use cases.

On the other hand, non-index coordinates indeed checked for equality in most xarray operations (e.g., for the coordinate merge in align). So it is indeed useful not to have to recompute them all the time.

Even eagerly loading indexes is potentially problematic, if loading the index values is expensive.

So I'm conflicted: - I like the current caching behavior for coords and indexes - But I also want to avoid implicit conversions from dask to numpy, which is problematic for all the reasons you pointed out earlier

I'm going to start throwing out ideas for how to deal with this: Option A

Add two new (public?) methods, something like .load_coords() and .load_indexes(). We would then insert calls to these methods at the start of each function that uses coordinates: - .load_indexes(): reindex, reindex_like, align and sel - .load_coords(): merge and anything that calls the functions in core/merge.py (this indirectly includes Dataset.init and Dataset.setitem)

Hypothetically, we could even have options for turning this caching systematically on/off (e.g., with xarray.set_options(cache_coords=False, cache_indexes=True): ...).

Your proposal is basically an extreme version of this, where we call .load_coords() immediately after constructing every new object.

Advantages: - It's fairly predictable when caching happens (especially if we opt for calling .load_cords() immediately, as you propose). - Computing variables is all done at once, which is much more performant than what we currently do, e.g., loading variables as needed for .equals() checks in merge_variables one at a time.

Downsides: - Caching is more aggressive than necessary -- we cache indexes even if that coord isn't actually indexed.

Option B

Like Option A, but someone infer the full set of variables that need to be cached (e.g., in a .merge() operation) before it's actually done. This seems hard, but maybe is possible using a variation of merge_variables.

This solves the downside of A, but diminishes the predictability. We're basically back to how things work now. Option C

Cache dask.array in IndexVariable but not Variable. This preserves performance for repeated indexing, because the hash table behind the pandas.Index doesn't get thrown away.

Advantages: - Much simpler and easier to implement than the alternatives. - Implicit conversions are greatly diminished.

Downsides: - Non-index coordinates get thrown away after being evaluated once. If you're doing lots of operations of the form [ds + other for ds in datasets] where ds and other has conflicting coordinates this would probably make you unhappy.

Option D

Load the contents of an IndexVariable immediately and eagerly. They no longer cache data or use lazy loading.

This has the most predictable performance, but might cause trouble for

some edge use cases?

I need to think about this a little more, but right now I am leaning towards Option C or D.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1024#issuecomment-255286001, or mute the thread https://github.com/notifications/unsubscribe- auth/AF7OMLBh4eDuKRNv0x5HwRie_yaGh0Yzks5q2DMjgaJpZM4KLurN .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1024#issuecomment-256114879, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1jUaNUCxHlCx86P4JjbhsLA99ZIqks5q3kZYgaJpZM4KLurN .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
256114879 https://github.com/pydata/xarray/pull/1024#issuecomment-256114879 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1NjExNDg3OQ== crusaderky 6213168 2016-10-25T18:11:35Z 2016-10-25T18:11:35Z MEMBER

Hi Stephen, Thank you for your thinking. IMHO option D is the cleanest and safest. Could you come up with any example where it may be problematic?

On 21 Oct 2016 4:36 am, "Stephan Hoyer" notifications@github.com wrote:

I've been thinking about this... Maybe the simple, clean solution is to simply invoke compute() on all coords as soon as they are assigned to the DataArray / Dataset?

I'm nervous about eager loading, especially for non-index coordinates. They can have more than one dimension, and thus can contain a lot of data. So potentially eagerly loading non-index coordinates could break existing use cases.

On the other hand, non-index coordinates indeed checked for equality in most xarray operations (e.g., for the coordinate merge in align). So it is indeed useful not to have to recompute them all the time.

Even eagerly loading indexes is potentially problematic, if loading the index values is expensive.

So I'm conflicted: - I like the current caching behavior for coords and indexes - But I also want to avoid implicit conversions from dask to numpy, which is problematic for all the reasons you pointed out earlier

I'm going to start throwing out ideas for how to deal with this: Option A

Add two new (public?) methods, something like .load_coords() and .load_indexes(). We would then insert calls to these methods at the start of each function that uses coordinates: - .load_indexes(): reindex, reindex_like, align and sel - .load_coords(): merge and anything that calls the functions in core/merge.py (this indirectly includes Dataset.init and Dataset.setitem)

Hypothetically, we could even have options for turning this caching systematically on/off (e.g., with xarray.set_options(cache_coords=False, cache_indexes=True): ...).

Your proposal is basically an extreme version of this, where we call .load_coords() immediately after constructing every new object.

Advantages: - It's fairly predictable when caching happens (especially if we opt for calling .load_cords() immediately, as you propose). - Computing variables is all done at once, which is much more performant than what we currently do, e.g., loading variables as needed for .equals() checks in merge_variables one at a time.

Downsides: - Caching is more aggressive than necessary -- we cache indexes even if that coord isn't actually indexed.

Option B

Like Option A, but someone infer the full set of variables that need to be cached (e.g., in a .merge() operation) before it's actually done. This seems hard, but maybe is possible using a variation of merge_variables.

This solves the downside of A, but diminishes the predictability. We're basically back to how things work now. Option C

Cache dask.array in IndexVariable but not Variable. This preserves performance for repeated indexing, because the hash table behind the pandas.Index doesn't get thrown away.

Advantages: - Much simpler and easier to implement than the alternatives. - Implicit conversions are greatly diminished.

Downsides: - Non-index coordinates get thrown away after being evaluated once. If you're doing lots of operations of the form [ds + other for ds in datasets] where ds and other has conflicting coordinates this would probably make you unhappy.

Option D

Load the contents of an IndexVariable immediately and eagerly. They no longer cache data or use lazy loading.

This has the most predictable performance, but might cause trouble for

some edge use cases?

I need to think about this a little more, but right now I am leaning towards Option C or D.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1024#issuecomment-255286001, or mute the thread https://github.com/notifications/unsubscribe-auth/AF7OMLBh4eDuKRNv0x5HwRie_yaGh0Yzks5q2DMjgaJpZM4KLurN .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
255286001 https://github.com/pydata/xarray/pull/1024#issuecomment-255286001 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1NTI4NjAwMQ== shoyer 1217238 2016-10-21T03:36:01Z 2016-10-21T03:36:01Z MEMBER

I've been thinking about this... Maybe the simple, clean solution is to simply invoke compute() on all coords as soon as they are assigned to the DataArray / Dataset?

I'm nervous about eager loading, especially for non-index coordinates. They can have more than one dimension, and thus can contain a lot of data. So potentially eagerly loading non-index coordinates could break existing use cases.

On the other hand, non-index coordinates indeed checked for equality in most xarray operations (e.g., for the coordinate merge in align). So it is indeed useful not to have to recompute them all the time.

Even eagerly loading indexes is potentially problematic, if loading the index values is expensive.

So I'm conflicted: - I like the current caching behavior for coords and indexes - But I also want to avoid implicit conversions from dask to numpy, which is problematic for all the reasons you pointed out earlier

I'm going to start throwing out ideas for how to deal with this:

Option A

Add two new (public?) methods, something like .load_coords() and .load_indexes(). We would then insert calls to these methods at the start of each function that uses coordinates: - .load_indexes(): reindex, reindex_like, align and sel - .load_coords(): merge and anything that calls the functions in core/merge.py (this indirectly includes Dataset.__init__ and Dataset.__setitem__)

Hypothetically, we could even have options for turning this caching systematically on/off (e.g., with xarray.set_options(cache_coords=False, cache_indexes=True): ...).

Your proposal is basically an extreme version of this, where we call .load_coords() immediately after constructing every new object.

Advantages: - It's fairly predictable when caching happens (especially if we opt for calling .load_cords() immediately, as you propose). - Computing variables is all done at once, which is much more performant than what we currently do, e.g., loading variables as needed for .equals() checks in merge_variables one at a time.

Downsides: - Caching is more aggressive than necessary -- we cache indexes even if that coord isn't actually indexed.

Option B

Like Option A, but someone infer the full set of variables that need to be cached (e.g., in a .merge() operation) before it's actually done. This seems hard, but maybe is possible using a variation of merge_variables.

This solves the downside of A, but diminishes the predictability. We're basically back to how things work now.

Option C

Cache dask.array in IndexVariable but not Variable. This preserves performance for repeated indexing, because the hash table behind the pandas.Index doesn't get thrown away.

Advantages: - Much simpler and easier to implement than the alternatives. - Implicit conversions are greatly diminished.

Downsides: - Non-index coordinates get thrown away after being evaluated once. If you're doing lots of operations of the form [ds + other for ds in datasets] where ds and other has conflicting coordinates this would probably make you unhappy.

Option D

Load the contents of an IndexVariable immediately and eagerly. They no longer cache data or use lazy loading.

This has the most predictable performance, but might cause trouble for some edge use cases?


I need to think about this a little more, but right now I am leaning towards Option C or D.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
255066352 https://github.com/pydata/xarray/pull/1024#issuecomment-255066352 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1NTA2NjM1Mg== crusaderky 6213168 2016-10-20T10:14:30Z 2016-10-20T10:14:30Z MEMBER

ping - how do you prefer me to proceed?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
253133290 https://github.com/pydata/xarray/pull/1024#issuecomment-253133290 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1MzEzMzI5MA== crusaderky 6213168 2016-10-12T06:49:23Z 2016-10-12T06:49:23Z MEMBER

I've been thinking about this... Maybe the simple, clean solution is to simply invoke compute() on all coords as soon as they are assigned to the DataArray / Dataset?

On 12 Oct 2016 02:18, "Stephan Hoyer" notifications@github.com wrote:

@shoyer commented on this pull request.

Apologies for the delay here -- my comments were stuck as a "pending" GitHub review.

I am still wondering what the right behavior is for variables used as indexes. (These can be dask arrays, too.)

I think there is a good case for skipping these variables in .chunk(), but we probably do want to make indexes still cache as pandas.Index objects, because otherwise repeated evaluation of dask arrays to build the

index for alignment or indexing gets expensive.

In xarray/core/dataset.py https://github.com/pydata/xarray/pull/1024#pullrequestreview-3794240:

@@ -792,13 +806,19 @@ def chunks(self): array. """ chunks = {} - for v in self.variables.values(): - for v in self.data_vars.values():

I am concerned about skipping non-data_vars here. Coordinates could still be chunked, e.g., if they were loaded from a file, or created directly from

dask arrays.

In xarray/core/dataset.py https://github.com/pydata/xarray/pull/1024#pullrequestreview-3794240:

if v.chunks is not None: new_chunks = list(zip(v.dims, v.chunks)) if any(chunk != chunks[d] for d, chunk in new_chunks if d in chunks): raise ValueError('inconsistent chunks') chunks.update(new_chunks) - if chunks:

I guess this method is inconsistent with Variable.chunks, but it currently always returns a dict.

I would either skip this change or use something like my version.

In xarray/core/dataset.py https://github.com/pydata/xarray/pull/1024#pullrequestreview-3794240:

@@ -851,6 +871,9 @@ def selkeys(dict_, keys): return dict((d, dict_[d]) for d in keys if d in dict_)

def maybe_chunk(name, var, chunks): - if name not in self.data_vars:

I see your point about performance, but I think that mostly holds true for indexes. So I would be inclined to adjust this to only skip variables in self.dims (aka indexes used for alignment).

I am still concerned about skipping coords if they are already dask arrays. If they are already dask arrays, then .chunk() should probably adjust their chunks anyways.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1024#pullrequestreview-3794240, or mute the thread https://github.com/notifications/unsubscribe-auth/AF7OMBL7_F2IV5P04Em8NhPy-K8aNrGZks5qzDVlgaJpZM4KLurN .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
252033741 https://github.com/pydata/xarray/pull/1024#issuecomment-252033741 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1MjAzMzc0MQ== crusaderky 6213168 2016-10-06T17:34:01Z 2016-10-06T17:34:01Z MEMBER

I can't reproduce the above failure test.test_conventions.TestEncodeCFVariable.testMethod=test_missing_fillvalue. I suspect it might be a random failure, because - it used to succeed until my latest commit, which eclusively changed the readme - it suceeds on python 3

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
251983000 https://github.com/pydata/xarray/pull/1024#issuecomment-251983000 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1MTk4MzAwMA== crusaderky 6213168 2016-10-06T14:39:47Z 2016-10-06T14:39:47Z MEMBER

I added the disclaimer in the release notes. Is there any other outstanding issue? Thanks

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
251209654 https://github.com/pydata/xarray/pull/1024#issuecomment-251209654 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1MTIwOTY1NA== crusaderky 6213168 2016-10-03T19:58:53Z 2016-10-03T20:31:48Z MEMBER

What happened before this PR was that all coords were blindly converted to dask on chunk(). Then, the first time anything invoked the values property, e.g. Something as simple as DataArray.__str__., they were silently converted back to numpy. It wasn't easy to accidentally get them in dask format; in fact no unit test noticed before my last commit.

If you deliberately use a dask array as a coord, it won't be converted to numpy. However I can't think of any reason why anybody would want to do it in practice. I'll add it to the breaking changes as if somebody did do the above, the performance of his program will degrade with this release as his coord will risk being evaluated multiple times.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196
250922373 https://github.com/pydata/xarray/pull/1024#issuecomment-250922373 https://api.github.com/repos/pydata/xarray/issues/1024 MDEyOklzc3VlQ29tbWVudDI1MDkyMjM3Mw== crusaderky 6213168 2016-10-01T16:41:04Z 2016-10-01T17:37:39Z MEMBER

Well, crud. This introduces a regression where DataArray.chunk() converts the data and the coords to dask. This becomes enormously problematic later on as pretty much nothing expects a dask-based coord.

[edit] fixed below

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Disable automatic cache with dask 180451196

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.097ms · About: xarray-datasette