larry-discuss team mailing list archive

Thread
Date

Re: A new proposal for indexing with labels

To: Keith Goodman <kwgoodman@xxxxxxxxx>
From: josef.pktd@xxxxxxxxx
Date: Sun, 7 Feb 2010 21:52:59 -0500
Cc: larry-discuss@xxxxxxxxxxxxxxxxxxx
In-reply-to: <f4f93d421002071835m5ba687br2a3789e7bbd948e@mail.gmail.com>

On Sun, Feb 7, 2010 at 9:35 PM, Keith Goodman <kwgoodman@xxxxxxxxx> wrote:
> On Sun, Feb 7, 2010 at 6:28 PM,  <josef.pktd@xxxxxxxxx> wrote:
>> On Sun, Feb 7, 2010 at 8:46 PM, Keith Goodman <kwgoodman@xxxxxxxxx> wrote:
>>> On Sun, Feb 7, 2010 at 5:26 PM, Keith Goodman <kwgoodman@xxxxxxxxx> wrote:
>>>> On Sat, Feb 6, 2010 at 6:53 PM,  <josef.pktd@xxxxxxxxx> wrote:
>>>>> On Sat, Feb 6, 2010 at 9:48 PM, Keith Goodman <kwgoodman@xxxxxxxxx> wrote:
>>>>>> On Sat, Feb 6, 2010 at 5:38 PM, Keith Goodman <kwgoodman@xxxxxxxxx> wrote:
>>>>>>> In a blueprint titled "index-by-label" I proposed a way to index
>>>>>>> larrys by lists of label elements. Here's a simpler, but less
>>>>>>> versatile, proposal. On the whole, due to its simplicity, I think it
>>>>>>> is more powerful.
>>>>>>
>>>>>> I commit this proposal in r187. Please give it a try.
>>>>>
>>>>> I will try it tomorrow and look at the implementation.
>>>>> My first reaction: very convenient but potentially fragile for arbitrary labels.
>>>>
>>>> The rule is simple for indexing with a string S:
>>>>
>>>> 1. Look for string S in the label. If found you are done. If not found...
>>>> 2. Map the labels to strings and look again
>>>>
>>>> Although the rule is simple, the result can be unexpected in corner
>>>> cases. For example, you may try to index with str(1) to access the
>>>> label integer 1 but the label could also contain string '1'. So in
>>>> that case you'd get an unexpected result even though the rule is
>>>> simple.
>>>>
>>>> I could add a check: len(set(strlabel)) == len(set(label)). And raise
>>>> an IndexError (or is that ValueError?) if they are not equal. That
>>>> will slow things down but only for indexing by strings.
>>>>
>>>> Would that address your fragile comment? Or do you have something else in mind?
>>>
>>> Wait, that's being too restrictive. We don't care if there are
>>> duplicates in strlabel. We only care if S appears more than once in
>>> strlabel. For example, if we are indexing with str(1) and the label is
>>> [2, str(2), 1], then we don't care that strlabel = [str(2), str(2),
>>> str(1)] has duplicates; we only care that str(1) only appears once. If
>>> we were indexing with str(2), on the other hand, then there would be a
>>> problem and we'd raise a ValueError.
>>>
>>> I can add that check and then you can take a look.
>>>
>>
>> I just started to look at it. I saw in str2labelindex  you use
>> str(labelobject) to identify the label.
>> I don't think __string__ is very save to use in general, I don't think
>> it is guaranteed to remain unchanged. e.g. in numpy you can affect the
>> str result with the print options for numbers in arrays, e.g.
>> np.set_printoptions(precision=2).
>>
>> another example objects that don't define a unique string or use a
>> default string
>>>>> class MyA(object):pass
>>
>>>>> aaa = MyA()
>>>>> str(aaa)
>> '<__main__.MyA object at 0x01A57DD0>'
>>
>> I'm not very familiar with datetime, Is the string representation
>> locale or timezone dependent ?
>> decimal point is local dependent from some messages on the mailing
>> lists, I assume that in some cases the default in german is 5,4
>> instead of 5.4
>>
>> So, relying on the string representation imposes quite a lot of
>> restrictions for which type of labels this would work.
>>
>> I look some more.
>
> Sure, indexing with things like '(3,4)' will be a problem since
> str((3,4)) is '(3, 4)' (note the space). So the safe way to index is,
> for example, y[str(1)].
>
> I like the general idea of using __getitem__ to index both the regular
> and the label way. One thing I am wondering about is if there is
> another way to signify indexing by labels other than with strings. It
> would have to be something that numpy arrays can't be indexed by.

I had two thoughts on alternatives, but I don't think anymore the first helps

1) require either label or array indexing by axis, e.g.
lar1[[1:4],['a','b','c']]  or lar1[[1:4][datetime(..),...]]
    but from the structure of getitem it might not be so easy and
would require to verify the entire index for this axis.
2) use a special slice type/class to indicate labelindexing, which
would be a bit more writing, e.g.
    lar1[[1:4],lix(['a','b','c'])]   or lar1[[1:4], lix('msft')]
    where lix is a labelindex class that just works to signal that
this axis is indexed by label
    this would be unambiguous I think. with isinstance(index,
lixclass) or something like this

another option following your idea would be to try except different
possibilities until a valid interpretation is found. I would check for
labels last to keep standard array indexing fast and dominant. There
might still be some ambiguities that are resolved by sequence of the
checks (maybe).

Just some thoughts while I was reading your messages and looking at
the changes, I haven't looked yet at either version more carefully.

Josef

References

A new proposal for indexing with labels
From: Keith Goodman, 2010-02-07
Re: A new proposal for indexing with labels
From: Keith Goodman, 2010-02-07
Re: A new proposal for indexing with labels
From: josef . pktd, 2010-02-07
Re: A new proposal for indexing with labels
From: Keith Goodman, 2010-02-08
Re: A new proposal for indexing with labels
From: Keith Goodman, 2010-02-08
Re: A new proposal for indexing with labels
From: josef . pktd, 2010-02-08
Re: A new proposal for indexing with labels
From: Keith Goodman, 2010-02-08