larry-discuss team mailing list archive

Thread
Date

Re: [pystatsmodels] Re: Bootstrap and cross validation iterators

To: pystatsmodels@xxxxxxxxxxxxxxxx, larry-discuss@xxxxxxxxxxxxxxxxxxx
From: josef.pktd@xxxxxxxxx
Date: Tue, 25 May 2010 11:20:10 -0400
In-reply-to: <AANLkTik2RoU54xaRBejYs-kg_F95GQ6M9Ou8_wBt2wPm@mail.gmail.com>

On Sun, May 23, 2010 at 11:22 AM, Keith Goodman <kwgoodman@xxxxxxxxx> wrote:
> On Sat, May 22, 2010 at 8:24 PM,  <josef.pktd@xxxxxxxxx> wrote:
>> On Sat, May 22, 2010 at 8:21 PM, Keith Goodman <kwgoodman@xxxxxxxxx> wrote:
>>> I added a resample module to the la package (la.util.resample). It
>>> currently contains bootstrap and k-fold cross validation iterators.
>>> The index iterators are not specific to larrys; they return lists of
>>> indices. Just thought I'd mention it since they can be used most
>>> anywhere. Lots of unit tests are in la.util.tests.resample_test.py.
>>>
>>> You can optionally set the state of your random number generator
>>> outside of the index iterators and pass in shuffle to cv and randint
>>> to boot.
>>>
>>> K-fold cross validation indices for 5 elements and 3 folds:
>>>
>>>    >>> from la.util.resample import cv
>>>    >>> for train, test in cv(5,2):
>>>    ...     print
>>>    ...     print 'train: ', train
>>>    ...     print 'test:  ', test
>>>    ...
>>>
>>>    train:  [4, 3, 1]
>>>    test:   [0, 2]
>>>
>>>    train:  [0, 2]
>>>    test:   [4, 3, 1]
>>>
>>> Three bootstrap samples taken with replacement from four elements:
>>>
>>>    >>> from la.util.resample import boot
>>>    >>> for train, test in boot(4, 3):
>>>    ...     print
>>>    ...     print 'train: ', train
>>>    ...     print 'test:  ', test
>>>    ...
>>>
>>>    train:  [2 1 3 1]
>>>    test:   [0]
>>>
>>>    train:  [1 1 2 1]
>>>    test:   [0, 3]
>>>
>>>    train:  [1 3 0 0]
>>>    test:   [2]
>>>
>>> http://bazaar.launchpad.net/~kwgoodman/larry/trunk/annotate/head:/la/util/resample.py
>>> http://bazaar.launchpad.net/~kwgoodman/larry/trunk/annotate/head:/la/util/tests/resample_test.py
>>
>> 2 design question
>>
>> why did you choose to use a fixed seed by default? (I'm not completely
>> sure how using RandomState directly works, I usually just use
>> random.seed)
>
> Resetting the random number generator to a fixed state was a bad
> choice. I'll remove it. Alan gave an answer to the other question. It
> was just the way I learned how to do it.
>
>> In some early leave one out loops, I also used indices to select. The
>> scikits.learn cross_val iterators use boolean index arrays. Do you
>> have any idea whether integer or boolean indices are faster?
>
> A bool index would not work for a bootstrap since the training set has
> repeats. As for speed, there also the issue that boot returns a
> training and testing set of indices. Yet a testing set is often not
> used in bootstraps. And calculating the testing set is probably the
> slowest part. For indexing into arrays I don't know which is faster,
> indices or bool index.
>
>> Does boot work if nboot=n  (no testsample) ?
>
> Yes, nboot=n works. And boot only returns samples that have test data.
> Empty test data indices are rejected. kfold=n, leave one out, also
> works in cv.

I misinterpreted what nboot means, I thought it was the sample size
and not the number of replications. But since they are draws with
replacement the bootstrap sample size doesn't affect the existence of
test indices anyway (except probabilistically)
So far, I only did pure bootstrap not mixed with cross-validation, so
I had a different use case in mind then the function does.

crossvalidation(..., shuffle=lambda x:x) can do leave 1 out on the
original sample. It considers consecutive folds in the shuffled array
like KFold in scikits.learn.

But for leave k out (which would be similar to kfold=n-2),
scikits.learn uses all combinations.
I find your idea of doing a random shuffle first, also useful for an
all combinations leavekout, because if we stop early with a large
dataset the used training/test datasets would be random not
deterministic.


>
>> I find the function names, especially cv (crossval_random_kfold?), a
>> bit too short and unspecific.
>
> How about cv --> cross_validation and boot --> bootstrap?

In statsmodels I would add index or iter to the name, because we will
have functions or classes that will do the actual bootstrap analysis.

>
>> I think we will have more design questions, when we start to use this
>> (or similar) more systematically than just some eclectic examples of
>> bootstrap as we have until now.
>
> I'm thinking of adding two more (convenience) functions to the
> resample module. Good design?
>
> bootstrap(arr, nboot, axis=0, randint=None) --> numpy array iterator
> bootstrap_index(n, nboot, randint=None) --> index iterator
>
> cross_validation(arr, kfold, axis=0, shuffle=None) --> numpy array iterator
> cross_valiudation_index(n, kfold, shuffle=None) --> index iterator

I don't know what the numpy array iterator would do. For a 2d dataset,
the cross-validation or bootstrap index would only be used along one
axis, the second axis (if it exists) would be arbitrary.
This can be easily done with your current index functions.

Josef

Follow ups

Re: [pystatsmodels] Re: Bootstrap and cross validation iterators
From: Keith Goodman, 2010-05-25

References

Bootstrap and cross validation iterators
From: Keith Goodman, 2010-05-23
Re: Bootstrap and cross validation iterators
From: josef . pktd, 2010-05-23
Re: [pystatsmodels] Re: Bootstrap and cross validation iterators
From: Keith Goodman, 2010-05-23