← Back to team overview

larry-discuss team mailing list archive

Re: [pystatsmodels] Re: Bootstrap and cross validation iterators

 

On Sat, May 22, 2010 at 8:24 PM,  <josef.pktd@xxxxxxxxx> wrote:
> On Sat, May 22, 2010 at 8:21 PM, Keith Goodman <kwgoodman@xxxxxxxxx> wrote:
>> I added a resample module to the la package (la.util.resample). It
>> currently contains bootstrap and k-fold cross validation iterators.
>> The index iterators are not specific to larrys; they return lists of
>> indices. Just thought I'd mention it since they can be used most
>> anywhere. Lots of unit tests are in la.util.tests.resample_test.py.
>>
>> You can optionally set the state of your random number generator
>> outside of the index iterators and pass in shuffle to cv and randint
>> to boot.
>>
>> K-fold cross validation indices for 5 elements and 3 folds:
>>
>>    >>> from la.util.resample import cv
>>    >>> for train, test in cv(5,2):
>>    ...     print
>>    ...     print 'train: ', train
>>    ...     print 'test:  ', test
>>    ...
>>
>>    train:  [4, 3, 1]
>>    test:   [0, 2]
>>
>>    train:  [0, 2]
>>    test:   [4, 3, 1]
>>
>> Three bootstrap samples taken with replacement from four elements:
>>
>>    >>> from la.util.resample import boot
>>    >>> for train, test in boot(4, 3):
>>    ...     print
>>    ...     print 'train: ', train
>>    ...     print 'test:  ', test
>>    ...
>>
>>    train:  [2 1 3 1]
>>    test:   [0]
>>
>>    train:  [1 1 2 1]
>>    test:   [0, 3]
>>
>>    train:  [1 3 0 0]
>>    test:   [2]
>>
>> http://bazaar.launchpad.net/~kwgoodman/larry/trunk/annotate/head:/la/util/resample.py
>> http://bazaar.launchpad.net/~kwgoodman/larry/trunk/annotate/head:/la/util/tests/resample_test.py
>
> 2 design question
>
> why did you choose to use a fixed seed by default? (I'm not completely
> sure how using RandomState directly works, I usually just use
> random.seed)

Resetting the random number generator to a fixed state was a bad
choice. I'll remove it. Alan gave an answer to the other question. It
was just the way I learned how to do it.

> In some early leave one out loops, I also used indices to select. The
> scikits.learn cross_val iterators use boolean index arrays. Do you
> have any idea whether integer or boolean indices are faster?

A bool index would not work for a bootstrap since the training set has
repeats. As for speed, there also the issue that boot returns a
training and testing set of indices. Yet a testing set is often not
used in bootstraps. And calculating the testing set is probably the
slowest part. For indexing into arrays I don't know which is faster,
indices or bool index.

> Does boot work if nboot=n  (no testsample) ?

Yes, nboot=n works. And boot only returns samples that have test data.
Empty test data indices are rejected. kfold=n, leave one out, also
works in cv.

> I find the function names, especially cv (crossval_random_kfold?), a
> bit too short and unspecific.

How about cv --> cross_validation and boot --> bootstrap?

> I think we will have more design questions, when we start to use this
> (or similar) more systematically than just some eclectic examples of
> bootstrap as we have until now.

I'm thinking of adding two more (convenience) functions to the
resample module. Good design?

bootstrap(arr, nboot, axis=0, randint=None) --> numpy array iterator
bootstrap_index(n, nboot, randint=None) --> index iterator

cross_validation(arr, kfold, axis=0, shuffle=None) --> numpy array iterator
cross_valiudation_index(n, kfold, shuffle=None) --> index iterator



Follow ups

References