larry-discuss team mailing list archive
-
larry-discuss team
-
Mailing list archive
-
Message #00151
Re: [pystatsmodels] Re: Bootstrap and cross validation iterators
On Sat, May 22, 2010 at 8:24 PM, <josef.pktd@xxxxxxxxx> wrote:
> On Sat, May 22, 2010 at 8:21 PM, Keith Goodman <kwgoodman@xxxxxxxxx> wrote:
>> I added a resample module to the la package (la.util.resample). It
>> currently contains bootstrap and k-fold cross validation iterators.
>> The index iterators are not specific to larrys; they return lists of
>> indices. Just thought I'd mention it since they can be used most
>> anywhere. Lots of unit tests are in la.util.tests.resample_test.py.
>>
>> You can optionally set the state of your random number generator
>> outside of the index iterators and pass in shuffle to cv and randint
>> to boot.
>>
>> K-fold cross validation indices for 5 elements and 3 folds:
>>
>> >>> from la.util.resample import cv
>> >>> for train, test in cv(5,2):
>> ... print
>> ... print 'train: ', train
>> ... print 'test: ', test
>> ...
>>
>> train: [4, 3, 1]
>> test: [0, 2]
>>
>> train: [0, 2]
>> test: [4, 3, 1]
>>
>> Three bootstrap samples taken with replacement from four elements:
>>
>> >>> from la.util.resample import boot
>> >>> for train, test in boot(4, 3):
>> ... print
>> ... print 'train: ', train
>> ... print 'test: ', test
>> ...
>>
>> train: [2 1 3 1]
>> test: [0]
>>
>> train: [1 1 2 1]
>> test: [0, 3]
>>
>> train: [1 3 0 0]
>> test: [2]
>>
>> http://bazaar.launchpad.net/~kwgoodman/larry/trunk/annotate/head:/la/util/resample.py
>> http://bazaar.launchpad.net/~kwgoodman/larry/trunk/annotate/head:/la/util/tests/resample_test.py
>
> 2 design question
>
> why did you choose to use a fixed seed by default? (I'm not completely
> sure how using RandomState directly works, I usually just use
> random.seed)
Resetting the random number generator to a fixed state was a bad
choice. I'll remove it. Alan gave an answer to the other question. It
was just the way I learned how to do it.
> In some early leave one out loops, I also used indices to select. The
> scikits.learn cross_val iterators use boolean index arrays. Do you
> have any idea whether integer or boolean indices are faster?
A bool index would not work for a bootstrap since the training set has
repeats. As for speed, there also the issue that boot returns a
training and testing set of indices. Yet a testing set is often not
used in bootstraps. And calculating the testing set is probably the
slowest part. For indexing into arrays I don't know which is faster,
indices or bool index.
> Does boot work if nboot=n (no testsample) ?
Yes, nboot=n works. And boot only returns samples that have test data.
Empty test data indices are rejected. kfold=n, leave one out, also
works in cv.
> I find the function names, especially cv (crossval_random_kfold?), a
> bit too short and unspecific.
How about cv --> cross_validation and boot --> bootstrap?
> I think we will have more design questions, when we start to use this
> (or similar) more systematically than just some eclectic examples of
> bootstrap as we have until now.
I'm thinking of adding two more (convenience) functions to the
resample module. Good design?
bootstrap(arr, nboot, axis=0, randint=None) --> numpy array iterator
bootstrap_index(n, nboot, randint=None) --> index iterator
cross_validation(arr, kfold, axis=0, shuffle=None) --> numpy array iterator
cross_valiudation_index(n, kfold, shuffle=None) --> index iterator
Follow ups
References