larry-discuss team mailing list archive
-
larry-discuss team
-
Mailing list archive
-
Message #00152
Re: [pystatsmodels] Re: Bootstrap and cross validation iterators
On Sun, May 23, 2010 at 8:22 AM, Keith Goodman <kwgoodman@xxxxxxxxx> wrote:
> On Sat, May 22, 2010 at 8:24 PM, <josef.pktd@xxxxxxxxx> wrote:
>> On Sat, May 22, 2010 at 8:21 PM, Keith Goodman <kwgoodman@xxxxxxxxx> wrote:
>>> I added a resample module to the la package (la.util.resample). It
>>> currently contains bootstrap and k-fold cross validation iterators.
>>> The index iterators are not specific to larrys; they return lists of
>>> indices. Just thought I'd mention it since they can be used most
>>> anywhere. Lots of unit tests are in la.util.tests.resample_test.py.
>>>
>>> You can optionally set the state of your random number generator
>>> outside of the index iterators and pass in shuffle to cv and randint
>>> to boot.
>>>
>>> K-fold cross validation indices for 5 elements and 3 folds:
>>>
>>> >>> from la.util.resample import cv
>>> >>> for train, test in cv(5,2):
>>> ... print
>>> ... print 'train: ', train
>>> ... print 'test: ', test
>>> ...
>>>
>>> train: [4, 3, 1]
>>> test: [0, 2]
>>>
>>> train: [0, 2]
>>> test: [4, 3, 1]
>>>
>>> Three bootstrap samples taken with replacement from four elements:
>>>
>>> >>> from la.util.resample import boot
>>> >>> for train, test in boot(4, 3):
>>> ... print
>>> ... print 'train: ', train
>>> ... print 'test: ', test
>>> ...
>>>
>>> train: [2 1 3 1]
>>> test: [0]
>>>
>>> train: [1 1 2 1]
>>> test: [0, 3]
>>>
>>> train: [1 3 0 0]
>>> test: [2]
>>>
>>> http://bazaar.launchpad.net/~kwgoodman/larry/trunk/annotate/head:/la/util/resample.py
>>> http://bazaar.launchpad.net/~kwgoodman/larry/trunk/annotate/head:/la/util/tests/resample_test.py
>>
>> 2 design question
>>
>> why did you choose to use a fixed seed by default? (I'm not completely
>> sure how using RandomState directly works, I usually just use
>> random.seed)
>
> Resetting the random number generator to a fixed state was a bad
> choice. I'll remove it. Alan gave an answer to the other question. It
> was just the way I learned how to do it.
>
>> In some early leave one out loops, I also used indices to select. The
>> scikits.learn cross_val iterators use boolean index arrays. Do you
>> have any idea whether integer or boolean indices are faster?
>
> A bool index would not work for a bootstrap since the training set has
> repeats. As for speed, there also the issue that boot returns a
> training and testing set of indices. Yet a testing set is often not
> used in bootstraps. And calculating the testing set is probably the
> slowest part. For indexing into arrays I don't know which is faster,
> indices or bool index.
>
>> Does boot work if nboot=n (no testsample) ?
>
> Yes, nboot=n works. And boot only returns samples that have test data.
> Empty test data indices are rejected. kfold=n, leave one out, also
> works in cv.
>
>> I find the function names, especially cv (crossval_random_kfold?), a
>> bit too short and unspecific.
>
> How about cv --> cross_validation and boot --> bootstrap?
>
>> I think we will have more design questions, when we start to use this
>> (or similar) more systematically than just some eclectic examples of
>> bootstrap as we have until now.
>
> I'm thinking of adding two more (convenience) functions to the
> resample module. Good design?
>
> bootstrap(arr, nboot, axis=0, randint=None) --> numpy array iterator
> bootstrap_index(n, nboot, randint=None) --> index iterator
>
> cross_validation(arr, kfold, axis=0, shuffle=None) --> numpy array iterator
> cross_valiudation_index(n, kfold, shuffle=None) --> index iterator
I forgot the most important part: Thank you for the improvements, Josef!
References