larry-discuss team mailing list archive

Thread
Date

Re: [pystatsmodels] Re: Bootstrap and cross validation iterators

To: pystatsmodels@xxxxxxxxxxxxxxxx
From: Keith Goodman <kwgoodman@xxxxxxxxx>
Date: Tue, 25 May 2010 09:07:03 -0700
Cc: larry-discuss@xxxxxxxxxxxxxxxxxxx
In-reply-to: <AANLkTimb4ypK3IYu-i7_t7jc9cpdcA1kOhx4FWqvhru2@mail.gmail.com>

On Tue, May 25, 2010 at 8:20 AM,  <josef.pktd@xxxxxxxxx> wrote:
>
> crossvalidation(..., shuffle=lambda x:x) can do leave 1 out on the
> original sample. It considers consecutive folds in the shuffled array
> like KFold in scikits.learn.

That's clever. But cross_validation(n, n) will already give you each
data point exactly once in the test index:

>> [idx for idx in cross_validation(3,3)]
   [([2, 1], [0]), ([0, 1], [2]), ([0, 2], [1])]
>> [idx for idx in cross_validation(3,3)]
   [([0, 2], [1]), ([1, 2], [0]), ([1, 0], [2])]

The order is not the same each time, but you I guess that is OK.

> But for leave k out (which would be similar to kfold=n-2),
> scikits.learn uses all combinations.

Yes, that's a nice one. There are some special functions that I could
add to resample.py. Which I would do if i began to see some use.

I'd like a function that does a random partition of the data. Something like:

partition(n, [0.2, 0.8])

which would randomly partition that data into two samples of (approx)
20% of the data and 80%. Trivial to write. But I didn't want to fill
up the module without using it first and without getting the great
feedback you are providing.

> I find your idea of doing a random shuffle first, also useful for an
> all combinations leavekout, because if we stop early with a large
> dataset the used training/test datasets would be random not
> deterministic.

Yes, in CV the partitioning of the data into the kfolds can be the
biggest source of noise. Sometimes you want to run more than once.

>>> I find the function names, especially cv (crossval_random_kfold?), a
>>> bit too short and unspecific.
>>
>> How about cv --> cross_validation and boot --> bootstrap?
>
> In statsmodels I would add index or iter to the name, because we will
> have functions or classes that will do the actual bootstrap analysis.

Admit it! You want cross_validation_index_iterator. Both index and
iter give useful info. I think index gives more info.

>> I'm thinking of adding two more (convenience) functions to the
>> resample module. Good design?
>>
>> bootstrap(arr, nboot, axis=0, randint=None) --> numpy array iterator
>> bootstrap_index(n, nboot, randint=None) --> index iterator
>>
>> cross_validation(arr, kfold, axis=0, shuffle=None) --> numpy array iterator
>> cross_valiudation_index(n, kfold, shuffle=None) --> index iterator
>
> I don't know what the numpy array iterator would do. For a 2d dataset,
> the cross-validation or bootstrap index would only be used along one
> axis, the second axis (if it exists) would be arbitrary.
> This can be easily done with your current index functions.

For CV, if the input is a 2d array and axis=0, I was thinking of
returning two 2d arrays: ar[idx_train,:] and ar[idx_test,:]. It would
be a convenience function.

Follow ups

Re: [pystatsmodels] Re: Bootstrap and cross validation iterators
From: josef . pktd, 2010-05-25

References

Bootstrap and cross validation iterators
From: Keith Goodman, 2010-05-23
Re: Bootstrap and cross validation iterators
From: josef . pktd, 2010-05-23
Re: [pystatsmodels] Re: Bootstrap and cross validation iterators
From: Keith Goodman, 2010-05-23
Re: [pystatsmodels] Re: Bootstrap and cross validation iterators
From: josef . pktd, 2010-05-25