← Back to team overview

larry-discuss team mailing list archive

Re: [pystatsmodels] Re: Bootstrap and cross validation iterators

 

On Tue, May 25, 2010 at 12:07 PM, Keith Goodman <kwgoodman@xxxxxxxxx> wrote:
> On Tue, May 25, 2010 at 8:20 AM,  <josef.pktd@xxxxxxxxx> wrote:
>>
>> crossvalidation(..., shuffle=lambda x:x) can do leave 1 out on the
>> original sample. It considers consecutive folds in the shuffled array
>> like KFold in scikits.learn.
>
> That's clever. But cross_validation(n, n) will already give you each
> data point exactly once in the test index:
>
>>> [idx for idx in cross_validation(3,3)]
>   [([2, 1], [0]), ([0, 1], [2]), ([0, 2], [1])]
>>> [idx for idx in cross_validation(3,3)]
>   [([0, 2], [1]), ([1, 2], [0]), ([1, 0], [2])]
>
> The order is not the same each time, but you I guess that is OK.
>
>> But for leave k out (which would be similar to kfold=n-2),
>> scikits.learn uses all combinations.
>
> Yes, that's a nice one. There are some special functions that I could
> add to resample.py. Which I would do if i began to see some use.
>
> I'd like a function that does a random partition of the data. Something like:
>
> partition(n, [0.2, 0.8])
>
> which would randomly partition that data into two samples of (approx)
> 20% of the data and 80%. Trivial to write. But I didn't want to fill
> up the module without using it first and without getting the great
> feedback you are providing.

I agree with waiting with the details and a lot of enhancements until
we actually have cases where we use it. I looked at it recently,
because of the discussion for Principal Component Regression with
cross-validation, but that's just an example so far.

(One of the main applications that also requires different iterators
will be bootstrap and cross-validation for time series analysis.)

>
>> I find your idea of doing a random shuffle first, also useful for an
>> all combinations leavekout, because if we stop early with a large
>> dataset the used training/test datasets would be random not
>> deterministic.
>
> Yes, in CV the partitioning of the data into the kfolds can be the
> biggest source of noise. Sometimes you want to run more than once.
>
>>>> I find the function names, especially cv (crossval_random_kfold?), a
>>>> bit too short and unspecific.
>>>
>>> How about cv --> cross_validation and boot --> bootstrap?
>>
>> In statsmodels I would add index or iter to the name, because we will
>> have functions or classes that will do the actual bootstrap analysis.
>
> Admit it! You want cross_validation_index_iterator. Both index and
> iter give useful info. I think index gives more info.

Actually, I prefer just with _iterator, whether it's index or boolean
indexing won't be relevant for the standard usage.

>>> I'm thinking of adding two more (convenience) functions to the
>>> resample module. Good design?
>>>
>>> bootstrap(arr, nboot, axis=0, randint=None) --> numpy array iterator
>>> bootstrap_index(n, nboot, randint=None) --> index iterator
>>>
>>> cross_validation(arr, kfold, axis=0, shuffle=None) --> numpy array iterator
>>> cross_valiudation_index(n, kfold, shuffle=None) --> index iterator
>>
>> I don't know what the numpy array iterator would do. For a 2d dataset,
>> the cross-validation or bootstrap index would only be used along one
>> axis, the second axis (if it exists) would be arbitrary.
>> This can be easily done with your current index functions.
>
> For CV, if the input is a 2d array and axis=0, I was thinking of
> returning two 2d arrays: ar[idx_train,:] and ar[idx_test,:]. It would
> be a convenience function.

This was briefly discussed on the scikits.learn mailing list, but,
since indexing into an array is so easy, this was considered as not
necessary, and I also think there is not much gain in this.

Josef

>



References