larry-discuss team mailing list archive
-
larry-discuss team
-
Mailing list archive
-
Message #00156
Re: [pystatsmodels] Re: Bootstrap and cross validation iterators
On Tue, May 25, 2010 at 8:20 AM, <josef.pktd@xxxxxxxxx> wrote:
>
> crossvalidation(..., shuffle=lambda x:x) can do leave 1 out on the
> original sample. It considers consecutive folds in the shuffled array
> like KFold in scikits.learn.
That's clever. But cross_validation(n, n) will already give you each
data point exactly once in the test index:
>> [idx for idx in cross_validation(3,3)]
[([2, 1], [0]), ([0, 1], [2]), ([0, 2], [1])]
>> [idx for idx in cross_validation(3,3)]
[([0, 2], [1]), ([1, 2], [0]), ([1, 0], [2])]
The order is not the same each time, but you I guess that is OK.
> But for leave k out (which would be similar to kfold=n-2),
> scikits.learn uses all combinations.
Yes, that's a nice one. There are some special functions that I could
add to resample.py. Which I would do if i began to see some use.
I'd like a function that does a random partition of the data. Something like:
partition(n, [0.2, 0.8])
which would randomly partition that data into two samples of (approx)
20% of the data and 80%. Trivial to write. But I didn't want to fill
up the module without using it first and without getting the great
feedback you are providing.
> I find your idea of doing a random shuffle first, also useful for an
> all combinations leavekout, because if we stop early with a large
> dataset the used training/test datasets would be random not
> deterministic.
Yes, in CV the partitioning of the data into the kfolds can be the
biggest source of noise. Sometimes you want to run more than once.
>>> I find the function names, especially cv (crossval_random_kfold?), a
>>> bit too short and unspecific.
>>
>> How about cv --> cross_validation and boot --> bootstrap?
>
> In statsmodels I would add index or iter to the name, because we will
> have functions or classes that will do the actual bootstrap analysis.
Admit it! You want cross_validation_index_iterator. Both index and
iter give useful info. I think index gives more info.
>> I'm thinking of adding two more (convenience) functions to the
>> resample module. Good design?
>>
>> bootstrap(arr, nboot, axis=0, randint=None) --> numpy array iterator
>> bootstrap_index(n, nboot, randint=None) --> index iterator
>>
>> cross_validation(arr, kfold, axis=0, shuffle=None) --> numpy array iterator
>> cross_valiudation_index(n, kfold, shuffle=None) --> index iterator
>
> I don't know what the numpy array iterator would do. For a 2d dataset,
> the cross-validation or bootstrap index would only be used along one
> axis, the second axis (if it exists) would be arbitrary.
> This can be easily done with your current index functions.
For CV, if the input is a 2d array and axis=0, I was thinking of
returning two 2d arrays: ar[idx_train,:] and ar[idx_test,:]. It would
be a convenience function.
Follow ups
References