larry-discuss team mailing list archive

Thread
Date

Missing values in larry

To: larry-discuss@xxxxxxxxxxxxxxxxxxx
From: Keith Goodman <kwgoodman@xxxxxxxxx>
Date: Thu, 4 Feb 2010 08:27:36 -0800

larry currently considers NaNs as missing values. That statement is
true in most, but not all, cases. Some larry methods use
~np.isfinite() as the missing mask; others use np.isnan(). Each larry
method is responsible for creating its own mask.

Some methods like merge and morph convert int input to float since
merge and morph can create missing values but an int array cannot hold
NaN. And methods that use morph internally convert ints to floats.

A couple of functions have specialized code to handle missing values
for str and object dtype; others don't have any code for it.

I wonder if it would simplify things to add a third piece of data to a
larry. Right now a larry consists of a numpy array (x) and a list
(label). We could add a third item: a bool numpy array (m) to mark
missing values. So:

lar.label
lar.x
lar.m

That would centralize the missing value handling.

If no mask was specified upon creating a larry then a mask would be
created based on dtype:

int ->  no missing values
float ->  isnan
str ->  == ''
object ->  is None
any other dtype -> no missing values

That would preserve compatibility with la 0.1.

I don't think we'd need to change any of the unary functions. For the
binary functions, the output mask would just be lar1.m | lar2.m. I
don't think the change would be that hard. But it is hard to guess.

I think it would simplify maintenance by making missing value handling
more explicit. And it would give the user full control if they wanted
it.