dhis2-devs team mailing list archive
-
dhis2-devs team
-
Mailing list archive
-
Message #39989
Re: [Bug 1065014] Re: Min/Max generation goes into negative
Truncation was a poor choice of words. Let me explain what I mean.
>From line of MinMaxOutlierAnalysisService.java we see what is happening.
...
for ( Integer unit : averages.keySet() )
{
Double stdDev = standardDeviations.get( unit );
Double avg = averages.get( unit );
if ( stdDev != null && avg != null )
{
int min = (int) MathUtils.getLowBound( stdDev,
stdDevFactor, avg );
int max = (int) MathUtils.getHighBound( stdDev,
stdDevFactor, avg );
if ( ValueType.INTEGER_POSITIVE == valueType ||
ValueType.INTEGER_ZERO_OR_POSITIVE == valueType )
{
min = Math.max( 0, min ); // Cannot be < 0
}
if ( ValueType.INTEGER_NEGATIVE == valueType )
{
max = Math.min( 0, max ); // Cannot be > 0
}
As you can see, the standard deviation is calculated along with the mean.
If the data element is of a type positive or negative integer, the lower or
upper bound is set to be zero (depending on the type). I am not sure what
the theoretical basis for this really, as up until this point, it looks
like we are treating the data effectively as being normally distributed,
since we are going to do something with the mean and standard deviation.
However, some issues with this include.
1) We are likely dealing with discrete data, not continuous.
2) We are dealing with small sample sizes of the population (perhaps the
last 12 months of data)
3) We are not controlling for whether the data actually is normally
distributed.
4) We are not excluding any outlliers in the calculation of the mean and
standard deviation.
So, problem number one through four.
I assume that the original hypothesis with using standard deviations and
means was to establish confidence intervals, that 95% of values would be
observed within two standard deviations of the mean. This of course assumes
that we know the population distribution, which we don't, since typically
we are dealing with small sample sizes and have no idea, what the actual
population is.
Next the method then "truncates" (I should have said modified) the min/max
depending on the number type and whether the calculated min/max fall
outside of the allowable range of the value type. Again, this seems sort of
logical, except that we have already calculated what the theoretical min
and max should be based on a normal distribution, but we clearly see that
our simple test for a valid value (i.e. confidence intervals) is thrown out
the door, since the upper and or lower bounds were rather arbitrarily
changed. So, problem number five.
So, I would argue this is in fact a series of assumptions which were made,
with real no theoretical basis for them. It would at least be good to know
what the theory is. However for the purpose of this bug report, the method
is doing what it has been programmed to do. :)
In the end, as long as people know what the method is doing, and that
perhaps it works for them, great, but if you look at real data even
superficially, one will notice immediately that many of the assumptions
which seem to be made in this code, do not really hold up. Perhaps this was
the "design flaw."
Regards,
Jason
On Sun, Sep 20, 2015 at 9:45 AM, Calle Hedberg <calle.hedberg@xxxxxxxxx>
wrote:
> Jason,
>
> "with all the truncation of data going on" - ??
>
> Not sure what you mean by that, but I users don't regard min-max values as
> a kind of "hard" range - it has never been intended to be that, except (at
> least in DHIS 1.4) where you specify a min-max to be ABSOLUTE. For
> everything else, it is a simple method to highlight possible outliers - for
> typing mistakes an easy fix, for collection/collation/transcribing mistakes
> often a more involved process (query sent back to staff etc).
>
> The complexity of correcting mistakes made during the manual data
> collection and collation process - often made worse by people tend to be
> stubborn about not admitting mistakes - is one reason for moving electronic
> data capture closer to the actual patient encounters. A typical example is
> South Africa's move from capturing monthly data per facility to capturing
> data into the DHIS on a daily basis per consulting room)
>
> Regards
> Calle
>
> On 19 September 2015 at 17:26, Jason Pickering <
> jason.p.pickering@xxxxxxxxx> wrote:
>
>> Hi Calle,
>> The problem is the premise upon which this algorithm is based is flawed,
>> I would say. There is really no reason to believe that the data is normally
>> distributed, or should be, unless of course it has been proved to be a
>> reliable and appropriate model. What we are seeking to do is to eliminate
>> outliers, based on a certain statistical model (i.e the normal
>> distribution). Problem is, the data is often not normally distributed. Just
>> as a quick example, I prepare a density plot of the skewness of all
>> OU/DE/COC combinations for a real database with a significant amount of
>> data over time, which should be fairly representative of a "real" DHIS
>> database. As a very trivial test of normality, we can examine the skewness
>> and see that in fact, showing that the tendency for the database is towards
>> positive skew, which is somewhat expected, as there are probably going to
>> be fewer "higher" values than "low" values for many data elements. Zero
>> skewness implied a perfectly normal distribution.
>>
>> I still think we need to carefully document what the min-max generation
>> function is actually doing. If it works for people, great, but with all of
>> the truncation of data going on, it may not really be clear to people how
>> these values are actually generated, nor what their limitation may be, as
>> well as to introduce an API endpoint for the min-max values to allow people
>> to generate these outside of the system, based on perhaps more appropriate
>> models than the normal distribution.
>>
>> Regards,
>> Jason
>>
>>
>> On Fri, Sep 18, 2015 at 9:02 PM, Calle Hedberg <calle.hedberg@xxxxxxxxx>
>> wrote:
>>
>>> Hi,
>>>
>>> Ah - bugger, I completely forgot about then zero or positive type, which
>>> provides the same effect (if set). my bad..
>>>
>>> Jason's point is correct, but in my opinion less important for most
>>> types of routine data where the primary function of the min-max values is
>>> to highlight likely data capturing mistakes.
>>>
>>> Regards
>>> Calle
>>>
>>> On 18 September 2015 at 13:10, jason.p.pickering <
>>> 1065014@xxxxxxxxxxxxxxxxxx> wrote:
>>>
>>>> Hi there. The current design is to take the mean, and calculate
>>>> n-standard
>>>> deviations away from the mean, for a given data element/orgunit/catcombo
>>>> set of data values. If the data value is set to be zero or positive
>>>> integer, and can never have a negative value and does not follow a
>>>> normal
>>>> distribution, then flooring the projected min/max at zero makes little
>>>> sense, if the distribution is not normal. Another distribution would be
>>>> required to determine what the accepted min/max actually are
>>>> (logistical,
>>>> zero-inflated model, etc) if the actual distribution is not normal.
>>>>
>>>> But per the bug report, the application does what it is supposed to do,
>>>> namely calculate the theoretical min/max based on a stastical routine,
>>>> which itself may not be valid without confirming that the distribution
>>>> in
>>>> question actually is normal or not.
>>>>
>>>> Regards,
>>>> Jason
>>>>
>>>>
>>>> On Fri, Sep 18, 2015 at 11:57 AM, Lars Helge Øverland <
>>>> larshelge@xxxxxxxxx>
>>>> wrote:
>>>>
>>>> > This is not a design flaw. It depends on the data element value type
>>>> > property. The default value type is "number", for which negative
>>>> values
>>>> > are perfectly valid. One can set the value type to "Positive number",
>>>> in
>>>> > this case the min-max values will never be less than zero.
>>>> >
>>>> > ** Changed in: dhis2
>>>> > Status: Opinion => Invalid
>>>> >
>>>> > --
>>>> > You received this bug notification because you are a member of DHIS 2
>>>> > developers, which is subscribed to DHIS.
>>>> > https://bugs.launchpad.net/bugs/1065014
>>>> >
>>>> > Title:
>>>> > Min/Max generation goes into negative
>>>> >
>>>> > Status in DHIS:
>>>> > Invalid
>>>> >
>>>> > Bug description:
>>>> > A very minor bug, but the min/max generation algorithm (which I
>>>> assume
>>>> > is some std. dev) sometimes leads the minimum to be a negative
>>>> number.
>>>> > Probably not an issue per se for data quality, as the alternative
>>>> > would be to set it to 0 (unless there is a reason why you would
>>>> enter
>>>> > negative numbers), but the chart you get when you double-click a
>>>> data
>>>> > entry field is then skewed and does not look very sensible. In
>>>> extreme
>>>> > cases, with a few very high values and a few months with very low
>>>> (as
>>>> > when you have campaigns or hand-outs), the minimum can be down to
>>>> > minus a lot.
>>>> >
>>>> > To manage notifications about this bug go to:
>>>> > https://bugs.launchpad.net/dhis2/+bug/1065014/+subscriptions
>>>> >
>>>> > _______________________________________________
>>>> > Mailing list: https://launchpad.net/~dhis2-devs
>>>> > Post to : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>>>> > Unsubscribe : https://launchpad.net/~dhis2-devs
>>>> > More help : https://help.launchpad.net/ListHelp
>>>> >
>>>>
>>>>
>>>> --
>>>> Jason P. Pickering
>>>> email: jason.p.pickering@xxxxxxxxx
>>>> tel:+46764147049
>>>>
>>>> --
>>>> You received this bug notification because you are a member of DHIS 2
>>>> developers, which is subscribed to DHIS.
>>>> https://bugs.launchpad.net/bugs/1065014
>>>>
>>>> Title:
>>>> Min/Max generation goes into negative
>>>>
>>>> Status in DHIS:
>>>> Invalid
>>>>
>>>> Bug description:
>>>> A very minor bug, but the min/max generation algorithm (which I assume
>>>> is some std. dev) sometimes leads the minimum to be a negative number.
>>>> Probably not an issue per se for data quality, as the alternative
>>>> would be to set it to 0 (unless there is a reason why you would enter
>>>> negative numbers), but the chart you get when you double-click a data
>>>> entry field is then skewed and does not look very sensible. In extreme
>>>> cases, with a few very high values and a few months with very low (as
>>>> when you have campaigns or hand-outs), the minimum can be down to
>>>> minus a lot.
>>>>
>>>> To manage notifications about this bug go to:
>>>> https://bugs.launchpad.net/dhis2/+bug/1065014/+subscriptions
>>>>
>>>> _______________________________________________
>>>> Mailing list: https://launchpad.net/~dhis2-devs
>>>> Post to : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>>>> Unsubscribe : https://launchpad.net/~dhis2-devs
>>>> More help : https://help.launchpad.net/ListHelp
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> *******************************************
>>>
>>> Calle Hedberg
>>>
>>> 46D Alma Road, 7700 Rosebank, SOUTH AFRICA
>>>
>>> Tel/fax (home): +27-21-685-6472
>>>
>>> Cell: +27-82-853-5352
>>>
>>> Iridium SatPhone: +8816-315-19119
>>>
>>> Email: calle.hedberg@xxxxxxxxx
>>>
>>> Skype: calle_hedberg
>>>
>>> *******************************************
>>>
>>>
>>> _______________________________________________
>>> Mailing list: https://launchpad.net/~dhis2-devs
>>> Post to : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>>> Unsubscribe : https://launchpad.net/~dhis2-devs
>>> More help : https://help.launchpad.net/ListHelp
>>>
>>>
>>
>>
>> --
>> Jason P. Pickering
>> email: jason.p.pickering@xxxxxxxxx
>> tel:+46764147049
>>
>
>
>
> --
>
> *******************************************
>
> Calle Hedberg
>
> 46D Alma Road, 7700 Rosebank, SOUTH AFRICA
>
> Tel/fax (home): +27-21-685-6472
>
> Cell: +27-82-853-5352
>
> Iridium SatPhone: +8816-315-19119
>
> Email: calle.hedberg@xxxxxxxxx
>
> Skype: calle_hedberg
>
> *******************************************
>
>
--
Jason P. Pickering
email: jason.p.pickering@xxxxxxxxx
tel:+46764147049
References