← Back to team overview

dhis2-devs team mailing list archive

Re: The use of dimensions in data entry and data analysis (was: commit message for Rev 938)

 

Bob, I think you hit the nail on the head, and this is sort of what I
was getting at in my previous mails, but will try and explain more
here.

First, I think the current implementation of the DHIS aggregation
engine is clearly the way to go when it comes to materializations of
data values, but as  Bob hints at, we are not quite there yet.
Calculated data elements and indicators can follow complex aggregation
rules that OLAP does not understand  well. This was one of the hard
lessons we learned from the OpenHealth functional prototype. Not all
"multidimensional" data elements follow multidimensional aggregation
rules that OLAP engines are familiar with.  The calculated data
element/indicator functionality allows one to define complex rules for
how data elements should be calculated, which OLAP generally is not
capable of handling. SUM, AVG, COUNT work well with OLAP, but factors,
aggregation start levels, and some of the other features of the
aggregation engine have necessitated a custom solution. Mind you,
there is still room for improvement here. I would personally like
more operators (COUNT, STDEV) and ideally an integrated scripting
support to define highly complex indicators/data elements.

The aggregateddatavalue/indicator/report tables are very useful
artefacts to report builders/analysts. I would hate to have to
replicate in SQL/OLAP what the data aggregation service does for me.
This is what procedural languages are for after all.  These tables
provide a very useful, albeit one could argue bulky, materialized view
of the data in the routine data/semipermanent data tables. But disk
space is cheap. With proper definition of calculated data elements and
indicator, followed by materilization into report tables via the data
mart, report builders and analysts have very useful, simple tables
they can readily work with.

Get to the point Jason you say! Right. The point is, that data element
group sets (at least as I am seeing them in the blueprints) have an
assumed and implicit aggregation pathway  between dimension group set
element/category options.  As an example, (Under 1) + (1-<5) + (Over
5) = Total for "Age" perhaps. Why would I not want to define this
relationship explicitly in the form of a calculated data element
instead, when the logic and procedures already exist? How about if I
want the population for  the "Under 5" age group? This would be the
sum "Under 1" and "1-<5" right? If I need this value in a report, how
would I get it? Would DHIS automagically know that the "Under 5"
category is a result of the aggregation of two other category options?
It would not seem that it could know, without me defining a calculated
data element and assigning it a category option of "Under 5". Perhaps
I would not want to show this dimension option in the data entry form,
but I might be interested in having in a report or other table for
analysis. Are we going to require that people pull out "Under 1" and
"1-<5" into a PivotTable, perform the aggregation, and import it back
into the DB? No,  I would not think this would be the right solution.
So, it would seem to me for this use case, we would need to define
explicitly a calculated data element, with specific aggregation
operations, that would tell me how to add "Under 1" and "1-<5" in
order to get the "Under 5" age group.  Thus, I am not sure that the
category options go far enough in allowing me to explicity define how
totals are calcualted. I agree that in many cases, the total will be
the sum of the parts, but not always.

Let me take another example to further clarify my point. Disregard my
previous paragraph for a moment in regards to the age groups. Let us
assume we have population values provided to us by official sources
for "Pop. Under 1" ,  "Pop. Under 5", "Population 5-15", "Population
15-49", "Population Over 49". Let us assume I would define a category
"Age" and specify options "Under 1", "1-<5", "5-15", "15-49", and
"Over 49".   I need "1-<5" for calculation of certain inidcators,
although it has not been provided to me. Let us further assume that I
collect data routine for three age groups "Under 1", "1-<5" and "Over
5". This would imply that if I define a multidimensional data element
for "Total Population" I would need the following rule.

"Total Population" = "Population Under 5" + "Population Under 5-15" +
"Population 15-49" + "Population Over 49"

In this case, the "Total" would not be the sum of the component parts.
Does this exclude me from using the "category" functionality in this
case? Or would I need to somehow exclude the "1-5" age group from the
category, as it is not used in data entry. If so, would I need to
define it as a plain old non-multidimensional calculated data element?
It feels we are missing something here.

In order to calculate the "1-<5" population group coverage rate for a
particular data element, I need to define a calculated data element in
order to get the proper denominator:

 "Populatiton " = "Population Under 5" - "Population Under 1".


Note the minus sign there. It must be defined explicitly, which says
to me we cannot always assume that the operator between
category/dimension elements is always a "+".  Thus, we cannot simply
assume that we can always add category options up in order to get
another data element. Even "Total" is not a safe bet, as in my
example, I would enter a value that would not be aggregated in order
to obtain the "Total".


How then do we handle the issue of dimensional hierarchies?  Well,
with OrgUnit hierarchies, I have the ability to decide how the
aggregation take place, to some degree. Here in Zambia, we allow
districts to enter facility catchment populations, which allow them to
calculate facility coverage rates. The sum of all the catchment
populations of all the facilities in a given district, does not
necessarily add up to the "official" district population figures,
which according to government policy, must be used to calculate
district coverage rates. DHIS allows me to define this explicitly by
deciding where the  aggregation of the population figures start, in
our case at the district level.

What about the period hierarchy? Where can I define explicit rules
about how to derive quarterly figures from monthly figures?
Well, in this case ,I would need to define a rule that says that any
data value that has an period attribute of "Jan", "Feb" or "Mar" would
fall into "1st quarter". What about if I use financial quarters
instead of calendar quarters? It feels again that I need the ability
to define aggregation rules within a dimension, to derive either the
total or other values that may not be entered, as well as between
dimensions themselves.

These examples are not completed fabricated. There is a need to be
able to define, explicitly, operators regarding how aggregation takes
place within a dimension/category.When an analyst pulls the data into
a PivotTable, s/he is defining the rules dynamically. However for
reports and other materialized tables, how are we going to materialize
the values and present them in a format that is usable to people not
using external OLAP/analysis tools?

This mail turned out the be a bit longish, but I agree with Bob. We
are close and I think the generalization of the dimension concept is a
definite step in the right direction, but it feels we need to make the
extra push and see if we can get it right.

Best regards,
Jason




On Fri, Oct 30, 2009 at 7:39 PM, Bob Jolliffe <bobjolliffe@xxxxxxxxx> wrote:
> Don't really have much time to contribute to this discussion right now, but
> ...
>
> 2009/10/30 Ola Hodne Titlestad <olatitle@xxxxxxxxx>
>>
>> 2009/10/30 Jason Pickering <jason.p.pickering@xxxxxxxxx>
>>>
>>> Perhaps it is a bad example but it raises a good point, and we might
>>> should move this to a new thread if it continues to balloon.
>>>
>> I changed the name of the subject, might be to general, but still better
>> than a reply to a commit message.
>>
>>>
>>>  My understanding was the category options would be used for data
>>> entry.  This is not really an issue about 1.4, it is really an issue
>>> about whether people will enter totals or not. There is nothing to
>>> prevent people from defining a category , Gender, with three (or more)
>>> options, "Male" "Female" and "Total", and it may be necessary. Let me
>>> explain.  On the paper tools used here in Zambia, there is a separate
>>> column "Total" which is the sum of three age groups (Under 1, 1-5 and
>>> Over 5). If I was going to implement the multidimensional data
>>> elements here, if I wanted to replicate the paper tool exactly, I
>>> would need a separate column for totals. This is what we have now, and
>>> it serves a good purpose, as the data entry personnel can see if the
>>> totals provided by the facility actually match the calculated totals.
>>
>> This raises an interesting point related to the discussion we have had
>> about the role of data sets and data entry forms. To me such a control
>> column like "total" is simply a GUI feature and I don't think it should be
>> reflected in the data model or persisted.
>>
>> It would be great if we could add this feature to our data entry module.
>> What I see here is a need for an option to add a total column to each
>> categorycombination and then to automatically populate this field as the
>> other fields of the row gets filled. This is not a new request as it has
>> been mentioned several times (I remember a quite heated discussion about the
>> use of calculated data elements a few years ago), but with a new take on the
>> data set and form relation and a refined multidimensional model this might
>> be a better time to look at this.
>>
>> And I agree with Bob, to get these totals in a report is a matter of
>> adding this to the GUI somehow, the ability to add total columns for data
>> elements + category combos.
>>
>>
>>
>>>
>>> No idea if this is how the categories work in DHIS2. But from the
>>> analysis standpoint, it would seem that you would need some calculated
>>> data element as well that would calculate the total from the
>>> multidimensional components of the data element, unless as you say,
>>> you are going to rely on OLAP or PivotTables to always do this
>>> aggregation for you.
>>
>> At least for categories and options there should be no need to go to OLAP
>> to get this.
>> And although more complicated, I would think it should be possible to also
>> extract totals from a data element group set model with a similar logic to
>> what I described earlier. I guess that is the point of the new dimension
>> service which abstract away the difference between categories and group
>> sets, is that correct Lars/Bob?
>
> My (radical) idea on this is that a GroupSet should actually "BE" a
> dataelement.  Reason comes down to the fact that values have dimensions.
> And those dimensions can be different depending on the dataelement used.
>
> eg (using shorthand)
>
> Here's a datavalue in its "raw" form
> <dv de="Immunization_Male_Under5"  Value="5"/>
> Now lets say there are groups gender and age defined of which the above is a
> member.  And a groupset Immunization.  Then here's the same datavalue
> <dv de="Immunization" gender="M" Age="<5" Value="5"/>
> Now what about that same de, but without the dimensions:
> <dv de="Immunization"  Value="105"/>
>
> where I guess 105 would be the Total of all the underlying datavalues.
>
> In fact what would be very nice would be to do away with groups/groupsets
> entirely.  Less is more.  Just have (calculated?) dataelements which can
> form hierarchies (like orgunits).  We're not too far from here at the
> moment.  Another little step and we'll be over the edge.
>
> I'll think more about this later.  Right now in a rush to implement dxf2
> parser ...
>
> Cheers
> Bob
>
>
>>
>>>
>>> I would think that actually having the ability to
>>> persist and store the data value, as a calculated data element (Save
>>> calculated) and assign it a Category option of "Total" (which might be
>>> implicit anyway in the system) would make sense, since you might need
>>> it directly in a report or something and do not want to have to revert
>>> to OLAP or custom SQL to get this. But again, I am looking at this
>>> from the perspective of a bunch of data elements which do not use
>>> category options.
>>>
>>> You would get the totals as you state, but only by using OLAP. What
>>> about if I want to create an Excel report with only Totals? Now if the
>>> new model will automatically give me the totals from the component
>>> dimensions, great, but I did not see this in the blueprint.
>>
>> You are right, getting total from the group set/groups part of
>> dimension/dimensionoptions was not covered I think.
>> We need to add this to the blueprint. The idea was to abstract away the
>> difference between categories and group sets at the point of data analysis,
>> e.g. when defining new report tables, so I guess this means more complexity
>> to the dimension service Lars is working on.
>>
>> Ola
>> ---------
>>
>>
>>> I was
>>> assuming that I would need explicitly define a separate, calculated
>>> element for this.
>>>
>>> Regards,
>>> Jason
>>>
>>>
>>> On Fri, Oct 30, 2009 at 5:34 PM, Ola Hodne Titlestad <olatitle@xxxxxxxxx>
>>> wrote:
>>> > 2009/10/30 Jason Pickering <jason.p.pickering@xxxxxxxxx>
>>> >>
>>> >> OK, I took a walk around the block to think about this a bit more. I
>>> >> think it does, make sense, sort of. Lets look at  "Total", which might
>>> >> be defined as a calculated data element, say composed of different age
>>> >> groups. But the "Total" in this category, would not be the same as the
>>> >> "Total" that might be defined in a different category, or would it?
>>> >>
>>> >
>>> > I thought the whole point of the category/categoryoption/categorycombo
>>> > model
>>> > was that the total would be the data element itself without any
>>> > categoryoption? The "total" should then not be defined as one of the
>>> > options, but be always be derived from the sum of all the options.
>>> >
>>> > Your example Jason is from a 1.4 design point of view where you are not
>>> > using this model, but normally need calculated data elements to get to
>>> > a
>>> > total (since the categoryoptions are part of the data element names).
>>> > With
>>> > the new data element group set model I guess you can derive the total
>>> > for
>>> > e.g.  "Malaria new cases OPD" e.g. by filtering on the data element
>>> > group
>>> > "Malaria" in the group set "Diseases" plus the group called "New cases"
>>> > in
>>> > the group set "Patient status" and then simply sum up all the data
>>> > elements
>>> > in the two groups sets "Gender" and "Morbidity age group". Would't such
>>> > an
>>> > approach give you the totals you need?
>>> >
>>> > As in exactly how we could accommodate that within DHIS2 e.g in a
>>> > report
>>> > table GUI I am not sure. Seems complicated and something for an OLAP
>>> > tool to
>>> > take care of.
>>> >
>>> > Ola
>>> > -----------
>>> >
>>> >>  Having a single categoryoption "Total" would allow one to slice out
>>> >> particular groups of dimensional elements, which is a fairly common
>>> >> operation as Ola mentions, with a single filter statement. Otherwise,
>>> >> you would need to collect all of the "Total"s for different categories
>>> >> through another table and perform an inner join, as opposed to a
>>> >> filter. For multiple category options, I guess there would need to be
>>> >> a decision made whether to perform an inner join or loop through a
>>> >> filter, but I guess an inner join would actually be better for either
>>> >> one or many category options (have not looked at the code). If the
>>> >> uniqueness contraint is not there, the user would need to select in a
>>> >> separate step to select all "Total"s and then perform an inner join,
>>> >> as there would be no intrinsic relationship between "Total" in the
>>> >> "Age" category and the "Total" in the "Gender" category. This might be
>>> >> very tedious if there are many categories to select from. Having
>>> >> multiple category options with the same name does not make sense in
>>> >> this case, and I think this is what everyone is saying?
>>> >>
>>> >>
>>> >>
>>> >> Obviously  there should not be two category options called "Total" to
>>> >> be within a single category/data element group set. However,I am not
>>> >> sure I understand completely your point Ola. To me, the use case you
>>> >> describe is very typical. "Give me all data for the under 1 age
>>> >> group", "Give me all data on in patient discharges". Having to define
>>> >> multiple "under 1" and "IPD" for each category seems to be very
>>> >> inefficient, as well as painful.
>>> >>
>>> >> So, I guess maybe I am answering my own mail...I think.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> 2009/10/30 Lars Helge Øverland <larshelge@xxxxxxxxx>:
>>> >> >
>>> >> >
>>> >> > On Fri, Oct 30, 2009 at 2:43 PM, Jason Pickering
>>> >> > <jason.p.pickering@xxxxxxxxx> wrote:
>>> >> >>
>>> >> >> Could some one remind me once again what the point of having a
>>> >> >> category option in two separate categories is? is there a use case
>>> >> >> here? It does not seem totally obvious, but maybe I am missing
>>> >> >> something.
>>> >> >>
>>> >> >
>>> >> > It might be that there are none. This could be useful in the sense
>>> >> > that
>>> >> > if
>>> >> > nobody asks for removing the constraint - we won't.
>>> >> >
>>> >> >
>>> >> >
>>> >>
>>> >> _______________________________________________
>>> >> Mailing list: https://launchpad.net/~dhis2-devs
>>> >> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>>> >> Unsubscribe : https://launchpad.net/~dhis2-devs
>>> >> More help   : https://help.launchpad.net/ListHelp
>>> >
>>> >
>>
>>
>> _______________________________________________
>> Mailing list: https://launchpad.net/~dhis2-devs
>> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~dhis2-devs
>> More help   : https://help.launchpad.net/ListHelp
>>
>
>



References