← Back to team overview

dhis2-devs team mailing list archive

Fwd: On categories and dimensions and zooks

 

Hi all. I am forwarding a "side" discussion I was having with Bob on
this topic. I wanted to bounce it off him, before I exposed myself too
openly. :)  I have been trying to conceptualize my thoughts on this
topic in words, and thought we should maybe try a drawing as  well. It
seems that it has more or less been agreed that data element set
groups should be implemented. The key question for me is how far we
can realistically go, as slipping down the multidimensional slope
might get messy quickly . If we think about a data value, it may be
composed of multiple internal dimensions (as Bob) points out below.
DHIS deals with this by the use of "calculated data elements", whereby
users can define essentially any combination of data elements.
Currently, it would seem that that the "dimensionality" of a
calculated data element sort of get lost during a data mart export.
For instance, we can define data elements like "Total cases of
confirmed malaria" and "Total cases of clinical malaria" and then add
them up to get "Total cases of malaria ". However, Total clinical
cases may be the combination of various other data elements as well,
such Under 1, 1-5 and Over 5 age groups. It seems we have a situation
where certain data elements can be defined in the application itself,
and then others need to derived through ad-hoc means, such an SQL
query or PivotTable.

I have put together a few "mindmaps" with the tool "freemind" which is
freely available for many platforms, to try and conceptualize to some
degree how I see a "data value" or a measure. I think Bob, below
provides a good play-by-play of the diagrams. I have started with the
concept of a data value, which may be composed of several dimensions:
DataElement (what), Period (when), OrganizationalUnit (where), and
Source (how). I faintly remember there being some discussion, as Bob
alludes to, about the "source". I see the source in a slightly
different light than Bob. If you look at a system like DevInfo, their
concept of source is essentially a difference in how the data is
obtained, and relates to its methodology (I think this may be
essentially how). Here in Zambia, there are two sources of
denomitnators: the official census and facility catchment populations.
The sum of catchment populations for all facilities in a given
district does not necessarily add up to the total official population.
It may be desirable to use this alternative denominator to calculate
things like coverage rates. Another example would be routine based
data and population based data. Often times, it may be useful to
compare two values for the same indicator value, for instance HIV
prevalence among pregnant women. This could be obtained though routine
data, or through population based surveys. Again, I have seen such
graphs here, comparing results from the routine HMIS data and results
from the DHS. So, this type of analysis may be desirable.

I think the dimensions of OrgUnit and Time are pretty straightforward.
Bob highlights are few points here that are quite valid. Each country
has their own sense of what "time" is and how it is implemented in
their HMIS system. Here, daily patient registers get tallied once a
month, and then the management wants to see quarterly figures. The
situation is of course different elsewhere, which highlights the fact
that dimensions need to be 1) hierarchical and 2) flexible. The
implementation of the OrgUnit hierarchy in DHIS is a good example,
which allows countries or organizations total flexibility in
configuration of some type of "place" hierarchy.

I personally think that the "source" dimension is one of lesser
importance. I guess the question would be is DHIS for routine data
only. It is not a priority for me at least, although it would be a
nice to have.

Data elements seem to be much more complex. As Bob points out, in my
mind, they are intrinsically recursive and may be composed of other
data elements (which has been implemented by calculated data element
with a  fixed definition in DHIS). When I started mapping out some of
the data elements here, a couple of things occur to me, which have
implications on the implementation of "data element set groups".

First,there appear to be "primary data elements" which are things that
actually get recorded and entered into the system. Here, they are
taken from patient registers, tallied and then entered onto an
aggregation form (which I sent a long a few emails ago). They might
also come from a medical record system, such as OpenMRS. There may be
many folded and hidden dimensions wrapped up inside of this data
element, but for the aggregate system, we do not really care about
them.  This correspond in my diagram, what is being referred to in
DHIS as "data elements".

Second, there appear to be implicit default operators about how
operations at nodes should be handled. Sometimes, it would appear to
make sense to "sum" data elements (whether they are primary or
derived). Other times, it would not make sense. We have some data
elements on the number of Doctors who have been lost, recruited and
who are on-site. It does not make sense to sum these values up to
arrive at "Total number of doctors".  So, it would seem we need a bit
more logic to be built in somehow, or simply leave it up to the users
how certain values of aggregate data values should be handled.

Third, there appear to be different levels of dimensionality for
different data element hierarchies. Some data elements may be more
recursive than others, which is fairly typical. This highlights that
the dimensional hierarchy may be dynamic for each branch of a
particular derived data element. This also raises potential issues
with what a crosstab table would look like. What happens if we have 20
dimensions? Well, this would probably be OK, but what if it balloons
somehow to 200?

Fourth, in terms of categories versus data element group sets. I am
thinking that in terms of best practice, perhaps one way of
distinguishing between the two concepts would be 1) Categories provide
a dimensionality  construct of data elements that can be aggregated
and should be restricted to relatively few dimensional levels (Age +
Gender) as an example. In my second point, I highlight what could be a
multidimensional data element (Doctors) with three category options
(Recruitments, Losses, On-site) which do not seem to be able to be
aggregated through standard operators, but would probably want to be
visualized in some sort of cross-tab table instead only.    2) Data
element set groups provide grouping of data elements with no preset
aggregation path (at this point either SUM or AVG).  Perhaps
eventually, we could define what the default aggregation path would
be, but at this point, we can leave it up to the user to decide how to
handle slicing and dicing in a PivotTable or OLAP engine.

So, in conclusion for this mail, I think that data element set groups
would go a long way to providing some multidimensional analysis
capacity, but it feels like we are missing something to me, especially
as it applies to calculated data elements. Perhaps this is best left
up to analysts to decide, and that we define what the goal-posts are
in terms of what is achievable with the current model and our level of
resources.

Apologies once again for the long mail, but maybe this can be brought
into the eventual documentation on this subject!

Regards,
jason






---------- Forwarded message ----------
From: Bob Jolliffe <bobjolliffe@xxxxxxxxx>
Date: Tue, Oct 6, 2009 at 1:10 PM
Subject: Re: [Dhis2-devs] On categories and dimensions and zooks
To: Jason Pickering <jason.p.pickering@xxxxxxxxx>


Hi

2009/10/6 Jason Pickering <jason.p.pickering@xxxxxxxxx>
>
> I thought I would mail this to you first. I have been trying to
> conceptualize my thinking a bit more, and thought a picture may
> represent a thousand words. I created these "mind maps" with freemind.
> I think it should run on any system.

Yes freemind is cool.  I introduced it to Sundeep in Goa and he is now
an avid fan.

>
> Take a look at them and let me know if they make any sense in terms of
> our discussion. They are really not complete, and I have purposefully
> left out a lot of possibilities, but have tried to give enough
> examples to make my points clear. They simple diagrams obviously do
> not have the rigor of something like UML, but help me to try and
> visualize the concepts a bit clearer.
>
> Let me know what you think and if this is in line with your thoughts.

First observation is that I see you are *really* talking hierarchical
dataelements here - rather than just a single level of grouping.  The
SDMX model also expects these (I'll send you some sample files).  I
think if this is a requirement then you should highlight it.  I think
it is but then again I have a different brief - how to deal with
importing an sdmx metadata file which has hierarchical indicators.
But this is quite a fundamental paradigm shift which we should
probably look at proposing for a dhis2-ng requirements gathering
exercise.

Are we saying that:
1.  each dataelement can be thought of as a composite thing;
2.  it might be composed of other dataelements (recursion):
3.  it might be composed of "internal" dimensions
4.  it might be associated with simple datavalues

And it can get more complex :-(

5.  If it doesn't have simple datavalues then it must be possible to
return an aggregated value calculated by summing a slice along any
axis below it.
6.  And it should be able to be able to return the slice (or dice) of
datavalues associated with any axes below it.

I don't know if I can draw this.

But assuming we could do all the above then can we generalize these
requirements so that the same requirements can be applied to other
hierarchical entities.  I suspect we can.  Datavalues are fairly
simple beasts which simply have tags associated with entities in all
of these trees.

One problem with hierarchical models is that relation databases are
really not that clever at representing them.  Queries into tree like
structures tend to get needlessly complex for what should be flowing
with the logic of the model.  One nice alternative to a relational
database for representing the structural metadata is an xml database
like eXist.  These beasts are designed to efficiently and intuitively
handle tree-like data.  I can (if I close my eyes) see a situation
where structural metada is stored in eXist and the grunt work of
datavalue storage is handled in a relational database.  The eXist
query would generate an Xml output which represented the particular
tree view you required - using identified simple dataelements,
dataelement internal dimensions, sources, and periods.  Pulling
datavalues out of the database to match these identifying tags should
not be a complex query.

>
>
> I have also added an additional "uber-dimension", Source, which is
> distinct from the DHIS concept of source, but is more inline with what
> DevInfo considers a source. The thinking here would be that we  would
> eventually like to be able to potentially have two sources of the same
> data element, measured through different means, such as population
> based surveys and routine data systems.

I am not sure if this is too far away from what the original idea of
dhis2 "source" was - using inheritance an orgunit is just one kind of
source.  It was always envisaged that there could be others.  Mind you
the others have not materialized as yet, so I recall Lars has been
considering removing the inheritance relationship.  Its been there for
a few years and no one has suggested a use yet.  Maybe there is an
argument for maintaining it ...

Your hierarchies of periods are not quite so straightforward.  In
particular Weeks do not sit neatly under months.  In fact they don't
even sit uniformly under years unless you agree to some standard like
ISO8601.  Different countries and regions have different conventions
regarding the first day of the week and the first week of the year
which makes for a horrible mess of complexities.  You can't reasonably
aggregate weekly data to monthly.  But you can aggregate both to
yearly.

Bottom line is not that dissimilar to dataelement hierarchies mind you
- trees of weeks won't necessarily sit neatly in a single period tree
structure.  Different trees must be able to co-exist in parallel under
the same root nodes eg. tree of 52 weeks under year plus tree of 12
months under year.  Quarters, Decades etc are easier to fit in.

But all in all I think you are thinking in a similar direction to me.
I just don't fancy descending into RDBMS hell trying to model these
things.  probably we need to plan a design fest ..

Regards
Bob

Attachment: dhis_mind_maps.zip
Description: Zip archive


Follow ups

References