← Back to team overview

dhis2-devs team mailing list archive

Multidimensional data analysis and Postgresql scripts

 

Hi there. Lars set me off on a winding path the other day, and asked me to
whip up some SQL for the representation of a specific structure of data.

I have added two Postgresql scripts to the resources/sql folder which
contain a partial solution to two separate problems. Lars and I have had
many mails back and forth privately, but now that I have some results, I
thought I would try and describe what has been done.

First, the  ct_source_period_from_deid.sql script will transform the
datavalue table for a given data element into a different form, with
categoryoptioncombos on a seperate column. . I needed to constrain it to
some dimension, as there is no way to know how many columns we would end up
with, as the table is crosstabbed now essentially on the category options.
Without this contrstraint, if you tried to crosstab the entire table, you
might end up with a very "wide" table with too many columns. Anyway,
hopefully this meets Lars needs, but this procedure would of course need to
be generalized to Java/Hibernate.

As as example of how to execute this script, just so "SELECT * FROM
create_ct_source_period(36) where 36 is a given data element id. A new table
will be created _ct_source_period_36 which will contain all datavalues for
each source/period combination with a seperate column for each category
option.

The second (interesting) by product of this little journey into the depths
of the DHIS2 data model, is a second script to decompose the datavalue table
into an entity-attribute-value representation. This has been a nagging
problem with me for several years when dealing with data models such as the
one that DHIS employs, namely how to decompose the data into a format that
can then be pulled into an aribtrary crosstab function. This script will do
this for a given dataelement, sourceid combination. Again, I have employed
somewhat of an artifical contraint here, as of course you would likely want
to combine multiple dataelements, orgunits and periods, likely with some
sort of filters as well. I think this could be achieved fairly easily with
some modifications to the script or better yet through some other procedural
login, but since we are staying to try database agnostic here, I really
wanted to make this more as  an example of we can achieve this
representation of the data. Once this script is executed with the right
parameters, a separate data table is generated _eav_dataset that looks
something like this..

"0";"dataelementname";"Fully Immunized child"
"0";"enddatte";"2008-01-31"
"0";"Fixed+Outreach";"Outreach"
"0";"orgunitname";"Gbonkobana CHP"
"0";"startdate";"2008-01-01"
"0";"Under12mths+12to59mths";"12 to 59 mths"
"1";"dataelementname";"Fully Immunized child"
"1";"enddatte";"2008-01-31"
"1";"Fixed+Outreach";"Outreach"
"1";"orgunitname";"Gbonkobana CHP"
"1";"startdate";"2008-01-01"
"1";"Under12mths+12to59mths";"< 12 mths"
"1";"value";"15"

So, various "dimensions" have been transformed into separate attributes and
value pairs, and assigned an arbitrary ObjectID. EAV is a useful data
representation, with advantages and disadvantages.

Why would anyone want to see the data like this? Well, the reason is that
the "crosstab" function of Postgresql likes this type of data, and can
generate an crosstab table on any dimension once the data is represented in
this format.

Once I get my data looking like this, I can utilize the crosstab function of
postgres like this..

SELECT * from crosstab( 'select objectid::text, attribute, value
from _eav_dataset where attribute = ''orgunitname''
or attribute = ''dataelementname''
or attribute = ''startdate''
or attribute = ''Under12mths+12to59mths''
or attribute = ''value''
ORDER BY 1,2,3')
as ct(row_name text, dataelementname text, orgunit text, startdate text, age
text, value text)

which returns this

"0";"Fully Immunized child";"Gbonkobana CHP";"2008-01-01";"12 to 59 mths";""
"1";"Fully Immunized child";"Gbonkobana CHP";"2008-01-01";"< 12 mths";"15"
"10";"Fully Immunized child";"Gbonkobana CHP";"2008-02-01";"12 to 59
mths";"2"
"11";"Fully Immunized child";"Gbonkobana CHP";"2008-02-01";"< 12 mths";"4"

Of course, the crosstab query could be rewritten in essentially any way,
with different dimensions as columns as needed.

So, in conclusion, I just wanted to bring this up the the devs/community for
more discussion.

Best regards,
Jason