dhis2-devs team mailing list archive

Thread
Date

Reporting based on tracker data

To: dhis2-devs <dhis2-devs@xxxxxxxxxxxxxxxxxxx>
From: Olav Poppe <olav.poppe@xxxxxx>
Date: Thu, 29 Jan 2015 11:35:19 +0100

Hi devs,
I’m having some problems with making reports/dashboards/output based on tracker data.

We have made a «Cause of death» program, quite similar to the inpatient program on the demo instance and what is currently being used several places. Each case will have as a minimum age, sex and a cause of death (based on an ICD-10 based short list of about 120 diagnosis).

Based on this data, we want to make reports that include as a minimum:
- number of cases by gender => can be done in ER/EV
- number of cases by diagnosis => can be done in ER/EV
- number of cases by age group => can NOT be done in ER/EV (unless I’ve missed something)
- number of cases by diagnosis group => can NOT be done in ER/EV (unless I’ve missed something)
- combinations of the above, e.g. case by diagnosis and age group/sex => can NOT be done in ER/EV (unless I’ve missed something)

Since this output can quite easily be created based on aggregate data (it is similar to the OPD datasets), I decided to set up aggregation queries to make the reports we need. I therefore made a categorycombo with gender and age group, and one data element for each diagnosis. I then grouped the diagnosis into data element groups to be able make reports based on disease group. I tested this with a couple of queries, and it worked fine.

The problem with this approach is that 118 diagnosis * 2 sexes * 22 age group = 4 956 aggregation queries, and this causes the aggregation to take a very long time. When I try to run manual aggregation for just one week with 4-500 cases, it takes about 10 minutes (4/8 core CPU, SSD, 3/6 GB memory for tomcat/postgres). Running the aggregation for the last 6 months would take a several hours. This might work for a cause of death program where the number of cases is relatively low, but to do similar analysis on an admission program (where I know some countries have 19000+ cases some weeks) is just not possible.

So I guess my question is whether there are any ideas on how this can be made to work. One obvious thing I see is that the aggregation only uses one CPU (as opposed to the analytics), could that be made more efficient? Or what would be even better was if ER/EV could support this type of analysis - the general requirement would be some sort of grouping functionality? That is something I think would be useful in many/most programs (e.g. grouping mothers by age group in ANC program)

Olav

Follow ups

Re: Reporting based on tracker data
From: Jason Pickering, 2015-01-29