dhis2-devs team mailing list archive
-
dhis2-devs team
-
Mailing list archive
-
Message #35421
Re: Reporting based on tracker data
Hi,
I am really hoping to avoid custom scripts for this. What we’re trying to do is to set up a reusable meta data «packages» (input and output) for recording this data, and having to make «hacks» outside of DHIS (that then needs to be maintained) to make it work does not look great.. Especially because all the functionality is almost there: 1) the aggregation queries works, but isn’t optimized, whilst 2) ER and EV still miss a few things for aggregated analysis (which is available in PT and DV).
I’ve been experimenting a bit more with the aggregation queries, including deleting 90% of the cases (but keeping all the ≈ 5000 queries). That didn’t really speed up things, but looking at the CPU use, the java process was now using 100+% CPU whilst postgresql stayed around 40-50% - so most of the time seems to be spent preparing the queries rather than executing them. Also did a few stack traces (jstack), and about half of the time, it was doing something in HibernatePeriodStore.java:
at org.hibernate.internal.CriteriaImpl.uniqueResult(CriteriaImpl.java:396)
at org.hisp.dhis.period.hibernate.HibernatePeriodStore.getPeriod(HibernatePeriodStore.java:73)
at org.hisp.dhis.period.hibernate.HibernatePeriodStore.reloadPeriod(HibernatePeriodStore.java:163)
at org.hisp.dhis.period.hibernate.HibernatePeriodStore.reloadForceAddPeriod(HibernatePeriodStore.java:171)
at org.hisp.dhis.period.DefaultPeriodService.reloadPeriod(DefaultPeriodService.java:278)
So perhaps (I’m only guessing here of course) some small optimizations is all it takes to make the aggregation feasible.
Olav
> 29. jan. 2015 kl. 13.16 skrev Jason Pickering <jason.p.pickering@xxxxxxxxx>:
>
> Hi Olav,
> I came across a similar case,but even worse with many more disaggregates and and data elements . Short story is that's it was simply not feasible with the aggregation query builder,but rather trivial with a script to pull required data out of the database,aggregate it and then reimport it through the API. Not surprisingly, I chose to use R,but am sure it could be done with other languages. A map between the event data elements and aggregate data elements was used to transform and aggregate from one domain to the other. Robin brought up a similar request a while ago about this ,so looks like something we should really have in the core as opposed to relying on scripts.
>
> Regards,
> Jason
>
> On Jan 29, 2015 1:36 PM, "Olav Poppe" <olav.poppe@xxxxxx <mailto:olav.poppe@xxxxxx>> wrote:
> Hi devs,
> I’m having some problems with making reports/dashboards/output based on tracker data.
>
> We have made a «Cause of death» program, quite similar to the inpatient program on the demo instance and what is currently being used several places. Each case will have as a minimum age, sex and a cause of death (based on an ICD-10 based short list of about 120 diagnosis).
>
> Based on this data, we want to make reports that include as a minimum:
> - number of cases by gender => can be done in ER/EV
> - number of cases by diagnosis => can be done in ER/EV
> - number of cases by age group => can NOT be done in ER/EV (unless I’ve missed something)
> - number of cases by diagnosis group => can NOT be done in ER/EV (unless I’ve missed something)
> - combinations of the above, e.g. case by diagnosis and age group/sex => can NOT be done in ER/EV (unless I’ve missed something)
>
> Since this output can quite easily be created based on aggregate data (it is similar to the OPD datasets), I decided to set up aggregation queries to make the reports we need. I therefore made a categorycombo with gender and age group, and one data element for each diagnosis. I then grouped the diagnosis into data element groups to be able make reports based on disease group. I tested this with a couple of queries, and it worked fine.
>
> The problem with this approach is that 118 diagnosis * 2 sexes * 22 age group = 4 956 aggregation queries, and this causes the aggregation to take a very long time. When I try to run manual aggregation for just one week with 4-500 cases, it takes about 10 minutes (4/8 core CPU, SSD, 3/6 GB memory for tomcat/postgres). Running the aggregation for the last 6 months would take a several hours. This might work for a cause of death program where the number of cases is relatively low, but to do similar analysis on an admission program (where I know some countries have 19000+ cases some weeks) is just not possible.
>
> So I guess my question is whether there are any ideas on how this can be made to work. One obvious thing I see is that the aggregation only uses one CPU (as opposed to the analytics), could that be made more efficient? Or what would be even better was if ER/EV could support this type of analysis - the general requirement would be some sort of grouping functionality? That is something I think would be useful in many/most programs (e.g. grouping mothers by age group in ANC program)
>
> Olav
>
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~dhis2-devs <https://launchpad.net/~dhis2-devs>
> Post to : dhis2-devs@xxxxxxxxxxxxxxxxxxx <mailto:dhis2-devs@xxxxxxxxxxxxxxxxxxx>
> Unsubscribe : https://launchpad.net/~dhis2-devs <https://launchpad.net/~dhis2-devs>
> More help : https://help.launchpad.net/ListHelp <https://help.launchpad.net/ListHelp>
>
Follow ups
References