← Back to team overview

dhis2-devs team mailing list archive

Re: [Dhis2-users] Creation of CategoryOptionCombinations

 

Hi Jason,

just to clarify: it's 1 CategoryCombo with ten Categories resulting in 50 Mio
 CategoryOptionCombos (I misspelled this before). Theoretically this must be
multiplied by the number of dataelements in the dataset, the number of orgunits
and the number of periods (daily over 50 Years) to get the number of expected
dataValues. 

In reality this number of dataValues will not be reached as there are functional
dependencies between options, thus leaving lots of combinations empty. Actually
I cannot predict just how many combinations (aka records) will pop up from the
group by SQL on the Source-System. In our current prototype with 5 Categories in
the CatCombo we are getting 4 Mio values in total, from which only 10.000 have
to be updated every day - which is a very reasonable number. I am actually
hoping for similar numbers with the extended 10-dim version because of those
functional dependencies.

The idea of using the tracker is interesting, although I'd have to get used to
the idea of using a granular level to upload aggregated data and rethink the
whole model. I think, I'd rather try to reduce the number of categories first (I
am currently down to 10Mio COCs and it seems to work).

How do you estimate the chances to get rid of some of the heavy things from
DHIS2 core when generating categoryOptionCombinations? I am especially thinking
of the extraordinary long names and the huge log-entires for every new
categoryOptionCombination (currently over 3000 characters log for each). This
would already take a lot of data-volume out of the generation process.

Regards, Uwe


> Jason Pickering <jason.p.pickering@xxxxxxxxx> hat am 8. Juni 2016 um 15:44
> geschrieben:
> 
> 
> It just seems like if you have five million cat combos, you would need many
> more orders of magnitudes of data to support them. If the data was imported
> as events, instead of aggregate, you would not need to explicitly create
> all of those dimensions, but could still create aggregate figures from
> them.
> 
> It just feels like there is no way all of those cat combos are going to be
> filled, unless you really have a TON of data.
> 
> Regards,
> Jason
> 
> 
> 
> On Wed, Jun 8, 2016 at 2:36 PM, Uwe Wahser <uwe@xxxxxxxxx> wrote:
> 
> > Hi Jason,
> >
> > importing aggregate date into data-sets (see my reply to Lars yesterday
> > evening:
> > https://lists.launchpad.net/dhis2-users/msg10452.html)
> >
> > Again: the problem is not the import, but the combination of category
> > options.
> > Maybe it would already help a lot, if those bombastic strings for the names
> > wouldn't be created for categoryOptionCombinations.
> >
> > Thanks for good ideas,
> >
> > Uwe
> >
> > ---
> > > Jason Pickering <jason.p.pickering@xxxxxxxxx> hat am 8. Juni 2016 um
> > 09:09
> > > geschrieben:
> > >
> > >
> > > Hi Uwe,
> > >
> > > Are you importing this as aggregate data or as events?
> > >
> > > Regards,
> > > Jason
> > >
> > >
> > > On Wed, Jun 8, 2016 at 2:27 AM, Morten Olav Hansen <morten@xxxxxxxxx>
> > wrote:
> > >
> > > > Just to make sure, we are talking about the same thing: the problem
> > does
> > > >> not
> > > >> appear during import, but when generating of all possible combinations
> > > >> (when
> > > >> saving the CategoryCombination or when manually evoking the update of
> > > >> categoryOptionCombinations)
> > > >>
> > > >
> > > > Ah, sorry.. I was thinking it was the import that was slow.. so that
> > part
> > > > is ok?
> > > >
> > > >
> > > >> so I can still use /api/metadata without version to call the current
> > > >> api-version?
> > > >>
> > > >
> > > > That will give you the legacy importer, so going forward you would
> > need to
> > > > use /api/{version}/{endpoint}, we will have more
> > > > info about it in the release notes.
> > > >
> > > > And no, the UI is not switched to new importer yet (in 2.24), not 100%
> > it
> > > > will...
> > > >
> > > >
> > > >>
> > > >> Thanks for your replies at this time of the day :-)
> > > >>
> > > >> Regards, Uwe
> > > >>
> > > >> ---
> > > >>
> > > >>
> > > >> > Morten Olav Hansen <morten@xxxxxxxxx> hat am 7. Juni 2016 um 19:28
> > > >> > geschrieben:
> > > >> >
> > > >> >
> > > >> > Hi Uwe
> > > >> >
> > > >> > The improvements are mainly for speed and validation. Yes, we are
> > now
> > > >> (in
> > > >> > 2.24) introducing versioned web-api, so that endpoint importer will
> > be
> > > >> > available until 2.26 (we will support 3 versions). In 2.24, the same
> > > >> > endpoint is available at /api/24/metadata.
> > > >> >
> > > >> > If you are using cURL, or another utility.. the import part would
> > be the
> > > >> > same, but the UI in 2.23 can not be used, as it's hardcoded to
> > legacy
> > > >> > importer.
> > > >> >
> > > >> > --
> > > >> > Morten Olav Hansen
> > > >> > Senior Engineer, DHIS 2
> > > >> > University of Oslo
> > > >> > http://www.dhis2.org
> > > >> >
> > > >> > On Tue, Jun 7, 2016 at 11:25 PM, Uwe Wahser <uwe@xxxxxxxxx> wrote:
> > > >> >
> > > >> > > Hi Morten,
> > > >> > >
> > > >> > > no, i didn't. What would be the procedure for that? Importing
> > > >> Categories,
> > > >> > > Options and CategoryCombinations via api and having DHIS2
> > generate the
> > > >> > > CategoryOptionCombinations? Would that bring about any change at
> > all
> > > >> or
> > > >> > > does the
> > > >> > > importer use different libs for generating the COCs?
> > > >> > >
> > > >> > > btw. is the 23 in the api link valid for future dhis2 versions? I
> > > >> noticed
> > > >> > > it in
> > > >> > > a few api descriptions recently ...
> > > >> > >
> > > >> > > Regards, Uwe
> > > >> > >
> > > >> > > > Morten Olav Hansen <morten@xxxxxxxxx> hat am 7. Juni 2016 um
> > 18:50
> > > >> > > > geschrieben:
> > > >> > > >
> > > >> > > >
> > > >> > > > Hi Uwe
> > > >> > > >
> > > >> > > > Did you try out new importer? Available as /api/23/metadata in
> > 2.23
> > > >> > > >
> > > >> > > > On Tuesday, 7 June 2016, Uwe Wahser <uwe@xxxxxxxxx> wrote:
> > > >> > > >
> > > >> > > > > Dear devs,
> > > >> > > > >
> > > >> > > > > I am experiencing problems when handling category
> > combinations.
> > > >> Our
> > > >> > > > > protoype
> > > >> > > > > with 5 dimensions went through the process of generating
> > > >> > > > > categoryOptionCombinations (~20.000 records) quite well. 7
> > > >> dimensions
> > > >> > > > > (~400.000)
> > > >> > > > > worked as well, although it took a very long time.
> > > >> > > > >
> > > >> > > > > Now we defined the next datamodel with 10 dimensions
> > (expecting
> > > >> ~5Mio
> > > >> > > > > categoryOptionCombinations) and the process dies without
> > further
> > > >> > > notice.
> > > >> > > > > Last
> > > >> > > > > words in catalina.out:
> > > >> > > > > * INFO  2016-06-07 13:29:33,783 Building object-bridge maps
> > > >> > > (preheatCache:
> > > >> > > > > true,
> > > >> > > > > 3 classes). (DefaultObjectBridge.java [http-bio-8180-exec-15])
> > > >> > > > > * INFO  2016-06-07 13:29:36,779 Building object-bridge maps
> > took
> > > >> 2.99
> > > >> > > > > seconds.
> > > >> > > > > (DefaultObjectBridge.java [http-bio-8180-exec-15])
> > > >> > > > > * INFO  2016-06-07 13:29:36,896 'admin' update
> > > >> > > > > org.hisp.dhis.dataelement.DataElementCategoryCombo, name:
> > > >> Membership,
> > > >> > > uid:
> > > >> > > > > SCgLXYHqVzz (AuditLogUtil.java [http-bio-8180-exec-15])
> > > >> > > > >
> > > >> > > > > Ten dimensions with not extraordinarily big option sets is
> > > >> actually not
> > > >> > > > > unusual
> > > >> > > > > and rather slim for multi-dimensional data-models in data
> > > >> warehouses,
> > > >> > > so
> > > >> > > > > I'd
> > > >> > > > > expect DHIS2 to be able to handle this easily.
> > > >> > > > >
> > > >> > > > > Could of course be a memory problem (tried up to 14g for
> > tomcat
> > > >> on a
> > > >> > > 4-core
> > > >> > > > > Ubuntu 14.04 server, DHIS 2.23) Before I'll start
> > experimenting
> > > >> with
> > > >> > > other
> > > >> > > > > parameters, I am hoping to get some hints on known
> > limitations or
> > > >> > > > > workarounds
> > > >> > > > > from you (not allowed: reducing the number of options or
> > > >> categories,
> > > >> > > > > sql-hacks
> > > >> > > > > :-) ). Is there any info on whether optimizations on this
> > process
> > > >> are
> > > >> > > being
> > > >> > > > > planned in the kernel?
> > > >> > > > >
> > > >> > > > > Some observations on the process:
> > > >> > > > >
> > > >> > > > > * during generation (either when saving the
> > categoryCombination
> > > >> or in
> > > >> > > the
> > > >> > > > > data
> > > >> > > > > maintenance menu):
> > > >> > > > > - long names - cOCs are generated with generated names that
> > are
> > > >> getting
> > > >> > > > > extremely long as they are mere concats of the involved
> > > >> > > categoryOptions.
> > > >> > > > > Could
> > > >> > > > > there be an option to just use the codes as basis or to leave
> > > >> away the
> > > >> > > > > names
> > > >> > > > > completely? Could be one reason for a memory problem and
> > > >> performance
> > > >> > > > > issues.
> > > >> > > > > - long log entries - every single entry is logged in
> > catalina.out
> > > >> with
> > > >> > > > > several
> > > >> > > > > lines of text, causing catalina to become extremely big.
> > > >> > > > > - during execution lots of Java-memory are being used and no
> > > >> DB-memory,
> > > >> > > > > which
> > > >> > > > > looks to me as if all the logic is happening in the java
> > machine.
> > > >> It
> > > >> > > might
> > > >> > > > > be
> > > >> > > > > more usefull to transfer more logic into SQLs to the DB (e.g.
> > use
> > > >> DB
> > > >> > > > > cross-joins
> > > >> > > > > for combining options) as the DB will be more efficient.
> > > >> > > > > - because of the log entries I assume that every single
> > > >> combination is
> > > >> > > > > being
> > > >> > > > > persisted into the DB with a single SQL statement, causing
> > > >> millions of
> > > >> > > > > single
> > > >> > > > > SQL requests. Prefer batch SQL instead of single record
> > > >> processing.
> > > >> > > > >
> > > >> > > > > * during import/export of categoryOptionCombinations:
> > > >> > > > > - prefer batch SQL instead of single record processing
> > > >> > > > > - huge log entries in catalina.out due to several lines of
> > text
> > > >> per
> > > >> > > > > combination
> > > >> > > > >
> > > >> > > > > I'd be very happy about comments.
> > > >> > > > >
> > > >> > > > > Thanks in advance,
> > > >> > > > >
> > > >> > > > > Uwe
> > > >> > > > >
> > > >> > > > > _______________________________________________
> > > >> > > > > Mailing list: https://launchpad.net/~dhis2-users
> > > >> > > > > Post to     : dhis2-users@xxxxxxxxxxxxxxxxxxx <javascript:;>
> > > >> > > > > Unsubscribe : https://launchpad.net/~dhis2-users
> > > >> > > > > More help   : https://help.launchpad.net/ListHelp
> > > >> > > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > --
> > > >> > > > Morten Olav Hansen
> > > >> > > > Senior Engineer, DHIS 2
> > > >> > > > University of Oslo
> > > >> > > > http://www.dhis2.org
> > > >> > >
> > > >>
> > > >
> > > >
> > > > _______________________________________________
> > > > Mailing list: https://launchpad.net/~dhis2-devs
> > > > Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
> > > > Unsubscribe : https://launchpad.net/~dhis2-devs
> > > > More help   : https://help.launchpad.net/ListHelp
> > > >
> > > >
> > >
> > >
> > > --
> > > Jason P. Pickering
> > > email: jason.p.pickering@xxxxxxxxx
> > > tel:+46764147049
> >
> 
> 
> 
> -- 
> Jason P. Pickering
> email: jason.p.pickering@xxxxxxxxx
> tel:+46764147049


Follow ups

References