← Back to team overview

dhis2-devs team mailing list archive

Re: [Dhis-dev] DataElement -> PeriodType association

 

On Sun, May 23, 2010 at 9:36 AM, Ola Hodne Titlestad <olatitle@xxxxxxxxx> wrote:
> On 23 May 2010 03:32, Bob Jolliffe <bobjolliffe@xxxxxxxxx> wrote:
>>
>> On 22 May 2010 19:51, Ola Hodne Titlestad <olatitle@xxxxxxxxx> wrote:
>> > On 20 May 2010 18:39, Bob Jolliffe <bobjolliffe@xxxxxxxxx> wrote:
>> >>
>> >> On 20 May 2010 15:56, Bob Jolliffe <bobjolliffe@xxxxxxxxx> wrote:
>> >> > 2010/5/20 Ola Hodne Titlestad <olatitle@xxxxxxxxx>:
>> >> >>
>> >> >> 2010/5/20 Lars Helge Øverland <larshelge@xxxxxxxxx>
>> >> >>>
>> >> >>> Data elements derive their period type from the data sets they are
>> >> >>> members
>> >> >>> of.
>> >> >
>> >> > Restated (what I just sent Lars only by mistake):  a datavalue
>> >> > derives
>> >> > its period type from the data set of
>> >> > which its data element is a member  :-)
>> >> >
>> >> >>
>> >> >> And when they are members of two datasets with different period
>> >> >> types
>> >> >> they
>> >> >> have multiple period types right?
>> >> >
>> >> > It's important to remain aware that it is values ultimately which
>> >> > have
>> >> > periods (and hence period types).
>> >> >
>> >> > And when you look at a value you can derive its period type in one of
>> >> > two ways - via dataset or via period.  Potentially these could
>> >> > disagree,  The one which derives from its period should be considered
>> >> > authoritative ie. if the period is 2009-Jan then regardless of what
>> >> > the dataset might say this really must be monthly.  Of course we hope
>> >> > these always agree.  Incidentally the lookup from
>> >> > datelement-to-dataset-to-period looks like a greater complexity than
>> >> > the lookup from period->periodType.
>> >> >
>> >> >>
>> >> >> The key thing to look out for in data entry and data import is to
>> >> >> avoid
>> >> >> overlaps in data values that will cause duplication when aggregating
>> >> >> data
>> >> >> periods.
>> >> >> E.g. if the SAME ORGUNIT registers values for the same data element
>> >> >> for
>> >> >> two
>> >> >> different period types that have overlapping periods, e.g. Jan-10
>> >> >> and
>> >> >> Q1-10.
>> >> >> Then the aggregate values for Q1-10, Jan-June 2010, and 2010 will
>> >> >> all
>> >> >> show
>> >> >> an incorrect value since the value for Jan-10 is counted twice.
>> >> >
>> >> > OK.  Thats a good concrete constraint to have.
>> >> >
>> >> >>
>> >> >> One way to enforce this constraint is to monitor which datasets an
>> >> >> orgunit
>> >> >> is assigned to, and not allow orgunits to be assigned to two
>> >> >> datasets
>> >> >> that
>> >> >> have the same data element AND different period types.
>> >> >
>> >> > Agreed,  Though this constraint should probably be imposed on forms
>> >> > rather than datasets.
>> >> >
>> >> >>As far as I am aware,
>> >> >> we are not checking for this today. During data import it could be
>> >> >> checked
>> >> >> on data element level by looking up the period type the way Bob has
>> >> >> shown,
>> >> >> but that sounds like a lot of look ups and time consuming
>> >> >> validation,
>> >> >> or?
>> >> >
>> >> > On data import we don't really validate at all, beyond whatever
>> >> > constraints the db imposes. For efficiency we simply pop the values
>> >> > in
>> >> > with multiple insert statement.  So this validation would have to
>> >> > happen as a stage before the actual import or would have to be
>> >> > constrained within the db.  In fact it can't be validated easily
>> >> > before the import as it is dependent on existing values within the
>> >> > db.
>> >> >
>> >> >>
>> >> >> A relatively normal use case that we probably have to find a way to
>> >> >> support,
>> >> >> and I think they are struggling with in Vietnam, is that different
>> >> >> provinces
>> >> >> can use different period types for the same data elements (even for
>> >> >> complete
>> >> >> data sets). E.g. if the national data flow policy says to report on
>> >> >> immunisation data every quarter, so that becomes the minimum
>> >> >> requirement for
>> >> >> all provinces. Then some of the provinces decide that all their
>> >> >> facilities
>> >> >> have to collect this data monthly anyway, and then at the province
>> >> >> level
>> >> >> they simply send the quarterly aggregates to national level (in the
>> >> >> paper-based or Excel world). At the same time other provinces just
>> >> >> collect
>> >> >> quarterly data at the facility level as in the minimum national
>> >> >> requirement.
>> >> >> At the national level there is a need to consolidate all this data,
>> >> >> even
>> >> >> data by the facility level, so ideally a national DHIS database
>> >> >> should
>> >> >> be
>> >> >> able to store both monthly and quarterly raw data values for the
>> >> >> same
>> >> >> data
>> >> >> elements, but for different orgunits. The national information users
>> >> >> can
>> >> >> then easily generate quarterly reports on immunisation for all
>> >> >> provinces,
>> >> >> while in some provinces they can do monthly data analysis if they
>> >> >> want
>> >> >> to
>> >> >> collect data using that frequency.
>> >> >>
>> >> >> We support the above scenario by allowing the same data elements to
>> >> >> be
>> >> >> assigned to different data sets with different period types, but we
>> >> >> don't
>> >> >> control for misuse of this flexibility which can lead to duplication
>> >> >> and
>> >> >> inconsistent aggregated data values as pointed out above.
>> >> >
>> >> > Thinking further ... I really think the problem arises because we we
>> >> > have a dataset concept which represents a form and is also used to
>> >> > constrain periodtypes on dataelements.  Thinking of the use case you
>> >> > have just described, it should be the case that one can have a paper
>> >> > form which national level expect to collect quarterly, and the same
>> >> > form be used at a lower level to collect data monthly.  If we wanted
>> >> > to mirror that use case electronically we would have to divorce the
>> >> > form from the periodtype - ie a form would collect datavalues of a
>> >> > certain period, but the same form could be used in different orgunits
>> >> > for collecting data at a different frequency..
>> >> >
>> >> > So (leaving dataset aside for the moment) if we can't assign a
>> >> > periodtype to a form and we can't assign to a dataelement and its too
>> >> > inefficient to validate on a one by one datavalue basis what is a
>> >> > girl
>> >> > to do?
>> >> >
>> >> > I suspect the correct answer is to refactor datavalue and create a
>> >> > datavalueset type - note: a set of datavalues rather than a set of
>> >> > dataelements.  Designing out loud, a datavalueset would have the
>> >> > following fields/attributes:
>> >> >
>> >> > 1.  a formid - the collection instrument used - roughly corresponds
>> >> > to
>> >> > current dataset
>> >> > 2.  an orgunitid - where the datavalues come from
>> >> > 3.  a periodid - the period of all the datavalues
>> >> > couple of other useful attributes I can think of
>> >> >
>> >> > Datavalue now becomes slightly simpler (which is always a good
>> >> > thing).
>> >> >  It only has:
>> >> > value, dataelementid, categorycombooption, datasetid
>> >>
>> >> Afterthought:
>> >> At the risk of adding complexity to what is otherwise a
>> >> simplification, my life could become even simpler if datavalueset also
>> >> had a categorycombo attribute, which would imply that a dataset was
>> >> linked to a formsectionid rather than a formid.
>> >>
>> >> So a form has sections.  sections have dataelements.  And sections
>> >> have a datavalueset as a model - which implies a uniform categorycombo
>> >> within the section.
>> >>
>> >> There isn't really a need for dataelements to have a categorycombo.
>> >> And in lots of ways its good that they don't. Then I am reducing
>> >> complexity rather than adding to it :-)
>> >>
>> >> Consider one orgunit has collected malaria deaths disaggregated by
>> >> age.  Another has collected values for the the same dataelement, but
>> >> not disaggregated by age.  The datavalues will come from a
>> >> datavalueset so will have a categorycombo.  It is possible to
>> >> aggregate or compare these datavalues,from different datavaluesets,
>> >> but using the lowest common denominator of categorycombo ie. in both
>> >> cases you have access to malaria deaths - in the one case you have to
>> >> "roll-up" the categorycombo which does of course assume that the sum
>> >> of category options make a sensible whole, but Ola has mentioned this
>> >> one many times.
>> >>
>> >
>> > Some really interesting ideas you are bringing up here Bob. I like the
>> > kind
>> > of flexibility and yet structure this would bring to the data model.
>> >
>> > One quick question though:
>> > How would this fit with the use of data elements and
>> > categorycombooptions in
>> > metadata expressions like indicators and validation rules that are (and
>> > should be) completely independent from data collection structures? E.g.
>> > which categories and options should be available for a given data
>> > element
>> > when setting up an indicator formula? All?
>>
>> I think its a question of the "lowest common denominator" of the
>> datavalues that you have.  Indicators are calculated from datavalues
>> even though we express the calculation in terms of dataelements.
>>
>> Ivalue = f(de1,de2,de3...)/g(de4, de5 ..)
>>
>> Looking just at the numerator - if the set of datavalues you have
>> corresponding to de1, de2 and de3 share the same categorycombo (and
>> note that datavalues do have a categorycombo from which their
>> categoryoptioncombo is derived) , then you can also produce a
>> similalrly disaggregated indicator value.
>>
>> If they use different categorycombos (some have age+sex, some have
>> hiv_age+sex, and some have just sex), but each of these have at least
>> the sex category, then you could produce an indicator value
>> disaggregated by sex.
>>
>> If the categorycombos are a jumble of apples and pears then you can
>> produce just the rolled up calculation.
>>
> I like this idea.
>
>
>>
>> What is  the implication?  At design time, when you are coding the
>> expression, you probably should not include the categoryoptioncombo at
>> all.  The indicator is just expressed in terms of dataelements (I
>> guess traditional DHIS14 style).  But when you are generating for
>> example, the reporttable, the first pass analyzes the data you have
>> selected and suggests - would you like the indicator data
>> disaggregated by sex? Or age+sex?  Or no disaggregation.  So what you
>> can report on is determined by the data you've got.  I think that's a
>> sound principle.
>>
> I can see a few challenges with this principle. In typical implementations
> of DHIS you would design forms and canned/fixed reports at the same time
> before rolling out the installations. If it is impossible to design reports
> before you have any data values I can see a problem with this approach. But
> I guess you would know, from the forms information the potential
> datavaluesets and therefore could allow some disaggregated reports to be
> prepared even before you have any data values?
>
> Another issue I would like to bring up is performance. In the past we have
> struggled with and spent a lot of time on improving the performance of the
> datamart, the aggregation of data values. To me it sounds more complicated
> to have a floating set of disaggregations that needs to be looked up in a
> potentially huge storage of datavalues compared to working with a fixed set.
> Any thoughts on data mart service performance with this proposed design
> compared to the existing one?
>
>> And I think all of this is completely independent of data collection
>> structures.
>>
>> Of course in practice you will have designed and deployed your
>> collection instruments such that all your datavalues for a given
>> dataelement will have the same categorycombo.  But if you want to
>> compare data over the past five years, and the ministry decided only
>> in year two that they wanted to disaggregate by sex and in year 4
>> decided to introduce a third sex category, then you could still
>> calculate an indicator from all of those datavalues - but by rolling
>> up sex category.
>>
>> I think what we do currently - specifying the categorycombo in the
>> indicator expression - is more rigid and more fragile.
>>
> Agree, and I think most indicators analysis will be on the data element
> level anyway (without any disaggregations), so the current design is too
> complicated and cumbersome to work with.

We definitely need something that is manageable, both in terms of
understanding and performance. But looking at the GHO and thinking
about National Health Observatories (for which I think DHIS2 is quite
suited), people definitely want breakdowns by at least the "standard"
dimensions of age and sex.

Knut

PS: I think it can be a good practice in these long threads to snip
out some parts of the emails that are no longer needed for where the
discussion has gone (such as everything below your signature),
otherwise it becomes hard to reply and read, even in good clients like
Gmail.



References