← Back to team overview

dhis2-devs team mailing list archive

Re: Fwd: On categories and dimensions and zooks

 

Hi Lars

I think your suggestion might adequately cover the analysis use case, but
there remains a missing piece to the puzzle re SDMX export.  I am
particulalrly thinking of the challenge Ola and Knut are shortly facing of
presenting DHIS as a consumer of WHO MDG Indicator metadata and producer of
SDMX MDG reports.  Comments inline below:

2009/10/10 Lars Helge Øverland <larshelge@xxxxxxxxx>

>
> Here comes my shot at this issue. I'm gonna use Ola's example as a basis.
>
> <!-- start -->
>
> *
> *The flat data element names:
> "Malaria death <5 year"
> "Malaria death >5 year"
> "Malaria in OPD 1st attendance <5 year"
> "Malaria in OPD 1st attendance >5 year"
> "Malaria IP discharge <5 year"
> "Malaria IP discharge >5 year"
> "Typhoid death <5 year"
> "Typhoid death >5 year"
> etc.
> (OPD is outpatient, patients treated at the clinic, IP is inpatient meaning
> patients that was admitted to a hospital).
>
> There are three dimensions in the data elements above, so I define three
> data element group sets:
> Disease, Patient Status, and Age.
> I also define 7 new data element groups (Malaria, Typhoid, <5, >5, Death,
> OPD, IP) and assign these groups to the group set they belong to:
> Disease (Malaria, Typhoid)
> Patient Status (Death, OPD, IP)
> Age (<5, >5)
>
> I then assign the data element groups to the data elements
> "Malaria death <5 year" assigned to "Malaria", "Death", and "<5".
> etc.
>
> All these groupings can exist completely independent of data entry and be
> changed at any time.
> >From this I can generate a new resource table for my data analysis
> (similar to the one we already have for orgunit group sets) that provides:
> Data Element Group Set, Data Element Group, Data Element
> "Disease", "Malaria", "Malaria death <5 year",
> "Disease", "Typhoid", "Typhoid death <5 year"
> "Patient Status", "Death", "Malaria death <5 year"
> etc.
>
> When joining the above table with an aggregated data value table you can
> define a pivot table with your three data element group sets as columns
> (pivot fields) and analyse the data across these three dimensions. The data
> element name dimension can then be completely hidden in the analysis.
>
> <!-- end -->
>
>
> Some observations:
>
>
> a) From this we can derive that a GroupSet corresponds to a Dimension and
> that a Group corresponds to a DimensionOption.
>
> Dimension = GroupSet
> DimensionOption = Group
>
>
> b) The current Category model and the suggested simplified version both
> generate CategoryOptionCombos/DimensionElementCombinations which are linked
> to DataValue and constitute all possible combinations of their associated
> CategoryOptions/DimensionOptions. This means that once those
> CategoryOptionCombos/ DimensionElementCombinations are generated and
> DataValues are registered for them, they cannot change. Also, once a data
> entry grid is defined, the underlying model cannot change. According to Ola
> and Jason we must be able to assign "any dimension to a DataElement" at any
> time.


I think here is the snag.  In the proposed scheme you are not really
assigning dimensions to a dataelement at all.  In fact you do the reverse -
you assign dataelements to a dimension.  I still need to end up with a
resulting indicator/dataelement which has a name and which has these
dimensions.  I'll try a snippet of Patrick's sample sdmx inline here to
illustrate the point (Best viewed by making your font size very small).

Here is an example of some indicators:
<CodeLists>
        <structure:CodeList id="CL_INDICATOR" agencyID="SDMX-HD"
version="1.0" isFinal="false"
urn="urn:sdmx:org.sdmx.infomodel.codelist=SDMX-HD:CL_INDICATOR" >
            <structure:Name xml:lang="en">Indicator</structure:Name>
            <structure:Code value="0"
urn="urn:sdmx:org.sdmx.infomodel.codelist.Code=SDMX-HD:INDICATOR[1.0].0">
                <structure:Description xml:lang="en">Neonatal mortality rate
(per 1000 live births)</structure:Description>
            </structure:Code>
            <structure:Code value="1"
urn="urn:sdmx:org.sdmx.infomodel.codelist.Code=SDMX-HD:INDICATOR[1.0].1">
                <structure:Description xml:lang="en">Number of deaths during
first 28 completed days of life</structure:Description>
            </structure:Code>
            <structure:Code value="2"
urn="urn:sdmx:org.sdmx.infomodel.codelist.Code=SDMX-HD:INDICATOR[1.0].2">
                <structure:Description xml:lang="en">1000 live births in a
given year</structure:Description>
            </structure:Code>
            <structure:Code value="3"
urn="urn:sdmx:org.sdmx.infomodel.codelist.Code=SDMX-HD:INDICATOR[1.0].3">
                <structure:Description xml:lang="en">Life expectancy at
birth</structure:Description>
            </structure:Code>
            <structure:Code value="4"
urn="urn:sdmx:org.sdmx.infomodel.codelist.Code=SDMX-HD:INDICATOR[1.0].4">
                <structure:Description xml:lang="en">Adults aged = 15 years
who are obese</structure:Description>
            </structure:Code>
        </structure:CodeList>
  </CodeLists>

(The last one strikes me as a bid odd.  I would have thought the indicator
would be "Number of people who are Obese" and the age stuff would be in a
dimension.  But anyway ... best not to get obsessed with dimensions)

Here is an example of a dimension:
  <CodeLists>
    <structure:CodeList id="CL_GENDER" agencyID="SDMX-HD" version="1.0"
isFinal="true" urn="urn:sdmx:org.sdmx.infomodel.codelist=SDMX-HD:CL_GENDER">
      <structure:Name xml:lang="en">Gender</structure:Name>
      <structure:Description xml:lang="en">Gender.</structure:Description>
      <structure:Code value="1"
urn="urn:sdmx:org.sdmx.infomodel.codelist.Code=SDMX-HD:CL_GENDER[1.0].1">
        <structure:Description xml:lang="en">Male</structure:Description>
      </structure:Code>
      <structure:Code value="2"
urn="urn:sdmx:org.sdmx.infomodel.codelist.Code=SDMX-HD:CL_GENDER[1.0].2">
        <structure:Description xml:lang="en">Female</structure:Description>
      </structure:Code>
      <structure:Code value="3"
urn="urn:sdmx:org.sdmx.infomodel.codelist.Code=SDMX-HD:CL_GENDER[1.0].3">
        <structure:Description
xml:lang="en">Transgender</structure:Description>
      </structure:Code>
      <structure:Code value="_NA"
urn="urn:sdmx:org.sdmx.infomodel.codelist.Code=SDMX-HD:CL_GENDER[1.0]._NA">
        <structure:Description xml:lang="en">Not
Applicable</structure:Description>
      </structure:Code>
      <structure:Code value="_ALL"
urn="urn:sdmx:org.sdmx.infomodel.codelist.Code=SDMX-HD:CL_GENDER[1.0]._ALL">
        <structure:Description xml:lang="en">All</structure:Description>
      </structure:Code>
      <structure:Code value="_UNK"
urn="urn:sdmx:org.sdmx.infomodel.codelist.Code=SDMX-HD:CL_GENDER[1.0]._UNK">
        <structure:Description xml:lang="en">Unknown</structure:Description>
      </structure:Code>
    </structure:CodeList>
  </CodeLists>

Note that both the indicator and the dimension are represented by a common
element (structure:CodeList).  This is not purely coincidental.  In terms of
the DataValue the indicator and the dimension are treated the same way - as
an attribute.  So in this sense the Indicator (like the period and orgunit)
are like compulsory dimensions.

        <ns:Series DISEASE="1" PROG="0" GEOGRAPHIC_PLACE_NAME="CH-GE"
ORGANIZATION="1" INDICATOR="4" VALUE_TYPE="1" GENDER="_ALL" AGROUP="5"
GLOCATION="3" PERIODICITY="4" UNIT="_NA" REPEATS="0"  >
            <ns:Obs OBS_VALUE="400" TIME_PERIOD="2008"
DATE_COLLECT="2009-03-20" />
        </ns:Series>

(Series is just used to group datavalues in a time series.  DISEASE might be
for example Malaria)

What would (or what could) the Indicator be in our sample scenario?  This is
where it would be really useful to get hold of the actual MDG indicator
definitions that we apparently won't see till the 20th.  Having said that we
can get a pretty good idea of what they will look like from here:
http://mdgs.un.org/unsd/mdg/Host.aspx?Content=Indicators/OfficialList.htm.

Anyway, I hope you see my point.  Whereas we do need to be able to group
indicators/dataelements into dimensions, those dimensions still have to be a
dimension of something.  Is it a dimension of the Indicator?  Well almost,
but not quite.  Its interesting if you look at the indicator list above that
there is no mention of dimensions.  I think - and I don't want to confuse
things further by bringing in further terminology - it is actually a
dimension of the "measure".  Contrary to some recent discussions in which,
myself included, we thought that dataelement might be equivalent to what
some people call measure.  This is not the case, as Jørn quickly and
vigorously pointed out.  The "measure" is the type of data value (or series
of datavalues) which might be something like "percentage of population" or
"proportion of poulation per 1000" or something like that.

And the measure would have dimensions, including compulsory ones like
Indicator, Period, OrganisationUnit as well as optional ones like Disease,
Gender, Age etc.

But in practice, because the Indicator is a compulsory dimension,  a
particular instance of a measure (an OBS_VALUE in SDMX) would be associated
with a particular Indicator + its other dimensions.  So I think, besides the
Indicators which make up the dimensions as per the groupset idea, we must
also have an Indicator which *has* these dimensions.  A recursion I know.

So, in addition to Lars' model, I would propose an Indicator (and
DataElement) interface as follows:

interface MultiDimensionalElement
{
   OrderedList<Dimension> getDimensions():
   void setDimensions(OrderedList<Dimension>);
   void addDimension(Dimension);
   etc
}

and Indicator implements MultiDimensionalElement; and DataElement implements
MultiDimensionalElement.

And of course getDimensions() can (and many or most cases will) return NULL.

Remaing thoughts:
(i)  an Indicator, even a multidimensional one, still needs a value.  I
suspect in most cases this will be the aggregation of its dimension values.
For example, taking MDG indicator number 4.1 (Under-five mortality rate),
this will probably have a Gender dimension which we will implement using
groups and groupsets, but it will also have an aggregate value.

(ii)  medium term.  I don't think it makes any sense to continue to support
two methods of implementing multidimensionality.  The revised model of Lars
(with additions) should eventually also be able to be used to implement the
grid data entry requirement.  But we can suspend that discussion for now

Sorry for the long mail.  Lars do you think it makes sense to extend your
model this way?  I know we need to come up with a solution pretty quickly on
this.

Regards
Bob



To me this rules out re-using the same dimensional attributes for data entry
> and analysis - we must in any case have on set of dimensions for data entry
> and one set of dimensions for analysis.
>
>
> c) Ola's suggested solution supports this. It is powerful in the ability to
> assign "raw" DataElements to Dimensions/GroupSets through
> DimensionOptions/Groups, completely independent of which Categories the
> DataElement was assigned to for data entry. The weakness is that it is based
> on flat data elements, not Categorized data elements, which we must include
> if we are to justify the Categorized data entry.
>
>
> d) The Category model is pretty good at what it currently does -
> facilitating grid-based dataentry and cutting down on the number of data
> elements (as well as making the data element naming more elegant).
>
>
> Based on this I suggest we do the following:
>
> 1) We continue to use the Category model as it is, not for analysis - but
> for data entry.
>
> 2) Taken from Bob's suggestion - we phase out the existing Group and
> replace it with a new DimensionOption object. We introduce a new Dimension
> object which will work similarly to a GroupSet. We use this model for
> analysis.
>
> 3) We go for Ola's mentioned suggestion for analysis, with one exception:
> Rather than assigning DataElements to a Group/DimensionOption, we assign a
> combination of DataElement and CategoryOptionCombo (We create a new object
> for this for every assignment - and remove it for every de-assignment). If
> we want to see the total, we can assign a DataElement with the "default"
> CategoryOptionCombo, or create a DimensionOption where the elements make a
> total when summarized.
>
> 4) We use the same thing for Indicators.
>
>
> The resource table Ola mentions will then look like this:
>
> Group Set -Group - Data Element - CategoryOptionCombo
>
> "Disease" - "Malaria" - "Malaria" - "(death, <5 year)"
> "Disease" - "Typhoid" - "Typhoid" - "(death, >5 year)"
>
>
> This way we can assign dimensions as we like without loosing the fine
> granularity of the captured categorized data. We can improve the report
> table functionality in order to utilize this. This will be feasible with the
> time and resource constraints we are operating with. It also alleviates the
> challenge regarding Indicators and SDMX.
>
>
> Additionally, one could expand the quotation from a) to:
>
> Dimension = GroupSet = Category
> DimensionOption = Group = CategoryOption
>
> which means there is potential in merging those objects/making them
> implement a common interface. But I don't see the value if b) is valid.
>
>
> Waiting for your replies/slaughter.
>
>
> Lars
>
>
>
>
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~dhis2-devs<https://launchpad.net/%7Edhis2-devs>
> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~dhis2-devs<https://launchpad.net/%7Edhis2-devs>
> More help   : https://help.launchpad.net/ListHelp
>
>

Follow ups

References