dhis2-devs team mailing list archive
-
dhis2-devs team
-
Mailing list archive
-
Message #16448
[Branch ~dhis2-documenters/dhis2/dhis2-docbook-docs] Rev 463: Updated data warehouse chapter
------------------------------------------------------------
revno: 463
committer: Lars Helge Overland <larshelge@xxxxxxxxx>
branch nick: dhis2-docbook-docs
timestamp: Tue 2012-03-13 13:55:12 +0100
message:
Updated data warehouse chapter
modified:
src/docbkx/en/dhis2_implementation_guide_data_warehouse.xml
--
lp:~dhis2-documenters/dhis2/dhis2-docbook-docs
https://code.launchpad.net/~dhis2-documenters/dhis2/dhis2-docbook-docs
Your team DHIS 2 developers is subscribed to branch lp:~dhis2-documenters/dhis2/dhis2-docbook-docs.
To unsubscribe from this branch go to https://code.launchpad.net/~dhis2-documenters/dhis2/dhis2-docbook-docs/+edit-subscription
=== modified file 'src/docbkx/en/dhis2_implementation_guide_data_warehouse.xml'
--- src/docbkx/en/dhis2_implementation_guide_data_warehouse.xml 2012-02-08 22:35:01 +0000
+++ src/docbkx/en/dhis2_implementation_guide_data_warehouse.xml 2012-03-13 12:55:12 +0000
@@ -7,7 +7,7 @@
<title>Data warehouses and operational systems</title>
<para>A <emphasis role="italic">data warehouse</emphasis> is commonly understood as a database used for analysis. Typically data is uploaded from various operational / transactional systems. Before data is loaded into the data warehouse it usually goes through various stages where it is cleaned for anomalies and redundancy and transformed to conform with the overall structure of the integrated database. Data is then made available for use by analysis, also known under terms such as<emphasis role="italic"> data mining </emphasis>and <emphasis role="italic">online analytical processing</emphasis>. The data warehouse design is optimized for speed of data retrieval and analysis. To improve performance the data storage is often redundant in the sense that the data is stored both in its most granular form and in an aggregated (summarized) form.</para>
<para>A <emphasis role="italic">transactional system</emphasis> (or <emphasis role="italic">operational system</emphasis> from a data warehouse perspective) is a system that collects, stores and modifies low level data. This system is typically used on a day-to-day basis for data entry and validation. The design is optimized for fast insert and update performance.</para>
- <graphic fileref="resources/images/implementation_guide/data_warehouse.png" width="80%" format="PNG" align="center"/>
+ <graphic fileref="resources/images/implementation_guide/data_warehouse.png" format="PNG" width="80%" align="center"/>
<para>There are several benefits of maintaining a data warehouse, some of them being:</para>
<itemizedlist>
<listitem>
@@ -46,16 +46,15 @@
<graphic fileref="resources/images/implementation_guide/dhis_data_warehouse.png" format="PNG" width="80%" align="center"/>
</section>
<section>
- <title>Aggregation strategies in DHIS 2</title>
- <para>DHIS 2 is designed to run in low-end environments which puts certain restrictions on the performance. Two strategies for aggregation of data is offered:<emphasis role="italic"> Real-time aggregation</emphasis> means that the system will generate aggregated data on-the-fly based on the low-level data every time a report is requested. This implies that the aggregate data will reflect the the very latest captured data and is useful if producing reports immediately after data entry has been done is a priority. The downside is that this will not perform adequately on an online server where the database contains a large number of records and there is high user concurrency.</para>
- <para><emphasis role="italic">Batch aggregation</emphasis> means that the system will generate aggregated data every night for a defined time-span (typically the last two years) based on the low-level data and write this data to a data mart. A data mart is a data store optimized for meeting the most common user requests for data analysis. The DHIS 2 data mart contains data aggregated in the<emphasis role="italic"> space dimension</emphasis> (the organisation unit hierarchy), <emphasis role="italic">time dimension</emphasis> (over multiple periods) and for <emphasis role="italic">indicator formulas</emphasis> (mathematical expressions including data elements). This strategy for aggregation provides great performance even in high-concurrency environments since most requests for analysis can be served with a single, simple database query against the data mart. The aggregation engine in DHIS 2 is capable of processing low-level data in the multi-millions and manage most national-level databases, and it can be said to provide <emphasis role="italic">near real-time access</emphasis> to aggregate data. The downside of this approach is that captured data will not be available for aggregated analysis until the next day. However, for a routine system like DHIS 2 where data is typically collected with a monthly periodicity this is not a significant problem.</para>
- <para><emphasis role="italic">Hint</emphasis>: The aggregation strategy can be set in “Settings” - “System settings”, while scheduling of data mart exports can be enabled in “Reporting” - “Scheduling”.</para>
+ <title>Aggregation strategy in DHIS 2</title>
+ <para>The analysis tools in DHIS 2 reads aggregated data from <emphasis role="italic">data mart</emphasis> tables. A data mart is a data store optimized for meeting the most common user requests for data analysis. The DHIS 2 data mart contains data aggregated in the<emphasis role="italic"> space dimension</emphasis> (the organisation unit hierarchy), <emphasis role="italic">time dimension</emphasis> (over multiple periods) and for <emphasis role="italic">indicator formulas</emphasis> (mathematical expressions including data elements). Retrieving data directly from data marts provides good performance even in high-concurrency environments since most requests for analysis can be served with a single, simple database query against the data mart. The aggregation engine in DHIS 2 is capable of processing low-level data in the multi-millions and manage most national-level databases, and it can be said to provide <emphasis role="italic">near real-time access</emphasis> to aggregate data.</para>
+ <para>DHIS 2 allows for setting up scheduled aggregation tasks which typically will refresh and populate the data mart with aggregated data every night. You can choose between aggregating data for the last 12 months every night, or aggregate data for the last 6 months everry night and the last 6 to 12 months every Saturday. The scheduled tasks can be configured under "Scheduling" in "Data administration" module. It is also possible to execute arbitrary data mart tasks under "Data mart" in "Reports" module.</para>
</section>
<section>
<title>Data storage approach</title>
<para>There are two leading approaches for storing data in a data warehouse, namely the <emphasis role="italic">normalized</emphasis> and <emphasis role="italic">dimensional</emphasis> approach. DHIS 2 lends a bit from the former but mostly from the latter. In the dimensional approach the data is partitioned into <emphasis role="italic">dimensions</emphasis> and <emphasis role="italic">facts</emphasis>. Facts generally refers to transactional numeric data while dimensions are the reference data that gives context and meaning to the data. The strict rules of this approach makes it easy for users to understand the data warehouse structure and provides for good performance since few tables must be combined to produce meaningful analysis, while it on the other hand might make the system less flexible and harder to change.</para>
<para>
In DHIS the facts corresponds to the data value object in the data model. The data value captures data as numbers, yes/no or text. The <emphasis role="italic">compulsory dimensions</emphasis> which give meaning to the facts are the <emphasis role="italic">data element</emphasis>, <emphasis role="italic">organisation unit hierarchy</emphasis> and <emphasis role="italic">period</emphasis> dimensions. These dimensions are referred to as compulsory since they must be provided for all stored data records. DHIS 2 also has a custom dimensional model which makes it possible to represent any kind of dimensionality. This model must be defined prior to data capture. DHIS 2 also has a flexible model of groups and group sets which makes it possible to add custom dimensionality to the compulsory dimensions after data capture has taken place. You can read more about dimensionality in DHIS 2 in the chapter by the same name.</para>
- <graphic fileref="resources/images/implementation_guide/dimensional_approach.png" width="80%" format="PNG" align="center"/>
+ <graphic fileref="resources/images/implementation_guide/dimensional_approach.png" format="PNG" width="80%" align="center"/>
</section>
</chapter>