dhis2-devs team mailing list archive
-
dhis2-devs team
-
Mailing list archive
-
Message #16276
[Branch ~dhis2-documenters/dhis2/dhis2-docbook-docs] Rev 457: Minor fix.
------------------------------------------------------------
revno: 457
committer: Jason P. Pickering <jason.p.pickering@xxxxxxxxx>
branch nick: dhis2-docbook-docs
timestamp: Sun 2012-03-04 07:25:33 +0200
message:
Minor fix.
modified:
src/docbkx/en/dhis2_r.xml
--
lp:~dhis2-documenters/dhis2/dhis2-docbook-docs
https://code.launchpad.net/~dhis2-documenters/dhis2/dhis2-docbook-docs
Your team DHIS 2 developers is subscribed to branch lp:~dhis2-documenters/dhis2/dhis2-docbook-docs.
To unsubscribe from this branch go to https://code.launchpad.net/~dhis2-documenters/dhis2/dhis2-docbook-docs/+edit-subscription
=== modified file 'src/docbkx/en/dhis2_r.xml'
--- src/docbkx/en/dhis2_r.xml 2012-03-03 09:23:21 +0000
+++ src/docbkx/en/dhis2_r.xml 2012-03-04 05:25:33 +0000
@@ -4,12 +4,12 @@
<title>DHIS2 and R integration</title>
<section id="dhis2_r_example">
<title>DHIS2 and R</title>
- <para>R is freely available, open source statistical computing environment. R refer to both the computer programming language, as well as the software which can be used to create and run R scripts. There are <ulink url="http://cran.r-project.org/">numerous sources on the web</ulink> which describe the extensive set of features of R. </para>
- <para>R is a natural extension to DHIS2, as it provides powerful statistical routines, data manipulation functions, and visualization tools. This chapter will describe how to setup R and DHIS2 on the same server, and will provide a simple example of how to retrieve data from the DHIS2 database into an R data frame. </para>
+ <para>R is freely available, open source statistical computing environment. R refers to both the computer programming language, as well as the software which can be used to create and run R scripts. There are <ulink url="http://cran.r-project.org/">numerous sources on the web</ulink> which describe the extensive set of features of R. </para>
+ <para>R is a natural extension to DHIS2, as it provides powerful statistical routines, data manipulation functions, and visualization tools. This chapter will describe how to setup R and DHIS2 on the same server, and will provide a simple example of how to retrieve data from the DHIS2 database into an R data frame and perform some basic calculations.</para>
<para>In this example, we will use a system-wide ODBC connector which will be used to retrieve data from the DHIS2 database. There are some disadvantages with this approach, as ODBC is slower than other methods and it does raise some security concerns by providing a system-wide connector to all users. However, it is a convenient method to provide a connection to multiple users. The use of the R package RODBC will be used in this case. Other alternatives would be the use of the <ulink url="http://dirk.eddelbuettel.com/code/rpostgresql.html">RPostgreSQL</ulink> package, which can interface directly through the Postgresql driver.</para>
<para>First, we will install R and some other required and useful packages. Invoke the following command:</para>
<para><command>apt-get install r-base r-cran-odbc r-cran-lattice odbc-postgresql</command> </para>
- <para>Next, we need to configure the ODBC connection. Edit the file to suit your local situation using the following template as a guide. Edit a file called odbc.ini</para>
+ <para>Next, we need to configure the ODBC connection. Edit the file to suit your local situation using the following template as a guide. Lets create and edit a file called odbc.ini</para>
<para><screen>[dhis2]
Description = DHIS2 Database
Driver = /usr/lib/odbc/psqlodbcw.so
@@ -83,23 +83,21 @@
and de.name ~*('Attendance OPD')
GROUP BY p.startdate, de.name;")</screen></para>
<para>We have stored the result of the SQL query in an R data frame called OPD. Lets take a look at what the data looks like. </para>
- <para><screen>> str(OPD.ct)
-List of 7
- $ startdate : Date[1:12], format: "2011-01-01" "2011-02-01" "2011-03-01" ...
- $ Attendance OPD 12-59 months female: int [1:12] 208879 237521 268141 232637 206140 179559 161946 159530 144090 138224 ...
- $ Attendance OPD 12-59 months male : int [1:12] 200734 225217 252989 222649 195315 168896 150998 150014 137925 130591 ...
- $ Attendance OPD <12 months female : int [1:12] 116005 127485 140947 125511 110515 107205 100424 102100 93548 86301 ...
- $ Attendance OPD <12 months male : int [1:12] 109745 118643 131398 118729 105303 99383 94239 96428 88538 82174 ...
- $ Attendance OPD >5 years female : int [1:12] 550302 593682 656577 606291 553018 500631 458789 483245 458325 412032 ...
- $ Attendance OPD >5 years male : int [1:12] 409310 433319 489064 448069 409164 374119 347728 348012 325802 303556 ...
- - attr(*, "row.names")= int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
- - attr(*, "idvars")= chr "startdate"
- - attr(*, "rdimnames")=List of 2
- ..$ :'data.frame':12 obs. of 1 variable:
- .. ..$ startdate: Date[1:12], format: "2011-01-01" "2011-02-01" "2011-03-01" ...
- ..$ :'data.frame':6 obs. of 1 variable:
- .. ..$ de: Factor w/ 6 levels "Attendance OPD 12-59 months female",..: 1 2 3 4 5 6
-> </screen>
+ <para><screen>> head(OPD)
+ startdate de sum
+1 2011-12-01 Attendance OPD <12 months female 42557
+2 2011-02-01 Attendance OPD <12 months female 127485
+3 2011-01-01 Attendance OPD 12-59 months male 200734
+4 2011-04-01 Attendance OPD 12-59 months male 222649
+5 2011-06-01 Attendance OPD 12-59 months male 168896
+6 2011-03-01 Attendance OPD 12-59 months female 268141
+> unique(OPD$de)
+[1] Attendance OPD <12 months female Attendance OPD 12-59 months male
+[3] Attendance OPD 12-59 months female Attendance OPD >5 years male
+[5] Attendance OPD <12 months male Attendance OPD >5 years female
+6 Levels: Attendance OPD 12-59 months female ... Attendance OPD >5 years male
+>
+ </screen>
</para>
<para>We can see that we need to aggregate the two age groups (< 12 months and 12-59 months) into a single variable, based on the gender. Lets reshape the data into a crosstabulated table to make this easier to visualize and calculate the summaries.</para>
<para><screen>>OPD.ct<-cast(OPD,startdate ~ de)
@@ -109,16 +107,16 @@
[5] "Attendance OPD <12 months male" "Attendance OPD >5 years female"
[7] "Attendance OPD >5 years male" </screen>
</para>
- <para>It looks like we need to aggregate the second and fourth columns together to get the female attendance, and then the third and fifth columns to get the male under 5 attendance.After this, lets subset the data into a new data frame just to get the required information and display the results.</para>
+ <para>We have reshaped the data so that the data elements are individual columns. It looks like we need to aggregate the second and fourth columns together to get the under 5 female attendance, and then the third and fifth columns to get the male under 5 attendance.After this, lets subset the data into a new data frame just to get the required information and display the results.</para>
<para><screen>> OPD.ct$OPDUnder5Female<-OPD.ct[,2]+OPD.ct[,4]#Females
> OPD.ct$OPDUnder5Male<-OPD.ct[,3]+OPD.ct[,5]#males
> OPD.ct.summary<-OPD.ct[,c(1,8,9)]#new summary data frame
-> OPD.ct.summary$FemalePercent<-OPD.ct.summary$OPDUnder5Female/(OPD.ct.summary$OPDUnder5Female + OPD.ct.summary$OPDUnder5Male)
-> OPD.ct.summary$FemalePercent<-OPD.ct.summary$OPDUnder5Female/(OPD.ct.summary$OPDUnder5Female + OPD.ct.summary$OPDUnder5Male)*100
-> OPD.ct.summary$MalePercent<-OPD.ct.summary$OPDUnder5Male/(OPD.ct.summary$OPDUnder5Female + OPD.ct.summary$OPDUnder5Male)*100
-
-
-</screen></para>
+>OPD.ct.summary$FemalePercent<-
+OPD.ct.summary$OPDUnder5Female/
+(OPD.ct.summary$OPDUnder5Female + OPD.ct.summary$OPDUnder5Male)*100#Females
+>OPD.ct.summary$MalePercent<-
+OPD.ct.summary$OPDUnder5Male/
+(OPD.ct.summary$OPDUnder5Female + OPD.ct.summary$OPDUnder5Male)*100#Males </screen></para>
<para>Of course, this could be accomplished much more elegantly, but for the purpose of the illustration, this code is rather verbose.Finally, lets display the required information.</para>
<para><screen>> OPD.ct.summary[,c(1,4,5)]
startdate FemalePercent MalePercent
@@ -134,6 +132,7 @@
10 2011-10-01 51.34465 48.65535
11 2011-11-01 51.42526 48.57474
12 2011-12-01 50.68933 49.31067</screen></para>
- <para/>
+ <para>We can see that the male and female attendances are very similar for each month of the year, with seemingly higher male attendance relative to female attendance in the month of December.</para>
+ <para>In this example, we show how to retreive data from the DHIS2 database and manipulate in with some simple R commands. The basic pattern for using DHIS2 and R together, will be the retrieval of data from the DHIS2 database with an SQL query into an R data frame, followed by whatever routines (statistical analysis, plotting, etc) which may be required. </para>
</section>
</chapter>