dhis2-devs team mailing list archive
-
dhis2-devs team
-
Mailing list archive
-
Message #08124
Re: Pivot tables
On Fri, Oct 22, 2010 at 2:32 PM, Bob Jolliffe <bobjolliffe@xxxxxxxxx> wrote:
> hi
>
> 2010/10/22 Ola Hodne Titlestad <olati@xxxxxxxxxx>:
> > 2010/10/22 Lars Helge Øverland <larshelge@xxxxxxxxx>
> >>
> >> Hi Bob
> >> Sorry for not returning to this earlier, been battling with windows 64
> bit
> >> odbc drivers.
> >>
> >> 2010/10/22 Bob Jolliffe <bobjolliffe@xxxxxxxxx>
> >>>
> >>> Hi Lars, Ola and all
> >>>
> >>> Thinking about exporting pivot tables .. we need to pen up some
> >>> requirements which we can get into a blueprint. While we are
> >>> considering that, I've given some thoughts below.
> >>>
> >>> My initial thought was that we would construct a pivot table with a
> >>> jdbc connection like we would normally do manually, and then simply
> >>> populate the pivot cache so that the spreadsheet would be
> >>> "free-standing". On reflection that is probably neither the best nor
> >>> the simplest thing to do. Playing around with excel (in wine :-) I
> >>> figured it is actually much more straightforward to have two sheets;
> >>> one for the data and one for the pivot table. The pivot table would
> >>> operate on the data in the other worksheet. This is more explicit
> >>> than simply hiding the data in the pivot cache. Though I need to test
> >>> this a bit more. If the pivotcache is an obligatory data store then
> >>> it might not make sense to carry the data twice. The pivot table
> >>> constructs are fairly complex but not outlandishly so - once I'm done
> >>> unpicking, I will write up a short description of the anatomy of a
> >>> pivot table so we can see what needs to be done.
> >>>
> >>
> >> This sounds just fine to me.
> >
> > This might work, just have in mind that we will then need several pairs
> of
> > data+pivot worksheets as we want to have multiple pivot tables in an
> Excel
> > file.
> > Also have in mind that we can talk about up to around 1.500.000 rows
> (that's
> > when my laptop starts to complain at least) of data here, so these data
> > sheets will be large. Typically, number of rows of data will be somewhere
> > around 200-600K.
> >
> >
> >>
> >>
> >>>
> >>> One possibility would be that the pivot table could be generated as
> >>> part of the export of a report table ie. optionally export to excel or
> >>> export to excel-with-pivot.
> >>>
> >>> Given that there is no advance indication of the columns in a report
> >>> table, a challenge will be how and when to define the pivot model -
> >>> ie. the pivot fields - and how to persist that model (we don't want
> >>> users to have to recreate the pivot model each time). This will
> >>> probably require an extra object in our data model (PivotModel) which
> >>> defines the row and column pivot fields and data field, where a
> >>> ReportTable can have a 0-* relationship with 0 or more PivotModels. A
> >>> possible beneficial side effect of this could be that we also leave
> >>> open the ability to generate the pivot table rendition with things
> >>> other than excel. I don't see this as an immediate requirement but
> >>> useful to have the pivotmodel abstraction anyway. In reality the
> >>> design of the pivotmodel will be based on what excel requires.
> >>>
> >>> Generating the excel spreadsheet off the reporttable+pivotmodel will
> >>> produce an OOXML xlsx file - actually a bundle of xml streams which
> >>> need to be zipped. Its a bit unfortunate that M$ defines a flat file
> >>> single xml for other OPC office documents, but sadly not excel.
> >>> Dealing with a single flat file is much easier than zip containers
> >>> with multiple streams but so be it. At least they end up smaller.
> >>> Though given that these reporttables can be very large I don't think
> >>> an xslt approach is really the most efficient at least for producing
> >>> the data worksheet. It might make sense if the source data was xml,
> >>> but even then probably not. So simple iteration through the table
> >>> rows with a stax writer will work best. The second sheet (the pivot
> >>> sheet) would just be a serialization of the pivot model.
> >>>
> >>> We will probably have to implement the excel optimization of
> >>> sharedstrings (where literal strings are not written into the
> >>> spreadsheet - just index values from a table). This adds complexity
> >>> but I think we are talking pretty large files here. If we don't do
> >>> the sharedstrings optimization, Excel will do it anyway the first time
> >>> it opens and saves the file, but I suspect we will have to do it up
> >>> front.
> >>>
> >>> Regarding UI and PivotModel I want to take a look at our existing work
> >>> on web pivot tables to see whether we can't leverage some of this. If
> >>> we follow roughly the outline above the user would have the option to
> >>> define 0 or more pivottable definitions at the time of creating, or
> >>> editing, a reporttable. What do you think? Is the above close to
> >>> what you guys have in mind. Lets hammer out some requirements and
> >>> create the blueprint.
> >>>
> >>
> >> We won't base this on report tables as (at least I) can't see any huge
> >> benefits. The excel pivot tables themselves are more capable of having
> >> dimensions on columns, rows, filters than our report tables. In fact we
> >> usually use a SQL view as datasource for the pivot tables - which has a
> >> fixed number of columns. The view is based on the aggregateddatavalue
> table
> >> and joins in the dataelement, orgunit, period and periodtype tables. In
> >> addition we join in a series of resource tables to get information about
> the
> >> orgunit structure, dataelement groupset structure, orgunit groupset
> >> structure (additional dimensional information!) to make the pivot table
> more
> >> valuable for analysis. (This also goes for indicators.)
> >> Since this should be more of a offline analysis tool I think exporting
> all
> >> dataelements/indicators will be most appropriate. The user interface
> could
> >> simply request a start and endate (and/or relative periods), indicators
> vs
> >> dataelements, parent organisation unit and organisation unit level. Ola
> will
> >> maybe have some views here...
> >
> > Agree with Lars, I don't think report tables should be used as source for
> > pivot tables, at least not for the standard "ALL data" pivot tables.
>
> Yes I understand that now
>
> > For
> > more ad-hoc use the pivots generated off a report table might be useful
> as
> > an export option, but let's start with the most common use cases, the big
> > pivot tables with all the data for browsing and data analysis.
> > The pivot source views (sql) that we have used e.g. in the newly
> distributed
> > pivots for the demo database can be found in an email to the list on Sep
> 21,
> > together with the pivots and the database dumps:
> > http://www.mail-archive.com/dhis2-devs@xxxxxxxxxxxxxxxxxxx/msg07341.html
> (from
> > Sep 21)
> > These views have been stable for a long time and describe the fields of
> the
> > basic pivot tables that are needed for DHIS data analysis.
> > These set of fields go back as far as DHIS 1.3, so the basic pivot tables
> > are stable and we know their structure, basically the core data
> dimensions
> > of DHIS2 (see chapter 2 in user manual).
> > What always will be variables are the orgunit level of aggregation for
> the
> > data or indicator values and the period type. You will see these two are
> > specified at the end of every where clause in the pivot views. These are
> > used as a filters when fetching data from the data mart.
> > If you open the pivot table for download in that email you can follow my
> > reasoning below:
> > Typically we have two tables for monthly (raw data+indicator data) for
> the
> > lowest level plus at least two more for a higher level. E.g. in the demo
> > based on SL we use the health facility level (level 4) and the Chiefdom
> > level (3). The reason for using Chiefdom level as well is that these
> tables
> > are much faster to manipulate since there are a lot less data when
> shaving
> > away the lowest level. How many levels and which levels to put in the
> pivot
> > table will vary from place to place. In addition to these at least four
> > tables, there is often at least one table for population data, which has
> a
> > yearly period type and therefore needs a separate view and table. So for
> the
> > demo database we have 5 basic tables which covers all the data for data
> > elements and indicators. This set of pivot tables is what I would define
> as
> > a minimum pivot table setup for a DHIS database. Over time (depending on
> the
> > amount of data collected) it might be necessary to split up these basic
> > tables by year since they can get too big for excel to handle. Such a
> split
> > can be facilitated by a simple from and to filter on the start date in
> the
> > period field in the sql.
> > Then there can be many more customised pivot tables which make use of
> > additional dimensions to the data like the data element categories, and
> data
> > element and orgunit group sets. The simplest approach there is to (still
> > talking sql) do a select * from the various resource tables. Then you
> will
> > have a data source for all other customised tables (using the same period
> > type and aggregation level). Working with data element categories in
> pivot
> > tables it doesn't make sense to use too many at the same time, and rather
> > work on data elements that share the same categories (within or across
> > datasets). You can see in the demo pivot table file that we have set up
> > multiple tables making use of data element categories (EPI, RCH, HIV
> etc.).
> > These are all using the same source data (chiefdom level monthly raw
> data),
> > but are using different pivot fields (columns in the result set). The
> total
> > number of available fields are the same for all these custom chiefdom
> > tables, but we are only using a small subset of these fields in each
> table.
> > This means that the same data source could also be used for the basic
> tables
> > (the two chiefdom tables at least), and just select even fewer fields in
> the
> > table. The users can also easily add more fields to a table in Excel,
> using
> > drag and drop functionality.
> > It seems to me that there is a need to separate between the source data
> and
> > the pivot structure, to reuse the source data in multiple tables. I think
> > that is how Excel is thinking too, but it doesn't always seem to work
> that
> > way, e.g. when creating a copy of an existing worksheet with a pivot
> table.
>
> Yes we can ask excel pivot tables to use common pivot table cache.
>
> > I am not sure how much of this customisation we need to support inside
> DHIS2
> > and how much we can leave to Excel (which has nice tools for copying
> tables,
> > drag and drop pivot fields etc.). First we need to come up with something
> > that works for the basic tables and then look at these more custom use
> > cases.
> > When it comes to pivot tables using period types other than monthly or
> > yearly (both covered in the basic tables listed above), like six-monthly,
> > quarterly or weekly these are also important to support. The pivot fields
> > are the same as in the basic tables, it is just a different periodtype
> > filter on the data source, so these might be easier to accommodate.
> > Looking at the next steps, to me the key new features that I would like
> to
> > see with regards to pivot tables are:
> > 1) a pivot table update service, where users that are on the online
> server
> > (no local install) can update their local pivot tables against the online
> > DHIS server
> > Since a pivot table can contain months and years of data we do not want
> to
> > download the full tables every month, but simply append the latest month
> of
> > data to the existing local pivot tables
> > 2) a pivot generator, where the users can download a new pivot table file
> > directly off a DHIS2 server without having to do any local configuration
> > etc. , and then later use the update service to update their tables
> > with 1) being the most critical and urgent, but I realise that we might
> need
> > 2) first, in order to deal with a standard set of tables and pivot data
>
> Yes I think 1 can only really work in concert with 2. Since we can't
> ask dhis to update any old pivot table, it will need to be a
> pivottable which it is intimately familiar with - which is either a
> fixed format one as you have discussed or one that is created through
> dhis. Naturally we start with former. BTW looking at size of tables
> I have dropped the idea of holding data in a worksheet. This does in
> fact produce duplication of data which is also held in cache and both
> unfortunately also held in memory. So we work only with the hidden
> pivot table cache as the local store.
>
> Thinking how and where data is being moved, it might not make sense to
> pass the (massive) excel file back and forwards to dhis in order to
> get updates injected. That might defeat the purpose. Which raises
> some questions. Either we have a standalone tool which grabs
> incremental update from dhis and appends to excel pivottable cache or
> we have some sort of VBA macro defined within excel. I think the
> latter might prove problematic - like performing circumcision on
> oneself whilst awake, it might cause excel to breakdown. So probably
> a standalone tool ..
>
> I haven't worked through the samples yet. I think I will buy excel on
> the weekend and invoice Jorn.
>
>
Definitely lets start with 2). We can think of 1) later, maybe even a script
will do the job. Bob I will put up excel with key on separate mail.
Did some quick investigations on size and it seems excel is using around 30
KB / 1000 datavalues. The file seems already compressed btw. If a typical
pivot file has 600 000 values that means 18 MB.
Lars
Follow ups
References