← Back to team overview

dhis2-devs team mailing list archive

Re: Pivot tables

 

2010/10/22 Lars Helge Øverland <larshelge@xxxxxxxxx>:
>
>
> On Fri, Oct 22, 2010 at 2:32 PM, Bob Jolliffe <bobjolliffe@xxxxxxxxx> wrote:
>>
>> hi
>>
>> 2010/10/22 Ola Hodne Titlestad <olati@xxxxxxxxxx>:
>> > 2010/10/22 Lars Helge Øverland <larshelge@xxxxxxxxx>
>> >>
>> >> Hi Bob
>> >> Sorry for not returning to this earlier, been battling with windows 64
>> >> bit
>> >> odbc drivers.
>> >>
>> >> 2010/10/22 Bob Jolliffe <bobjolliffe@xxxxxxxxx>
>> >>>
>> >>> Hi Lars, Ola and all
>> >>>
>> >>> Thinking about exporting pivot tables .. we need to pen up some
>> >>> requirements which we can get into a blueprint.  While we are
>> >>> considering that, I've given some thoughts below.
>> >>>
>> >>> My initial thought was that we would construct a pivot table with a
>> >>> jdbc connection like we would normally do manually, and then simply
>> >>> populate the pivot cache so that the spreadsheet would be
>> >>> "free-standing".  On reflection that is probably neither the best nor
>> >>> the simplest thing to do.  Playing around with excel (in wine :-)  I
>> >>> figured it is actually much more straightforward to have two sheets;
>> >>> one for the data and one for the pivot table.  The pivot table would
>> >>> operate on the data in the other worksheet.  This is more explicit
>> >>> than simply hiding the data in the pivot cache.  Though I need to test
>> >>> this a bit more.  If the pivotcache is an obligatory data store then
>> >>> it might not make sense to carry the data twice.  The pivot table
>> >>> constructs are fairly complex but not outlandishly so - once I'm done
>> >>> unpicking, I will write up a short description of the anatomy of a
>> >>> pivot table so we can see what needs to be done.
>> >>>
>> >>
>> >> This sounds just fine to me.
>> >
>> > This might work, just have in mind that we will then need several pairs
>> > of
>> > data+pivot worksheets as we want to have multiple pivot tables in an
>> > Excel
>> > file.
>> > Also have in mind that we can talk about up to around 1.500.000 rows
>> > (that's
>> > when my laptop starts to complain at least) of data here, so these data
>> > sheets will be large. Typically, number of rows of data will be
>> > somewhere
>> > around 200-600K.
>> >
>> >
>> >>
>> >>
>> >>>
>> >>> One possibility would be that the pivot table could be generated as
>> >>> part of the export of a report table ie. optionally export to excel or
>> >>> export to excel-with-pivot.
>> >>>
>> >>> Given that there is no advance indication of the columns in a report
>> >>> table, a challenge will be how and when to define the pivot model -
>> >>> ie. the pivot fields -  and how to persist that model (we don't want
>> >>> users to have to recreate the pivot model each time).  This will
>> >>> probably require an extra object in our data model (PivotModel) which
>> >>> defines the row and column pivot fields and data field, where a
>> >>> ReportTable can have a 0-* relationship with 0 or more PivotModels.  A
>> >>> possible beneficial side effect of this could be that we also leave
>> >>> open the ability to generate the pivot table rendition with things
>> >>> other than excel.  I don't see this as an immediate requirement but
>> >>> useful to have the pivotmodel abstraction anyway.  In reality the
>> >>> design of the pivotmodel will be based on what excel requires.
>> >>>
>> >>> Generating the excel spreadsheet off the reporttable+pivotmodel will
>> >>> produce an OOXML xlsx file - actually a bundle of xml streams which
>> >>> need to be zipped.  Its a bit unfortunate that M$ defines a flat file
>> >>> single xml for other OPC office documents, but sadly not excel.
>> >>> Dealing with a single flat file is much easier than zip containers
>> >>> with multiple streams but so be it.  At least they end up smaller.
>> >>> Though given that these reporttables can be very large I don't think
>> >>> an xslt approach is really the most efficient at least for producing
>> >>> the data worksheet.  It might make sense if the source data was xml,
>> >>> but even then probably not.  So simple iteration through the table
>> >>> rows with a stax writer will work best.  The second sheet (the pivot
>> >>> sheet) would just be a serialization of the pivot model.
>> >>>
>> >>> We will probably have to implement the excel optimization of
>> >>> sharedstrings (where literal strings are not written into the
>> >>> spreadsheet - just index values from a table).  This adds complexity
>> >>> but I think we are talking pretty large files here.  If we don't do
>> >>> the sharedstrings optimization, Excel will do it anyway the first time
>> >>> it opens and saves the file, but I suspect we will have to do it up
>> >>> front.
>> >>>
>> >>> Regarding UI and PivotModel I want to take a look at our existing work
>> >>> on web pivot tables to see whether we can't leverage some of this.  If
>> >>> we follow roughly the outline above the user would have the option to
>> >>> define 0 or more pivottable definitions at the time of creating, or
>> >>> editing, a reporttable.  What do you think?  Is the above close to
>> >>> what you guys have in mind.  Lets hammer out some requirements and
>> >>> create the blueprint.
>> >>>
>> >>
>> >> We won't base this on report tables as (at least I) can't see any huge
>> >> benefits. The excel pivot tables themselves are more capable of having
>> >> dimensions on columns, rows, filters than our report tables. In fact we
>> >> usually use a SQL view as datasource for the pivot tables - which has a
>> >> fixed number of columns. The view is based on the aggregateddatavalue
>> >> table
>> >> and joins in the dataelement, orgunit, period and periodtype tables. In
>> >> addition we join in a series of resource tables to get information
>> >> about the
>> >> orgunit structure, dataelement groupset structure, orgunit groupset
>> >> structure (additional dimensional information!) to make the pivot table
>> >> more
>> >> valuable for analysis. (This also goes for indicators.)
>> >> Since this should be more of a offline analysis tool I think exporting
>> >> all
>> >> dataelements/indicators will be most appropriate. The user interface
>> >> could
>> >> simply request a start and endate (and/or relative periods), indicators
>> >> vs
>> >> dataelements, parent organisation unit and organisation unit level. Ola
>> >> will
>> >> maybe have some views here...
>> >
>> > Agree with Lars, I don't think report tables should be used as source
>> > for
>> > pivot tables, at least not for the standard "ALL data" pivot tables.
>>
>> Yes I understand that now
>>
>> > For
>> > more ad-hoc use the pivots generated off a report table might be useful
>> > as
>> > an export option, but let's start with the most common use cases, the
>> > big
>> > pivot tables with all the data for browsing and data analysis.
>> > The pivot source views (sql) that we have used e.g. in the newly
>> > distributed
>> > pivots for the demo database can be found in an email to the list on Sep
>> > 21,
>> > together with the pivots and the database dumps:
>> >
>> > http://www.mail-archive.com/dhis2-devs@xxxxxxxxxxxxxxxxxxx/msg07341.html ;(from
>> > Sep 21)
>> > These views have been stable for a long time and describe the fields of
>> > the
>> > basic pivot tables that are needed for DHIS data analysis.
>> > These set of fields go back as far as DHIS 1.3, so the basic pivot
>> > tables
>> > are stable and we know their structure, basically the core data
>> > dimensions
>> > of DHIS2 (see chapter 2 in user manual).
>> > What always will be variables are the orgunit level of aggregation for
>> > the
>> > data or indicator values and the period type. You will see these two are
>> > specified at the end of every where clause in the pivot views. These are
>> > used as a filters when fetching data from the data mart.
>> > If you open the pivot table for download in that email you can follow my
>> > reasoning below:
>> > Typically we have two tables for monthly (raw data+indicator data) for
>> > the
>> > lowest level plus at least two more for a higher level. E.g. in the demo
>> > based on SL we use the health facility level (level 4) and the Chiefdom
>> > level (3). The reason for using Chiefdom level as well is that these
>> > tables
>> > are much faster to manipulate since there are a lot less data when
>> > shaving
>> > away the lowest level. How many levels and which levels to put in the
>> > pivot
>> > table will vary from place to place. In addition to these at least four
>> > tables,  there is often at least one table for population data, which
>> > has a
>> > yearly period type and therefore needs a separate view and table. So for
>> > the
>> > demo database we have 5 basic tables which covers all the data for data
>> > elements and indicators. This set of pivot tables is what I would define
>> > as
>> > a minimum pivot table setup for a DHIS database. Over time (depending on
>> > the
>> > amount of data collected) it might be necessary to split up these basic
>> > tables by year since they can get too big for excel to handle. Such a
>> > split
>> > can be facilitated by a simple from and to filter on the start date in
>> > the
>> > period field in the sql.
>> > Then there can be many more customised pivot tables which make use of
>> > additional dimensions to the data like the data element categories, and
>> > data
>> > element and orgunit group sets. The simplest approach there is to (still
>> > talking sql) do a select * from the various resource tables. Then you
>> > will
>> > have a data source for all other customised tables (using the same
>> > period
>> > type and aggregation level). Working with data element categories in
>> > pivot
>> > tables it doesn't make sense to use too many at the same time, and
>> > rather
>> > work on data elements that share the same categories (within or across
>> > datasets). You can see in the demo pivot table file that we have set up
>> > multiple tables making use of data element categories (EPI, RCH, HIV
>> > etc.).
>> > These are all using the same source data (chiefdom level monthly raw
>> > data),
>> > but are using different pivot fields (columns in the result set). The
>> > total
>> > number of available fields are the same for all these custom chiefdom
>> > tables, but we are only using a small subset of these fields in each
>> > table.
>> > This means that the same data source could also be used for the basic
>> > tables
>> > (the two chiefdom tables at least), and just select even fewer fields in
>> > the
>> > table. The users can also easily add more fields to a table in Excel,
>> > using
>> > drag and drop functionality.
>> > It seems to me that there is a need to separate between the source data
>> > and
>> > the pivot structure, to reuse the source data in multiple tables. I
>> > think
>> > that is how Excel is thinking too, but it doesn't always seem to work
>> > that
>> > way, e.g. when creating a copy of an existing worksheet with a pivot
>> >  table.
>>
>> Yes we can ask excel pivot tables to use common pivot table cache.
>>
>> > I am not sure how much of this customisation we need to support inside
>> > DHIS2
>> > and how much we can leave to Excel (which has nice tools for copying
>> > tables,
>> > drag and drop pivot fields etc.). First we need to come up with
>> > something
>> > that works for the basic tables and then look at these more custom use
>> > cases.
>> > When it comes to pivot tables using period types other than monthly or
>> > yearly (both covered in the basic tables listed above), like
>> > six-monthly,
>> > quarterly or weekly these are also important to support. The pivot
>> > fields
>> > are the same as in the basic tables, it is just a different periodtype
>> > filter on the data source, so these might be easier to accommodate.
>> > Looking at the next steps, to me the key new features that I would like
>> > to
>> > see with regards to pivot tables are:
>> > 1) a pivot table update service, where users that are on the online
>> > server
>> > (no local install) can update their local pivot tables against the
>> > online
>> > DHIS server
>> > Since a pivot table can contain months and years of data we do not want
>> > to
>> > download the full tables every month, but simply append the latest month
>> > of
>> > data to the existing local pivot tables
>> > 2) a pivot generator, where the users can download a new pivot table
>> > file
>> > directly off a DHIS2 server without having to do any local configuration
>> > etc. , and then later use the update service to update their tables
>> > with 1) being the most critical and urgent, but I realise that we might
>> > need
>> > 2) first, in order to deal with a standard set of tables and pivot data
>>
>> Yes I think 1 can only really work in concert with 2.  Since we can't
>> ask dhis to update any old pivot table, it will need to be a
>> pivottable which it is intimately familiar with - which is either a
>> fixed format one as you have discussed or one that is created through
>> dhis.  Naturally we start with former.   BTW looking at size of tables
>> I have dropped the idea of holding data in a worksheet.  This does in
>> fact produce duplication of data which is also held in cache and both
>> unfortunately also held in memory.  So we work only with the hidden
>> pivot table cache as the local store.
>>
>> Thinking how and where data is being moved, it might not make sense to
>> pass the (massive) excel file back and forwards to dhis in order to
>> get updates injected.  That might defeat the purpose. Which raises
>> some questions.  Either we have a standalone tool which grabs
>> incremental update from dhis and appends to excel pivottable cache or
>> we have some sort of VBA macro defined within excel. I think the
>> latter might prove problematic - like performing circumcision on
>> oneself whilst awake, it might cause excel to breakdown.  So probably
>> a standalone tool ..
>>
>> I haven't worked through the samples yet.  I think I will buy excel on
>> the weekend and invoice Jorn.
>>
>
> Definitely lets start with 2). We can think of 1) later, maybe even a script
> will do the job. Bob I will put up excel with key on separate mail.
> Did some quick investigations on size and it seems excel is using around 30
> KB / 1000 datavalues. The file seems already compressed btw. If a typical
> pivot file has 600 000 values that means 18 MB.

Ok.  Working on the uncompressed stream will then be considerably
greater than that (100MB?).  And up 1.5 million rows might be maybe
closer to 300MB.

A few thoughts stemming from that:
(i)  if you are populating a pivot table from a jdbc connection backed
by a query, then the fields which are returned in that resultset
should be carefully and minimally (optimally) trimmed.  All the data
is retained in pivot cache of excel file and all of that is loaded
into memory when excel opens the file (great though excel might be,
and much though we al love it(!) its not a database).  So for example
in the xlsx case, two important streams in the representation of the
pivotcache are the pivotCacheDefinition(n).xml and the actual
pivotCacheRecords(n).  The cache definition is something like the
database metadata definitions, but including "selection options" for
common strings eg:

<cacheField name="year" numFmtId="0" sqlType="-10">
            <sharedItems count="2">
                <s v="2009"/>
                <s v="2010"/>
            </sharedItems>
        </cacheField>
        <cacheField name="month" numFmtId="0" sqlType="-10">
            <sharedItems count="12">
                <s v="Jan"/>
                <s v="Feb"/>
                <s v="Mar"/>
                <s v="Apr"/>
                <s v="May"/>
                <s v="Jun"/>
                <s v="Jul"/>
                <s v="Aug"/>
                <s v="Sep"/>
                <s v="Oct"/>
                <s v="Nov"/>
                <s v="Dec"/>
            </sharedItems>
        </cacheField>
        <cacheField name="period" numFmtId="0" sqlType="-10">
            <sharedItems count="17">
                <s v="Jan-09"/>
                <s v="Feb-09"/>
                <s v="Mar-09"/>
                <s v="Apr-09"/>
                <s v="May-09"/>
                <s v="Jun-09"/>
                <s v="Jul-09"/>
                <s v="Aug-09"/>
                <s v="Sep-09"/>
                <s v="Oct-09"/>
                <s v="Nov-09"/>
                <s v="Dec-09"/>
                <s v="Jan-10"/>
                <s v="Feb-10"/>
                <s v="Mar-10"/>
                <s v="Apr-10"/>
                <s v="May-10"/>
            </sharedItems>
        </cacheField>

So a resultset of a query with 13 fields will be represented by 13
cacheField elements in the  pivotCacheDefinition.  This in turn will
translate into each row in the resultset being represented by a record
in the pivotCacheRecords with 13 child elements, eg

<r>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <s v="ARI/Pneu.DeathM"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <x v="16"/>
        <n v="4000"/>
        <x v="0"/>
        <n v="8318"/>
        <s v="ARI/Pneumonia death male (2009)"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
        <x v="0"/>
    </r>

So if we are talking about really, really large resultsets (as it
seems we are) then we should really pay great attention to eliminating
excess fields in the query as these will get replicated potentially
1.5 million times in the pivotCacheRecords which all has to get loaded
in memory when excel loads the file.  That's just a word of caution in
case folk haven't realized that.  So for example having year, month
and period might be excessive.  Similarly I have seen dataelement name
being duplicated in some of the sample queries sent to me.  I guess
this is probably a matter of getting the query right first then fine
tuning ruthlesslesly.  People should be aware that everything (not
just the fields actually used in the pivot) is stored in the
pivotCacheRecords.

(ii)  If I understand the use cases correctly, the main issue is that
when we want to refresh a pivottable which uses an odbc connection,
the pivotcacherecords are flushed and repopulated with the resultset
of the latest query result.  Given the size of the pivotRecordsCache
this is a non-trivial bulk of data movement.  And if its over a
network its likely to be intolerable.  Particularly if all we want to
do is to append new data from say the next month.

Looking at the example of a pivotcache record above, it should be
obvious that it is quite trivial to generate new records and append
them to the cache.  (though if we are talking about a couple of 100 MB
cache I'd be reluctant to do this in java, tcl or python - some c++
code using libxerces-c is probably going to make most sense).  I have
succeeded so far in adding records into the pivot cache without
causing excel to complain or die which is good.  But haven't yet
persuaded it to update the displayed table based on the modified
cache.  I hope to play some more with this but tomorrow is a public
holiday here in Ireland so might be delayed till Tuesday.

If it proves difficult or impossible to make this happen then we might
need to think some more.  An odbc connection is only one way (and
potentially a slow way) for a pivottable to get hold of its data.
particularly if it has to do it to, say a postgres instance through
windoze crappy tcp layer.  Pivottables can also read from excel data
sources and I think even csv (using odbc csv driver).  If we are
thinking about offline use, then it might make sense, instead of doing
surgery on missing db conections, to instead explicitly pass the data
in an excel or csv file (starting to look a bit like my earlier
reporttable export).  The point being that there might be an initial
enormous one, but subsequent appended amendments would be more
reasonable.  Then - contrary to my earlier suggestion - the pivottable
should be in a separate workbook from the data because if a pivottable
makes use of excel data in the same workbook, its going to transfer to
pivotcache and you will end up with two copies of the data in the
workbook and which excel will also try to hold in memory.  Which can
be bad.  (I would have to look into whether its more efficient to
represent data as compressed csv or excel, but the principle is the
same).

But if pivottable and data are separate then the process of refreshing
the pivottable will involve copying the data from the external data
source workbook into the pivotcache of the pivottable.  This might
still turn out to be unacceptably slow if we are talking 100's of MB
of data but at least we can update the data workbook incrementally
from dhis.  Then transferring into pivotcache will take how long it
takes :-)

I still think directly appending rows to the pivot cache will be the
most optimal but I'm talking alternatives out loud in case it doesn't
work out.  I hope to answer this question definitively by Tuesday.
Meanwhile I'm sharing more info on excel internals than you probably
are interested in knowing, partly because its probably good for you
designing queries and partly because it helps me straighten my
thoughts  :-)

Regards
Bob



> Lars
>



Follow ups

References