← Back to team overview

dhis2-devs team mailing list archive

Re: Pivot tables

 

Attached is a simple skeleton overview diagram of how it *can* be
done.  I'm not sure if I'm 100% comfortable with the approach yet but
I can verify that it works (caveat below).

>From the diagram, the point ultimately is to populate the
pivottablecache of the pivottable within the local excel spreadsheet
with the data from the view in the remote database.  This is what you
would normally do (in a connected setting) by setting an odbc type
connection to the postgres database.

What I have done is instead to create a connection to a local csv
file.  This is done by creating a odc connection description file and
pointing the pivottable at that.  Selected nippet from example odc
connection file:
<xml id='msodc'>
  <odc:OfficeDataConnection
  xmlns:odc="urn:schemas-microsoft-com:office:odc"
  xmlns="http://www.w3.org/TR/REC-html40";>
  <odc:Connection odc:Type="OLEDB">
   <odc:ConnectionString>Provider=MSDASQL.1;Persist Security
Info=False;Data Source=csv;Extended
Properties=&quot;DSN=csv;DefaultDir=C:\Users\bobj\Documents\excel;DriverId=27;FIL=text;MaxBufferSize=2048;PageTimeout=5;&quot;;Initial
Catalog=C:\Users\bobj\Documents\excel</odc:ConnectionString>
   <odc:CommandType>Table</odc:CommandType>
   <odc:CommandText>`C:\Users\bobj\Documents\excel\test.csv`</odc:CommandText>
  </odc:Connection>
 </odc:OfficeDataConnection>
</xml>

I've taken a 30MB csv file dumped from the
_VIEW_PIVOTSOURCEROUTINEDATAOU3ALL view of Ola's sample data.
(Grabbed from h2 with: CALL CSVWRITE('test.csv', 'SELECT * FROM
_VIEW_PIVOTSOURCEROUTINEDATAOU3ALL');  )

Using the odc above, refreshing the pivottable causes a copy from the
csv file into the pivot cache which seems pretty quick (4-5 seconds) -
certainly quicker than pulling it via tcp-odbc-jdbc or whatever.

Note that worksheet can have more than one pivottable and more than
one pivotcache and that more than one pivottable can share the same
pivot cache.  I haven't gone into those details.

So the upside is local pivottable refresh is quick and appending
monthly csv snippets is trivial.  Another positive spinoff in havng
the datset as a local csv is that you can also pivot with all kinds of
other tools other than excel (ranging from primitive awk to fancy 'r'
reshaping - both of which I think I prefer to excel :-) ).  But
managing this setup is fairly straightforward - 3 files:  the excel
file, the odc and the csv files which grow with monthly updates (I
know there must also be a facility to amend changed data but we'll
cross that bridge ..) .   And there are of course downsides.

These csv files are really pretty big and hugely sub optimal.  The
normalization of the database has been flattened out for convenience
of pivoting so we are dealing with tens of thousands of repeated
strings (datelement names, ou names etc).  From a transport
perspective this is maybe not too bad - I'm guessing a reasonable
compression might factor out some of this.   And appending monthly
chunks wouldn't be too bad.  But they are pretty big files to manage.
 But I think they are maybe more robust than depending solely on the
pivotcache as a local primary store.

That's it for now.  Thought to sleep on is whether we can't somehow
just maintain a local database which replicates the necessary tables
from the online server.  But its complicated.  Maintaining a couple
(hopefully not too many) csv files might be easier.

Bob

PS.  The caveat.  My csv source for pivottables sort of works but all
my datavalues are being treated as zero :-(  I think this is probably
to do with setting the datatype of the field somehow.  I think its
fixable.

2010/10/26 Lars Helge Øverland <larshelge@xxxxxxxxx>:
>
>
> 2010/10/25 Bob Jolliffe <bobjolliffe@xxxxxxxxx>
>>
>> 2010/10/25 Lars Helge Øverland <larshelge@xxxxxxxxx>:
>> >
>> > Hi thanks for the nice update, interesting. The thinking sounds very
>> > reasonable to me.
>> > Ola and I will work a bit on the optimization of SQL view. One issue
>> > here is
>> > that when writing "generic" sql queries for tables like
>> > orgunitgroupsetstructure, one needs to select * (all) columns in order
>> > to be
>> > portable because the table has each column per groupset, ie. its
>> > dynamic.
>> > But I guess we can fix these things when generating the SQL from Java /
>> > DHIS.
>>
>> Yes. The more we optimize these (in terms of number of fields) the
>> less enormous the excel file will be.  And given that excel holds all
>> in memory we must do the best we can.
>>
>> > Re size, a good thing is that a typical pivot table won't contain the
>> > whole
>> > database, rather the data for a district and its facilities. We want to
>> > be
>> > able to compare with other districts (maybe inside the same province
>> > only)
>> > but for those we only need the aggregated data at the district level
>> > (not
>> > facility data) - which means a lot less data.
>>
>> I'm trying to focus on the particular problems related to enormous
>> spreadsheets.  If they are small we don't really have much of a
>> problem either way.  But if we can deal with the really large ones
>> then we can maybe deal with medium large ones quite snappily.
>>
>
> Yes definitely.
>
>>
>> > Its also interesting to see how excel uses a "minimal" xml (<r><x
>> > v="0"/></r> etc). We could consider that for DXF 2 datavalues.
>> > Size/performance vs readability.
>>
>> This is an OOXML "feature" which has been heavily criticized by almost
>> all in the xml world.  Not being religious, I am in two minds about
>> it.  I think it might be reasonable for the datavalues in dxf, but I'd
>> stick my heels in if we started doing the same thing for metadata.  In
>> fact I think there is just such a minimal syntax which I have
>> suggested to Jo some time back as being maybe useful for a lightweight
>> mobile version of sdmx.  I wouldn't have a problem with:
>>
>> <dv de='32' o='3' p='4' d=''2' />  <!-- where 'd' = disaggregation -->
>>
>> (I'd still like to see the other attributes, comment, createdby etc
>> being aggregated up to a 'datavalueset' but that's another story)
>>
>> Regarding ooxml I think there is a better justification doing it for a
>> data oriented document format like a spreadsheet, than there is for
>> word docs and presentations.  Certainly M$ has achieved significant
>> performance benefits to Excel in terms of load times.  File size is
>> not that significant a benefit because the compressed files shouldn't
>> be that different.  What also really helps is the use of
>> sharedstrings.  You can see that they do a similar shared strings type
>> optimization in the pivotcache.  All of which allows us to stretch the
>> use of spreadsheets into domains where they shouldn't really go.
>>
>
> Yes, was only thinking about data values. We can leave this for later.

Attachment: excel_pivot_csv.svg
Description: image/svg


Follow ups

References