← Back to team overview

dhis2-devs team mailing list archive

Re: Analytics and disk space

 

+1 to Calle's idea of staggering analytics year by year

I also like Jason's suggestion of being able to configure the time period
for which analytics is regenerated. If the general use-case has data being
entered only for the current year, then is it perhaps unnecessary to
regenerate data for previous years?

Cheers,

-doh

On Tue, Jul 26, 2016 at 2:36 PM, Calle Hedberg <calle.hedberg@xxxxxxxxx>
wrote:

> Hi,
>
> One (presumably) simple solution is to stagger analytics on a year by year
> basis - i.e. run and complete 2009 before processing 2010. That would
> reduce temp disk space requirements significantly while (presumably) not
> changing the general design.
>
> Regards
> Calle
>
> On 26 July 2016 at 10:24, Jason Pickering <jason.p.pickering@xxxxxxxxx>
> wrote:
>
>> Hi Devs,
>> I am seeking some advice on how to try and decrease the amount of disk
>> usage with DHIS2.
>>
>> Here is a list of the biggest tables in the system.
>>
>>  public.datavalue                                   | 2316 MB
>>  public.datavalue_pkey                         | 1230 MB
>>  public.in_datavalue_lastupdated          | 680 MB
>>
>>
>> There are a lot more tables, and all in all, the database occupies about
>> 5.4 GB without analytics.
>>
>> This represents about 30 million data rows, so not that big of a database
>> really. This server is being run off of a Digital Ocean virtual server with
>> 60 GB of disk space. The only thing on the server really is Linux,
>> Postgresql and Tomcat. Nothing else. With out analytics and everything
>> installed for the system, we have about 23% of that 60 GB free.
>>
>> When analytics runs, it maintains a copy of the main analytics tables (
>> analytics_XXXX) and creates temp tables like analytics_temp_2004. When
>> things are finished and the indexes are built, the tables are swapped. This
>> ensures that analytics resources are available while analytics are being
>> built, but the downside of this is that A LOT more disk space is required,
>> as now we effectively have two copies of the tables along with all their
>> indexes, which are quite large themselves (up to 60% the size of the table
>> itself).  Here's what happens when analytics is run
>>
>>  public.analytics_temp_2015              | 1017 MB
>>  public.analytics_temp_2014              | 985 MB
>>  public.analytics_temp_2011              | 952 MB
>>  public.analytics_temp_2010              | 918 MB
>>  public.analytics_temp_2013              | 885 MB
>>  public.analytics_temp_2012              | 835 MB
>>  public.analytics_temp_2009              | 804 MB
>>
>> Now each analytics table is taking about 1 GB of space. In the end, it
>> adds up to more than 60 GB and analytics fails to complete.
>>
>> So, while I understand the need for this functionality, I am wondering if
>> we need a system option to allow the analytics tables to be dropped prior
>> to regenerating them, or to have more control over the order in which they
>> are generated (for instance to generate specific periods). I realize this
>> can be done from the API or the scheduler, but only for the past three
>> relative years.
>>
>>  The reason I am asking for this is because its a bit of a pain (at the
>> moment) when using Digital Ocean as a service provider, since their stock
>> disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit
>> easier, but DigitalOcean only supports block storage in two regions at the
>> moment. Regardless, it would seem somewhat wasteful to have to have such a
>> large amount of disk space, for such a relatively small database.
>>
>> Is this something we just need to plan for and maybe provide better
>> documentation on, or should we think about trying to offer better
>> functionality for people running smaller servers?
>>
>> Regards,
>> Jason
>>
>> _______________________________________________
>> Mailing list: https://launchpad.net/~dhis2-devs
>> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~dhis2-devs
>> More help   : https://help.launchpad.net/ListHelp
>>
>>
>
>
> --
>
> *******************************************
>
> Calle Hedberg
>
> 46D Alma Road, 7700 Rosebank, SOUTH AFRICA
>
> Tel/fax (home): +27-21-685-6472
>
> Cell: +27-82-853-5352
>
> Iridium SatPhone: +8816-315-19119
>
> Email: calle.hedberg@xxxxxxxxx
>
> Skype: calle_hedberg
>
> *******************************************
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~dhis2-devs
> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~dhis2-devs
> More help   : https://help.launchpad.net/ListHelp
>
>

Follow ups

References