← Back to team overview

dhis2-devs team mailing list archive

Analytics and disk space

 

Hi Devs,
I am seeking some advice on how to try and decrease the amount of disk
usage with DHIS2.

Here is a list of the biggest tables in the system.

 public.datavalue                                   | 2316 MB
 public.datavalue_pkey                         | 1230 MB
 public.in_datavalue_lastupdated          | 680 MB


There are a lot more tables, and all in all, the database occupies about
5.4 GB without analytics.

This represents about 30 million data rows, so not that big of a database
really. This server is being run off of a Digital Ocean virtual server with
60 GB of disk space. The only thing on the server really is Linux,
Postgresql and Tomcat. Nothing else. With out analytics and everything
installed for the system, we have about 23% of that 60 GB free.

When analytics runs, it maintains a copy of the main analytics tables (
analytics_XXXX) and creates temp tables like analytics_temp_2004. When
things are finished and the indexes are built, the tables are swapped. This
ensures that analytics resources are available while analytics are being
built, but the downside of this is that A LOT more disk space is required,
as now we effectively have two copies of the tables along with all their
indexes, which are quite large themselves (up to 60% the size of the table
itself).  Here's what happens when analytics is run

 public.analytics_temp_2015              | 1017 MB
 public.analytics_temp_2014              | 985 MB
 public.analytics_temp_2011              | 952 MB
 public.analytics_temp_2010              | 918 MB
 public.analytics_temp_2013              | 885 MB
 public.analytics_temp_2012              | 835 MB
 public.analytics_temp_2009              | 804 MB

Now each analytics table is taking about 1 GB of space. In the end, it adds
up to more than 60 GB and analytics fails to complete.

So, while I understand the need for this functionality, I am wondering if
we need a system option to allow the analytics tables to be dropped prior
to regenerating them, or to have more control over the order in which they
are generated (for instance to generate specific periods). I realize this
can be done from the API or the scheduler, but only for the past three
relative years.

 The reason I am asking for this is because its a bit of a pain (at the
moment) when using Digital Ocean as a service provider, since their stock
disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit
easier, but DigitalOcean only supports block storage in two regions at the
moment. Regardless, it would seem somewhat wasteful to have to have such a
large amount of disk space, for such a relatively small database.

Is this something we just need to plan for and maybe provide better
documentation on, or should we think about trying to offer better
functionality for people running smaller servers?

Regards,
Jason

Follow ups