dhis2-devs team mailing list archive
-
dhis2-devs team
-
Mailing list archive
-
Message #46734
Re: Analytics and disk space
Hi,
It's not only analytics that would benefit from segmented/staggered
processing: I exported around 100 mill data values yesterday from a number
of instance, and found that the export process was (seemingly)
exponentially slower with increasing number of records exported. Most of
the export files contained well under 10 mill records, which was pretty
fast. In comparison, the largest export file with around 30 mill data
values probably took 20 times as much time as an 8 mill value export. Based
on just keeping an eye on the "progress bar", it seemed like some kind of
cache staggering was taking place - the amount exported would increase
quickly by 2-3mb, then "hang" for a good while, then increase quickly by
2-3mb again.
Note also that there are several fundamental strategies one could use to
reducing heavy work processes like analytics, exports (and thus imports),
etc:
- to be able to specify a sub-period as Jason's suggest
- to be able to specify the "dirty" part of the instance by using e.g.
LastUpdated >= xxxxx
- to be able to specify a sub-OrgUnit-area
These partial strategies are of course mostly relevant for very large
instances, but such large instances are also the ones where you typically
only have changes made to a small segment of the total - like if you have
data for 30 years, 27 of those might be locked down and no longer available
for updates.
Regards
Calle
On 11 September 2016 at 15:47, David Siang Fong Oh <doh@xxxxxxxxxxxxxxxx>
wrote:
> +1 to Calle's idea of staggering analytics year by year
>
> I also like Jason's suggestion of being able to configure the time period
> for which analytics is regenerated. If the general use-case has data being
> entered only for the current year, then is it perhaps unnecessary to
> regenerate data for previous years?
>
> Cheers,
>
> -doh
>
> On Tue, Jul 26, 2016 at 2:36 PM, Calle Hedberg <calle.hedberg@xxxxxxxxx>
> wrote:
>
>> Hi,
>>
>> One (presumably) simple solution is to stagger analytics on a year by
>> year basis - i.e. run and complete 2009 before processing 2010. That would
>> reduce temp disk space requirements significantly while (presumably) not
>> changing the general design.
>>
>> Regards
>> Calle
>>
>> On 26 July 2016 at 10:24, Jason Pickering <jason.p.pickering@xxxxxxxxx>
>> wrote:
>>
>>> Hi Devs,
>>> I am seeking some advice on how to try and decrease the amount of disk
>>> usage with DHIS2.
>>>
>>> Here is a list of the biggest tables in the system.
>>>
>>> public.datavalue | 2316 MB
>>> public.datavalue_pkey | 1230 MB
>>> public.in_datavalue_lastupdated | 680 MB
>>>
>>>
>>> There are a lot more tables, and all in all, the database occupies about
>>> 5.4 GB without analytics.
>>>
>>> This represents about 30 million data rows, so not that big of a
>>> database really. This server is being run off of a Digital Ocean virtual
>>> server with 60 GB of disk space. The only thing on the server really is
>>> Linux, Postgresql and Tomcat. Nothing else. With out analytics and
>>> everything installed for the system, we have about 23% of that 60 GB free.
>>>
>>> When analytics runs, it maintains a copy of the main analytics tables (
>>> analytics_XXXX) and creates temp tables like analytics_temp_2004. When
>>> things are finished and the indexes are built, the tables are swapped. This
>>> ensures that analytics resources are available while analytics are being
>>> built, but the downside of this is that A LOT more disk space is required,
>>> as now we effectively have two copies of the tables along with all their
>>> indexes, which are quite large themselves (up to 60% the size of the table
>>> itself). Here's what happens when analytics is run
>>>
>>> public.analytics_temp_2015 | 1017 MB
>>> public.analytics_temp_2014 | 985 MB
>>> public.analytics_temp_2011 | 952 MB
>>> public.analytics_temp_2010 | 918 MB
>>> public.analytics_temp_2013 | 885 MB
>>> public.analytics_temp_2012 | 835 MB
>>> public.analytics_temp_2009 | 804 MB
>>>
>>> Now each analytics table is taking about 1 GB of space. In the end, it
>>> adds up to more than 60 GB and analytics fails to complete.
>>>
>>> So, while I understand the need for this functionality, I am wondering
>>> if we need a system option to allow the analytics tables to be dropped
>>> prior to regenerating them, or to have more control over the order in which
>>> they are generated (for instance to generate specific periods). I realize
>>> this can be done from the API or the scheduler, but only for the past three
>>> relative years.
>>>
>>> The reason I am asking for this is because its a bit of a pain (at the
>>> moment) when using Digital Ocean as a service provider, since their stock
>>> disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit
>>> easier, but DigitalOcean only supports block storage in two regions at the
>>> moment. Regardless, it would seem somewhat wasteful to have to have such a
>>> large amount of disk space, for such a relatively small database.
>>>
>>> Is this something we just need to plan for and maybe provide better
>>> documentation on, or should we think about trying to offer better
>>> functionality for people running smaller servers?
>>>
>>> Regards,
>>> Jason
>>>
>>> _______________________________________________
>>> Mailing list: https://launchpad.net/~dhis2-devs
>>> Post to : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>>> Unsubscribe : https://launchpad.net/~dhis2-devs
>>> More help : https://help.launchpad.net/ListHelp
>>>
>>>
>>
>>
>> --
>>
>> *******************************************
>>
>> Calle Hedberg
>>
>> 46D Alma Road, 7700 Rosebank, SOUTH AFRICA
>>
>> Tel/fax (home): +27-21-685-6472
>>
>> Cell: +27-82-853-5352
>>
>> Iridium SatPhone: +8816-315-19119
>>
>> Email: calle.hedberg@xxxxxxxxx
>>
>> Skype: calle_hedberg
>>
>> *******************************************
>>
>>
>> _______________________________________________
>> Mailing list: https://launchpad.net/~dhis2-devs
>> Post to : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~dhis2-devs
>> More help : https://help.launchpad.net/ListHelp
>>
>>
>
--
*******************************************
Calle Hedberg
46D Alma Road, 7700 Rosebank, SOUTH AFRICA
Tel/fax (home): +27-21-685-6472
Cell: +27-82-853-5352
Iridium SatPhone: +8816-315-19119
Email: calle.hedberg@xxxxxxxxx
Skype: calle_hedberg
*******************************************
References