← Back to team overview

dhis2-devs team mailing list archive

Re: Analytics and disk space

 

Hi there,

thanks for the feedback. Most of what's requested is available in the API.
It's on our list to rewrite the import-export app and write a better
scheduling manager for background tasks such analytics generation.

In the meantime:

- Analytics tables generation
<http://dhis2.github.io/dhis2-docs/master/en/developer/html/webapi_generating_resource_analytics_tables.html>
for
last x years
- Data value export
<http://dhis2.github.io/dhis2-docs/master/en/developer/html/webapi_data_values.html#d0e3600>
(lastUpdated, lastUpdatedDuration, orgUnit params)


regards,

Lars



On Sun, Sep 11, 2016 at 5:20 PM, David Siang Fong Oh <doh@xxxxxxxxxxxxxxxx>
wrote:

> I think Jason also pointed out that this could be achieved from the API,
> but the question is whether it needs to be more user-friendly, i.e.
> customisable using the web application as opposed to requiring a custom
> script triggered by a cron job.
>
> Cheers,
>
> -doh
>
> On Sun, Sep 11, 2016 at 8:36 PM, Dan Cocos <dcocos@xxxxxxxxx> wrote:
>
>> Hi All,
>>
>> You could run this
>> /api/24/maintenance/analyticsTablesClear
>> and this possibly this
>> /api/24/maintenance/periodPruning
>>
>> I don't see it in the documentation but we use call this
>>  /api/resourceTables/analytics?lastYears=2 quite often for clients with
>> a lot of historical data.
>>
>> Good luck,
>> Dan
>>
>> *Dan Cocos*
>> Principal, BAO Systems
>> dcocos@xxxxxxxxxxxxxx <nhobby@xxxxxxxxxxxxxx> | http://www.baosystems.com
>>  |  2900 K Street, Suite 404, Washington D.C. 20007
>>
>>
>>
>>
>>
>> On Sep 11, 2016, at 10:05 AM, Calle Hedberg <calle.hedberg@xxxxxxxxx>
>> wrote:
>>
>> Hi,
>>
>> It's not only analytics that would benefit from segmented/staggered
>> processing: I exported around 100 mill data values yesterday from a number
>> of instance, and found that the export process was (seemingly)
>> exponentially slower with increasing number of records exported. Most of
>> the export files contained well under 10 mill records, which was pretty
>> fast. In comparison, the largest export file with around 30 mill data
>> values probably took 20 times as much time as an 8 mill value export. Based
>> on just keeping an eye on the "progress bar", it seemed like some kind of
>> cache staggering was taking place - the amount exported would increase
>> quickly by 2-3mb, then "hang" for a good while, then increase quickly by
>> 2-3mb again.
>>
>> Note also that there are several fundamental strategies one could use to
>> reducing heavy work processes like analytics, exports (and thus imports),
>> etc:
>> - to be able to specify a sub-period as Jason's suggest
>> - to be able to specify the "dirty" part of the instance by using e.g.
>> LastUpdated >= xxxxx
>> - to be able to specify a sub-OrgUnit-area
>>
>> These partial strategies are of course mostly relevant for very large
>> instances, but such large instances are also the ones where you typically
>> only have changes made to a small segment of the total - like if you have
>> data for 30 years, 27 of those might be locked down and no longer available
>> for updates.
>>
>> Regards
>> Calle
>>
>> On 11 September 2016 at 15:47, David Siang Fong Oh <doh@xxxxxxxxxxxxxxxx>
>>  wrote:
>>
>>> +1 to Calle's idea of staggering analytics year by year
>>>
>>> I also like Jason's suggestion of being able to configure the time
>>> period for which analytics is regenerated. If the general use-case has data
>>> being entered only for the current year, then is it perhaps unnecessary to
>>> regenerate data for previous years?
>>>
>>> Cheers,
>>>
>>> -doh
>>>
>>> On Tue, Jul 26, 2016 at 2:36 PM, Calle Hedberg <calle.hedberg@xxxxxxxxx>
>>>  wrote:
>>>
>>>> Hi,
>>>>
>>>> One (presumably) simple solution is to stagger analytics on a year by
>>>> year basis - i.e. run and complete 2009 before processing 2010. That would
>>>> reduce temp disk space requirements significantly while (presumably) not
>>>> changing the general design.
>>>>
>>>> Regards
>>>> Calle
>>>>
>>>> On 26 July 2016 at 10:24, Jason Pickering <jason.p.pickering@xxxxxxxxx>
>>>>  wrote:
>>>>
>>>>> Hi Devs,
>>>>> I am seeking some advice on how to try and decrease the amount of disk
>>>>> usage with DHIS2.
>>>>>
>>>>> Here is a list of the biggest tables in the system.
>>>>>
>>>>>  public.datavalue                                   | 2316 MB
>>>>>  public.datavalue_pkey                         | 1230 MB
>>>>>  public.in_datavalue_lastupdated          | 680 MB
>>>>>
>>>>>
>>>>> There are a lot more tables, and all in all, the database occupies
>>>>> about 5.4 GB without analytics.
>>>>>
>>>>> This represents about 30 million data rows, so not that big of a
>>>>> database really. This server is being run off of a Digital Ocean virtual
>>>>> server with 60 GB of disk space. The only thing on the server really is
>>>>> Linux, Postgresql and Tomcat. Nothing else. With out analytics and
>>>>> everything installed for the system, we have about 23% of that 60 GB free.
>>>>>
>>>>> When analytics runs, it maintains a copy of the main analytics tables
>>>>> ( analytics_XXXX) and creates temp tables like analytics_temp_2004. When
>>>>> things are finished and the indexes are built, the tables are swapped. This
>>>>> ensures that analytics resources are available while analytics are being
>>>>> built, but the downside of this is that A LOT more disk space is required,
>>>>> as now we effectively have two copies of the tables along with all their
>>>>> indexes, which are quite large themselves (up to 60% the size of the table
>>>>> itself).  Here's what happens when analytics is run
>>>>>
>>>>>  public.analytics_temp_2015              | 1017 MB
>>>>>  public.analytics_temp_2014              | 985 MB
>>>>>  public.analytics_temp_2011              | 952 MB
>>>>>  public.analytics_temp_2010              | 918 MB
>>>>>  public.analytics_temp_2013              | 885 MB
>>>>>  public.analytics_temp_2012              | 835 MB
>>>>>  public.analytics_temp_2009              | 804 MB
>>>>>
>>>>> Now each analytics table is taking about 1 GB of space. In the end, it
>>>>> adds up to more than 60 GB and analytics fails to complete.
>>>>>
>>>>> So, while I understand the need for this functionality, I am wondering
>>>>> if we need a system option to allow the analytics tables to be dropped
>>>>> prior to regenerating them, or to have more control over the order in which
>>>>> they are generated (for instance to generate specific periods). I realize
>>>>> this can be done from the API or the scheduler, but only for the past three
>>>>> relative years.
>>>>>
>>>>>  The reason I am asking for this is because its a bit of a pain (at
>>>>> the moment) when using Digital Ocean as a service provider, since their
>>>>> stock disk storage is 60 GB. With other VPS providers (Amazon, Linode), its
>>>>> a bit easier, but DigitalOcean only supports block storage in two regions
>>>>> at the moment. Regardless, it would seem somewhat wasteful to have to have
>>>>> such a large amount of disk space, for such a relatively small database.
>>>>>
>>>>> Is this something we just need to plan for and maybe provide better
>>>>> documentation on, or should we think about trying to offer better
>>>>> functionality for people running smaller servers?
>>>>>
>>>>> Regards,
>>>>> Jason
>>>>>
>>>>> _______________________________________________
>>>>> Mailing list: https://launchpad.net/~dhis2-devs
>>>>> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>>>>> Unsubscribe : https://launchpad.net/~dhis2-devs
>>>>> More help   : https://help.launchpad.net/ListHelp
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> *******************************************
>>>>
>>>> Calle Hedberg
>>>>
>>>> 46D Alma Road, 7700 Rosebank, SOUTH AFRICA
>>>>
>>>> Tel/fax (home): +27-21-685-6472
>>>>
>>>> Cell: +27-82-853-5352
>>>>
>>>> Iridium SatPhone: +8816-315-19119
>>>>
>>>> Email: calle.hedberg@xxxxxxxxx
>>>>
>>>> Skype: calle_hedberg
>>>>
>>>> *******************************************
>>>>
>>>>
>>>> _______________________________________________
>>>> Mailing list: https://launchpad.net/~dhis2-devs
>>>> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>>>> Unsubscribe : https://launchpad.net/~dhis2-devs
>>>> More help   : https://help.launchpad.net/ListHelp
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> *******************************************
>>
>> Calle Hedberg
>>
>> 46D Alma Road, 7700 Rosebank, SOUTH AFRICA
>>
>> Tel/fax (home): +27-21-685-6472
>>
>> Cell: +27-82-853-5352
>>
>> Iridium SatPhone: +8816-315-19119
>>
>> Email: calle.hedberg@xxxxxxxxx
>>
>> Skype: calle_hedberg
>>
>> *******************************************
>>
>> _______________________________________________
>> Mailing list: https://launchpad.net/~dhis2-devs
>> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~dhis2-devs
>> More help   : https://help.launchpad.net/ListHelp
>>
>>
>>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~dhis2-devs
> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~dhis2-devs
> More help   : https://help.launchpad.net/ListHelp
>
>


-- 
Lars Helge Øverland
Lead developer, DHIS 2
University of Oslo
Skype: larshelgeoverland
lars@xxxxxxxxx
http://www.dhis2.org <https://www.dhis2.org/>

Follow ups

References