dhis2-devs team mailing list archive

Thread
Date
Re: Analytics and disk space

To: Lars Helge Øverland <lars@xxxxxxxxx>
From: Jason Pickering <jason.p.pickering@xxxxxxxxx>
Date: Mon, 12 Sep 2016 12:24:03 +0200
Cc: Dan Cocos <dcocos@xxxxxxxxx>, dhis2-devs <dhis2-devs@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAD_DPKz1D49y0vFviWbk7i1qn56aPDksFRs-5YOE6VgDAUnRJA@mail.gmail.com>
Hi Lars,

You are right in that most of this is available via the API, but making a
little bit easier for people through the UI would seem to be a quick win
really. Previously when the datamart was in use, one could choose specific
time periods to regenerate data for, but no longer.Currently, users who use
the UI can only execute a full-analytics run, which is really the basis of
this request. In this particular situation, a full analytics run caused the
server to choke up, due to lack of disk space and the manner in which the
analytics tables are (re)generated.

 Hopefully the new scheduling functionality will allow for better control
over analytics, without users always having to resort to a curl call to the
API.

Regards,
Jason


On Mon, Sep 12, 2016 at 8:42 AM, Lars Helge Øverland <lars@xxxxxxxxx> wrote:

> Hi there,
>
> thanks for the feedback. Most of what's requested is available in the API.
> It's on our list to rewrite the import-export app and write a better
> scheduling manager for background tasks such analytics generation.
>
> In the meantime:
>
> - Analytics tables generation
> <http://dhis2.github.io/dhis2-docs/master/en/developer/html/webapi_generating_resource_analytics_tables.html> for
> last x years
> - Data value export
> <http://dhis2.github.io/dhis2-docs/master/en/developer/html/webapi_data_values.html#d0e3600>
> (lastUpdated, lastUpdatedDuration, orgUnit params)
>
>
> regards,
>
> Lars
>
>
>
> On Sun, Sep 11, 2016 at 5:20 PM, David Siang Fong Oh <doh@xxxxxxxxxxxxxxxx
> > wrote:
>
>> I think Jason also pointed out that this could be achieved from the API,
>> but the question is whether it needs to be more user-friendly, i.e.
>> customisable using the web application as opposed to requiring a custom
>> script triggered by a cron job.
>>
>> Cheers,
>>
>> -doh
>>
>> On Sun, Sep 11, 2016 at 8:36 PM, Dan Cocos <dcocos@xxxxxxxxx> wrote:
>>
>>> Hi All,
>>>
>>> You could run this
>>> /api/24/maintenance/analyticsTablesClear
>>> and this possibly this
>>> /api/24/maintenance/periodPruning
>>>
>>> I don't see it in the documentation but we use call this
>>>  /api/resourceTables/analytics?lastYears=2 quite often for clients with
>>> a lot of historical data.
>>>
>>> Good luck,
>>> Dan
>>>
>>> *Dan Cocos*
>>> Principal, BAO Systems
>>> dcocos@xxxxxxxxxxxxxx <nhobby@xxxxxxxxxxxxxx> | http:/
>>> /www.baosystems.com |  2900 K Street, Suite 404, Washington D.C. 20007
>>>
>>>
>>>
>>>
>>>
>>> On Sep 11, 2016, at 10:05 AM, Calle Hedberg <calle.hedberg@xxxxxxxxx>
>>> wrote:
>>>
>>> Hi,
>>>
>>> It's not only analytics that would benefit from segmented/staggered
>>> processing: I exported around 100 mill data values yesterday from a number
>>> of instance, and found that the export process was (seemingly)
>>> exponentially slower with increasing number of records exported. Most of
>>> the export files contained well under 10 mill records, which was pretty
>>> fast. In comparison, the largest export file with around 30 mill data
>>> values probably took 20 times as much time as an 8 mill value export. Based
>>> on just keeping an eye on the "progress bar", it seemed like some kind of
>>> cache staggering was taking place - the amount exported would increase
>>> quickly by 2-3mb, then "hang" for a good while, then increase quickly by
>>> 2-3mb again.
>>>
>>> Note also that there are several fundamental strategies one could use to
>>> reducing heavy work processes like analytics, exports (and thus imports),
>>> etc:
>>> - to be able to specify a sub-period as Jason's suggest
>>> - to be able to specify the "dirty" part of the instance by using e.g.
>>> LastUpdated >= xxxxx
>>> - to be able to specify a sub-OrgUnit-area
>>>
>>> These partial strategies are of course mostly relevant for very large
>>> instances, but such large instances are also the ones where you typically
>>> only have changes made to a small segment of the total - like if you have
>>> data for 30 years, 27 of those might be locked down and no longer available
>>> for updates.
>>>
>>> Regards
>>> Calle
>>>
>>> On 11 September 2016 at 15:47, David Siang Fong Oh <doh@xxxxxxxxxxxxxxxx
>>> > wrote:
>>>
>>>> +1 to Calle's idea of staggering analytics year by year
>>>>
>>>> I also like Jason's suggestion of being able to configure the time
>>>> period for which analytics is regenerated. If the general use-case has data
>>>> being entered only for the current year, then is it perhaps unnecessary to
>>>> regenerate data for previous years?
>>>>
>>>> Cheers,
>>>>
>>>> -doh
>>>>
>>>> On Tue, Jul 26, 2016 at 2:36 PM, Calle Hedberg <calle.hedberg@xxxxxxxxx
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> One (presumably) simple solution is to stagger analytics on a year by
>>>>> year basis - i.e. run and complete 2009 before processing 2010. That would
>>>>> reduce temp disk space requirements significantly while (presumably) not
>>>>> changing the general design.
>>>>>
>>>>> Regards
>>>>> Calle
>>>>>
>>>>> On 26 July 2016 at 10:24, Jason Pickering <jason.p.pickering@xxxxxxxxx
>>>>> > wrote:
>>>>>
>>>>>> Hi Devs,
>>>>>> I am seeking some advice on how to try and decrease the amount of
>>>>>> disk usage with DHIS2.
>>>>>>
>>>>>> Here is a list of the biggest tables in the system.
>>>>>>
>>>>>>  public.datavalue                                   | 2316 MB
>>>>>>  public.datavalue_pkey                         | 1230 MB
>>>>>>  public.in_datavalue_lastupdated          | 680 MB
>>>>>>
>>>>>>
>>>>>> There are a lot more tables, and all in all, the database occupies
>>>>>> about 5.4 GB without analytics.
>>>>>>
>>>>>> This represents about 30 million data rows, so not that big of a
>>>>>> database really. This server is being run off of a Digital Ocean virtual
>>>>>> server with 60 GB of disk space. The only thing on the server really is
>>>>>> Linux, Postgresql and Tomcat. Nothing else. With out analytics and
>>>>>> everything installed for the system, we have about 23% of that 60 GB free.
>>>>>>
>>>>>> When analytics runs, it maintains a copy of the main analytics tables
>>>>>> ( analytics_XXXX) and creates temp tables like analytics_temp_2004. When
>>>>>> things are finished and the indexes are built, the tables are swapped. This
>>>>>> ensures that analytics resources are available while analytics are being
>>>>>> built, but the downside of this is that A LOT more disk space is required,
>>>>>> as now we effectively have two copies of the tables along with all their
>>>>>> indexes, which are quite large themselves (up to 60% the size of the table
>>>>>> itself).  Here's what happens when analytics is run
>>>>>>
>>>>>>  public.analytics_temp_2015              | 1017 MB
>>>>>>  public.analytics_temp_2014              | 985 MB
>>>>>>  public.analytics_temp_2011              | 952 MB
>>>>>>  public.analytics_temp_2010              | 918 MB
>>>>>>  public.analytics_temp_2013              | 885 MB
>>>>>>  public.analytics_temp_2012              | 835 MB
>>>>>>  public.analytics_temp_2009              | 804 MB
>>>>>>
>>>>>> Now each analytics table is taking about 1 GB of space. In the end,
>>>>>> it adds up to more than 60 GB and analytics fails to complete.
>>>>>>
>>>>>> So, while I understand the need for this functionality, I am
>>>>>> wondering if we need a system option to allow the analytics tables to be
>>>>>> dropped prior to regenerating them, or to have more control over the order
>>>>>> in which they are generated (for instance to generate specific periods). I
>>>>>> realize this can be done from the API or the scheduler, but only for the
>>>>>> past three relative years.
>>>>>>
>>>>>>  The reason I am asking for this is because its a bit of a pain (at
>>>>>> the moment) when using Digital Ocean as a service provider, since their
>>>>>> stock disk storage is 60 GB. With other VPS providers (Amazon, Linode), its
>>>>>> a bit easier, but DigitalOcean only supports block storage in two regions
>>>>>> at the moment. Regardless, it would seem somewhat wasteful to have to have
>>>>>> such a large amount of disk space, for such a relatively small database.
>>>>>>
>>>>>> Is this something we just need to plan for and maybe provide better
>>>>>> documentation on, or should we think about trying to offer better
>>>>>> functionality for people running smaller servers?
>>>>>>
>>>>>> Regards,
>>>>>> Jason
>>>>>>
>>>>>> _______________________________________________
>>>>>> Mailing list: https://launchpad.net/~dhis2-devs
>>>>>> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>>>>>> Unsubscribe : https://launchpad.net/~dhis2-devs
>>>>>> More help   : https://help.launchpad.net/ListHelp
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> *******************************************
>>>>>
>>>>> Calle Hedberg
>>>>>
>>>>> 46D Alma Road, 7700 Rosebank, SOUTH AFRICA
>>>>>
>>>>> Tel/fax (home): +27-21-685-6472
>>>>>
>>>>> Cell: +27-82-853-5352
>>>>>
>>>>> Iridium SatPhone: +8816-315-19119
>>>>>
>>>>> Email: calle.hedberg@xxxxxxxxx
>>>>>
>>>>> Skype: calle_hedberg
>>>>>
>>>>> *******************************************
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Mailing list: https://launchpad.net/~dhis2-devs
>>>>> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>>>>> Unsubscribe : https://launchpad.net/~dhis2-devs
>>>>> More help   : https://help.launchpad.net/ListHelp
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *******************************************
>>>
>>> Calle Hedberg
>>>
>>> 46D Alma Road, 7700 Rosebank, SOUTH AFRICA
>>>
>>> Tel/fax (home): +27-21-685-6472
>>>
>>> Cell: +27-82-853-5352
>>>
>>> Iridium SatPhone: +8816-315-19119
>>>
>>> Email: calle.hedberg@xxxxxxxxx
>>>
>>> Skype: calle_hedberg
>>>
>>> *******************************************
>>>
>>> _______________________________________________
>>> Mailing list: https://launchpad.net/~dhis2-devs
>>> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>>> Unsubscribe : https://launchpad.net/~dhis2-devs
>>> More help   : https://help.launchpad.net/ListHelp
>>>
>>>
>>>
>>
>> _______________________________________________
>> Mailing list: https://launchpad.net/~dhis2-devs
>> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~dhis2-devs
>> More help   : https://help.launchpad.net/ListHelp
>>
>>
>
>
> --
> Lars Helge Øverland
> Lead developer, DHIS 2
> University of Oslo
> Skype: larshelgeoverland
> lars@xxxxxxxxx
> http://www.dhis2.org <https://www.dhis2.org/>
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~dhis2-devs
> Post to     : dhis2-devs@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~dhis2-devs
> More help   : https://help.launchpad.net/ListHelp
>
>


-- 
Jason P. Pickering
email: jason.p.pickering@xxxxxxxxx
tel:+46764147049
References

Analytics and disk space
From: Jason Pickering, 2016-07-26
Re: Analytics and disk space
From: Calle Hedberg, 2016-07-26
Re: Analytics and disk space
From: David Siang Fong Oh, 2016-09-11
Re: Analytics and disk space
From: Calle Hedberg, 2016-09-11
Re: Analytics and disk space
From: David Siang Fong Oh, 2016-09-11
Re: Analytics and disk space
From: Lars Helge Øverland, 2016-09-12