dhis2-devs team mailing list archive
-
dhis2-devs team
-
Mailing list archive
-
Message #46735
Re: Analytics and disk space
I think Jason also pointed out that this could be achieved from the API,
but the question is whether it needs to be more user-friendly, i.e.
customisable using the web application as opposed to requiring a custom
script triggered by a cron job.
Cheers,
-doh
On Sun, Sep 11, 2016 at 8:36 PM, Dan Cocos <dcocos@xxxxxxxxx> wrote:
> Hi All,
>
> You could run this
> /api/24/maintenance/analyticsTablesClear
> and this possibly this
> /api/24/maintenance/periodPruning
>
> I don't see it in the documentation but we use call this
> /api/resourceTables/analytics?lastYears=2 quite often for clients with a
> lot of historical data.
>
> Good luck,
> Dan
>
> *Dan Cocos*
> Principal, BAO Systems
> dcocos@xxxxxxxxxxxxxx <nhobby@xxxxxxxxxxxxxx> | http://www.baosystems.com
> | 2900 K Street, Suite 404, Washington D.C. 20007
>
>
>
>
>
> On Sep 11, 2016, at 10:05 AM, Calle Hedberg <calle.hedberg@xxxxxxxxx>
> wrote:
>
> Hi,
>
> It's not only analytics that would benefit from segmented/staggered
> processing: I exported around 100 mill data values yesterday from a number
> of instance, and found that the export process was (seemingly)
> exponentially slower with increasing number of records exported. Most of
> the export files contained well under 10 mill records, which was pretty
> fast. In comparison, the largest export file with around 30 mill data
> values probably took 20 times as much time as an 8 mill value export. Based
> on just keeping an eye on the "progress bar", it seemed like some kind of
> cache staggering was taking place - the amount exported would increase
> quickly by 2-3mb, then "hang" for a good while, then increase quickly by
> 2-3mb again.
>
> Note also that there are several fundamental strategies one could use to
> reducing heavy work processes like analytics, exports (and thus imports),
> etc:
> - to be able to specify a sub-period as Jason's suggest
> - to be able to specify the "dirty" part of the instance by using e.g.
> LastUpdated >= xxxxx
> - to be able to specify a sub-OrgUnit-area
>
> These partial strategies are of course mostly relevant for very large
> instances, but such large instances are also the ones where you typically
> only have changes made to a small segment of the total - like if you have
> data for 30 years, 27 of those might be locked down and no longer available
> for updates.
>
> Regards
> Calle
>
> On 11 September 2016 at 15:47, David Siang Fong Oh <doh@xxxxxxxxxxxxxxxx>
> wrote:
>
>> +1 to Calle's idea of staggering analytics year by year
>>
>> I also like Jason's suggestion of being able to configure the time period
>> for which analytics is regenerated. If the general use-case has data being
>> entered only for the current year, then is it perhaps unnecessary to
>> regenerate data for previous years?
>>
>> Cheers,
>>
>> -doh
>>
>> On Tue, Jul 26, 2016 at 2:36 PM, Calle Hedberg <calle.hedberg@xxxxxxxxx>
>> wrote:
>>
>>> Hi,
>>>
>>> One (presumably) simple solution is to stagger analytics on a year by
>>> year basis - i.e. run and complete 2009 before processing 2010. That would
>>> reduce temp disk space requirements significantly while (presumably) not
>>> changing the general design.
>>>
>>> Regards
>>> Calle
>>>
>>> On 26 July 2016 at 10:24, Jason Pickering <jason.p.pickering@xxxxxxxxx>
>>> wrote:
>>>
>>>> Hi Devs,
>>>> I am seeking some advice on how to try and decrease the amount of disk
>>>> usage with DHIS2.
>>>>
>>>> Here is a list of the biggest tables in the system.
>>>>
>>>> public.datavalue | 2316 MB
>>>> public.datavalue_pkey | 1230 MB
>>>> public.in_datavalue_lastupdated | 680 MB
>>>>
>>>>
>>>> There are a lot more tables, and all in all, the database occupies
>>>> about 5.4 GB without analytics.
>>>>
>>>> This represents about 30 million data rows, so not that big of a
>>>> database really. This server is being run off of a Digital Ocean virtual
>>>> server with 60 GB of disk space. The only thing on the server really is
>>>> Linux, Postgresql and Tomcat. Nothing else. With out analytics and
>>>> everything installed for the system, we have about 23% of that 60 GB free.
>>>>
>>>> When analytics runs, it maintains a copy of the main analytics tables (
>>>> analytics_XXXX) and creates temp tables like analytics_temp_2004. When
>>>> things are finished and the indexes are built, the tables are swapped. This
>>>> ensures that analytics resources are available while analytics are being
>>>> built, but the downside of this is that A LOT more disk space is required,
>>>> as now we effectively have two copies of the tables along with all their
>>>> indexes, which are quite large themselves (up to 60% the size of the table
>>>> itself). Here's what happens when analytics is run
>>>>
>>>> public.analytics_temp_2015 | 1017 MB
>>>> public.analytics_temp_2014 | 985 MB
>>>> public.analytics_temp_2011 | 952 MB
>>>> public.analytics_temp_2010 | 918 MB
>>>> public.analytics_temp_2013 | 885 MB
>>>> public.analytics_temp_2012 | 835 MB
>>>> public.analytics_temp_2009 | 804 MB
>>>>
>>>> Now each analytics table is taking about 1 GB of space. In the end, it
>>>> adds up to more than 60 GB and analytics fails to complete.
>>>>
>>>> So, while I understand the need for this functionality, I am wondering
>>>> if we need a system option to allow the analytics tables to be dropped
>>>> prior to regenerating them, or to have more control over the order in which
>>>> they are generated (for instance to generate specific periods). I realize
>>>> this can be done from the API or the scheduler, but only for the past three
>>>> relative years.
>>>>
>>>> The reason I am asking for this is because its a bit of a pain (at the
>>>> moment) when using Digital Ocean as a service provider, since their stock
>>>> disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit
>>>> easier, but DigitalOcean only supports block storage in two regions at the
>>>> moment. Regardless, it would seem somewhat wasteful to have to have such a
>>>> large amount of disk space, for such a relatively small database.
>>>>
>>>> Is this something we just need to plan for and maybe provide better
>>>> documentation on, or should we think about trying to offer better
>>>> functionality for people running smaller servers?
>>>>
>>>> Regards,
>>>> Jason
>>>>
>>>> _______________________________________________
>>>> Mailing list: https://launchpad.net/~dhis2-devs
>>>> Post to : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>>>> Unsubscribe : https://launchpad.net/~dhis2-devs
>>>> More help : https://help.launchpad.net/ListHelp
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *******************************************
>>>
>>> Calle Hedberg
>>>
>>> 46D Alma Road, 7700 Rosebank, SOUTH AFRICA
>>>
>>> Tel/fax (home): +27-21-685-6472
>>>
>>> Cell: +27-82-853-5352
>>>
>>> Iridium SatPhone: +8816-315-19119
>>>
>>> Email: calle.hedberg@xxxxxxxxx
>>>
>>> Skype: calle_hedberg
>>>
>>> *******************************************
>>>
>>>
>>> _______________________________________________
>>> Mailing list: https://launchpad.net/~dhis2-devs
>>> Post to : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>>> Unsubscribe : https://launchpad.net/~dhis2-devs
>>> More help : https://help.launchpad.net/ListHelp
>>>
>>>
>>
>
>
> --
>
> *******************************************
>
> Calle Hedberg
>
> 46D Alma Road, 7700 Rosebank, SOUTH AFRICA
>
> Tel/fax (home): +27-21-685-6472
>
> Cell: +27-82-853-5352
>
> Iridium SatPhone: +8816-315-19119
>
> Email: calle.hedberg@xxxxxxxxx
>
> Skype: calle_hedberg
>
> *******************************************
>
> _______________________________________________
> Mailing list: https://launchpad.net/~dhis2-devs
> Post to : dhis2-devs@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~dhis2-devs
> More help : https://help.launchpad.net/ListHelp
>
>
>
Follow ups
References