← Back to team overview

dhis2-users team mailing list archive

Re: Cross database querying for updating data

 

Hi Jason

yes DELETES are a problem alright.

Having a complete copy of the remote database would in general defeat
the purpose of the exercise :-)  Which is to maintain such a copy with
increments rather that download the whole thing ...

A while back i thought about using postgres native replication feature
for this, but the problem is that the datamart/analytics process
breaks any sane process of following incremental updates.  And there
is no way to do table-selective replication.

Of course this is one (of a number) of quite good reasons
architecturally to break the data collection database away from the
analytics database entirely, but that is not something that will
happen anytime soon.  A simple datacollection database that grew (and
occasionally shrunk) incrementally would be quite easy to track.

A slightly roundabout route might be to maintain a separate database
on the server which didn't have any of the analytics stuff (ie. acting
as if it were just a data collection database, but back to
front/upside down).  This database could track the data tables in the
production database using some of the diff stuff you talked of
earlier, or even putting triggers on the datavalue table - though that
could certainly impact the speed of large data updates.  In fact
putting triggers on this table would probably just feed updates to a
"transaction" table in the same database (recording
delete,update,inserts).  There are quite a few systems I can think of
where this would probably not be a big problem at all.

Whichever way it was accomplished, something like this would then be
more easily trackable from a remote end accross a
low/expensive/unreliable connection.

Currently I know at least one country system in the cloud where this
is a real issue.  The system is running well but there is some
discomfort that they are not able to realistically pull nightly
snapshots back because of the size of the data.  And given that 95%
(and growing) of the data is data that they already have it does seem
remarkably inefficient.


On 21 April 2015 at 07:21, Jason Pickering <jason.p.pickering@xxxxxxxxx> wrote:
> Yeah maybe, or simply "NOT EXISTS" .
>
> However, cross-database queries in postgres are not in general supported
> without an extension, so you would need to deal with that, and obviously
> have a completely integral copy of the remote database on the local machine.
>
> Seems like a lot of moving parts, but it could be done. Whether or not it
> would save you any bandwidth is another issue.
>
> Anthony, how big is the database which you are trying to replicate, after
> compression? Are you working over a very slow link and thus, you are trying
> to decrease the size of what you need to download to the remote machine?
>
> Regards,
> Jason
>
>
>
>
>
> On Tue, Apr 21, 2015 at 2:44 AM Knut Staring <knutst@xxxxxxxxx> wrote:
>>
>> Couldn't two "left joins with null" give you the diff?
>>
>> On Apr 21, 2015 12:36 AM, "Jason Pickering" <jason.p.pickering@xxxxxxxxx>
>> wrote:
>>>
>>> Hi there. I think the big challenge here is what is not there, as opposed
>>> to what needs to be added.   DHIS2 has no functionality for "soft deletes",
>>> meaning when you mark a record with a column like "deletedon" with a
>>> timestamp for deletion, but you actually do not remove it from the database.
>>> This is a common problem with DXF syncronizations, as the upstream system
>>> does not transmit DELETES. It only transmits INSERTS and UPDATES.
>>>
>>> At first glance, in your case, it would seem to be pretty easy, since you
>>> can use the "lastupdated" of the data value table to parse out values which
>>> have been added since your last synchronization. The problem however is to
>>> determine what has been DELETED from the data value table.Well, that is more
>>> tricky, and would require you diff both data values tables to determine what
>>> changes you need to transmit to your downstream system.
>>>
>>> There are tools out there, to help determine the "diff" between two
>>> databases, which I think in your case would require you to have copy of your
>>> remote DB and then diff it against your local DB. You would then prepare a
>>> diff, make the update to your local copy and then you remote copy.
>>>
>>> I think that would really be the only real way of doing this, using a
>>> tool like perhaps these
>>> (http://stackoverflow.com/questions/4804779/how-to-check-difference-between-two-databases-in-p)
>>> other other commercial alternatives to diff the two databases, and prepare a
>>> series of INSERT and DELETE statements. Never attempted it myself, but it
>>> might work.
>>>
>>> Does not seem really that trivial. :(
>>>
>>> Regards,
>>> Jason
>>>
>>>
>>> On Mon, Apr 20, 2015 at 6:59 PM Knut Staring <knutst@xxxxxxxxx> wrote:
>>>>
>>>> I think I would trust UIDs or Codes more than database IDs
>>>>
>>>> On Mon, Apr 20, 2015 at 10:48 PM, Muyepa A. <muyepaa@xxxxxxxxx> wrote:
>>>>>
>>>>> Indeed Database A has extra data not available in B while  B has data
>>>>> missing in A.  No changes have been made to orgunits, dataelements, periods
>>>>> or users. The delta is in data-values. I would really want to get views on
>>>>> whether the operation described (all other parameters being equal) will
>>>>> bring the missing data into A, and whether it may affect report summary
>>>>> rates.
>>>>> I cannot easily download the big databases. Something on command-line
>>>>> would be very suitable for this circumstance  and transparent to users.
>>>>>
>>>>> On Mon, Apr 20, 2015 at 5:08 PM, Bob Jolliffe <bobjolliffe@xxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>> If database B is restored off a backup of database A then it will
>>>>>> probably work with some caveats.  Clearly they are not exactly the
>>>>>> same because you've got extra data in A which you are trying to push
>>>>>> into B.  So it depends really on what else might have changed in A.
>>>>>>
>>>>>> If you have made any changes to orgunits, datelements or periods then
>>>>>> that will cause at best a database integrity violation when you try to
>>>>>> import that way.  At worst cause a very horrible mess.
>>>>>>
>>>>>> Even though the process is fragile, there is a related use case where
>>>>>> one might need to do such a thing or something similar.  If a country
>>>>>> system is hosted somewhere in the cloud (because they have poor or
>>>>>> expensive local bandwidth) then there is a real problem to solve about
>>>>>> getting data backups from the server back into the country.
>>>>>>
>>>>>> In particular we'd like to be able to pull not all the datavalues but
>>>>>> only those which have been changed ie. do an incremental backup of the
>>>>>> datavalues table.  The same could apply to events.
>>>>>>
>>>>>> One possible approach might be to first do a metadata backup and
>>>>>> restore.  And then pull the changed datavalues - though you would have
>>>>>> to do this through the api with uids or codes as the metadata backup
>>>>>> and restore would have ignored the database identifiers.
>>>>>>
>>>>>> Yet another approach would be to take a reduced postgres backup
>>>>>> excluding the datavalues table, restore that and then try and pull in
>>>>>> the delta of datavalues table something like you describe.
>>>>>>
>>>>>> Its a tough problem.  Jason, what do you suggest?  Downloading the
>>>>>> full postgres backup (with ephemeral tables like analytics excluded)
>>>>>> can be a serious challenge for many countries to do with regularity
>>>>>> and reliability.  So how best to go about incremental backup and
>>>>>> restore of data?
>>>>>>
>>>>>> On 20 April 2015 at 15:36, Muyepa A. <muyepaa@xxxxxxxxx> wrote:
>>>>>> > I had thought of  data import/export functionality, however this
>>>>>> > will
>>>>>> > require two instances of tomcat to be running. (or one instance with
>>>>>> > one
>>>>>> > database at a time, however there will be an associated down time).
>>>>>> > Since upgrading to 2.18, when exported, the resultant download zip
>>>>>> > file says
>>>>>> > invalid for all formats (xml, json, excel) and is only 2kb in size
>>>>>> > regardless of the hierarchy I choose.
>>>>>> >
>>>>>> > Since one database is copy of the other, i thought I could easily
>>>>>> > update
>>>>>> > from the other as described.
>>>>>> >
>>>>>> > On Mon, Apr 20, 2015 at 3:48 PM, Jason Pickering
>>>>>> > <jason.p.pickering@xxxxxxxxx> wrote:
>>>>>> >>
>>>>>> >> It may work, but in general, it will not because the internal IDs
>>>>>> >> of
>>>>>> >> different DHIS2 databases are not the same. So, it might work, but
>>>>>> >> it
>>>>>> >> requires both databases to be in essentially the same state. Better
>>>>>> >> to use
>>>>>> >> the data import/export functionality if you can.
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >> Jason
>>>>>> >>
>>>>>> >>
>>>>>> >> On Mon, Apr 20, 2015 at 3:28 PM Muyepa A. <muyepaa@xxxxxxxxx>
>>>>>> >> wrote:
>>>>>> >>>
>>>>>> >>> Can I safely update DHIS2 database A with data from DHIS database
>>>>>> >>> B both
>>>>>> >>> on the same PostgreSQL server by cross query. B is backup of A
>>>>>> >>>
>>>>>> >>> From database B, I can retrieve and save the data that is required
>>>>>> >>> using:
>>>>>> >>>
>>>>>> >>> psql B -c "\copy (select * from datavalue where  periodid='xxxx')
>>>>>> >>> TO
>>>>>> >>> STDOUT" > /tmp/data.tsv;
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> And to restore:
>>>>>> >>> psql A -c "\copy datavaue (x, y, z, ...) FROM /tmp/data.tsv"
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Is this recommended, and what else should be cross  transferred.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> _______________________________________________
>>>>>> >>> Mailing list: https://launchpad.net/~dhis2-users
>>>>>> >>> Post to     : dhis2-users@xxxxxxxxxxxxxxxxxxx
>>>>>> >>> Unsubscribe : https://launchpad.net/~dhis2-users
>>>>>> >>> More help   : https://help.launchpad.net/ListHelp
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > Mailing list: https://launchpad.net/~dhis2-users
>>>>>> > Post to     : dhis2-users@xxxxxxxxxxxxxxxxxxx
>>>>>> > Unsubscribe : https://launchpad.net/~dhis2-users
>>>>>> > More help   : https://help.launchpad.net/ListHelp
>>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Mailing list: https://launchpad.net/~dhis2-users
>>>>> Post to     : dhis2-users@xxxxxxxxxxxxxxxxxxx
>>>>> Unsubscribe : https://launchpad.net/~dhis2-users
>>>>> More help   : https://help.launchpad.net/ListHelp
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Knut Staring
>>>> Dept. of Informatics, University of Oslo
>>>> Norway: +4791880522
>>>> Skype: knutstar
>>>> http://dhis2.org
>>>> _______________________________________________
>>>> Mailing list: https://launchpad.net/~dhis2-users
>>>> Post to     : dhis2-users@xxxxxxxxxxxxxxxxxxx
>>>> Unsubscribe : https://launchpad.net/~dhis2-users
>>>> More help   : https://help.launchpad.net/ListHelp
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~dhis2-users
> Post to     : dhis2-users@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~dhis2-users
> More help   : https://help.launchpad.net/ListHelp
>


References