← Back to team overview

dhis2-devs-core team mailing list archive

Re: ADX data import proposal

 

OK I relented and used a pipe.  Seems to work but not fully tested yet.

A few thoughts:

1.  The PipeImporter class could and probably should be generalized
with an interface and could conceivably be used for other similar
tasks (I am thinking for example of the gml job Halvdan referred to,
though I think there are deeper issues there which probably require
the gml import to be more nuanced than just piping to dxf.  It was
only initially designed for a one-off bootstrapping of orgunits ).

2.  Currently I am constraining to use a single thread spawning
executor.  There might be some use value in expanding this but its not
urgent.

3.  I have a method in DefaultADXDataService which dips from staxwax
reader to the underlying stream reader to gather all the attributes
for an element (and only the attributes).  If this method could be
part of the next staxwax library release I will remove it from here,
where it sits a bit ugly.

4.  It took me a while to figure out setting up the unit test
scaffolding to test the import.  It is based on the DataValueSet test
code.  To make these tests more comprehensive would require
duplicating a lot of that metadata setup which i haven't done.
Obviously it is a simple enough copy and paste but does anyone have
any idea of a cleaner way to do this?  I'm thinking of a sort of
general dummy test store which could be setup once and reused across
tests.

Bob

On 19 June 2015 at 09:20, Bob Jolliffe <bobjolliffe@xxxxxxxxx> wrote:
> Sure that will be an easy enough thing to refactor later.  There are a
> couple of sensible options.  For the moment I want to get it
> functional and I'll ensure the memory doesn't go pop.
>
>
> On 18 June 2015 at 20:54, Lars Helge Øverland <larshelge@xxxxxxxxx> wrote:
>> Hi okay, yes its maybe no ideal solution here. I think I would favor a
>> PipedOutputStream/PipedInputStream pair with a separate thread over an
>> in-memory DOM.
>>
>> Do we really need a separate threadpool? We fork off threads many places in
>> the system already, e.g. with parallel analytics queries. I thought as long
>> as its limited to one of a few per process it should be handled by the JVM.
>> But I might be wrong.
>>
>>
>>
>>
>>
>> On Thu, Jun 18, 2015 at 8:46 PM, Bob Jolliffe <bobjolliffe@xxxxxxxxx> wrote:
>>>
>>> Hi Lars
>>>
>>> The problem is the dataValuSetService requires an an inputstream to
>>> feed off.  There are only 2 ways to provide an inputstream that I can
>>> think of.  Either create a pipe or buffer (eg with a string).
>>>
>>> Creating a pipe is doable but then you also need to create a separate
>>> thread to read it which is another resource to manage (eg with a pool)
>>> but that seemed like more effort than it is worth.
>>>
>>> What I can do short term as a defensive measure is to place a limit on
>>> the number of datavalues which can be buffered for a single
>>> datavalueset.  That way it should not be possible to explode the
>>> memory.  I'll do that soon.
>>>
>>> Note that in "normal" use this should not be a problem as a single adx
>>> group corresponds to the data for one orgunit, for one period - what
>>> is envisaged typically is a single dataset's worth.
>>>
>>> The other "alternative" is not to use the datavalueSetService at all
>>> but just duplicate the code.
>>>
>>> Bob
>>>
>>> On 18 June 2015 at 15:22, Lars Helge Øverland <larshelge@xxxxxxxxx> wrote:
>>> > Hi Bob,
>>> >
>>> > as you say this creates a hard limit on memory. Now all it will take to
>>> > bring down a DHIS 2 instance is now to submit a sufficiently large
>>> > import
>>> > file. Seems like this will provide head-aches for server admins ;) Can
>>> > we
>>> > find a stream-based solution which scales well?
>>> >
>>> > Lars
>>> >
>>> >
>>> > On Thu, Jun 18, 2015 at 2:49 PM, Bob Jolliffe <bobjolliffe@xxxxxxxxx>
>>> > wrote:
>>> >>
>>> >> WIP committed and slight adjustment of strategy ...
>>> >>
>>> >> I was not comfortable with creating a new thread just to pipe from adx
>>> >> to
>>> >> dxf.
>>> >>
>>> >> So instead, for each adx group corresponding to a dataValueSet with
>>> >> orgUnit, period (and potentially atributeOptionCombo), I create a
>>> >> dataValueSet DOM document and present that to the dxf2 stream importer
>>> >> as a stream.  Given that this data is bound by a single orgunit and
>>> >> period I don't think the DOM document is going to break the memory
>>> >> bank.
>>> >>
>>> >> Basic conversion to dxf2 is working fine.
>>> >>
>>> >> Next task is to "implode" the categories.
>>> >>
>>> >> A luta Continua.
>>> >>
>>> >> On 12 June 2015 at 13:40, Bob Jolliffe <bobjolliffe@xxxxxxxxx> wrote:
>>> >> > Hi
>>> >> >
>>> >> > As yoou have seen I have already started to commit a few bits of code
>>> >> > in support of the ADX implementation.  I hadn't been planning to do
>>> >> > this so will proceed quite slowly, but let me outline the approach I
>>> >> > am considering for your comment and suggestion.
>>> >> >
>>> >> > 1.  Currently we have a datavaueset service which can import dxf2
>>> >> > data
>>> >> > from an inputstream.
>>> >> >
>>> >> > 2.  I would like to use that existing service and place the adx
>>> >> > service as a thin veneer above it rather than create a lot of
>>> >> > duplicated code.
>>> >> >
>>> >> > 3.  The adx data importer would read its adx input from a stream and
>>> >> > convert that into a dxf2 stream.  The main tasks it would need to
>>> >> > perform are:
>>> >> > (i)  convert periods into dxf2 format
>>> >> > (ii) lookup catoptcombos and attributeoptioncombos for the dimensions
>>> >> > in the adx message
>>> >> > All other attributes and ImportOptions would be passed through
>>> >> > directly to the dxf2 datavalueset service.
>>> >> >
>>> >> > 4.  In order to present the resulting dxf2 to the service as an
>>> >> > InputStream it would have to use PipeReader/PipeWriter combination
>>> >> > (Something Lars will recall from earlier dxf1 code).  The equivalent
>>> >> > alternative would be to post the dxf2 datasets backout to the REST
>>> >> > endpoint but that seems wasteful and more awkward.
>>> >> >
>>> >> > Does that approach sound reasonable?
>>> >> >
>>> >> > I have some lingering uncertainty about the best way to deal with
>>> >> > ImportSummary.  The adx data is naturally grouped by orgunit/period.
>>> >> > So I would likely split the stream and post each as a separate dxf2
>>> >> > datavalueset.  So probably this would imply collecting the results
>>> >> > into an <ImportSummaries ... /> element.  ADX is currently silent on
>>> >> > the result message as it deliberately does not define the transaction
>>> >> > (just the message) so we have some latitude here to do whatever is
>>> >> > best.  The above is my best suggestion.
>>> >> >
>>> >> > Cheers
>>> >> > Bob
>>> >>
>>> >> --
>>> >> Mailing list: https://launchpad.net/~dhis2-devs-core
>>> >> Post to     : dhis2-devs-core@xxxxxxxxxxxxxxxxxxx
>>> >> Unsubscribe : https://launchpad.net/~dhis2-devs-core
>>> >> More help   : https://help.launchpad.net/ListHelp
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Lars Helge Øverland
>>> > Lead developer, DHIS 2
>>> > University of Oslo
>>> > Skype: larshelgeoverland
>>> > http://www.dhis2.org
>>> >
>>
>>
>>
>>
>> --
>> Lars Helge Øverland
>> Lead developer, DHIS 2
>> University of Oslo
>> Skype: larshelgeoverland
>> http://www.dhis2.org
>>


References