dhis2-devs team mailing list archive
-
dhis2-devs team
-
Mailing list archive
-
Message #04256
Re: Regular expressions in data validation rules
Very good points. I was thinking initially at least, so start with 4
and 1, in that order.
There are already many checks already in place in the UI, but somehow
it feels that it should be possible to extend them and make them more
generic, to suit a particular implementations needs. Could the rules
defined in the data integrity checks be reused at the UI level (and
other levels?). It feels like it is possible, although there may be
complications due to different regex flavors. The fourth alternative
seems like a quick win.
Data integrity checks serve a useful purpose, by allowing people to
enter some data, even if it may not be 100% correct. This is a
property of HMIS systems that I think we all face, namely that some
information is better than no information and all. For instance, you
can enter values beyond the min/max values, but there are checks there
to warn you. The same could be said of the functionality of regular
expressions in the data validation process. Allow people to enter
data, even though it may be not entirely correct (e.g. does not follow
the countries naming conventions, includes decimal places where there
should not be any, etc). Each of these rules are often highly specific
to implementations. Placing regular expressions in the data integrity
as a start, would seem fairly simple to implement, and would offer up
some quick wins to allow better data quality.
I agree that intercepting problems at the import level is important,
but as Bob highlights, it is costly in terms of processing. At a
personal level, I tend to want to get the data in the DB first, and
then try and clean it up, rather than trying to analyze all the
possibly problems prior to a data import. I think there are good
arguments both ways, but in many cases, we have no control, except
when we do the import ourselves, of whether imported data has been
properly imported or no. 90% of the time here in Zambia, data imported
is pretty good, but it is that 10% that can often only be resolved by
a human most efficiently, at least when one thinks about the code
required to try and correct every single issue that may arise from a
particular naming convention, and whether someone follows it or not.
Regards,
Jason
On Mon, Feb 8, 2010 at 12:34 PM, Bob Jolliffe <bobjolliffe@xxxxxxxxx> wrote:
> Hi
>
> There are 4 places one could use these regex's:
> 1. in the browser - client side validation
> 2. in the framework action/interceptors
> (http://struts.apache.org/2.1.8.1/docs/validation.html)
> 3. in the object persist methods
> 4. post fact validation checks.
>
> There are lots of examples of validation with regex using javascript. Not
> much to say.
>
> Regarding 2 it is a natural way to proceed but it won't affect import which
> doesn't use the web interface.
>
> Regarding 3 we do need to be aware of those places where we bypass the
> object model. But where the object model is being used it is not difficult
> to validate with a regex on save. Of course we have to find the
> corresponding regex. That is really the first problem to solve. Where to
> find the regex within the model.
>
> Leaving values out of the picture for a while it might make sense to start
> with names. We have many named objects and the way we name then is
> frequently very important as the names also act as primary identifiers. We
> need somehow to add a class-wide string regex field for descendents of
> NamedObjects (you might want two - one for name and one shortName, but maybe
> start with name). This way the regex should be available to clients of
> orgunit, dataelement, category etc
>
> On importing from XML it is very natural and easy to do regular expression
> based validation using something like schematron which can validate against
> any xpath expression - but regex is only available in XPath2 which means
> using saxon and there are some concerns about introducing a saxon
> dependency. (We might re-look at that). Though there is also another
> reason to perhaps not use regex validation on dataValues. It will slow
> things enormously for large imports.
>
> It is also possible to do regular expression matching at the schema level
> (using either RelaxNG or XSD) and validate via schema. This might be the
> most viable way to go though it would imply that the Zambia dxf schema would
> have slightly different constraints to say the Tajik one. And these schema
> variations would have to be auto-generated somehow based on the local
> database.
>
> Regards
> Bob
>
> On 8 February 2010 08:57, Jason Pickering <jason.p.pickering@xxxxxxxxx>
> wrote:
>>
>> Hi Murod,
>>
>> This, of course, is one particular trivial example and was provided
>> to illustrate a point.
>>
>> I totally agree, this particular example could be solved through
>> JavaScript validation on the client, and it may already be there in
>> 2.0. I have found this particular example by importing data from 1.4,
>> where organization units are allowed to have trailing spaces. I think
>> this is not really a one-off issue, as many people may need to import
>> data from external systems, which may or may not have this particular
>> validation enforced.
>>
>> What I am trying to get at is that regular expressions could be used
>> to expand the scope of the current data integrity checks, by enforcing
>> certain patterns on the data (which in some cases could also be
>> enforced through JavaScript in through the UI). Of course, if we can
>> do it at the UI level great, but it may not work in all cases,
>> especially when receiving data from external system. This is why I
>> think that the data integrity checks come in place. For instance, as I
>> mentioned in the specs, I need to find all organizational units that
>> do not correspond to the naming conventions here in Zambia. I can do
>> this with this...
>>
>> SELECT name from organisationunit where name !~
>> '^(ce|co|ea|ls|lu|no|nw|so|we) '
>>
>> Well, I found 47, which do not correspond to the naming convention. I
>> have made my dislike of the supposed best practice naming conventions
>> in earlier threads, but with the implmenetation of regex for checking
>> of these conventions, at least we could enforce them, even if it is ex
>> post facto.
>>
>> Again, these are all examples, and they are really impossible to
>> predict what they may be, thus the need for flexible rules, built by
>> administrators/users, and then applied during data integrity checks
>> (and/or during data entry).
>>
>>
>>
>> Regards,
>> Jason
>>
>>
>>
>>
>> On Mon, Feb 8, 2010 at 9:55 AM, Murodullo Latifov
>> <murodlatifov@xxxxxxxxx> wrote:
>> > Hi Jason,
>> >
>> > Looks like single time task if I understood you correctly? If you want
>> > to clean data already on database. like data integrity checking. Why not to
>> > make it clean at the very beginning, when particular record being captured?
>> > For this one could use regexp in javascript on client side too. As for
>> > leading and trailing spaces String.trim(" string ") should do before
>> > passing to database.
>> >
>> > regards,
>> > murod
>> >
>> >
>> >
>> > ----- Original Message ----
>> > From: Jason Pickering <jason.p.pickering@xxxxxxxxx>
>> > To: Hieu Dang Duy <hieu.hispvietnam@xxxxxxxxx>
>> > Cc: dhis2-devs <dhis2-devs@xxxxxxxxxxxxxxxxxxx>
>> > Sent: Mon, February 8, 2010 1:05:27 PM
>> > Subject: Re: [Dhis2-devs] Regular expressions in data validation rules
>> >
>> > Hi Hieu,
>> > Yes, I am actively fishing for a developer to implement this, as it
>> > will really save me a huge amount of work in trying to clean up data.
>> >
>> > I have no idea really how it would be implemented, other than that
>> > java.util.regex should be able to be used, but let me give it a try at
>> > a better specification. I do not think it should be so difficult
>> > either.
>> >
>> > I am thinking of something like this....
>> >
>> > The user would create a regular expression for later assignment to a
>> > database object. The user would select a database table (object) and
>> > field for validation. For instance, lets say we want to validate that
>> > there are no trailing spaces in an organization name.
>> >
>> > So, we would create a rule called "Trailing spaces are not allowed"
>> >
>> > We would create this rule, and assign a description and a regular
>> > expression to it.
>> >
>> > in this case, it would probably be something really simple like '\s+$'
>> >
>> > Now, I have no idea how to do this in java, but I assume this would be
>> > really simple, something like this query in Postgresql.
>> >
>> > SELECT name from organisationunit where name ~*('\s+$')
>> >
>> > Wow, I found 571 orgunits in my organisationunittable with trailing
>> > spaces. Cool.
>> >
>> > So, i think we need two objects.
>> >
>> > 1) A persistence object that stores the following files for the
>> > RegexExpression
>> >
>> > a) regexid
>> > b) name
>> > c) expression
>> > d) description
>> > e) resolution description (telling the user how to solve this problem)
>> >
>> > 2) A table to assign regular expressions to database objects.
>> >
>> > a) regexid
>> > b) table
>> > c) field
>> >
>> > We could maybe reuse this rule on the davavalue table, to determine if
>> > any values have been stored with trailing spaces.
>> >
>> > Yeah, its very easy I think. I would do it myself if I knew a lick of
>> > Java. :)
>> >
>> > Best regards,
>> > Jason
>> >
>> >
>> > On Sun, Feb 7, 2010 at 7:36 PM, Hieu Dang Duy
>> > <hieu.hispvietnam@xxxxxxxxx> wrote:
>> >> Hi all,
>> >>
>> >> I've no idea about using RegEx for validating data in DHIS2. Just a
>> >> small
>> >> comment, I am also using this many times so my feeling on this is not
>> >> easy
>> >> but not too difficult when applying RegEx in your coding, ie,
>> >> javascript and
>> >> java also.
>> >> With RegEx, we can easy controlling any thing that we want to force the
>> >> user
>> >> for entering data (text, number) or something else (a file name is an
>> >> example).
>> >> Let's try !
>> >>
>> >> Thanks !
>> >>
>> >> On Sun, Feb 7, 2010 at 10:24 PM, Jason Pickering
>> >> <jason.p.pickering@xxxxxxxxx> wrote:
>> >>>
>> >>> https://blueprints.launchpad.net/dhis2/+spec/regex-validation
>> >>>
>> >>> I have updated the blueprint on regular expression use in data
>> >>> validation rules. This would really make my life (and I suspect
>> >>> others) lives a lot easier, as long as we are using naming
>> >>> conventions, lets at least enforce them somehow.
>> >>>
>> >>> For discussion.
>> >>>
>> >>> Jason
>> >>>
>> >>> _______________________________________________
>> >>> Mailing list: https://launchpad.net/~dhis2-devs
>> >>> Post to : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>> >>> Unsubscribe : https://launchpad.net/~dhis2-devs
>> >>> More help : https://help.launchpad.net/ListHelp
>> >>
>> >>
>> >>
>> >> --
>> >> Hieu.HISPVietnam
>> >> Good Health !
>> >>
>> >
>> > _______________________________________________
>> > Mailing list: https://launchpad.net/~dhis2-devs
>> > Post to : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>> > Unsubscribe : https://launchpad.net/~dhis2-devs
>> > More help : https://help.launchpad.net/ListHelp
>> >
>> >
>> >
>> >
>> >
>>
>> _______________________________________________
>> Mailing list: https://launchpad.net/~dhis2-devs
>> Post to : dhis2-devs@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~dhis2-devs
>> More help : https://help.launchpad.net/ListHelp
>
>
Follow ups
References