launchpad-dev team mailing list archive

Thread
Date

Re: RFC: Bug import ideas

To: Stuart Bishop <stuart.bishop@xxxxxxxxxxxxx>
From: Gavin Panella <gavin.panella@xxxxxxxxxxxxx>
Date: Thu, 11 Mar 2010 13:35:29 +0000
Cc: launchpad-dev@xxxxxxxxxxxxxxxxxxx, Graham Binns <graham.binns@xxxxxxxxxxxxx>
In-reply-to: <g6lsthu5ky7egusigbUYAxe124vaj_firegpg@mail.gmail.com>
Sender: gavinpanella@xxxxxxxxx

On 10 March 2010 07:18, Stuart Bishop <stuart.bishop@xxxxxxxxxxxxx> wrote:
...
> If it is just inserting new data rather than modifying existing rows it
> should be ok at the moment. You say 'almost all new data' though, which is
> the catch. Even if it is all new data, that doesn't mean it will be fine in
> the future (eg. we add an ON INSERT trigger to update some cache
> information). It also doesn't protect us from long running imports, which we
> will kill off to avoid causing database bloat (garbage cannot be cleared up
> in the database by VACUUM until it is older than the longest running
> transaction).

Looking through bugimport.py, the only occurrence I see of
manipulating existing data is a call to
email_address.account.createPerson(), when a user has an account in
SSO (?) but not in Launchpad.

If this is a problem, we could, for example, identify all bugs to be
imported that refer to users with SSO and not LP, then process these
last of all. Or we could say that dry runs are not possible when an
import contains such users.

> If the goal here is to avoid writing the cache file, I'd suggest just using
> another method to detect an already imported bug (eg. the bug nickname is
> set by the importer to allow old bug ids to map to launchpad bug ids).

Avoiding the cache file is one thing certainly. The bug importer does
set the bug nick name, so we should change it to check for that
instead of using the cache file.

The other reason is to allow anyone to do dry runs, so that we don't
have to, and as a step towards completely self-service bug imports.
Allowing trial runs feels like it's important to that, but maybe it's
not.

> The other points are valid rationales though. Perhaps we should import into
> temporary tables and, on success, move all the data from the temporary
> tables into the real ones. I'd suggest now worrying about these issues
> though - better validation of the import file before attempting the import
> would seem to be a better approach. For the database import to fail, you
> would need to violate database constraints or attempt to link to a
> non-existant row and there not that many constraints to check and I don't
> think there are any foreign key references that might get removed mid-run.

I don't think temporary tables are the way to go, because it's the
constraints and foreign keys that need to be exercised. The temp
tables could have those same constraints I guess... but it seems like
a lot of work.

I agree that validation is the way to go. Two validators might work
well: one static, and one when the database is available. The static
validator could be packaged stand-alone so that it's easy to reuse by
the developers of bug exporters.

Gavin.

References

Re: RFC: Bug import ideas
From: Francis J. Lacoste, 2010-03-09
Re: RFC: Bug import ideas
From: Stuart Bishop, 2010-03-10