launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #03633
Re: Suggestions for searching bug attachments
On Tue, Jun 22, 2010 at 12:07 AM, Kamran Riaz Khan
<krkhan@xxxxxxxxxxxxxx> wrote:
> Hello all,
>
> One of the areas I am working for Summer of Code is attachment
> searching. Basically the aim is to let an Arsenal user do something like
> this:
>
> "Search for <text> in attachments of all bugs that exist in source
> packages subscribed by <team>"
This pretty much has to be done using some sort of index. 'attachments
of all bugs that exist in source packages subscribed by' can run to
tens of thousands of attachments totaling gigabytes of data. Even if
you only look in the first few kilobytes of an attachment, that is
still hundreds or maybe thousands of librarian requests that need to
be made. For some real numbers for the ubuntu-bugs team, that query
matches 32000 attachments averaging 128kb in size totaling 4GB of
data.
I don't think using our existing database full text search will be
useful - this is for searching for words in text but your examples
need some sort of substring search. An external search engine might be
better, such as Google or a Google appliance, but they are still word
searches to some extent.
I'm really not sure of the best way to tackle this problem. The
Librarian data is not stored in the database because there are
multiple TB of files. The team membership information is in the
relational database. There are no indexes anywhere to the contents of
the Librarian files. I think we need some sort of external search
engine (I don't think we don't want to integrate this into the
Librarian core). Ideally we could feed it subscriber information
allowing it to determine the set of 32000 attachments that ubuntu-bugs
has access to rather than having to calculate this information from
the relational db and then feed the ids to the search engine.
Whatever approach certainly needs signoff from the LP team leads, as
the resource requirements are non trivial and someone needs to pay for
the hardware.
An alternative approach would be to keep the search separate from
Launchpad. A team would host their own search engine somewhere. A
LaunchpadAPI script would be run regularly, pulling new attachments
meeting the teams criteria from the Librarian and feeding them into
the search engine.
--
Stuart Bishop <stuart@xxxxxxxxxxxxxxxx>
http://www.stuartbishop.net/
References