← Back to team overview

launchpad-dev team mailing list archive

Re: "subscribe to search": implementation questions

 

On Aug 12, 2010, at 7:55 AM, Abel Deuring wrote:

> Hi Robert, hi Stuart,
> 
> I started to work on
> https://dev.launchpad.net/LEP/BetterBugSubscriptionsAndNotifications .
> The basic idea is to allow filtering of emails from bug subscriptions by
> the same search crietria which we already use here:
> https://bugs.edge.launchpad.net/malone/+bugs?advanced=1
> 
> Users could for example click a button "subscribe to this search" on web
> pages that show bug search results; notifications for any bugs matching
> the given search would be sent to the subscriber.
> 
> I could come up with two implementation ideas, but I don't like both of
> them very much...
> 
> (1) Add a table SubscriptionFilter with columns like bugtask_status,
> bugtask_importance, bugtask_assignee, bugtask_reporter, milestone etc;
> bug notifications would be sent if all non-null values of a
> SubscriptionFilter row match those of the given bug[task]. The main
> problem this approach: To represent a subscription like "all valid open
> bugs with importance High or Critical for milestones A or B" we would
> need 12 rows for all combinations of the status values Confirmed,
> Triaged, In Progress, two importance values and two milestone values...
> An alternative is to use several tables, like SubscriptionFilterStatus,
> SubscriptionFilterImportance etc. Another problem: If the filter
> specifies a full text search, we have the interesting query "give me all
> rows which define a full text search that matches the current bug".
> 
> (2) Just store some representation of the search parameters, for example
> the query part of the regular search URL, in a text column of a table
> SubscriptionFilter. If a subscription is associated with a filter, we
> call BugTaskSet.search() with the filter parameters and we pass
> additionally the ID of the "current" bug; if the the result set is not
> empty, the bug notification mail is sent. So, one or more SQL queries
> for each "filtered subscription", but since we have a WHERE clause like
> "bug.id=N", it should be quite fast. Also, we can be fairly sure that
> subscribers receive mails about exactly those bugs they would see on a
> search results page. (A crazy variant would be to squeeze
> BugTaskSet.search() into a stored procedure so that we can call it
> directly in the query "find all subscriptions for the current bug
> notification"...)
> 

Having recently come from a search company (job searches are not all that
different than bug/project/etc. searches. :), I can offer some insight,
as over 6 years, we solved this problem 3 different ways.

Method 1 above is pretty similar to the final solution we settled on,
though it was fraught with one big problem. Every time we changed search,
we had to go through all of the distinct search keys and make sure we
didn't break peoples' search. Or we'd forget that step (often) and just
break lots of saved searches.

The biggest pitfall here was that if we changed a field from structured to
free form, we had to go generate free-form representations of the old
structured search. One example of this was location. For a long time,
location was City, State/Province, Country, Zip/PostalCode. This became 
"Location" and search location providers (such as google maps, and others)
were used to interpret it. The re-interpretations can also be done at 
run-time, whenever one of the old fields is encountered, but I prefer to 
keep code simple, and the simplest way to do that is to have less
variation in your stored data.

Method 2 has the same problem, but now instead of having things in
predictible, structured storage where its easy to find any rows you may 
break, you now have to go digging/regexing through all the URL query parts.
One thing about this that mitigates the problem, is that often times you
will keep "old style" searches working for a while, so that links to old
searches continue to work, so you can simply use whatever method you use
there to re-process these searches.

A third method, which was only done as an experiment and never rolled out, 
was to store searches as documents in CouchDB. This allowed flexibility
in the schema, so while it resembled method 2, instead of a query part,
it was still "structured" and had indexes for querying. It also had a single
read for every search, rather than, as you put it, 12 rows for a single
search. It was also more logical to store and retrieve a full data structure,
rather than try to break it up and reassemble it from relational rows.

The hottest part of this was of course that you could cache the actual result
in the couchdb document, and simply tag it with a date/time stamp, and then
only refresh the result when it was out of date. This is cool because for a
user viewing their subscriptions its a very low cost to show them the results.

The only reason it wasn't pursued further was CouchDB's, at the time, dismal 
performance. This was almost 2 years ago, and I'm sure by now CouchDB has
gotten much better.

The "12 rows for a single search" mentioned above isn't all that bad. As I
understand it, if you do 12 inserts in a row to a single table in PostgreSQL,
at that point, those 12 rows are physically stored in serial. So at least
that row will remain fast until the table is re-clustered (apologies for
my terminology, I am more familiar w/ mysql than postgres).


Follow ups

References