launchpad-dev team mailing list archive

Thread
Date

Re: "subscribe to search": implementation questions

To: Clint Byrum <clint.byrum@xxxxxxxxxxxxx>
From: Abel Deuring <abel.deuring@xxxxxxxxxxxxx>
Date: Fri, 13 Aug 2010 16:31:51 +0200
Cc: Robert Collins <robert.collins@xxxxxxxxxxxxx>, Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <7C7591D8-87B6-4C1F-B4A6-1A654FC4C89C@canonical.com>
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.11) Gecko/20100713 Thunderbird/3.0.6

On 12.08.2010 18:16, Clint Byrum wrote:
> 
> On Aug 12, 2010, at 7:55 AM, Abel Deuring wrote:
[...]
> Having recently come from a search company (job searches are not all that
> different than bug/project/etc. searches. :), I can offer some insight,
> as over 6 years, we solved this problem 3 different ways.

cool, I suspected that I missed something obvious while thinking about
the "search subscriptions", but now it seems that this is a problem
where it is hard to come up with a really good solution.

> 
> Method 1 above is pretty similar to the final solution we settled on,
> though it was fraught with one big problem. Every time we changed search,
> we had to go through all of the distinct search keys and make sure we
> didn't break peoples' search. Or we'd forget that step (often) and just
> break lots of saved searches.

yes, that's what I thought too; OTOH we don't change the search
parameters (or the search behaviour) that often.

> 
> The biggest pitfall here was that if we changed a field from structured to
> free form, we had to go generate free-form representations of the old
> structured search. One example of this was location. For a long time,
> location was City, State/Province, Country, Zip/PostalCode. This became 
> "Location" and search location providers (such as google maps, and others)
> were used to interpret it. The re-interpretations can also be done at 
> run-time, whenever one of the old fields is encountered, but I prefer to 
> keep code simple, and the simplest way to do that is to have less
> variation in your stored data.
> 
> Method 2 has the same problem, but now instead of having things in
> predictible, structured storage where its easy to find any rows you may 
> break, you now have to go digging/regexing through all the URL query parts.
> One thing about this that mitigates the problem, is that often times you
> will keep "old style" searches working for a while, so that links to old
> searches continue to work, so you can simply use whatever method you use
> there to re-process these searches.

right, there is a risk that we have to do some maintenance work on
stored queries. But that would be also a good reminder that we would
also break URLs for bug searches which people have bookmarked when we
have to update the stored query parameters ;)

> 
> A third method, which was only done as an experiment and never rolled out, 
> was to store searches as documents in CouchDB. This allowed flexibility
> in the schema, so while it resembled method 2, instead of a query part,
> it was still "structured" and had indexes for querying. It also had a single
> read for every search, rather than, as you put it, 12 rows for a single
> search. It was also more logical to store and retrieve a full data structure,
> rather than try to break it up and reassemble it from relational rows.
> 
> The hottest part of this was of course that you could cache the actual result
> in the couchdb document, and simply tag it with a date/time stamp, and then
> only refresh the result when it was out of date. This is cool because for a
> user viewing their subscriptions its a very low cost to show them the results.

Could you explain a bit more how CouchDB would be used? Would query
parameters be mapped to search results? Perhaps I am somewhat slow, but
I don't how this would help for the case "does bug notification X match
the search criteria specified by subscriber Y".

> 
> The only reason it wasn't pursued further was CouchDB's, at the time, dismal 
> performance. This was almost 2 years ago, and I'm sure by now CouchDB has
> gotten much better.
> 
> The "12 rows for a single search" mentioned above isn't all that bad. As I
> understand it, if you do 12 inserts in a row to a single table in PostgreSQL,
> at that point, those 12 rows are physically stored in serial. So at least
> that row will remain fast until the table is re-clustered (apologies for
> my terminology, I am more familiar w/ mysql than postgres).

The expansion of three values for parameter 1 and two values for
parameters 2 and 3 into twelve rows is simply unaesthetic ;) But storing
the selected values of parameter 1 in table 1, the values of parameter 2
in table 2 etc and joining all these tables may also work sufficiently
fast in queries.

Abel

References

"subscribe to search": implementation questions
From: Abel Deuring, 2010-08-12
Re: "subscribe to search": implementation questions
From: Clint Byrum, 2010-08-12