zeitgeist team mailing list archive

Thread
Date

Re: Fwd: RFC: Database Schema Changes Blueprint

To: Siegfried Gevatter <rainct@xxxxxxxxxx>
From: Mikkel Kamstrup Erlandsen <mikkel.kamstrup@xxxxxxxxx>
Date: Sun, 11 Oct 2009 15:26:04 +0200
Cc: zeitgeist@xxxxxxxxxxxxxxxxxxx
In-reply-to: <357b51820910100447t75bde2aetc0dec4964ed6a955@mail.gmail.com>

2009/10/10 Siegfried Gevatter <rainct@xxxxxxxxxx>:
> ---------- Forwarded message ----------
> From: Siegfried Gevatter <rainct@xxxxxxxxxx>
> Date: 2009/10/10
> Subject: Re: [Zeitgeist] RFC: Database Schema Changes Blueprint
> To: Mikkel Kamstrup Erlandsen <mikkel.kamstrup@xxxxxxxxx>
>
>
> 2009/10/10 Mikkel Kamstrup Erlandsen <mikkel.kamstrup@xxxxxxxxx>:
>> Ok. I get the idea now - the original intent of origin was not what
>> you describe though.
>>
>> The idea with origin was as follows: If I visit
>> http://youtube.com/v?7da84bdksy this would also be the URI of the item
>> in question, but the origin of the item would be http://youtube.com.
>> The reason we want to extract the origin in a rigorous manner (and not
>> simply use some prefix-matching on query time) is that we want to be
>> able to cluster events based on their origins. "Which youtube videos
>> have I watched lately?". Or ask the more general question "what do I
>> usually do after watching a youtube video?".
>
> Ah, I see now, but I'm still not convinced we need it (looks like data
> duplication). The first use case can already be achieved using
> "http://youtube.com/%"; as URI filter in FindEvents. That's more
> flexible than just having the host name.

Here you are assuming that URIs behave in a tree/prefix friendly way.
This can not be assumed in general. URIs should for our purposes be
treated as opaque identifiers (ie. we can not assume any a priori
structure on them). Hence we need to store the URI of the origin.

>>>> Questionable: Remove the app table all together?
>>>> [...] One less table can save us a SQL JOIN.
>>> No, if we get two things from the same table we still need two joins
>>> as the stuff is in different rows. Further, this would increment disk
>>> space usage.
>>
>> Storing maybe, 100, apps in the item table is not any significant
>> overhead space-wise I believe. If we stored 25.000 then maybe, but
>> that is way beyond realistic.
>
> Right, the space isn't really the reason why I'm against it, I just
> see no benefit in this. The JOIN is still needed anyway (and it
> probably becomes slower as there is way more stuff in "item" than in
> "app").

You are right about the JOIN. It still simplifies our data model a bit
though... It is more less needless complication to put apps in their
own table, and we are chaning the DB structure anyway...

-- 
Cheers,
Mikkel

References

RFC: Database Schema Changes Blueprint
From: Mikkel Kamstrup Erlandsen, 2009-10-08
Fwd: RFC: Database Schema Changes Blueprint
From: Siegfried Gevatter, 2009-10-10