zeitgeist team mailing list archive

Thread
Date

Fwd: RFC: Database Schema Changes Blueprint

To: zeitgeist@xxxxxxxxxxxxxxxxxxx
From: Siegfried Gevatter <rainct@xxxxxxxxxx>
Date: Sat, 10 Oct 2009 13:47:34 +0200
In-reply-to: <9961daf10910092251q770c8295sdd6097507994bc28@mail.gmail.com>
Sender: siggi.gevatter@xxxxxxxxx

---------- Forwarded message ----------
From: Mikkel Kamstrup Erlandsen <mikkel.kamstrup@xxxxxxxxx>
Date: 2009/10/10
Subject: Re: [Zeitgeist] RFC: Database Schema Changes Blueprint
To: Siegfried Gevatter <rainct@xxxxxxxxxx>

2009/10/9 Siegfried Gevatter <rainct@xxxxxxxxxx>:
> Hey,

We probably need to document this conversation somewhere, but here
goes anyway...

> (1)
>> Relational Mimetype
> To save disk space? Sounds fine.

Yeah, mostly that

> (4)
>> Questionable: Origin moved to uri Table?
>> Siegfried has an idea - I don't completely get it :-) https://bugs.launchpad.net/zeitgeist/+bug/425258
>
> The "origin" field we currently have is in the "item" table, and this
> doesn't make any sense at all. It should be a property of events,
> instead. Let me explain it with an example. Suppose I'm on Google,
> search for Zeitgeist and click on a link to zeitgeist-project.com; the
> origin here would be "google.com". Now, some hours later, I'm reading
> Seif's blog and click on a link to zeitgeist-project.com; for this
> event, the origin should be Seif's blog: but origin is in the "item"
> table and is already set to "google.com", so the new origin can't be
> stored!

Ok. I get the idea now - the original intent of origin was not what
you describe though.

The idea with origin was as follows: If I visit
http://youtube.com/v?7da84bdksy this would also be the URI of the item
in question, but the origin of the item would be http://youtube.com.
The reason we want to extract the origin in a rigorous manner (and not
simply use some prefix-matching on query time) is that we want to be
able to cluster events based on their origins. "Which youtube videos
have I watched lately?". Or ask the more general question "what do I
usually do after watching a youtube video?".

But origin for events is not well defined in this context. Introducing
your idea for the origin of events is very nice though. It doubles as
what I've called 'actor' in the blueprint though...

> (5)
> I'm also not quite sure what the value of "origin should be. In the
> example above, what would the URI for Google be, just
> "http://google.com"; or the URI of the search query? In case we go with
> the former, what happens with pages on shared hosting, how do we
> differentiate between them?
>
> I find the "origin" field rather confusing, and given that we'll get
> the focus tracking I tend to agree with "Questionable: Remove
> item.origin?".

I hope my above explanation clarifies things a bit?

> (6)
>> Questionable: Remove the app table all together?
>> [...] One less table can save us a SQL JOIN.
> No, if we get two things from the same table we still need two joins
> as the stuff is in different rows. Further, this would increment disk
> space usage.

Storing maybe, 100, apps in the item table is not any significant
overhead space-wise I believe. If we stored 25.000 then maybe, but
that is way beyond realistic.

> (7)
>> Solution 2: Separate Tables
>> [...] the event and annotation tables will look exactly like the item table but
>> adding an extra column subject_id
> I don't understand this at all.
>
> Gotta run now, see you later. By the way, I'll try to take a look at
> how views/procedures/etc work to see how we can optimize the big
> query.

That sounds sweet! I'll probably not be able to respond to mails
before sunday night. Have a good weekend!

--
Cheers,
Mikkel

References

RFC: Database Schema Changes Blueprint
From: Mikkel Kamstrup Erlandsen, 2009-10-08