u1db-discuss team mailing list archive

Thread
Date

Re: Indexing and lists

To: Stuart Langridge <stuart.langridge@xxxxxxxxxxxxx>, U1DB Discuss <u1db-discuss@xxxxxxxxxxxxxxxxxxx>
From: eric casteleijn <eric.casteleijn@xxxxxxxxxxxxx>
Date: Thu, 17 Nov 2011 17:03:32 -0500
In-reply-to: <1321560389.7594.90.camel@giles>
User-agent: Notmuch/0.10~rc1 (http://notmuchmail.org) Emacs/23.3.1 (x86_64-pc-linux-gnu)

On Thu, 17 Nov 2011 20:06:29 +0000, Stuart Langridge <stuart.langridge@xxxxxxxxxxxxx> wrote:
> So, let's talk about indexing.
> 
> Imagine a database with the following documents in:
> 
> { "name": "Stuart", "address": {"town": "Birmingham", "country": "UK"},
>   "hair": "red", "colours": [ "red", "blue" ], "id": "sil" }
> { "name": "Samuele", "address": {"town": "Sometown", "country": "CH"},
>   "hair": "brown", "colours": [ "green" ], "id": "pedronis" }
> { "name": "Lucio", "address": {"town": "Othertown", "country": "AR"},
>   "hair": "brown", "colours": [ "pink" ], "id": "lucio" }
> { "name": "Rodney", "address": {"town": "Podunk", "country": "US"},
>   "hair": "brown", "colours": [ "brown", "red" ], "id": "dobey" }
> 
> If I create an index as create_index("haircolour", ["hair"]), the index
> would look (conceptually) like this:
> brown: dobey
> brown: lucio
> brown: pedronis
> red: sil
> 
> All agreed so far. We ought to also be able to create an index on
> subfields, as was planned from the start, so create_index("townname",
> ["address.town"]):
> Birmingham: sil
> Sometown: pedronis
> Othertown: lucio
> Podunk: dobey
> 
> And indexing on multiple fields should also be doable, and was also
> planned from the start, so create_index("hairandtown", ["hair",
> "address.town"]):
> brown,Othertown: pedronis
> brown,Podunk: dobey
> brown,Sometown: lucio
> red,Birmingham: sil
> 
> However... I also want to create an index on favourite colours.
> Importantly, favourite colours is a list in the documents, so an
> individual document should show up more than once in the index. So the
> index would want to end up looking like this:
> 
> blue: sil
> brown: dobey
> green: pedronis
> pink: lucio
> red: dobey
> red: sil
> 
> So my question is... what's the index expression to create that index?
> create_index("colour", ["colours"]) seems wrong -- that feels like it
> should put the whole list in the index. What should it be?

I know we're purposely trying to keep the API as small as possible, but
I think being explicit here would help in two ways:

1. We signal the API consumer that not all indices are created equal,
and that they have to think (at least a little) about what they index
and how.

2. We give the backend implementation a hint as to how to build the
index, without it having to inspect the data (and what if there is no
data yet for instance.)

I can imagine several ways to do this, but the cleanest to me seems an
(optional perhaps) index_type argument to create_index. We can discuss
what kinds of indices we allow. (If that is not already hammered out
somewhere.)

The list example above would be awesome to have in the context of
tagging/folksonomies, but it comes at a cost, because basically you're
doing full text indexing. Maybe because are indexes are (presumably,
hopefully) not stored as wildly inefficiently as they were in couchdb,
this will not be as much of a problem as it is there.

> Extra thing: what if a document also contains
> "phones" [ {"name":"home", "num":"123"},{"name":"work", "num":"456"} ]
> 
> how would I do an index on people who have a work phone number?
> create_index("worknums", [ "phones.name" ]) ? That feels weird; the
> indexer would act differently depending on whether the value of "phones"
> is a dict or a list of dicts. Then again, maybe that's the answer; if a
> part of an index expression resolves to a list, then we do the remainder
> of the index expression for *each item in the list*. This would also
> cope with the above colours example, ignoring my reservations about it
> feeling weird. To me that makes a certain amount of sense. Thoughts?

This reminds me of mergeable lists and the dramas that came thereof. I
would probably, at least for starters, not allow indexes on paths into
arbitrary depths of the document.

I think maybe instead of trying to be overly generic at the beginning
again, we should encourage document schema writers to do what every
other schema does: have separate fields for these kinds of data, rather
than an annotated list of arbitrary length.

-- 
eric casteleijn
https://launchpad.net/~thisfred

References

Indexing and lists
From: Stuart Langridge, 2011-11-17