u1db-discuss team mailing list archive

Thread
Date
Re: Indexing and lists

To: John Rowland Lenton <john.lenton@xxxxxxxxxxxxx>
From: John Arbash Meinel <john@xxxxxxxxxxxxxxxxx>
Date: Fri, 18 Nov 2011 09:30:34 +0100
Cc: U1DB Discuss <u1db-discuss@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <8762ii2yu3.fsf@canonical.com>
User-agent: Mozilla/5.0 (Windows NT 6.0; rv:8.0) Gecko/20111105 Thunderbird/8.0
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Just so we're all aware, we do have an initial Index implementation
that handles 'lists' from James Westby:
https://code.launchpad.net/~james-w/u1db/index-transformations/+merge/81069



On 11/17/2011 11:01 PM, John Rowland Lenton wrote:
> On Thu, 17 Nov 2011 20:06:29 +0000, Stuart Langridge
> <stuart.langridge@xxxxxxxxxxxxx> wrote:
>> 
>> how would I do an index on people who have a work phone number? 
>> create_index("worknums", [ "phones.name" ]) ? That feels weird;
>> the indexer would act differently depending on whether the value
>> of "phones" is a dict or a list of dicts. Then again, maybe
>> that's the answer; if a part of an index expression resolves to a
>> list, then we do the remainder of the index expression for *each
>> item in the list*. This would also cope with the above colours
>> example, ignoring my reservations about it feeling weird. To me
>> that makes a certain amount of sense. Thoughts?
> 
> More questions: do you want to be able to create an index on the
> names of an object? Do we want partial indexes? If we have an index
> expression that transforms a string into a list of strings, do we
> need to explicitly say that we want each of those added separately
> to the index, rather than the list itself?

1) I think you mean by this something like:
   create_index('names', ['names()'])
   create_doc('{"a": 1, "b": 2}')
   get_from_index('names', ["a"])

   Would then return the document.
   I can see where that could be useful, though if there are only a
   small number of names that you care about, then you can create an
   index for each one.

2) I'm not 100% sure what you mean by partial indexes here. If part of
   an index evaluates to 'null', then that document is not put into the
   index.
   Maybe you are taking it a step further and having an equality check?
   create_index('john', ['equal(name, "john")'])
   or
   create_index('john', ['name == "john"])

   The former fits into our current syntax ok, the latter would be a
   possible transformation, but I imagine the syntax parser gets crazy
   when you start layering them.

3) I think here you mean do we want something like:

   create_index('favcolor', ["any(colour)"])

   rather than just writing it as:

   create_index('favcolor', ["colour"])

   And if the 'colour' field is a list, we just evaluate each item of
   the list.

   I think I agree that 'any()' seems superfluous. The question that
   remains is if we want an 'all()' function (flatten a list into a
   single item).

   As an example:
     create_index('all_colour', ['all(colours)'])

     get_from_index('all_colour', ['green'])
	returns Samuel
     get_from_index('all_colour', ['red'])
	returns [], nobody likes *just* red.
     get_from_index('all_colour', ['red|blue'])
	returns Stuart

   I don't think we want all() because its syntax is probably a set
   operation (red,blue) is the same as (blue,red)?
   And I think users can approximate it in user-space with:

     create_index('colour', ['colours'])
     docs = get_from_index('colour', ['red', 'blue'])
     for doc in docs:
       if 'red' not in doc.colours or 'blue' not in doc.colours:
	# doesn't like both
	continue
       ...




> 
> I think the answer to those is no, yes, and no: I think the rule
> for index expressions should be that they either resolve to a
> single "scalar" value (one of string|number|true|false|null), which
> is added to the index, or to a list, which scalar elements are
> added sequentially to the index, and that if neither of those
> happens it's not an error, it simply isn't added (I'm on the fence
> as to whether lists that have list elements should have the
> elements of the list elemenet added recursively; having to explain
> that makes my head hurt a little. man perllol). That we should
> provide no index functions to address individual items of a list;
> if you need to treat the second item differently from the first,
> then it should be an object, not a list. that
> "name.split().lower()" (or "lower(split(name))", or 
> name|split|lower, or whatever) should result in the same values
> added to the index as "name.lower().split()". And that we should
> continue to enforce the semantics (in the same way i said "you
> shouldn't care about the nth element of the list") by saying that
> you shouldn't get into the situation where you have to create an
> index on the keys.
> 
> I also think that after describing what we want for the indexing 
> language, we need to look at what is the minimal thing we can do
> that is useful, and do that first. That we shouldn't spend too much
> time worrying about how we'd create an index of an object with 3
> layers of nested dicts and lists of lists; we can put hard limits
> to the complexity of the expressions we admit, especially at
> first.
> 
> We're going to want to throw away the indexing language in a few
> years (WHAT WERE WE THINKING?!? *hair pull*) and rewrite it, and
> still admit the old expressions for backwards compatibility, so the
> smaller it is (while still being useful) the less we'll have to
> hack it up later. Yes? (probably preaching to the choir by now).

I think you have some good points here. Something simple that is
functional enough to get work done, and then iterate to find a better
solution.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7GF6oACgkQJdeBCYSNAAMRuwCfdtS2ihPUr0aeYqZWUZZAG9Do
jIsAnjxpAlUei1lyMuglI3CgiMFrC5o7
=Q54u
-----END PGP SIGNATURE-----
References

Indexing and lists
From: Stuart Langridge, 2011-11-17
Re: Indexing and lists
From: John Rowland Lenton, 2011-11-17