← Back to team overview

maria-developers team mailing list archive

Updated (by Monty): Add support for google protocol buffers (34)

 

-----------------------------------------------------------------------
                              WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Add support for google protocol buffers
CREATION DATE..: Tue, 21 Jul 2009, 21:11
SUPERVISOR.....: Monty
IMPLEMENTOR....: 
COPIES TO......: 
CATEGORY.......: Server-Sprint
TASK ID........: 34 (http://askmonty.org/worklog/?tid=34)
VERSION........: WorkLog-3.4
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 0
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0

PROGRESS NOTES:

-=-=(Monty - Thu, 11 Feb 2010, 20:03)=-=-
High Level Description modified.
--- /tmp/wklog.34.old.18329     2010-02-11 18:03:42.000000000 +0000
+++ /tmp/wklog.34.new.18329     2010-02-11 18:03:42.000000000 +0000
@@ -19,3 +19,14 @@
 
 Any support for indexing GPB data is outside of scope of this WL entry.
 
+Example usage:
+
+SELECT proto_get(blob, 1, varchar) from table_with_proto;
+
+UPDATE table_with_proto SET blob=proto_add(blob, 2, "hello") where id=1;
+
+UPDATE table_with_proto SET blob=proto_del(blob,4) where id=5;
+
+Note that 'proto_add()' will replace any old value with the given proto_id.
+
+

-=-=(Monty - Thu, 11 Feb 2010, 19:59)=-=-
High-Level Specification modified.
--- /tmp/wklog.34.old.17962     2010-02-11 19:59:45.000000000 +0200
+++ /tmp/wklog.34.new.17962     2010-02-11 19:59:45.000000000 +0200
@@ -1,13 +1,8 @@
-
-<contents>
 1. GPB Encoding overview
 2. GPB in an SQL database
-2.1 Informing server about GPB field names and types
-2.2 Addressing GPB fields
-2.2.1 Option1: SQL Function
-2.2.2 Option2: SQL columns
-</contents>
-
+3. Encoding to use for dynamic columns
+4. How to store and access data in a protocol buffer from SQL
+5. Extensions for the future
 
 1. GPB Encoding overview
 ========================
@@ -37,42 +32,50 @@
 traffic right away, and will open path to getting the best possible
 performance.
 
-2.1 Informing server about GPB field names and types
-----------------------------------------------------
-User-friendly/meaningful access to GPB fields requires knowledge of GPB field
-names and types, which are not available from GPB message itself (see "GPB 
-encoding overview" section).
-
-So the first issue to be addressed is to get the server to know the definition
-of stored messages. We intend to assume that all records have GPB messages
-that conform to a certain single definition, which gives one definition per
-GPB field.
+3. Encoding to use for dynamic columns
+======================================
 
-DecisionToMake: How to pass the server the GPB definition?
-First idea: add a CREATE TABLE parameter which will specify either the
-definition itself or path to .proto file with the definition.
+The data should be coded into the proto buffer in the following format:
+
+<field_number><value_type><value>[<field_number><value_type><value>...]
+
+Where field_number is a number between 0-65536 that identifes the field
+<value_type> is a enum of type 'Item_result'
+<value> is the value coded in proto format.
+
+In other words, we should have no nested or complex structure.
+
+4. How to store and access data in a protocol buffer from SQL
+============================================================
+
+User-friendly/meaningful access to GPB fields requires knowledge of
+GPB field names and types, which are not available from GPB message
+itself (see "GPB encoding overview" section).
+
+To make things easy for the user, we will at first stage provide SQL
+functions to manipulate a string that is actually in proto format.
 
-2.2 Addressing GPB fields
--------------------------
-We'll need to provide a way to access GPB fields. This can be complicated as
-structures that are encoded in GPB message can be nested and recursive.
-
-2.2.1 Option1: SQL Function
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Introduce an SQL function GPB_FIELD(path) which will return contents of the
-field.  
-- Return type of the function will be determined from GPB message definition. 
-- For path, we can use XPath selector (a subset of XPath) syntax.
-
-(TODO ^ the above needs to be specified in more detail. is the selector as
-simple as filesystem path or we allow quantifiers (with predicates?)?)
-
-2.2.2 Option2: SQL columns
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-Make GPB columns to be accessible as SQL columns. 
-This approach has problems:
-- It might be hard to implement code-wise
-  - (TODO will Virtual columns patch help??)
-- It is not clear how to access fields from nested structures. Should we allow 
-  quoted names like `foo/bar[2]/baz' ? 
+The functions we should provde are:
 
+proto_get(gpb, field_number, type)
+
+This return the field tagged with 'field_number' from the 'gpb' buffer.
+
+Example:  proto_get(blob, 1, varchar)     -> Returns field number 1 as varchar
+
+proto_put(gpb, field_number, value)
+
+This returns a new gbp buffer with the new value appended.
+
+Example: proto_put(proto_put(blob, 1, 1), 2, "hello")
+
+5. Extension for future
+=======================
+
+In the future we may want to access data based on name and get MariaDB to
+automaticly know the correct type.  To do this we need to be able to
+store a definition for the content of the proto buffer somewhere.
+
+DecisionToMake: How to pass the server the GPB definition?
+First idea: add a CREATE TABLE parameter which will specify the
+definition itself.

-=-=(Monty - Thu, 11 Feb 2010, 19:59)=-=-
High Level Description modified.
--- /tmp/wklog.34.old.17915     2010-02-11 17:59:17.000000000 +0000
+++ /tmp/wklog.34.new.17915     2010-02-11 17:59:17.000000000 +0000
@@ -1,5 +1,21 @@
-Add support for Google Protocol Buffers (further GPB). It should be possible 
-to have columns that store GPB-encoded data, as well as use SQL constructs to
+Add support for dynamic columns:
+
+- A column that can hold information from many columns 
+- One can instantly add or remove column data
+
+This is a useful feature for any store type of application, where you want to
+store different type of information for different kind of items.
+
+For example, for shoes you want to store: material, size, colour, maker
+For a computer you want to store ram, hard disk size etc...
+
+In a normal 'relational' system you would need to a table for each type.
+With dynamic columns you have all common items as fixed fields (like
+product_code, manufacturer, price) and the rest stored in a dynamic column.
+
+The proposed idea is to store the dynamic information in a blob in
+Google Protocol Buffers (further GPB) format and use SQL constructs to
 extract parts of GPB data for use in select list, for filtering, and so forth.
+
 Any support for indexing GPB data is outside of scope of this WL entry.
 

-=-=(Knielsen - Fri, 22 Jan 2010, 11:38)=-=-
Low Level Design modified.
--- /tmp/wklog.34.old.29965     2010-01-22 11:38:57.000000000 +0200
+++ /tmp/wklog.34.new.29965     2010-01-22 11:38:57.000000000 +0200
@@ -2,3 +2,12 @@
   and a parser for text form of .proto file which then exposes the parsed
   file via standard GPB message navigation API.
 
+* We should have both server-side support and client-side support (client side
+  means functions in libmysqlclient so that user can select the full BLOB and
+  extract fields in the application).
+
+* Add some kind of header to the GPB blob to support versioning and future
+  extensibility.
+
+* Add complete syntax description (update, add, drop, exists, ...).
+

-=-=(Psergey - Tue, 21 Jul 2009, 21:13)=-=-
Low Level Design modified.
--- /tmp/wklog.34.old.6462      2009-07-21 21:13:13.000000000 +0300
+++ /tmp/wklog.34.new.6462      2009-07-21 21:13:13.000000000 +0300
@@ -1 +1,4 @@
+* GPB tarball contains a protocol definition for .proto file structure itself
+  and a parser for text form of .proto file which then exposes the parsed
+  file via standard GPB message navigation API.
 

-=-=(Psergey - Tue, 21 Jul 2009, 21:12)=-=-
High-Level Specification modified.
--- /tmp/wklog.34.old.6399      2009-07-21 21:12:23.000000000 +0300
+++ /tmp/wklog.34.new.6399      2009-07-21 21:12:23.000000000 +0300
@@ -1 +1,78 @@
 
+<contents>
+1. GPB Encoding overview
+2. GPB in an SQL database
+2.1 Informing server about GPB field names and types
+2.2 Addressing GPB fields
+2.2.1 Option1: SQL Function
+2.2.2 Option2: SQL columns
+</contents>
+
+
+1. GPB Encoding overview
+========================
+
+GBB is a compact encoding for structured and typed data. A unit of GPB data
+(it is called message) is only partially self-describing: it's possible to 
+iterate over its parts, but, quoting the spec
+
+http://code.google.com/apis/protocolbuffers/docs/encoding.html:
+  " the name and declared type for each field can only be determined on the
+  decoding end by referencing the message type's definition (i.e. the .proto 
+  file). "
+
+2. GPB in an SQL database
+=========================
+
+It is possible to store GPB data in MariaDB today - one can declare a binary
+blob column and use it to store GPB messages. Storing and retrieving entire
+messages will be the only available operations, though, as the server has no
+idea about the GPB format. 
+It is apparent that ability to peek inside GPB data from SQL layer would be of
+great advantage: one would be able to 
+- select only certain fields or parts of GPB messages
+- filter records based on the values of GPB fields
+- etc
+performing such operations at SQL layer will allow to reduce client<->server 
+traffic right away, and will open path to getting the best possible
+performance.
+
+2.1 Informing server about GPB field names and types
+----------------------------------------------------
+User-friendly/meaningful access to GPB fields requires knowledge of GPB field
+names and types, which are not available from GPB message itself (see "GPB 
+encoding overview" section).
+
+So the first issue to be addressed is to get the server to know the definition
+of stored messages. We intend to assume that all records have GPB messages
+that conform to a certain single definition, which gives one definition per
+GPB field.
+
+DecisionToMake: How to pass the server the GPB definition?
+First idea: add a CREATE TABLE parameter which will specify either the
+definition itself or path to .proto file with the definition.
+
+2.2 Addressing GPB fields
+-------------------------
+We'll need to provide a way to access GPB fields. This can be complicated as
+structures that are encoded in GPB message can be nested and recursive.
+
+2.2.1 Option1: SQL Function
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Introduce an SQL function GPB_FIELD(path) which will return contents of the
+field.  
+- Return type of the function will be determined from GPB message definition. 
+- For path, we can use XPath selector (a subset of XPath) syntax.
+
+(TODO ^ the above needs to be specified in more detail. is the selector as
+simple as filesystem path or we allow quantifiers (with predicates?)?)
+
+2.2.2 Option2: SQL columns
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+Make GPB columns to be accessible as SQL columns. 
+This approach has problems:
+- It might be hard to implement code-wise
+  - (TODO will Virtual columns patch help??)
+- It is not clear how to access fields from nested structures. Should we allow 
+  quoted names like `foo/bar[2]/baz' ? 
+



DESCRIPTION:

Add support for dynamic columns:

- A column that can hold information from many columns 
- One can instantly add or remove column data

This is a useful feature for any store type of application, where you want to
store different type of information for different kind of items.

For example, for shoes you want to store: material, size, colour, maker
For a computer you want to store ram, hard disk size etc...

In a normal 'relational' system you would need to a table for each type.
With dynamic columns you have all common items as fixed fields (like
product_code, manufacturer, price) and the rest stored in a dynamic column.

The proposed idea is to store the dynamic information in a blob in
Google Protocol Buffers (further GPB) format and use SQL constructs to
extract parts of GPB data for use in select list, for filtering, and so forth.

Any support for indexing GPB data is outside of scope of this WL entry.

Example usage:

SELECT proto_get(blob, 1, varchar) from table_with_proto;

UPDATE table_with_proto SET blob=proto_add(blob, 2, "hello") where id=1;

UPDATE table_with_proto SET blob=proto_del(blob,4) where id=5;

Note that 'proto_add()' will replace any old value with the given proto_id.



HIGH-LEVEL SPECIFICATION:



1. GPB Encoding overview
2. GPB in an SQL database
3. Encoding to use for dynamic columns
4. How to store and access data in a protocol buffer from SQL
5. Extensions for the future

1. GPB Encoding overview
========================

GBB is a compact encoding for structured and typed data. A unit of GPB data
(it is called message) is only partially self-describing: it's possible to 
iterate over its parts, but, quoting the spec

http://code.google.com/apis/protocolbuffers/docs/encoding.html:
  " the name and declared type for each field can only be determined on the
  decoding end by referencing the message type's definition (i.e. the .proto 
  file). "

2. GPB in an SQL database
=========================

It is possible to store GPB data in MariaDB today - one can declare a binary
blob column and use it to store GPB messages. Storing and retrieving entire
messages will be the only available operations, though, as the server has no
idea about the GPB format. 
It is apparent that ability to peek inside GPB data from SQL layer would be of
great advantage: one would be able to 
- select only certain fields or parts of GPB messages
- filter records based on the values of GPB fields
- etc
performing such operations at SQL layer will allow to reduce client<->server 
traffic right away, and will open path to getting the best possible
performance.

3. Encoding to use for dynamic columns
======================================

The data should be coded into the proto buffer in the following format:

<field_number><value_type><value>[<field_number><value_type><value>...]

Where field_number is a number between 0-65536 that identifes the field
<value_type> is a enum of type 'Item_result'
<value> is the value coded in proto format.

In other words, we should have no nested or complex structure.

4. How to store and access data in a protocol buffer from SQL
============================================================

User-friendly/meaningful access to GPB fields requires knowledge of
GPB field names and types, which are not available from GPB message
itself (see "GPB encoding overview" section).

To make things easy for the user, we will at first stage provide SQL
functions to manipulate a string that is actually in proto format.

The functions we should provde are:

proto_get(gpb, field_number, type)

This return the field tagged with 'field_number' from the 'gpb' buffer.

Example:  proto_get(blob, 1, varchar)     -> Returns field number 1 as varchar

proto_put(gpb, field_number, value)

This returns a new gbp buffer with the new value appended.

Example: proto_put(proto_put(blob, 1, 1), 2, "hello")

5. Extension for future
=======================

In the future we may want to access data based on name and get MariaDB to
automaticly know the correct type.  To do this we need to be able to
store a definition for the content of the proto buffer somewhere.

DecisionToMake: How to pass the server the GPB definition?
First idea: add a CREATE TABLE parameter which will specify the
definition itself.


LOW-LEVEL DESIGN:



* GPB tarball contains a protocol definition for .proto file structure itself
  and a parser for text form of .proto file which then exposes the parsed
  file via standard GPB message navigation API.

* We should have both server-side support and client-side support (client side
  means functions in libmysqlclient so that user can select the full BLOB and
  extract fields in the application).

* Add some kind of header to the GPB blob to support versioning and future
  extensibility.

* Add complete syntax description (update, add, drop, exists, ...).


ESTIMATED WORK TIME

ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)