maria-developers team mailing list archive

Thread
Date
Re: Server-id patch

To: Robert Hodges <robert.hodges@xxxxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Tue, 18 Sep 2012 11:52:06 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <0879AA19-F27F-42A7-BF0B-56FB64232EDD@continuent.com> (Robert Hodges's message of "Fri, 7 Sep 2012 19:17:30 -0700")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
Robert Hodges <robert.hodges@xxxxxxxxxxxxxx> writes:

> I think we need to take this out to one of the email threads.  I hope other
> people would be interested in these problems.

Agree, I've quoted liberally and Cc:'ed maria-developers@

> On Aug 31, 2012, at 6:14 AM PDT, Kristian Nielsen wrote:

>> Robert Hodges <robert.hodges@xxxxxxxxxxxxxx> writes:
>> 
>>> 1.) The server-id of which I was speaking is a very simple patch to allow us
>>> to do the following client command: 'set session server-id=X' where X is
>>> whatever we choose.  Currently MySQL replication sets the server ID from an
>>> internal structure (THD as I recall), which means that applications like
>>> Tungsten cannot get the server-id into the binlog when applying updates on a
>>> slave.  Server-id is essential to prevent replication loops.  Monty and I
>>> discussed this feature briefly over dinner in April.  I hacked in support
>>> using a UDF about a year ago but it would be better to have this become a
>>> dynamic session variable.

>> Yes, it does not seem too hard to do (though as always some care is needed to
>> introduce it in a clean way).
>> 
>> I created an issue for this with some descriptions:
>> 
>>    https://mariadb.atlassian.net/browse/MDEV-500

> SET SESSION pseudo_server_id=236;
>
> Then the server_id 236 would to go into the binlog.  This seems like a good
> idea and avoids accidentally reseting the server_id, which is a pretty
> significant variable.

Seems there is hope we can have this in 10.0.

>>> 2.) I agree with your criticisms of the MySQL 5.6 replication design.  It's
>>> surprisingly complex and looks as if it will have a lot of bugs.  I felt at
>> 
>> Yes. They need this complexity because their parallel replication (which is
>> similar to Tungsten parallel replication I think?) can arbitrarily reorder
>> transactions in a slave binlog compared to the order in the master binlog.

> We just use the sequential log.  The slaves are complex but it does not
> pollute the log.  In fact, everything up to the point of applying to the
> slave is purely sequential.  I think their design got this wrong.

Well, the MySQL --log-slave-updates work by generating completely new events
on the slave as the received events are executed by the SQL thread(s).

Do you mean that Tungsten does not do this, but uses the original binlog with
original events from the master? How is multi-master (one slave replicating
from several masters) handled (if it is handled) ?

>>> the time that the better approach on transaction IDs would be to put enough
>>> information into logs to enable applications to compute a vector clock on
>>> the stream of binlog events coming from multiple servers.  This is a
>> 
>> Interesting, I though about a concept of "vector clock" before, but did not
>> know the terminology.
>> 
>> Aha, so your point is that even though we will apply changes from multiple
>> upstream master servers in parallel on a single slave, with a vector clock, we
>> can still order them on a slave according to which transactions on one server
>> where visible to which transactions on the other. And stuff like
>> that. Interesting!
>> 
>> But note that to be truly "theoretically sound", all queries (ie. SELECT) as
>> well as all applications that interact directly or indirectly with multiple
>> servers need to be included in the vector clock. So it is not without
>> complexity.

> Are you thinking about holding locks?  For practical purposes it's enough to
> do writes only.  My point is that you have a global ID that is partially
> ordered by originating server.  Vector clocks are the traditional way of
> handling that.  This is the basis for most mechanisms of conflict
> resolution--you check to see if the clocks are ordered and if they are not
> (because two thing happened at the same time, hence are not comparable), it
> signals a conflict that must be reconciled.

Well, I was mostly just reacting to the phrase "truly theoretically sound". If
readers are not considered in the vector clock, then they are able to observe
different states of the database at different nodes in the cluster. But as you
say, that is probably acceptable.

>> This does not immediately apply to the MySQL design, as they work with a
>> single master (applying in parallel transactions in different databases), and
>> as they arbitrarily reorder between different databases, without any attempt
>> to maintain inter-database consistency. But of course, my point is that this
>> design is fundamentally wrong.
>> 
>>> theoretically sound approach to enable both synchronous and asynchronous
>>> multi-master replication.  (Galera incidentally uses approximately this
>>> approach for sync replication as you may be aware.)  If you give us server
>>> IDs we don't need anything else.  Oops, I forgot.  Making the binlog truly
>>> crash safe in every possible case is also very helpful.
>> 
>> I think it is getting closer to being truly crash safe - can you mention the
>> situations where it is not? And btw., are you interested in the recent testing
>> we (especially Elena) made for testing such crash safety, and the fixes made
>> for it? We found some fairly interesting (and scary) stuff...

> Actually your (MP's?) fix for group commit is the most important
> improvement.  Can we turn sync_binlog back on without killing performance?

Yes, well at least performance is substantially improved with sync_binlog=1.
With a parallel load you can see improvements of 10x or more, and even
single-threaded load is improved in 10.0.

>> I have been thinking about this for some time. Basically, the correct approach
>> is to make the binlog into a true transaction log using standard textbook
>> techniques. Pre-allocate the log, make it page based (this will also improve
>> performance). Do ping-pong writes at the end so that we can recover even from
>> torn pages. Stuff like that. Would be cool to do, but so far I have not really
>> seen much interest for something like this.

> It would be a lot of work.  If group commit is fixed, then most on-premise
> people can solve enough problems with RAID and BBU that they don't need much
> more.

Right, agree, I thought you had something more subtle than sync_binlog=0 in
mind.

(BTW, I fixed a couple binlog crash recovery bugs recently in MariaDB, which
also affect MySQL).

>>> 3.) I read the email you cited.  I need some time to study it in order to
>>> comment intelligently.  The most immediate feedback is that I doubt you want
>>> to make assumptions about how slaves will apply transactions.  The role of
>>> the master is to provide local transactions in serial order.  Looking across
>>> a set of masters you then have a set of easily identifiable streams of
>>> partially ordered transactions.  That in turn provides a mathematically
>>> tractable model for reasoning about things like update order, dependencies,
>>> and potential conflicts.  Any metadata you can supply about dependencies,
>>> e.g., which schema(s) transactions actually touch is useful but may or may
>>> not be relevant on slaves depending on the application.  It's up to slaves
>>> to make decisions on reordering.
>> 
>> But think about a three-level multi-master topology. Servers A and B are
>> masters for server C, which is itself a master for D. The DBA/application
>> developer will declare that transactions on A and B are independent, so C can
>> apply them in any order, and in parallel. Clearly on D we will want to also
>> apply them in parallel, and the only way we can do that is to use the
>> different server ids A and B in the stream of events in the binlog of C.
>> 
>> So now consider a simple two-level topology with just master C and slave
>> D. Why not still let C put different "server_id"s in different events, as a
>> way to tell server D that events are independent? For example, if the DBA runs
>> a long-running ALTER TABLE, he can ensure that no transactions touch this
>> table while it runs, and just SET SESSION SERVER_ID to something else, and it
>> can run in parallel on D without delaying normal transactions.
>> 
>> Or think about MySQL 5.6 style parallel replication, where the DBA declares
>> that on server A, transactions against different databases are to be
>> considered independent and applied in parallel on the slave. While on server B
>> such transactions are not independent. Now we want to setup slave C to
>> replicate from both A and B. Clearly, C is the wrong place to configure
>> whether transactions in different databases are to be considered independent -
>> this is a property of how the transactions are generated, not where they are
>> replicated. But if we just let the server_id of transactions on master A be a
>> hash of the affected database, then MySQL 5.6 style parallel replication just
>> falls out of multi-master for tree, without any of their complexity.

> I have not ever seen this topology arise in practice within MySQL, as people
> tend to keep things pretty homogeneous.  It seems to arise mostly in data
> warehouse loading where multiple applications load data into the same store
> for reporting.  These are more heterogeneous but arallel information from
> MySQL is of limited value here--data warehouses in Vertica or Hadoop don't
> have constraints and you can just jam data into them in parallel.

Ok.

>> It is just a matter of not artificially restricting transactions generated on
>> a given server to have just one server_id value. Which is in fact exactly what
>> you asked me about originally with SET SESSION SERVER_ID :)
>> 
>> On the other hand, the slave is free to try to discover more opportunities for
>> independence / parallelism if it can, and use them. But automatically
>> discovering opportunities for parallelism on the slave is a very hard problem
>> in general, as I am sure you agree ...

> Meanwhile, I don't like the idea of using server_id in the way you describe
> because it overloads the meaning.  Server_id just tells us which server
> actually committed the transaction in the first place.  The reason we need
> the session variable was to be able to signal this properly.  I was not
> intending to use it more broadly.

Ok, I've noted your concern. It would be interesting to learn how you use the
knowledge that one mysqld instance can have at most one server id in generated
transactions?

> More generally, I believe that if you want to allow programmers to declare
> that things are independent it would be best to add SQL-level features.  For
> instance, Facebook has a patch that prevents transactions from spanning
> schemas.  This is a very useful feature, because it ensures that all updates
> in one schema are independent from others.  That in turn means you can
> parallelize without being afraid of deadlocks (a big problem in this space)
> or constraint violations.
>
> You could generalize this into the notion of a shard, which is a scope for
> independent transactions and presumably also referential integrity.  This
> would be a very useful feature for SQL databases.  I don't know if you have
> read it but Pat Helland has a great paper on this topic called "Life beyond
> Distributed Transactions: an Apostate’s Opinion."  It was visionary at the
> time (and still is) and has never been properly implemented in relational
> DBMS to my knowledge.

Well, independent schemas are nice, but not enough. There are many
applications that need the slave to apply transactions in parallel even within
a single table.

 - Kristian.