← Back to team overview

randgen team mailing list archive

Case sensitivity in Transform validator's TRANSFORM_OUTCOME_DISTINCT check

 

Hi,


It is probably known to those of you using the Transformer validator that some query transforms are not comparable to the original in every case. One such case is the ExecuteAsUnion transforms when using TRANSFORM_OUTCOME_DISTINCT against queries using varchar columns.

With this transform, results from the original query (e.g. SELECT `col_varchar_10_latin1` FROM L) are compared to results of transformed queries using UNION. For example,

            (  SELECT `col_varchar_10_latin1`  FROM L    )
  UNION ALL (  SELECT `col_varchar_10_latin1`  FROM L    )
  /* TRANSFORM_OUTCOME_SUPERSET */;

where the /* TRANFORM_OUTCOME_SUPERSET */ signals to the validator that results are OK if results from the transformed query is a superset of the original query. This is all well.

On the other hand, one other type of transformed UNION query is:

        (  SELECT `col_varchar_10_latin1`  FROM L    )
  UNION (  SELECT `col_varchar_10_latin1`  FROM L    )
  /* TRANSFORM_OUTCOME_DISTINCT */;

where the /* TRANFORM_OUTCOME_DISTINCT */ signals to the validator that results are OK if results from the transformed query after removing duplicates is equal to the results of the original query.

This works OK in most cases, but when dealing with text values in a column with a case-insensitive collation (default in MySQL) the validator reports false positives. This is because the validator's idea of "DISTINCT" is not equal to the the database server's idea of DISTINCT. The issue is visible with transforms such as "ExecuteAsUnion" and "Distinct".

I see this for example when running the RQG against MySQL with --validator=Transform, --grammar=conf/optimizer/range_access.yy and --gendata=conf/optimizer/range_access.zz.

I am interested in your opinions on what is the best way to deal with this, especially if you too are affected by this. For example:

a) Implement permanent support for case insensitive duplicate removal (if possible) in the Transform validator's DISTINCT checking?

b) Automatically skip (return STATUS_WONT_HANDLE) this transform altogether if columns in query are varchar or char?

c) Implement temporary hacks to disable e.g. the ExecuteAsUnion transform in the affected test runs?



To make the issue more clear, consider the following MySQL table:

CREATE TABLE t9 (a VARCHAR(10), b VARCHAR(10) COLLATE latin1_bin);

-- Here, column a has default collation (|latin1_swedish_ci|), while column b has case sensitive collation (latin1_bin).
-- Let's insert some data:

insert into t9 values ('hello', 'hello');
insert into t9 values ('v', 'v');
insert into t9 values ('i', 'i');
insert into t9 values ('I', 'I');
insert into t9 values ('m', 'm');

-- Mimicking the Transform validator, we execute first the original query and then the transformed query:

mysql> SELECT * FROM t9;
+-------+-------+
| a     | b     |
+-------+-------+
| hello | hello |
| v     | v     |
| i     | i     |
| I     | I     |
| m     | m     |
+-------+-------+
5 rows in set (0.00 sec)

-- Using the column with default collation, we get:

mysql> (SELECT a FROM t9) UNION (SELECT a FROM t9);
+-------+
| a     |
+-------+
| hello |
| v     |
| i     |
| m     |
+-------+
4 rows in set (0.00 sec)

-- The validator thinks this is a bad result, because the value "I" is missing. -- However, the result is correct, since duplicate rows were removed with case insensitivity in mind.

-- Using the column with the case sensitive collation:

mysql> (SELECT b FROM t9) UNION (SELECT b FROM t9);
+-------+
| b     |
+-------+
| hello |
| v     |
| i     |
| I     |
| m     |
+-------+
5 rows in set (0.00 sec)

-- The validator agrees this is an OK result, since it is equal to the original. -- The result is correct, since this was a column with case sensitive collation (no duplicates).


The problem here is that although the validator's isDistinct() check tries to remove duplicates from the result before comparing, it does this in a case sensitive manner ("i" is not the same as "I"). The UNION operator of MySQL, however, regards "i" the same as "I" and removes one of them from the result, as long as the column has default collation.

Let me know if you have suggestions or opinions.


--
John


Follow ups