maria-discuss team mailing list archive
Mailing list archive
Re: How to deal with text pasted from Word?
On Wed, Jul 5, 2017 at 3:54 PM, David Karr <davidmichaelkarr@xxxxxxxxx> wrote:
> I've inherited a small webapp that is using MariaDB for persistence.
> Some of the forms have textarea fields for extended text to be
> Someone reported an issue saving a form with some text that they had
> pasted from an email. The message started with this:
> Caused by: org.mariadb.jdbc.internal.util.dao.QueryException:
> Incorrect string value: '\xC2\x95\x09Onb...' for column 'ssimpact' at
> row 1
> I found where "Onb" is in the text, and right before it is a "bullet"
> character. So, this appeared to be a Unicode conversion issue. I
> tried pasting the same text after it had been passed to me, and it
> didn't fail. I'm pretty sure it didn't fail because that process of
> "passing it around" filtered the text to be all valid characters. The
> person who reported the problem said that when she just resubmitted
> it, it didn't fail. That might also point to a "cleansing" process
> that resulted in the submitted characters being legal.
> What are some reasonable strategies for getting this to work a little better?
Self-replying to add some more information.
I see from the output of "SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;"
that for my database, DEFAULT_CHARACTER_SET_NAME is "latin1" and
DEFAULT_COLLATION_NAME is "latin1_swedish_ci".
When I created the database, I just did "create database <name>;".
I'm guessing that when I created this database, I should have added
"CHARACTER SET = 'utf-8'".
Now that my database is created, and I have data in it, if I do an
"alter table" on the tables that can have this data, will this do a
proper conversion to the existing data, and allow the insertion of
those "special" characters like bullets?
, I would guess I would do something like this:
ALTER TABLE table_name CONVERT TO CHARACTER SET 'utf-8' COLLATE
I'm not certain about that collation name, but I noticed that the
"information_schema" database has the utf-8 charset, and the
"utf8_general_ci" collation name.