← Back to team overview

maria-discuss team mailing list archive

Re: How to deal with text pasted from Word?

 

On Wed, Jul 5, 2017 at 3:54 PM, David Karr <davidmichaelkarr@xxxxxxxxx> wrote:
> I've inherited a small webapp that is using MariaDB for persistence.
> Some of the forms have textarea fields for extended text to be
> entered.
>
> Someone reported an issue saving a form with some text that they had
> pasted from an email.  The message started with this:
> ------------------
> Caused by: org.mariadb.jdbc.internal.util.dao.QueryException:
> Incorrect string value: '\xC2\x95\x09Onb...' for column 'ssimpact' at
> row 1
> ----------------
>
> I found where "Onb" is in the text, and right before it is a "bullet"
> character.  So, this appeared to be a Unicode conversion issue.  I
> tried pasting the same text after it had been passed to me, and it
> didn't fail.  I'm pretty sure it didn't fail because that process of
> "passing it around" filtered the text to be all valid characters.  The
> person who reported the problem said that when she just resubmitted
> it, it didn't fail.  That might also point to a "cleansing" process
> that resulted in the submitted characters being legal.
>
> What are some reasonable strategies for getting this to work a little better?

Self-replying to add some more information.

I see from the output of "SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;"
that for my database, DEFAULT_CHARACTER_SET_NAME is "latin1" and
DEFAULT_COLLATION_NAME is "latin1_swedish_ci".

When I created the database, I just did "create database <name>;".

I'm guessing that when I created this database, I should have added
"CHARACTER SET = 'utf-8'".

Now that my database is created, and I have data in it, if I do an
"alter table" on the tables that can have this data, will this do a
proper conversion to the existing data, and allow the insertion of
those "special" characters like bullets?

>From https://mariadb.com/kb/en/mariadb/setting-character-sets-and-collations/
, I would guess I would do something like this:
-----------------
ALTER TABLE table_name CONVERT TO CHARACTER SET 'utf-8' COLLATE
'utf8_general_ci';
-----------

I'm not certain about that collation name, but I noticed that the
"information_schema" database has the utf-8 charset, and the
"utf8_general_ci" collation name.


References