← Back to team overview

mahara-contributors team mailing list archive

[Bug 1482410] Re: Leap2A import problem: "simplexml_load_file()... parser error : PCDATA invalid Char value..."

 

On further research I've decided to be more thorough and just whitelist
the allowed characters in XML, listed here:
https://en.wikipedia.org/wiki/Valid_characters_in_XML

I wound up using preg_replace() with the "/u" modifier to make it
Unicode-safe. The downside to this is that we read the entire file into
memory and then do preg_replace on it, but that shouldn't use too much
more memory, because we're already reading the entire file into memory
in order to use simplexml.

I also discovered that htmlentities() can get rid of these invalid
characters if you use the flags ENT_XML1 | ENT_DISALLOWED flags. But
those flags were only added in PHP 5.4, and we still aim to support PHP
5.3. Plus, the best they can do is replace the invalid characters with a
Unicode 0xFFDD character, which will display as an unprintable
character. So, it's still better to just remove them entirely.

-- 
You received this bug notification because you are a member of Mahara
Contributors, which is subscribed to Mahara.
Matching subscriptions: Subscription for all Mahara Contributors -- please ask on #mahara-dev or mahara.org forum before editing or unsubscribing it!
https://bugs.launchpad.net/bugs/1482410

Title:
  Leap2A import problem: "simplexml_load_file()... parser error : PCDATA
  invalid Char value..."

Status in Mahara:
  In Progress

Bug description:
  We had a report of a Mahara-generated Leap2a file that caused this
  crash stack upon attempting to import it:

  [WAR] 38 (import/leap/lib.php:126) simplexml_load_file(): /home/aaronw/dataroot/mahara/temp/import/admin-1438900425/extract/leap2a.xml:1808: parser error : PCDATA invalid Char value 11
  Call stack (most recent first):

      log_message("simplexml_load_file(): /home/aaronw/dataroot/mahar...", 8, true, true, "/home/aaronw/www/mahara/htdocs/import/leap/lib.php", 126) at /home/aaronw/www/mahara/htdocs/lib/errors.php:441
      error(2, "simplexml_load_file(): /home/aaronw/dataroot/mahar...", "/home/aaronw/www/mahara/htdocs/import/leap/lib.php", 126, array(size 2)) at Unknown:0
      simplexml_load_file("/home/aaronw/dataroot/mahara/temp/import/admin-143...", "SimpleXMLElement", 67584) at /home/aaronw/www/mahara/htdocs/import/leap/lib.php:126
      PluginImportLeap->read_leap2a_xml_file() at /home/aaronw/www/mahara/htdocs/import/leap/lib.php:147
      PluginImportLeap->build_default_load_mapping() at /home/aaronw/www/mahara/htdocs/import/leap/lib.php:164
      PluginImportLeap->process(1) at /home/aaronw/www/mahara/htdocs/import/index.php:245
      import_submit(object(Pieform), array(size 3)) at Unknown:0
      call_user_func_array("import_submit", array(size 2)) at /home/aaronw/www/mahara/htdocs/lib/pieforms/pieform.php:537
      Pieform->__construct(array(size 6)) at /home/aaronw/www/mahara/htdocs/lib/pieforms/pieform.php:164
      Pieform::process(array(size 6)) at /home/aaronw/www/mahara/htdocs/lib/pieforms/pieform.php:71
      pieform(array(size 6)) at /home/aaronw/www/mahara/htdocs/import/index.php:171
      print_upload_form() at /home/aaronw/www/mahara/htdocs/import/index.php:61

  Upon investigation it turned out that the leap2a XML file had a
  Vertical Tab character (ASCII x0A) in one of the page titles. There is
  a whole range of ASCII control characters that will cause a parser
  error in SimpleXML, and if they're placed in a Mahara page title, they
  will be included in the output of the Leap2a file, which will cause
  Mahara to crash when it attempts to import the file.

To manage notifications about this bug go to:
https://bugs.launchpad.net/mahara/+bug/1482410/+subscriptions


References