-
-
Saving user content to XML aka How to Remove Invalid Characters in UTF-8
July 23, 2009 -
Error Contains None UTF-8 Content… Bleh
When encoding use content to XML there are often issues that occur… 99.9% of the time stemming from users copy-pasting from MSWord in to text boxes and all the odd characters being carried through. PHP will play dumb and happily save this, then winge like a b’tch about displaying it.
A few quickfixes on saving the content are as follows…
Us your constructing your document as follows (simple example)
$doc = new DOMDocument(); $root = $doc->createElement('articles'); $root = $doc->appendChild($root); //... /* When you get to your body text (i.e the user entered bit) */ //BODY $body = $doc->createElement('mybodytext'); $root->appendChild($body); $bodytext = $_POST['bodytext']; //Take the post Data $bodytext = iconv("UTF-8","UTF-8//IGNORE",$bodytext); /* This will Convert the string to the requested character encoding and strip all of the none-utf-8 characters (http://uk3.php.net/manual/en/function.iconv.php */ $bodytext = htmlentities($bodytext,ENT_COMPAT); /* Standard Convert HTML entities */ $cdata = $doc->createCDATASection($bodytext); /* Stick it in a CDATA (as their could be all sorts of stuff in there*/ $body->appendChild($cdata); //... Rest of your Content + Save
As well as this there is also
mb_convert_encoding($str, 'UTF-8', 'HTML-ENTITIES');
Which offers similar functionality (http://us2.php.net/manual/en/function.mb-convert-encoding.php) as the above will allow you to convert string from ISO to UTF8.
-

Leave a Reply