[Templates] UTF8 support and issues

Mark Proctor m.proctor@bigfoot.com
Wed, 20 Nov 2002 12:30:46 -0000


This is a multi-part message in MIME format.

------=_NextPart_000_0018_01C29090.A6E08EA0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_001_0019_01C29090.A6E6A920"


------=_NextPart_001_0019_01C29090.A6E6A920
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I have managed to knock up a self contained example which I have
attached, an example string is Descripci=F3n - although you will need to
have XML::Simple installed.=20
=20
The example takes an input string and then prints it twice - one with
concatenation another just displaying the inputted string. The mangling
occurs when you concatenate an XML string with a CGI string.
=20
I'm not sure why this happens but here is a first attempt at a possible
theory. All XML parsing is done in UTF8, but perl has no idea of
encodings for incomding CGI streams and assumes them to be iso-88591
(latin1) - I read this somewhere don't know if its correct. String
operations upgrade none UTF8 strings to UTF8, so perl tries to convert
the CGI string from iso-88591 to UTF8 thus mangling it as its already
UTF8.
=20
Ivan - thank your for your example example, I think it shows the same
issue as mine. I'm not sure how your fix would help with mine as the
concatonation happens at compiled template stage - and we would have to
change template toolkit to work with sprintf which I expect is not
desirable. Is there some way to tag an incoming value as UTF8 so that it
doesn't get mangled when it is upgraded during the concatenation?
=20
Barry - I'm still trying to digest what you said, I'm about to start
reading through the unicode site you linked too. How does this issue
relate to the two example from Ivan and myself?
=20
Thanks
=20
Mark

------=_NextPart_001_0019_01C29090.A6E6A920
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Message</TITLE>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.2719.2200" name=3DGENERATOR></HEAD>
<BODY>
<DIV><SPAN class=3D264152611-20112002><FONT face=3DArial size=3D2>I have =
managed to=20
knock up a self contained example&nbsp;</FONT><SPAN=20
class=3D234324611-20112002><FONT face=3DArial size=3D2>which I have =
attached,=20
</FONT><SPAN class=3D264152611-20112002><FONT face=3DArial><FONT =
size=3D2><SPAN=20
class=3D234324611-20112002>a</SPAN>n example string&nbsp;is =
Descripci=F3n -=20
</FONT></FONT></SPAN></SPAN><FONT face=3DArial size=3D2>although you =
will need to=20
have XML::Simple installed. </FONT></SPAN></DIV>
<DIV>
<DIV><FONT face=3DArial size=3D2><SPAN=20
class=3D264152611-20112002></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D264152611-20112002>The =
example takes an=20
input string and then prints it twice - one with concatenation another =
just=20
displaying the inputted string. The mangling occurs when you concatenate =
an XML=20
string with a CGI string.</SPAN></FONT></DIV>
<DIV><FONT face=3DArial size=3D2><SPAN=20
class=3D264152611-20112002></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D264152611-20112002>I'm =
not sure why=20
this happens but here is a first attempt at a possible theory. All XML =
parsing=20
is done in UTF8, but perl has no idea of encodings for incomding CGI =
streams and=20
assumes them to be iso-88591 (latin1) -&nbsp;I read this somewhere don't =
know if=20
its correct. String operations upgrade none UTF8 strings to UTF8, so =
perl tries=20
to convert the CGI string from iso-88591 to UTF8 thus mangling it as its =
already=20
UTF8.</SPAN></FONT></DIV>
<DIV><FONT face=3DArial size=3D2><SPAN=20
class=3D264152611-20112002></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D264152611-20112002><SPAN=20
class=3D234324611-20112002>Ivan - thank your for your example example, I =
think it=20
shows the same issue as mine. I'm not sure how your fix would help with =
mine as=20
the concatonation happens at compiled template stage - and we would have =
to=20
change template toolkit to work with sprintf which I expect is not =
desirable. Is=20
there some way to tag an incoming value as UTF8 so that it doesn't get =
mangled=20
when it is upgraded during the concatenation?</SPAN></SPAN></FONT></DIV>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D264152611-20112002><SPAN=20
class=3D234324611-20112002></SPAN></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D264152611-20112002><SPAN=20
class=3D234324611-20112002>Barry - I'm still trying to digest what you =
said, I'm=20
about to start reading through the unicode site you linked too. How does =
this=20
issue relate to the two example from Ivan and =
myself?</SPAN></SPAN></FONT></DIV>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D264152611-20112002><SPAN=20
class=3D234324611-20112002></SPAN></SPAN></FONT><FONT face=3DArial =
size=3D2><SPAN=20
class=3D264152611-20112002><SPAN=20
class=3D234324611-20112002></SPAN></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D264152611-20112002><SPAN=20
class=3D234324611-20112002>Thanks</SPAN></SPAN></FONT></DIV>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D264152611-20112002><SPAN=20
class=3D234324611-20112002></SPAN></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2><SPAN=20
class=3D264152611-20112002>Mark</SPAN></FONT></DIV></DIV></BODY></HTML>

------=_NextPart_001_0019_01C29090.A6E6A920--

------=_NextPart_000_0018_01C29090.A6E08EA0
Content-Type: application/octet-stream;
	name="testUTF8.pl"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="testUTF8.pl"

#!/usr/local/perl5.6.1/bin/perl
require "httpd-paths.ph" || die "$0: can't load httpd-paths.ph: $!\n";
use XML::Simple;
use CGI;
use CGI::Carp;
use utf8;

my $file =3D \*DATA;
my $xmlSimple =3D XML::Simple->new(searchpath =3D> ".", forcearray =3D> =
0,parseropts =3D> [ProtocolEncoding =3D> 'UTF-8']);
my $xml =3D $xmlSimple->XMLin($file);

my $q =3D new CGI;

print "Content-type: text/html; charset=3Dutf-8\n\n";

print <<END;
<html>
<head>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
CHARSET=3DUTF-8">
</head>
<body>
END

print <<END;
<form action=3D"/cgi-bin/testUTF8.pl" method=3D"post" =
enctype=3D"multipart/form-data">
<textarea name=3D"text" style=3D"width:400;height:200">
END

$text =3D $q->param('text');
if ($q->param('text')) {
  my $out =3D '';
  $out .=3D $xml->{message};
  $out .=3D " ";
  $out .=3D $text;
  $out .=3D "\n";
  print $out;
  print $text;
}

print <<END;
</textarea>
<form>
<br>
<input type=3D"submit"><br><br>
END


foreach $i (keys %ENV) {
 print "$i:$ENV{$i} <br>\n";
}

print <<END;
</body>
</html>
END


__DATA__
<?xml version=3D"1.0" encoding=3D"UTF-8"?>
<buzz>
  <message>hello</message>
</buzz>
------=_NextPart_000_0018_01C29090.A6E08EA0--