[Templates] UTF8 support and issues
Ivan Kurmanov
kurmanov@openlib.org
Wed, 20 Nov 2002 06:09:09 +0200
--45Z9DzgjV8m4Oswq
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Friends,
On Tue, Nov 19, 2002 at 01:21:40PM -0800, Peter Guzis wrote:
> Try:
>
> My $legible_string = pack 'C*', (unpack 'U*',
> $message_mangled_by_xml_parser);
That doesn't help in my case. But I have another dirty work-around.
The problem is in perl 5.6.x, and I suppose we really should consult
perl-unicode mailing list archives to see what is known about the
problem and how to treat it well. It is a bit off-topic here, but I
think it is important for the TT users, so I continue.
I attach a script which reproduces the bug on my perl 5.6.1. It uses
XML::XPath, but I think XML::Parser would perform equally well if you
adapt the script apropriately. The script takes a string from a litle
xml file in __DATA__ section. First it takes it through XML::XPath,
then it reads the data directly, through <> operator. Then script
prints out the values, each at a time. Works fine: the output is the
same. Then it prints them concatenated with dot operator. Oops, the
second string gets printed utf-8 encoded (extra time).
There is a work-around with sprintf function, involving a value got
from the parser. Stupid as it can be, see the source.
Another way is to use bytes pragma. It helps in this testcase, but my
experience shows that bytes pragma is buggy and may break other
things. Once I had to create a special module with it, because it
didn't like something in the original one.
Ivan
ps. the word in the xml is russian for 'hello'
--45Z9DzgjV8m4Oswq
Content-Type: application/x-perl
Content-Disposition: attachment; filename="demo.pl"
Content-Transfer-Encoding: quoted-printable
#!/usr/local/bin/perl=0Ause strict;=0Ause XML::XPath;=0A=0Amy $file =3D \*D=
ATA;=0Amy $xp =3D XML::XPath -> new( ioref =3D> $file );=0Amy $xmltext =3D =
$xp -> findvalue( '/doc/text()' );=0A=0Aseek ( $file, 0, 0 );=0A=0Amy $data=
=3D join '', <$file>;=0A$data =3D~ m/<doc>(.*)<\/doc>/g;=0A=0Amy $othertex=
t =3D $1; =0A=0A# use bytes; # that's a way to fix the problem, but is unsa=
fe=0A=0Aprint "--> separate print:\n";=0Aprint "xmltext =3D '$xmltext'\n"=
;=0Aprint "othertext =3D '$othertext'\n";=0A=0Aprint "--> concatenated (.) =
print:\n" =0A . "xmltext =3D '$xmltext'\n"=0A . "othertext =3D '$othertex=
t'\n";=0A=0A# a hack, which makes $processed a safe copy of $othertext:=0A=
=0Amy $temporary =3D sprintf( "%s%s", $xmltext, $othertext );=0Amy $process=
ed =3D substr ( $temporary, length( $xmltext ) );=0A=0Aprint "--> workaroun=
d: \n";=0Aprint "temporary =3D '$temporary'\n";=0Aprint "processed =3D '$pr=
ocessed'\n";=0A=0A__DATA__=0A<?xml version=3D'1.0' encoding=3D'utf-8'?>=0A<=
doc>=D0=BF=D1=80=D0=B8=D0=B2=D0=B5=D1=82</doc>=0A
--45Z9DzgjV8m4Oswq--