[mythtvnz] xmltv-proc-nz problem with MHEG epg data
David Moore
dmoo1790 at ihug.co.nz
Fri Jun 29 11:33:02 BST 2012
On 29/06/12 12:17, David Moore wrote:
> On 29/06/12 04:46, Stephen Worthington wrote:
>> I run xmltv-proc-nz 0.5.8 on my ChoiceTV data that I get from MHEG. At
>> the moment, it has been having a problem with some characters in that
>> data:
>>
>> Running xmltv-proc-nz on the Freeview data
>> Movies: TMDB module not found.
>> Traceback (most recent call last):
>> File "/usr/local/bin/xmltv-proc-nz", line 563, in<module>
>> tree = ElementTree.XML(data)
>> File "<string>", line 106, in XML
>> cElementTree.ParseError: not well-formed (invalid token): line 2174,
>> column 50
>>
>> The data causing the problem seems to be an e-acute character, 0xE9,
>> in the word soufflés:
>>
>> <programme start="20120625140000 +1200" stop="20120625150000 +1200"
>> channel="tv1.freeviewnz.tv">
>> <title lang="eng">Celebrity Masterchef</title>
>> <sub-title>New Season</sub-title>
>> <desc>Temperatures rise higher than the soufflés as the determined
>> celebrities strive to demonstrate their ability to cook great food to
>> the judges.</desc>
>> <category>series</category>
>> <category>Education/Science/Factual</category>
>> <rating system="SKY-NZ">
>> <value>M</value>
>> </rating>
>> </programme>
>>
>> In order to fix this, I have had to get my epg script to add the
>> following line to the top of the output file from mhegepgsnoop.py:
>>
>> <?xml version="1.0" encoding="cp1252"?>
>>
>> That specifies the character encoding to be cp1252 which permits
>> characters such as 0xE9. But I think it might be better to have
>> mhegepgsnoop.py convert what it gets from MHEG and produce UTF-8
>> encoded data, as that is the default encoding for XML files.
>>
>
> Odd. The mhegepgsnoop.py code that writes the xml file specifies UTF-8
> encoding:
>
> ET.ElementTree(root_element).write(outfile, encoding="utf-8").
>
> Also I get "xmltvaaa.xml: text/plain; charset=utf-8" when I do "file
> -i xmltvaaa.xml". So maybe xmltv-proc-nz doesn't like UTF-8 extended
> characters? Or maybe I need to set the encoding attribute in the xml
> header? I had various issues with character encoding when writing
> mhegepgsnoop. I think one was myth didn't like the xml file before I
> specified UTF-8 encoding.
So it seems that there might be a bug in the cElementTree parser or at
least the version you're running. Maybe it's auto-detecting the
character set as something other than UTF-8? The UTF-8 code point for
e-acute (é) is 0xE9 so no problem there unless you mean the actual bytes
in the file were 0xE9 when they should be 0xc3 0xa9? FYI I had a play
with xmltv-proc-nz and it handled the e-acute character (and others)
with no problems. Seems to produce extended HTML characters, not UTF-8.
More information about the mythtvnz
mailing list