[mythtvnz] xmltv-proc-nz problem with MHEG epg data

Sat Jun 30 12:13:36 BST 2012

On Fri, 29 Jun 2012 22:33:02 +1200, you wrote:

>On 29/06/12 12:17, David Moore wrote:
>> On 29/06/12 04:46, Stephen Worthington wrote:
>>> I run xmltv-proc-nz 0.5.8 on my ChoiceTV data that I get from MHEG. At
>>> the moment, it has been having a problem with some characters in that
>>> data:
>>>
>>> Running xmltv-proc-nz on the Freeview data
>>> Movies: TMDB module not found.
>>> Traceback (most recent call last):
>>>    File "/usr/local/bin/xmltv-proc-nz", line 563, in<module>
>>>      tree = ElementTree.XML(data)
>>>    File "<string>", line 106, in XML
>>> cElementTree.ParseError: not well-formed (invalid token): line 2174,
>>> column 50
>>>
>>> The data causing the problem seems to be an e-acute character, 0xE9,
>>> in the word soufflés:
>>>
>>> <programme start="20120625140000 +1200" stop="20120625150000 +1200"
>>> channel="tv1.freeviewnz.tv">
>>> <title lang="eng">Celebrity Masterchef</title>
>>> <sub-title>New Season</sub-title>
>>> <desc>Temperatures rise higher than the soufflés as the determined
>>> celebrities strive to demonstrate their ability to cook great food to
>>> the judges.</desc>
>>> <category>series</category>
>>> <category>Education/Science/Factual</category>
>>> <rating system="SKY-NZ">
>>> <value>M</value>
>>> </rating>
>>> </programme>
>>>
>>> In order to fix this, I have had to get my epg script to add the
>>> following line to the top of the output file from mhegepgsnoop.py:
>>>
>>> <?xml version="1.0" encoding="cp1252"?>
>>>
>>> That specifies the character encoding to be cp1252 which permits
>>> characters such as 0xE9.  But I think it might be better to have
>>> mhegepgsnoop.py convert what it gets from MHEG and produce UTF-8
>>> encoded data, as that is the default encoding for XML files.
>>>
>>
>> Odd. The mhegepgsnoop.py code that writes the xml file specifies UTF-8 
>> encoding:
>>
>> ET.ElementTree(root_element).write(outfile, encoding="utf-8").
>>
>> Also I get "xmltvaaa.xml: text/plain; charset=utf-8" when I do "file 
>> -i xmltvaaa.xml". So maybe xmltv-proc-nz doesn't like UTF-8 extended 
>> characters? Or maybe I need to set the encoding attribute in the xml 
>> header? I had various issues with character encoding when writing 
>> mhegepgsnoop. I think one was myth didn't like the xml file before I 
>> specified UTF-8 encoding.
>
>So it seems that there might be a bug in the cElementTree parser or at 
>least the version you're running. Maybe it's auto-detecting the 
>character set as something other than UTF-8? The UTF-8 code point for 
>e-acute (é) is 0xE9 so no problem there unless you mean the actual bytes 
>in the file were 0xE9 when they should be 0xc3 0xa9? FYI I had a play 
>with xmltv-proc-nz and it handled the e-acute character (and others) 
>with no problems. Seems to produce extended HTML characters, not UTF-8.

I have been delving further into my epg script, and it looks like I
was wrong about it being an encoding problem with mhegepgsnoop.py.  It
looks like the downloads from nzepg.org have changed recently.  They
used to have epgsnoop headers at the top of the file, something like
this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE tv SYSTEM "xmltv.dtd">
<tv generator-info-name="epgsnoop/0.84"
generator-info-url="http://launchpad.net/epgsnoop"
date="20120622215556 ">

Now all they have is:

<tv>

So they now default to UTF-8 encoding, but they were in fact not UTF-8
at the time I was having that problem and my script was getting the
invalid e-acute characters.  That seems to have changed again in the
last few days, and now the data that had the e-acute characters in it
is encoded as HTML:

	<programme channel="tv1.freeviewnz.tv" start="20120706140000
+1200" stop="20120706150000 +1200">
		<title lang="eng">Celebrity Masterchef</title>
		<desc>Temperatures rise higher than the soufflés
as the determined celebrities strive to demonstrate their ability to
cook great food to the judges.</desc>
	</programme>

The HTML encoding is carrying through to the EPG visible in MythTV and
as MythTV does not do HTML, you see "soufflés" instead of
"soufflés", which is not good.

The loss of the epgsnoop headers and their "date=" field also broke my
EPG script in another way - it was using that date to compare the
downloaded EPG data to the latest epgsnoop derived data I had on file.
Without that "date=", my script was getting the local epgsnoop data,
and as I do not normally run epgsnoop it was getting the same old data
from last week and I was running out of EPG in MythTV.

I have fixed that now, so that if the epgsnoop "date=" field is
missing, I use the timestamp from the downloaded file.