Legal EPG Scraper for ARD TV Stations to Use With tvheadend External XMLTV Grabber

I wrote a Python EPG scraper for the EPG data of the German TV stations broadcast by ARD. It is legal for private use. Here I share the code and my thoughts behind it.

Skip explanations and get the code.

Motivation

Now being rather satisfied with my Media Center in general, one thing was missing: A complete EPG. For DVB-T2 stations it is there, and I’d even say it is the best possible source for those stations. It is fast, most up to date and can be grabbed in short intervals to accomodate for last-minute changes of the TV schedule. However, for the HbbTV and the pure IPTV stations most or all EPG data was missing still. Looking around, it is not difficult to find EPG sources on the net, but none was both free and legal. So I decided to write my own scraper, which I offer here for your own (private!) use.

Legal? How Do I Know?

I asked ARD. I wrote an e-mail asking if it is OK to automatically consume their EPG web pages 2-3 times a day for private use, and they wrote back that this is the case. You may use the EPG data including images as long as you do not publish the data on your own somewhere. That’s nice! Well, I also pay for it with my mandatory German public TV fees, so I already felt somewhat entitled anyhow, but it’s nice to have it “official”.

Features

My scraper has the following features:

Get the EPG data for the coming 14 days
Convert them into XMLTV format (compliant with this DTD)
Scrape the following information for each TV show:
- Start and end time (time zone/daylight savings time aware)
- VPS time (if available)
- Duration
- Title and subtitle
- Detailed description
- Credits (if available)
- Keywords
- Categories (derived from keywords)
- URL
- Video and audio properties
Keeps a list of keywords not categorized to help you to keep the categories up to date

Limitations

Some things are currently hard-coded that might better be in seperate config files or given as command line argument
The performance is very slow. I use the Beautiful Soup framework, which is definitely not the fastest on earth. Still, it is great work which saved me a lot of hassle, so thanks to the authors! Grabbing the 14 days of EPG data for the 18 available ARD TV stations takes some 4 hours on my Raspberry Pi 3 B+. But in the end: Why bother? As long as it does not take days… Still, I’d strongly recommend to run it an multi-core computers only (so e.g. not on older single core Raspberries), unless they are just idle. Otherwise, it may seriously impact the general performance of such an SBC.
Stability: ARD may decide any time to change their EPG layout and HTML code. So my scraper may break down any time. I think I will keep it up to date and adjust to changes within days, but not always having the time for that, don’t depend on it.

Requirements

You need

python 3 (will run on python 2, but be aware of the time conversion topic below. Also, some minor code changes required.)
Beautiful Soup (Raspbian: apt-get install python3-bs4 – or use pip)
lxml (Raspbian: apt-get install python3-lxml – or use pip) – you can also use the built-in XML.etree.ElementTree – main drawbacks: No DOCTYPE in XML, and no “pretty_print”-option, causing the XML to be badly readable for humans. Also, without lxml you need to change the Beautiful Soup parser to html.parser, which makes it slower by a factor of ~2-4. As of now, I did not test if tvheadend accepts a XML without DOCTYPE, but I’d be surprised if not.

If you plan to run this on LibreELEC or CoreELEC, requirements #1 and #3 are not fulfilled. Still, changing the code in a very few places, it still works fine. I found that on Le Potato the html.parser-version is nearly as fast if not a bit faster than the lxml-version on Raspberry Pi 3 B+.

Data Source

This page by Datenjournalist pointed me to the ARD EPG pages, which I use as data source. Using GET parameters, you can navigate to any date or TV station easily. Looking at the HTML response, you find links to a details page for each TV show in the program. The details page contains the information mentioned above in a somtimes more, sometimes less structured form, so using the Beautiful Soup framework and some search and split operations on strings it is possible to get what you need.

Challenges

Most of the work was just diligence – fetch the pages in a browser, look at the HTML code, identify the tags and classes to search for and adjust formats. Since ARD provides the desription of each show already well formatted in a meta tag, even this was a piece of cake.

The Timezone/Daylight Savings Time Problem

The only thing I spent a whole evening to get right was timezone/daylight savings time (DST) conversion. Looking on the time, datetime, pytz and other libraries you’d think it’s just straightforward, but it is certainly not! The problem is that the EPG times given on the ARD pages don’t come along with timezone or DST information. That’s no surprise, since they are German, and all of Germany shares the same timezone, CET, or CEST during DST, respectively. However, I did not want to develop my own code to determine when it’s CET, and when CEST, and I was pretty sure that there are ready made routines. I wanted to put in a time zone agnostic time, and get back a timezone and DST aware time. That this is not as straightforward as you might think becomes clear if you run the following code in python 2 and python 3:

import os
import time
os.environ['TZ'] = "Europe/Berlin"
Time = time.strptime("201905010513", "%Y%m%d%H%M")
Timestamp = time.mktime(Time)
print (time.strftime("%Y%m%d %H%M %z %Z", time.localtime(Timestamp)))
print (time.strftime("%Y%m%d %H%M %z %Z", time.gmtime(Timestamp)))

Output in python 3:

20190501 0513 +0200 CEST
20190501 0313 +0000 GMT

Output in python 2:

20190501 0513 +0000 CEST
20190501 0313 +0000 CET

Wtf.???

The following code however would work for python 3 and 2 both. Input is a date and time like 201904141654 (YYYYddmmHHMM):

time.strftime('%Y%m%d%H%M00 %z', time.gmtime(time.mktime(time.strptime(ShowStart, '%Y%m%d%H%M'))))

Output always will be like 201904141454 +0000 – which is correctly converted to GMT/UTC. In my code linked in below however I use

time.strftime('%Y%m%d%H%M00 %z', time.localtime(time.mktime(time.strptime(ShowStart, '%Y%m%d%H%M'))))

Output will be like 201904141654 +0200 – so the times stay in local timezone, which I think as a minor advantage. Not really important… But be aware that this will be problematic in python 2 and remember to adjust code in this case!

Usage

The Code

Download it here. I tried to put in comments wherever helpful. Also included in the comments the necessary modifications to run it in python 2 and/or without lxml. So if you plan to use it with Kodi on a closed OS, that may help you.

Running it is totally straightforward:

python3 GrabARD.py

Will output ARD.xml in the current working directory. It will also maintain a file with a list of keywords that are not (yet) mapped to a category.

Things to Adjust

At the beginning of the code you’ll find the configuration section – should be self explanatory. For the two text files see text below.

#####################
### Configuration ###
#####################

# Keyword to category assignments file
CategoryAssignmentsFileName = "CategoryAssignments.txt"

# Where to store uncategorized keywords
UnknownKeywordsFileName = "UnknownKeywords.txt"   

# Name of output file
XMLtvFileName = "ARD.xml"

# Number of days to scrape from EPG
DaysToGrab = 14

# Debug mode?
DebugMode = False

### End Configuration ###

Categories

Looking on the tvheadend web interface, it seems that these categories (aka. content types) are available:

However, some people point out the epg.c in tvheadend can do more, and looking there, indeed subcategories exist. A bit of trial-and-error showed that the XMLtv file needs to have in each <category> tag only the subcategory or the category, not both. This way, both tvheadend and Kodi get along well with it. More precisely, the following categories/subcategories are in epg.c:

Movie / Drama
- Detective / Thriller
- Adventure / Western / War
- Science fiction / Fantasy / Horror
- Comedy
- Soap / Melodrama / Folkloric
- Romance
- Serious / Classical / Religious / Historical movie / Drama
- Adult movie / Drama
News / Current affairs
- News / Weather report
- News magazine
- Documentary
- Discussion / Interview / Debate
Show / Game show
- Game show / Quiz / Contest
- Variety show
- Talk show
Sports
- Special events (Olympic Games, World Cup, etc.)
- Sports magazines
- Football / Soccer
- Tennis / Squash
- Team sports (excluding football)
- Athletics
- Motor sport
- Water sport
- Winter sports
- Equestrian
- Martial sports
Children’s / Youth programs
- Pre-school children’s programs
- Entertainment programs for 6 to 14
- Entertainment programs for 10 to 16
- Informational / Educational / School programs
- Cartoons / Puppets
Music / Ballet / Dance
- Rock / Pop
- Serious music / Classical music
- Folk / Traditional music
- Jazz
- Musical / Opera
- Ballet
Arts / Culture (without music)
- Performing arts
- Fine arts
- Religion
- Popular culture / Traditional arts
- Literature
- Film / Cinema
- Experimental film / Video
- Broadcasting / Press
- New media
- Arts magazines / Culture magazines
- Fashion
Social / Political issues / Economics
- Magazines / Reports / Documentary
- Economics / Social advisory
- Remarkable people
Education / Science / Factual topics
- Nature / Animals / Environment
- Technology / Natural sciences
- Medicine / Physiology / Psychology
- Foreign countries / Expeditions
- Social / Spiritual sciences
- Further education
- Languages
Leisure hobbies
- Tourism / Travel
- Handicraft
- Motoring
- Fitness and health
- Cooking
- Advertisement / Shopping
- Gardening

The ARD pages do not deliver any content types at all, just keywords. However, the keywords can be mapped to the content types. For this, at the beginning of the code a textfile is read in which contains the category or subcategory to keyword assignmens – general format is:

Category or Subcategory:
Keyword
Another Keyword
Other Category or Subcategory:
Yet another Keyword
# Comment
More Keywords
…
Uncategorized:
Keyword intentionally not assigned to a category
And another keyword to be ignored
…

From these assignments the categories are derived. Blank lines are OK, comment lines starting with # are ignored. Indentations are not mandatory. The “category” Uncategorized contains all keywords you know, but do not want to have a category assigned to. I also included a category “Special characteristics”, which I have seen somewhere in a tutorial, but which does not seem to have any meaning in tvheadend.

You may download my category assignments file.

Also, have a look on the unknown and yet uncategorized keywords output into a file by the script at the end: Some may be worth to add to the mappings. The file will be read in at the beginning of the script if it exists, so over time new keywords should accumulate there.

The categories that come via DVB-T2 are different – the ARD seems to have different ideas about how to categorize their shows. I decided not to follow their scheme – I find my mappings make more sense. However, I still use DVB-T2 EPG wherever available, because having correct schedules is more important than having nice categories – which I ignore most of the time anyhow.

Last thing: Kodi has its own algorithm to pick from the available categories the one that determins the color in the EPG display. So if you want to make sure that a given category “wins”, only give one category (needs code modification!). I did not bother.

Interaction With tvheadend

To get the XMLTV file into tvheadend, you first have to enable the external XMLTV grabber:

Take note of the “Path” value – we need it in a moment. It points to a socket connection into which the XMTV content needs to be piped, eg by running netcat:

cat ARD.xml | sudo nc -w 5 -U /var/lib/hts/.hts/tvheadend/epggrab/xmltv.sock

So a way to go would be to create a small shell script:

#!/bin/bash
cd /home/pi/EPGgrabber
python3 GrabARD.py > LastGrab.log 2>&1
cat ARD.xml | sudo nc -w 5 -U /var/lib/hts/.hts/tvheadend/epggrab/xmltv.sock

(Don’t forget to chmod u+x it)

Now this can run as a cronjob – done!

Things to Improve

Here’s my ToDo-List to make it better – will do so at no specific schedule:

~~Remeber uncategorized keywords in a file to accumulate over the days~~ done
Make scraping date and days to scrape command line arguments
Error handling (none yet…)
~~Debug mode (No output else)~~ done
~~Put keyword-category mappings in a config file~~ done
~~Improve categories to match epg.c from tvheadend~~ ~~does not make sense~~ DOES make sense – and done…
Find ways to make it faster

Alternatives

Of course, when you’re done with your project, you suddenly find another that makes it kind of obsolete… At least xmltv.se seems to be a good source – still I fail to see if they are legal. Also, some channels are missing. And: They provide a bit less/different information as compared to my scraper… Still, worth to have a look! And they have other TV stations as well.

Legal EPG Scraper for ARD TV Stations to Use With tvheadend External XMLTV Grabber

Motivation

Legal? How Do I Know?

Features

Limitations

Requirements

Data Source

Challenges

The Timezone/Daylight Savings Time Problem

Usage

The Code

Things to Adjust

Categories

Interaction With tvheadend

Things to Improve

Alternatives

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112