Thursday, 11 April 2013

Astro xmltv listings

For quite a while now, I've been pulling TV listings from the Astro website. At first this involved scraping some HTML and the scripts I had were pretty ugly and worked only for the channels I wanted. Then they added some RSS feeds, which made the scraping a bit easier and less prone to breaking every time they updated the look of the website (which was often). For a couple of years now, the RSS feed has been replaced with a JSON API, and I had a pretty good grabber written for that, but never got around to sharing it. Recently they switched to a dedicated server for the JSON API, and gave it an overhaul, so my script stopped working. Now that I've updated it to the new API, its a good time to share.

#!/usr/bin/perl -w

eval 'exec /usr/bin/perl -w -S $0 ${1+"$@"}'
    if 0; # not running under some shell


=head1 NAME

tv_grab_my_astro - Grab TV listings for Malaysia using Astro JSON feeds.


tv_grab_my_astro --help

tv_grab_my_astro --configure [--config-file FILE] [--gui OPTION]

tv_grab_my_astro [--config-file FILE] 
           [--days N] [--offset N]
           [--output FILE] [--quiet] [--debug]

tv_grab_my_astro --list-channels [--config-file FILE]
           [--output FILE] [--quiet] [--debug]

Output TV and listings in XMLTV format for channels available on
Malaysia's Astro subscription satellite service.

First you must run B<tv_grab_my_astro --configure> to choose which stations
you want to receive.

Then running B<tv_grab_my_astro> with no arguments will get a listings for
the stations you chose for five days including today.

=head1 OPTIONS

B<--configure> Prompt for which stations to download and write the
configuration file.

B<--config-file FILE> Set the name of the configuration file, the
default is B<~/.xmltv/tv_grab_my_astro.conf>.  This is the file written by
B<--configure> and read when grabbing.

B<--gui OPTION> Use this option to enable a graphical interface to be used.
OPTION may be 'Tk', or left blank for the best available choice.
Additional allowed values of OPTION are 'Term' for normal terminal output
(default) and 'TermNoProgressBar' to disable the use of Term::ProgressBar.

B<--output FILE> When grabbing, write output to FILE rather than
standard output.

B<--days N> When grabbing, grab N days rather than 5.

B<--offset N> Start grabbing at today + N days.  N may be negative.

B<--quiet> Suppress the progress-bar normally shown on standard error.

B<--debug> Provide more information on progress to stderr to help in

B<--list-channels>    Output a list of all channels that data is available
                      for. The list is in xmltv-format.

B<--version> Show the version of the grabber.

B<--help> Print a help message and exit.


If the grabber fails to download data for some channel on a specific day, 
it will print an errormessage to STDERR and then continue with the other
channels and days. The grabber will exit with a status code of 1 to indicate 
that the data is incomplete. 


The environment variable HOME can be set to change where configuration
files are stored. All configuration is stored in $HOME/.xmltv/. On Windows,
it might be necessary to set HOME to a path without spaces in it.


For information on supported channels, see

=head1 AUTHOR

Jason Rumney, jasonr -at- gnu -dot- org. This documentation and some code
copied from tv_grab_se_swedb by Mattias Holmlund, mattias -at-
holmlund -dot- se, which was in turn partially copied from tv_grab_uk
by Ed Avis, ed -at- membled -dot- com. Overall code based on
documentation at

=head1 BUGS

use strict;
use XMLTV;
use XMLTV::ProgressBar;
use XMLTV::Options qw/ParseOptions/;
use XMLTV::Configure::Writer;

use JSON;
use DateTime;
use Date::Parse;
use Date::Format;
use File::Path;
use File::Basename;
use LWP::Simple qw($ua get);
sub t;


use HTTP::Cache::Transparent;

# Although we use HTTP::Cache::Transparent, this undocumented --cache
# option for debugging is still useful, since it will always use a
# cached copy of a page, without contacting the server at all.
use XMLTV::Memoize; XMLTV::Memoize::check_argv('get');

my $default_root_url = '';
my $default_cachedir = get_default_cachedir();

my ($opt, $conf) = ParseOptions( {
   grabber_name => "tv_grab_my_astro",
   capabilities => [qw/baseline manualconfig apiconfig/],
   stage_sub => \&config_stage,
   listchannels_sub => \&list_channels,
   version =*> "0.1",
   description => "Malaysia (",
} );

#check config

if (not defined( $conf->{cachedir} )) {
    print STDERR "No cachedir defined in config file " .
	$opt->{'config-file'} . "\n" .
	"Please run the grabber with --configure.\n";
    exit 1;

if (not defined ( $conf->{'root-url'} )) {
    print STDERR "No root-url defined in config file " .
	$opt->{'config-file'} . "\n" .
	"Please run the grabber with --configure.\n";	

if (not defined ( $conf->{'channel'} )) {
    print STDERR "No channels selected in config file " .
	$opt->{'config-file'} . "\n" .
	"Please run the grabber with --configure.\n";
    exit 1;

# Astro webserver is slow to generate responses, with lots of little
# JSON requests, so initialise the caching to use a 12hr timeout.
init_cachedir( $conf->{cachedir}->[0] );
HTTP::Cache::Transparent::init( {
    BasePath => $conf->{cachedir}->[0],
    NoUpdate => 12*3600,
    Verbose => $opt->{debug},
    ApproveContent => sub { return $_[0]->is_success },
} );

# Get the actual data and print it to stdout.

my ($xmldecl, $channels) = load_channels( $conf->{'root-url'}->[0] );

my ($odoc, $root );
my $warnings = 0;

binmode STDOUT, ":utf8";

write_header( $xmldecl );

write_channel_list( $conf->{channel} );

my $now = DateTime->today();
t $now;
my $date = $now;
$date = $now->add_duration($DateTime::Duration->new( days => $opt->{offset}))
    if ( $opt->{offset} );

my $bar = undef;
$bar = new XMLTV::ProgressBar( {
    name => 'downloading listings',
    count => $opt->{days} * @{$conf->{channel}},
} ) if (not $opt->{quiet}) && (not $opt->{debug});

for ( my $i=0; $i < $opt->{days}; $i++ )
    t "Date: $date";
    foreach my $channel_id (@{$conf->{channel}})
	# We already warned the user if the channel doesn't exist.
	if ( exists $channels->{$channel_id} )
	    t "  $channel_id";
	    my ( $ch_name, $ch_code, $url, $ch_num, $ch_cat, $ch_astroid ) = @{$channels->{$channel_id}};
	    print_data( $url, $channel_id, $date, $ch_astroid, $ch_cat )
		or warning( "Failed to download data for $channel_id on " .
			    UnixDate( $date, "%Y-%m-%d" ) . "." );
	$bar->update() if defined( $bar );
    $date = $date->add_duration(DateTime::Duration->new(days => 1));

$bar->finish() if defined( $bar );


# Signal that something went wrong if there are warnings.
exit(1) if $warnings;

t "Exiting without warnings.";

sub t
    my( $message ) = @_;
    print STDERR $message . "\n" if $opt->{debug};

sub warning
    my( $message ) = @_;
    print STDERR $message . "\n";

sub list_channels
    my ( $conf, $opt ) = @_;

    ( $xmldecl, $channels ) = load_channels( $conf->{'root-url'}->[0] );

    my $result="";
    my $fh = new IO::Scalar \$result;
    my $oldfh = select( $fh );

    write_header( $xmldecl );
    write_channel_list ( [sort keys %{$channels}] );
    select( $oldfh );

    return $result;

sub config_stage
    my( $stage, $conf ) = @_;

    die "Unknown stage $stage" if $stage ne "start";

    my $result;
    my $writer = new XMLTV::Configure::Writer( OUTPUT => \$result,
					       encoding => 'utf-8' );
    $writer->start( { grabber => 'tv_grab_my_astro' } );
    $writer->write_string( {
	id => 'root-url',
	title => [ [ 'Root URL for grabbing data', 'en' ] ],
	description => [
	    [ 'The file at this URL describes which channels are available ' .
	      'and where data can be found for them. ', 'en' ] ],
	default => $default_root_url,
    } );
    $writer->write_string( {
	id => 'cachedir',
	title => [ [ 'Directory to store the cache in', 'en' ] ],
	description => [
	    [ 'tv_grab_my_astro uses a cache with files that it has already ' .
	      'downloaded.  Please specify where the cache shall be stored. ',
	      'en' ] ],
	default => $default_cachedir,
    } );

    $writer->end( 'select-channels' );
    return $result;

sub get_default_cachedir
    my $winhome = $ENV{HOMEDRIVE} . $ENV{HOMEPATH}
    if defined( $ENV{HOMEDRIVE} )
	and defined( $ENV{HOMEPATH} );

    my $home = $ENV{HOME} || $winhome || ".";
    return "$home/.xmltv/cache";

sub init_cachedir
    my ( $path ) = @_;
    if ( not -d $path )
	mkpath( $path ) or die "Failed to create cache-directory $path: $@";

sub load_channels
    my ($baseurl) = @_;
    my %channels;
    my $xmldecl = "<\?xml version='1.0' encoding='utf-8'\?>\n";
    my $full_url = $baseurl . "/pack?format=jsonp";

    my $download = get( $full_url );
    my $json = new JSON;
    my $chan_data = $json->allow_nonref->utf8->relaxed->decode($download);
    $chan_data = $chan_data->{'services'};

    foreach my $ch (@{$chan_data})
	    my $ch_id = $ch->{'service_id'};
	    my $ch_name = $ch->{'title'};
	    my $ch_code = $ch->{'service_key'};
	    my $ch_num = $ch->{'channel_number'};
	    my $ch_cat = $ch->{'channel_category'} || 'unknown';
	    my $ch_uri = "$ch_num.$";
	    $ch_uri =~ s/ /_/g;
	    $ch_name = xml_escape($ch_name);
	    $ch_uri = lc($ch_uri);
	    $channels{$ch_uri} = [
		$ch_name, $ch_code, $baseurl, $ch_num, $ch_cat, $ch_id ];
    # TODO: Parse from astro website

    return ($xmldecl, \%channels);

sub print_data
    my( $url, $channel_id, $date, $astroid, $category ) = @_;
    my $start_date = $date->ymd('-') . 'T00:00';
    my $end_date = $date->ymd('-') . 'T23:59';
    my $full_url = "$url/guide/start/$start_date/end/$end_date/channels/$astroid?format=jsonp";
    my $download = get( $full_url );
    my $json = new JSON;

    my $schedule = $json->allow_nonref->utf8->relaxed->decode($download);
    if ($schedule && ref($schedule) eq "HASH")
	    $schedule = $schedule->{'guides'};
	    foreach my $programme (@{$schedule})
		    my $event_id = $programme->{'event_id'};
		    if ($event_id != 0)
			    my $prog_id = $programme->{'program_id'};
			    my $title = $programme->{'name'};
			    my $episode_title = $programme->{''};
			    my $episode_id = $programme->{'group_order_num'};
			    my $desc = $programme->{'description'};
			    my $startTime = $programme->{'display_datetime_utc'};
			    my $stopTime = $programme->{'display_datetime_end_utc'};
			    my $start = str2time($startTime);
			    my $stop = str2time($stopTime);
			    $start = time2str('%Y%m%d%H%M%S %z', $start);
			    $stop = time2str('%Y%m%d%H%M%S %z', $stop);
			    # XML escaping of fields that are likely to contain problem chars.
			    $title = xml_escape($title);
			    print "  <programme channel=\"$channel_id\" start=\"$start\" stop=\"$stop\">\n"
				. "    <title>$title</title>\n";
			    if ($episode_title)
				    $episode_title = xml_escape($episode_title);
				    print "    <sub-title>$episode_title</sub-title>\n";
			    if ($desc)
				    $desc = xml_escape($desc);
				    print "    <desc>$desc</desc>\n";
			    print "    <category>$category</category>\n";
			    print "    <url>$event_id?format=html</url>\n";
			    if ($episode_id)
				    print "    <episode-num system=\"onscreen\">$episode_id</episode-num>\n";
			    print "  </programme>\n";
    if ($@) {
	# Exception occurred above.  Add a comment to the xmltv file
	# about the error, and allow the details for other channels to
	# keep being processed.
	print "<!-- Error parsing channel listings for "
	    . $channel_id . "\n  from URL: " . $full_url
	    . "\n  Message: " . xml_escape($@)
	    . "\n-->";

    return 1;

sub write_header
    my ( $xmldecl ) = @_;

    print $xmldecl;
    print '<!DOCTYPE tv SYSTEM "xmltv.dtd">' . "\n";
    print '<tv>' . "\n";

sub write_channel_list
    my ( $channel_list ) = @_;

    # Write list of channels.
    t 'Writing list of channels.';

    foreach my $channel_id (@{$channel_list})
	if ( not exists $channels->{$channel_id} )
	    print STDERR "Unknown channel $channel_id." .
		" See" .
		" for a list of available channels or run" .
		" tv_grab_my_astro --configure to reconfigure.";

	my ( $ch_name, $ch_code, $namespace, $ch_num, $ch_cat, $ch_astroid ) = @{$channels->{$channel_id}};
	print "  <channel id=\"$channel_id\">\n"
	    . "    <display-name>$ch_name</display-name>\n"
	    . "    <display-name>$ch_num</display-name>\n"
	    . "    <icon src=\"$ch_num.png\"/>\n"
	    . "    <url>$ch_astroid?format=html</url>\n"
	    . "  </channel>\n";

sub write_footer
    print "</tv>\n";

sub xml_escape()
    my ($string) = @_;
    if ($string)
	    $string =~ s/&/&amp;/g;
	    $string =~ s/</&lt;/g;
	    $string =~ s/>/&gt;/g;
	    $string =~ s/"/&#34;/g;
	    # Some of the feeds come in as ISO8859-1 and I can't figure out how
	    # to get them encoded properly as UTF-8 in perl, so convert all non-ASCII
	    # to character references.
	    $string =~ s/([^\x{20}-\x{7E}])/'&#' . ord($1) . ';'/gse;
    return $string;
### Setup indentation in Emacs
## Local Variables:
## perl-indent-level: 4
## perl-continued-statement-offset: 4
## perl-continued-brace-offset: 0
## perl-brace-offset: -4
## perl-brace-imaginary-offset: 0
## perl-label-offset: -2
## cperl-indent-level: 4
## cperl-brace-offset: 0
## cperl-continued-brace-offset: 0
## cperl-label-offset: -2
## cperl-extra-newline-before-brace: t
## cperl-merge-trailing-else: nil
## cperl-continued-statement-offset: 2
## End:


Chris said...

Thanks for this.
I actually use a VB app that does this, but notice that the api call doesn't get shows that finish on the next day (i.e. moves that start at 10.30 and finish at 1.30am)
Do you have the same issues?

Jason Rumney said...
This comment has been removed by the author.
Chris said...

Hmm. ok. I tried getting more than 1 day, but it seems to only return a single day. I will revert back to that and see if I can get it to work again.

thanks. :)

Jason Rumney said...

[deleted by accident because Blogger showed me two copies] Yes I get warnings about that all the time too, but I think it only causes a problem for the last day in the schedule. I tend to fetch 7 or 14 days at a time and update daily, so I never have a problem in practice.

Jason Rumney said...

To clarify; my tv_grab script fetches one day at a time, then use tv_sort to combine the next 7 days into a single schedule.

Chris said...

Thanks Jason.
when I tried to do day by day, i cant get any shows that span midnight.
i.e. if it starts at 11pm, and the endtime is next morning, the shows doesnt appear in either day.
Do you see this happening?

If i do a multi-day select I do get those shows, but the response to the multi day call seems to be hit and miss?

Jason Rumney said...

When I check now, you're right, the last program on each channel every day is missing. In my previous version, it used to work, so I guess they must be missing from the JSON feeds.

Jason Rumney said...

Astro seem to have the same problem, so it looks like it is a problem with the API. I've tried getting the data spanning multiple days, specifying shorter times, but it looks like the API just gives you one day of programs (excluding any that start before, or end on or after midnight) whatever parameters you give it.

Chris said...

Yeah, thanks Jason.
Glad its not my code that's the problem :)
Guess, I will try first for a full week, and If i only get a day, then loop through each day instead.

Unknown said...

Hello Jason,
I would like to know more about the Malaysian EPG data. Please let me know how to get in touch with you.


Jason Rumney said...

I haven't had time to update the script lately and it is failing to get listings for about half the channels. You can email me specific questions at jasonrumney AT