NAME

podcatcher.pl -- a highly configurable command line podcatcher

SYNOPSIS

# Download new episodes from the podcasts listed in podconfig.txt.
podcatcher.pl podconfig.txt

# Copy new episodes of the podcasts listed in podconfig.txt to the mp3 player.
podcatcher.pl --copy podconfig.txt

# Download new episodes and write only to the log file.
podcatcher.pl --quiet --logfile podcatcher-log.txt podconfig.txt

# Download new episodes, sleeping an average of 120 seconds after each
# file downloaded.
podcatcher.pl --sleep 120 podconfig.txt

# Save the title and description of each episode from the RSS feed to an html
# file.
podcatcher.pl --description podconfig.txt

# Run in verbose debug mode.
podcatcher.pl --debug 2 podconfig.txt

# Get help.
podcatcher.pl --help

DESCRIPTION

podcatcher.pl reads a configuration file consisting of a list of podcasts and attributes for each, then, depending on the mode, either downloads new episodes of each podcast or copies new episodes to the mp3 player and then moves them from download directories to long-term storage directories. The default mode is to download new episodes; run in copy mode by setting the -c or --copy command line option.

Global settings can be set on the command line; most can also be set in the configuration file, although they won't take effect until that block of the configuration file is read. In most cases that won't matter, but e.g. using the --debug command line option may print some startup messages that aren't printed if you set debug = 1 at the beginning of the config file.

COMMAND LINE OPTIONS

-d --debug

Turn on debug mode. Followed by 1 or 2 for moderate or verbose messages.

-s --sleep

Number of seconds to sleep after downloading each episode.

-D --description

Turn on saving title/description of each episode to an html file.

-l --logfile

Specify where to write the output log. A %s in the filename will be replaced by today's date.

-q --quiet

Write only to the logfile, not to terminal.

-c --copy

Instead of downloading new episodes, copy new files to player and move them to storage.

-e --extensions

A comma-separated list of file extensions for podcast episodes. Defaults to mp3,m4a,oga

-a --agent

What User-Agent header to use for http requests.

-h --help

Get brief help.

-m --manual

Get long help.

Example

I run this script in a daily cron job to download each day's podcast episodes:

#! /bin/bash
LOG=/home/jim/Downloads/wrap-podcatcher-log-`date +%Y-%m-%d`.txt

pushd /home/jim/Downloads

/home/jim/Documents/scripts/crontab-stuff/podcatcher.pl \
--description \
--logfile /home/jim/Downloads/podcatcher-log-%s.txt \
--sleep 60  \
/home/jim/Documents/podconfig.txt \
>> $LOG 2>&1

popd

The redundant logging in the wrapper script probably isn't necessary anymore, but it's helpful when first setting up a cron job if it doesn't work right the first time. If the podcatcher script won't run because of some Perl installation issue, a missing library, or invalid command line options, you'll see the error in the wrapper log.

CONFIGURATION FILE

'#' begins a comment. Each block of options, whether global or per-podcast, should be separated from the others by at least one blank line. Each option consists of a name, equals sign, and value; you can have optional whitespace at the beginnings of lines or around the equal sign.

If a block begins with the comment # TEMPLATE then that block will not be parsed and no error messages will be issued for it. That way we can have a template block with empty/default variable values that we can copy, paste, and edit when we want to add a podcast to the config file.

GLOBAL VARIABLES

These options should typically come at the beginning of the file, although you can them on for a particular podcast or group of podcasts and turn them off again for the rest of the file if you want to troubleshoot a podcast that's not working right (especially debug, logging, and description).

extensions

A comma-separated list of file extensions you want to download, defaulting to mp3,m4a,oga

sleep

Number of seconds to sleep (on average, will vary randomly) after each download. Default is 30 seconds.

description

If set to a nonzero value, will save the title and description of each episode from the RSS feed to an html file.

agent

What User-Agent header to use for http requests.

logfile

Set to the filename where you want log messages written, or an empty value to stop logging. Probably best to give this a full path, not using ~ as a synonym for your home directory.

If the logfile name contains a %s, that will be replaced with today's date.

quiet

Set to nonzero if you want output only written to the log file, not to the console.

debug

Set to 1 for moderately chatty debug messages and 2 for verbose debug messages.

PER-PODCAST VARIABLES

An example podcast configuration block:

name = Be the Serpent  # a podcast of extremely deep literary merit
url=https://feed.podbean.com/betheserpent/feed.xml
download_dir=/home/user/Downloads/betheserpent
limit=1/14            # download one episode every 14 days
replace=s/^/BtS__/
player_dir=/media/user/CLIP/Podcasts/
keep_dir = /home/user/podcasts/betheserpent
pause=0

REQUIRED VARIABLES

name

A freeform name for the podcast, used only in messages to the user.

url

The URL for the podcast's RSS feed.

download_dir

The directory to which to download the podcast. This must not be the same as the download directory for any other podcast. I recommend you use subdirectories under your $HOME/Downloads directory, one for each podcast.

If the directory doesn't exist yet the first time we download episodes for a given podcast, it will be created.

player_dir

The directory on the mp3 player (or phone) where new episodes of this podcast are to be copied. This can be the same for multiple podcasts, or all podcasts.

podcatcher.pl will not attempt to access this directory or keep_dir unless it's in copy mode, so you can download new episodes without having your mp3 player/phone mounted.

If the directory doesn't exist yet the first time we copy/move episodes for a given podcast, it will be created.

keep_dir

The directory where new episodes are to be moved to after they have been successfully copied to the mp3 player/phone. Again, this can be the same for multiple podcasts, but I recommend having one for each, possibly on an external hard drive. It must not be the same as the download directory for a given podcast, otherwise we'll keep copying the same episodes to the player every time we run in --copy mode.

If the directory doesn't exist yet the first time we copy/move episodes for a given podcast, it will be created.

OPTIONAL VARIABLES

pause

If pause is nonzero, the podcast will be skipped. This is equivalent to commenting out every line of the block, except that a log message will be printed about skipping the podcast.

limit

limit is the maximum number of episodes of a given podcast to download on a given run. If set to a fraction like N/D the catcher will only download N episodes every D days. (It calculates this by doing a modulus with the number of days since the epoch; it doesn't keep track of how many days it's been since you edited the config file to set the limit.)

If this isn't set, the podcatcher will attempt to download all new episodes of a given podcast every time it's run in download mode.

replace

replace must be one or more valid Perl s/// operators (separated by semicolons and whitespace) which will be applied to podcast filenames before saving them to disk (e.g., to prefix a consistent string to episodes of a podcast that names them ep1.mp3, EP_02.mp3, etc). Regexes can have any modifier except for /e, which could allow code injection problems. You must use forward slashes as delimiters.

I mostly use this to prefix a podcast name abbreviation to the filenames of podcasts which give them unhelpful names like "ep1.mp3" or whatever. E.g., the example above would turn "ep1.mp3" to "BtS__ep1.mp3". You can also use it to normalize the filenames in other ways, e.g., expanding "ep" to "episode" and making sure it's always lowercase, ensuring there's an underscore between the word "episode' and the number, etc. E.g.:

replace = s/ep(isode)?_*/episode_/i

will cause ep1.mp3, EPISODE__2.mp3, Ep_3.mp3 etc. to all be normalized to episode_1.mp3, episode_2.mp3, episode_3.mp3.

If a podcast's episodes sometimes have a prefix but sometimes doesn't, you can use negative lookahead to add it only when it's needed:

replace = s/^(?!WW-)/WW-/

will prefix "WW-" only to episode filenames that don't already have a WW- prefix.

reverse

If this is set to 1, download the episodes at the top of the RSS feed first. Default behavior is to download the episodes at the bottom first; usually RSS feeds are ordered from newest to oldest, but now and then you'll find a perverse feed that is ordered oldest to newest.

This lets you set individual HTTP headers for a given podcast. You can have multiple instances of this; its value should be a colon-separated string with the name of an HTTP header on the left and its value on the right. E.g.:

header = x-extra-header: stuff that should be ignored

You probably won't need this. I added it while trying to debug a stubborn podcast, but wound up fixing the problem by changing the LWP::UserAgent settings.

FILES

The configuration file is described above, as is the log file; either can have any name the user wishes. The other files created and used by the podcatcher:

Downloaded podcast episodes

Saved in the download_dir for each podcast, then copied to the player_dir and moved to the keep_dir.

Episode descriptions

Simple HTML files consisting of the episode title and description taken from the RSS feed. These are saved in the download_dir for each podcast and have the same filename as the episode, except for replacing .mp3, .oga etc. with .html.

Block lists

A text file named block-list.txt in each download_dir, listing the episodes we've already downloaded. You can edit this to delete some lines to make the podcatcher download them again (preferably not in the middle of a run, as that may cause unintended behavior).

Bad RSS files

If the podcatcher can't parse an RSS feed, it will save it in the download_dir to make debugging easier.

CHANGELOG

2023-09-08

Add -a --agent command line option and 'agent' configuration variable.

Escape regular expression metacharacters in the regular expression for http header names.

Fix invalid pod.

Add function prototypes.

DEPENDENCIES

This script requires Perl 5.14 or higher.

warnings, strict, constant, Getopt::Long, List::Util, File::Copy, File::Temp, Data::Dumper, and Pod::Usage are all in the standard library. LWP::UserAgent and XML::RSS::LibXML are in CPAN.

AUTHOR

Jim Henry III, http://jimhenry.conlang.org/software/

ACKNOWLEDGMENTS

Thanks to jwkrahn, AnomlaousMonk, Your Mother, thomas895, and tobyink at perlmonks.org.

https://www.perlmonks.org/?node_id=11154272 https://www.perlmonks.org/?node_id=11120857

LICENSE

This script is free software; you may redistribute it and/or modify it under the same terms as Perl itself.

TO DO LIST

Add support config variables that can expand in later directory variable settings. E.g.:

DL=/home/jim/Downloads
PLAYER=/media/jim/CLIP/Podcasts

....

name = somepodcast
download = $DL/somepodcast
playerdir = $PLAYER/talk

Or maybe just let the user use environment variables in the config file?

Would it be advantageous to delegate the downloading and saving to disk of the actual podcast episodes to `wget`? It can do much more thorough error checking and handling than anything I'm likely to be able to implement in a reasonable amount of time, and save files to disk progressively as they're downloaded, using up less memory when downloading large files.

If I don't do that, I should probably add more error checking to the episode downloading code.

-----

Should use MP3::Tag to check whether the files we've downloaded have metadata, and if not, supply it from the name config variable and the RSS feed title/description. For now I'm running a separate cron job to fix the metadata in files from Acatalepsis, which is the worst offender of those I'm currently listening to.

------

Check if a response was a redirect and log that. Ideally, we would update the config file with the new RSS URL if that is redirected or "moved permanently" or something, but that would require a refactoring of the main function where we first read the configuration file and then process each podcast in two separate loops (possibly two functions). In that case we should probably keep the original url= line but comment it out?

FUNCTIONS

usage()

Prints help message.

test_regex( regex )

Returns true if the arg is one or more valid s/// operators, separated by semicolons and possible whitespace, with any number of modifiers (but not /e), and contains nothing else (prevent code injection attacks in the replace field)

Due to technical limitations, only slashes are allowed as delimiters. I tried matching with arbitrary delimiters (see the commented-out line) and it gave false negatives on any regex containing backreferences. It seems that \\1 in a negated character class doesn't match the character that the first parenthesis matched, but literal \ or 1.

set_extension_regex( extensions )

Take a comma-separated list of extensions and turn it into a pipe-separated list and pre-compiles it as a regular expression.

verify_dir( dirname )

Take a directory name. Add a trailing slash if it doesn't already have one. Test whether it exists and create it if necessary (but not if we would have to create more than one level of directory; this is likely to be a typo). Return the possibly modified directory name, or undef on failure to create dir.

E.g., if passed /home/jim/talk/newpodcast and newpodcast doesn't exist yet, but /home/jim/talk does, it will create the target and return /home/jim/talk/newpodcast/

verify_limit( limit )

Take a podcast limit and check if it's an integer or a fraction. Return the limit if valid, undef if not.

randsleep()

Sleep a random amount of time, varying from half our configured sleep time to one and a half times that. Avoid hammering servers too hard, especially if we're downloading a lot of episodes at once.

startlog()

Initialize the log file.

writelog ( message )

Write a message to a the log file, standard output, or both depending on the logfile and quiet configuration variables.

copy_and_move()

Copy new episodes from the download directories to the MP3 player, then move them to the keep directories. Return the number of files successfully copied and moved.

save_description_to_file( episode_ref, save_dir, save_name )

Write the title and description of the episode to an html file in the download directory.

download_episodes ( podcast hash )

Download new episodes of the podcast whose hash is passed as arguments. Return number of episodes downloaded.

get_new_episodes( podcast hash )

Figure out which episodes are new and which we haven't already downloaded (by checking the block-list.txt file in the individual podcast download directory), then pass the modified hash with the new episode list to download_episodes().

debug_headers( request )

Use Data::Dumper to write the HTTP headers from a request object to the log/standard output.

Attempt to parse the RSS file using XML::RSS::LibXML, then check for podcast files in the <enclosure> tags; if none are found, check the<media:content> tags; and if none are found there, do a regular expression match for mp3 filenames. Save the title and description (if found in the enclosure or media:content tags) and return a reference to an array of episode hashes.

Discussion

Some RSS feeds mention each MP3 URL twice, in a <media:content> tag and in an <enclosure> tag. I started using uniq to get rid of the duplicates. I don't think we need to sort; sorting by pathname would probably get the episodes out of order (which might matter if we're downloading and listening to it gradually, one episode every few days) and sorting by filename exclusive of path probably would, for any podcast with inconsistent filenames (which seems to be the majority).

However, further testing with other podcasts indicates that <media:content> and <enclosure> might be different URLs for different places the same file is stored. Maybe one of them is a redirect to the other? Maybe we want to go back to checking for <enclosure> tags instead of all mp3 URLs regardless of what tag they occur in. That failed for some oddly-formatted podcasts, but we could have a fallback to the more greedy method of looking for URLs if the conservative method of looking at <enclosure> tags doesn't find any.

check_rss_feed_via_wget( podcast hash )

Download the RSS feed using a system call to wget, check it for new episodes. Returns the number of episodes downloaded.

For now this is a fallback for if LWP::UserAgent gets an error trying to get an RSS feed at some point I might make this the primary way to get RSS feeds and/or podcast files.

check_rss_feed( podcast hash )

Download the RSS feed using LWP::UserAgent::get, then call get_mp3_links_from_string() to check it for new episodes and download them. Returns the number of episodes downloaded.

main

Parse command line options, read the config file and parse it, verify the validity of the config variable values, and for each valid podcast block, call check_rss_feed() or copy_and_move(), then report number of episodes downloaded/copied.

Discussion

This should probably be refactored. The main function is over two hundred lines of code.

Possible ways to refactor, and advantages thereof:

1. Parse the config file in a while loop and save the podcast hashes in an array, then download or copy the podcasts in a for loop (which could be in new functions called by an if ( $copymode) else structure). Both of those could be separated out into functions called from main.

2. Iterate over the config file with a line-by-line while loop (not para-by-para). This would let us print line numbers for each parsing error, and line number ranges for problems like missing variables. We could also save the podcast line number range in each %podcast hash and let the download/copy functions report those line numbers when it would help the user debug an error (like a bad directory or regex). Of course, we currently print the podcast name whenever possible, so that might not be much more help.

3. Both