PodHoarder - mudeth

Screenshot of podhoarder in action [view full size]

PodHoarder is a set of python scripts to bulk-download podcasts.

It can also do a couple of other cool things:

Generate a new RSS feed pointing to your hoarded files so that you can transparently listen to the mirrored podcast in your player of choice.
Run optional, on-demand post-processing commands so that you could, for example, get sox to do automatic volume control, or speed up the audio. These can be global or per-podcast.
It's tested to run in Termux, so you could run it on your phone (requires root now, and alias tsudo=sudo)

GPL3-Licensed

Downloads

Get Podhoarder here

Installation and initial setup
Usage
Post-processing
More options
Known Issues and Troubleshooting

Installation and initial setup

Prerequisites

python3
defusedxml library (pip install defusedxml --user)

Optional. if you want to re-host the podcast:

A web-server
PHP-CGI set up if you want on-demand post-processing.
sox, probably (sudo apt install sox libsox-fmt-mp3 on Debian)

Get the latest version of podhoarder and extract it to a directory of your choice. Run init_setup.py and it'll ask you to enter some paths:

cache_dir is where the podcasts files will be downloaded to. Separate sub-directories will be created for each podcast. Separate RSS files for each podcast will also be generated into this directory.
feed_cache is a master XML file where podhoarder stores all your hoarded channel and episode metadata.
www_user_group is the username and groupname of your webserver, usually www-data:www-data. This is necessary only if you're re-hosting the cache folder. It'll be used to set permissions for the post-processing helper script.
www_prefix, needed only if re-hosting, is the URL that corresponds to cache_dir on the filesystem.

For example, if your files are in /var/www/podcasts on the filesystem, and /var/www is accessible at https://example.com/, you would set your www_prefix to https://example.com/podcasts/. The trailing slash is important.

This way when the RSS feed is being re-generated by podhoarder, it will map /var/www/podcasts/my_podcast/whatever.mp3 to https://example.com/podcasts/my_podcast/whatever.mp3

Usage

Run add.py with podcast RSS feeds as arguments (multiple are okay).

Run ui.py to interactively add, remove, or configure podcast feeds. You can also change global PodHoarder settings here.

Run sync.py to download and generate feeds. It shows a progress indicator when run interactively.

Anytime you change settings, you should probably re-run init_setup.py and regenerate_feeds.py for good measure.

Post-processing

Post-processing is done on-demand, and uses the ph_redir.php script to maintain a cache of processed files. It uses shell scripts, so you can write your own pipelines and share them between podcasts.

Post-processing might take a while the first time it is requested on a file, so if a file is not ready yet, PodHoarder will reply with an HTTP 503 message. You can set the wait time for this with the postprocess_async_time option (see the config options).

In feed_cache, under the channel you want, add the tag

scriptname

script should be under the libph/post directory. It should take arguments in the form

script

Input filename will be a full path to an episode file, with an appropriate extension (.mp3, .m4a etc.)

Check the included agc postprocess for a template on what to do, and remember to delete the lock file when you're done.

More options

If you want to poke into the config file at ~/.podhoarder.xml, here are all the config options:

    feed_cache // xml file with channel and episode information
    cache_dir // directory where episode audio files are downloaded
    www_prefix // external prefix to replace cache_dir when generating feed for server
    www_user_group // user:group that web server is run as
    verbosity // debug, info, errors - default info
    stack_traces // whether to re-raise exceptions (for debugging)
    overwrite_existing // global setting - if true, podcast media files are always re-downloaded, not just new ones.
    redownload_existing // if false, if a file is found with the same filename, it is assumed to be a completed downloaded
    chunk_size // transfer chunk size - default 1024 * 1024 bytes
    log_file // filename mode is "w" default
    postprocess_script_dir // directory where postprocess commands are stored, defaults to working_dir/libph/post
    postprocess_niceness // unix nice command level with which post-processing scripts are invoked, default 15
    postprocess_async_time // how long to wait before sending HTTP 503 when postprocessing
    title_postfix // If present, this text is added to regenerated feed titles (for ex: " (hoarded)")
    postprocess_cache_size // max size, in bytes, for post-processing cache (default -1 = ignore)
    retry_failed_downloads // default false. if true, when an episode download has failed, it will be retried everytime sync.py is run

Known Issues and Troubleshooting

Python crashes with 'UnicodeEncodeError'

This happens when run interactively, because your terminal locale is probably set to ASCII or a derivative, and a feed that you're trying to display on screen is trying to show a unicode character. Set your codeset to UTF-8 to fix this.