Saturday, 2 January 2021

Reading RSS feeds using Python

RSS Feeds

RSS stands for Really Simple Syndication and is a method of supplying information generally to be read by computers. An RSS feed is a web page and contains a number of entries, each entry commonly containing: title, link, description, publication date, and entry ID.

If you Google reading RSS in Python, you will often be directed to the feedparser library. This handles a lot of the background work involved in using an RSS feed and has a lot of protection against some of the potential pitfalls of reading what is effectively XML from potentially unknown sources. This includes only reading elements from a whitelist of trusted elements.

Unfortunately not all RSS feeds use elements on that whitelist.

One of those is the BBC RSS feeds. This is a problem in that I wanted to download the free BBC podcasts automatically.

So it was back to basics, the Requests library to access the RSS feed and download the files, and the defusedxml library to manipulate the RSS file as XML.

RSS files contain publication dates, so if you record the date of the last file you have downloaded, you can then use that date to only download new files when you retry the feed.

Installation

Remember to check that the version of pip is for Python 3.x (if not then replace pip with pip3)

Requests

pip install requests

Defusedxml

pip install defusedxml

Dateutil

This is required for date reading and comparison
pip install python-dateutil

Design

There are four main elements of the program:
  • Read the RSS file
  • Iterate through the new files
  • Download the podcasts
  • Update the last file date

Code

import defusedxml.ElementTree as ET
from pathlib import Path
import requests
from dateutil.parser import parse
import os

# Save the file refereenced by the URL to the supplied filepath
def saveMP3(url, filepath):
    print(filepath)
    r = requests.get(url)
    with open(filepath, 'wb') as f:
        f.write(r.content)

# Get all the files from the supplied RSS url that are newer than the 
# last file date stored in the matching file
def getFiles(url, target_folder, prefix):
    # Build the Namespace for the XML searching
    ns = {'media': "http://search.yahoo.com/mrss/"}

    # Create target folder if required
    if not Path(target_folder).is_dir():
        os.makedirs(target_folder)

    # Get response from URL
    r = requests.get(rss_url)
    print(r.status_code)

    # Make XML document from response text
    root = ET.fromstring(r.text)

    # Get the channel title
    channel_title = root.find("channel/title").text.replace(": ", "-")

    # Build the filename for the last downloaded file date
    channel_file = channel_title + "_lastdate.txt"
    print(channel_file)
    
    last_file_date = ""
    # Set up a default date if there is no file
    last_date = parse("2000-01-01 00:00:00 +00:00")

    # If there is a file, obtain the date from the file
    if Path(channel_file).is_file():
        f = open(channel_file, "r")
        last_file_date = f.read()
        last_date = parse(last_file_date)
        f.close()
    print(last_date)

    # Find the item elements from the XML
    items = root.findall("channel/item")

    print(channel_title + " files:" + str(len(items)))
    download_count = 0
    first_file_downloaded_date = ""

    # For each item found
    for y in items:
        # Obtain the information required
        title = y.find("title")
        link = y.find("media:content", ns)
        pub_date = y.find("pubDate")
        # Create the published date as datetime
        datetime_object = parse(pub_date.text)
        print(title.text)
        print(datetime_object)
        if last_file_date == "":              # Download if no previous date
            saveMP3(link.attrib['url'], target_folder+prefix+datetime_object.strftime("%Y%m%d%H%M%S") + ".mp3")
            download_count = download_count+1
        elif last_date < datetime_object:     # Download if newer than the last run
            saveMP3(link.attrib['url'], target_folder+prefix+datetime_object.strftime("%Y%m%d%H%M%S") + ".mp3")
            download_count = download_count+1
        else:                               # Otherwise exit loop
            print("Not downloading")
            break
        # Record the filedate of the first (most recent) file
        if first_file_downloaded_date == "":
            first_file_downloaded_date = pub_date.text
    
    # Print how many files and the most recent file date
    print("Downloaded " + str(download_count))
    print(first_file_downloaded_date)
    
    # Write the most recent file date to the file
    if first_file_downloaded_date != "":
        f = open(channel_file, "w")
        f.write(first_file_downloaded_date)
        f.close()

# If this is the main call
if __name__ == "__main__":
    rss_url = 'https://podcasts.files.bbci.co.uk/p02nrss1.rss'
    getFiles(rss_url, "/home/pi/Music/moreorless/","mol")

References