RSS Feeds

RSS stands for Really Simple Syndication and is a method of supplying information generally to be read by computers. An RSS feed is a web page and contains a number of entries, each entry commonly containing: title, link, description, publication date, and entry ID.

If you Google reading RSS in Python, you will often be directed to the feedparser library. This handles a lot of the background work involved in using an RSS feed and has a lot of protection against some of the potential pitfalls of reading what is effectively XML from potentially unknown sources. This includes only reading elements from a whitelist of trusted elements.

Unfortunately not all RSS feeds use elements on that whitelist.

One of those is the BBC RSS feeds. This is a problem in that I wanted to download the free BBC podcasts automatically.

So it was back to basics, the Requests library to access the RSS feed and download the files, and the defusedxml library to manipulate the RSS file as XML.

RSS files contain publication dates, so if you record the date of the last file you have downloaded, you can then use that date to only download new files when you retry the feed.

Installation

Remember to check that the version of pip is for Python 3.x (if not then replace pip with pip3)

Requests

pip install requests

Defusedxml

pip install defusedxml

Dateutil

This is required for date reading and comparison

pip install python-dateutil

Design

There are four main elements of the program:

Read the RSS file
Iterate through the new files
Download the podcasts
Update the last file date

Code

import defusedxml.ElementTree as ET

from pathlib import Path

import requests

from dateutil.parser import parse

import os

# Save the file refereenced by the URL to the supplied filepath

def saveMP3(url, filepath):

print(filepath)

r = requests.get(url)

with open(filepath, 'wb') as f:

f.write(r.content)

# Get all the files from the supplied RSS url that are newer than the

# last file date stored in the matching file

def getFiles(url, target_folder, prefix):

# Build the Namespace for the XML searching

ns = {'media': "http://search.yahoo.com/mrss/"}

# Create target folder if required

if not Path(target_folder).is_dir():

os.makedirs(target_folder)

# Get response from URL

r = requests.get(rss_url)

print(r.status_code)

# Make XML document from response text

root = ET.fromstring(r.text)

# Get the channel title

channel_title = root.find("channel/title").text.replace(": ", "-")

# Build the filename for the last downloaded file date

channel_file = channel_title + "_lastdate.txt"

print(channel_file)

last_file_date = ""

# Set up a default date if there is no file

last_date = parse("2000-01-01 00:00:00 +00:00")

# If there is a file, obtain the date from the file

if Path(channel_file).is_file():

f = open(channel_file, "r")

last_file_date = f.read()

last_date = parse(last_file_date)

f.close()

print(last_date)

# Find the item elements from the XML

items = root.findall("channel/item")

print(channel_title + " files:" + str(len(items)))

download_count = 0

first_file_downloaded_date = ""

# For each item found

for y in items:

# Obtain the information required

title = y.find("title")

link = y.find("media:content", ns)

pub_date = y.find("pubDate")

# Create the published date as datetime

datetime_object = parse(pub_date.text)

print(title.text)

print(datetime_object)

if last_file_date == "": # Download if no previous date

saveMP3(link.attrib['url'], target_folder+prefix+datetime_object.strftime("%Y%m%d%H%M%S") + ".mp3")

download_count = download_count+1

elif last_date < datetime_object: # Download if newer than the last run

saveMP3(link.attrib['url'], target_folder+prefix+datetime_object.strftime("%Y%m%d%H%M%S") + ".mp3")

download_count = download_count+1

else: # Otherwise exit loop

print("Not downloading")

break

# Record the filedate of the first (most recent) file

if first_file_downloaded_date == "":

first_file_downloaded_date = pub_date.text

# Print how many files and the most recent file date

print("Downloaded " + str(download_count))

print(first_file_downloaded_date)

# Write the most recent file date to the file

if first_file_downloaded_date != "":

f = open(channel_file, "w")

f.write(first_file_downloaded_date)

f.close()

# If this is the main call

if __name__ == "__main__":

rss_url = 'https://podcasts.files.bbci.co.uk/p02nrss1.rss'

getFiles(rss_url, "/home/pi/Music/moreorless/","mol")

References

https://docs.python.org/3/library/xml.html

https://pypi.org/project/defusedxml/

https://pypi.org/project/requests/

https://requests.readthedocs.io/en/master/

https://pypi.org/project/python-dateutil/

https://dateutil.readthedocs.io/en/stable/

https://www.bbc.co.uk/sounds/help/questions/features/subscribe#rss

Technology Is Not Dull

Saturday, 2 January 2021

Reading RSS feeds using Python