RSS Feeds
RSS stands for Really Simple Syndication and is a method of supplying information generally to be read by computers. An RSS feed is a web page and contains a number of entries, each entry commonly containing: title, link, description, publication date, and entry ID.
If you Google reading RSS in Python, you will often be directed to the feedparser library. This handles a lot of the background work involved in using an RSS feed and has a lot of protection against some of the potential pitfalls of reading what is effectively XML from potentially unknown sources. This includes only reading elements from a whitelist of trusted elements.
Unfortunately not all RSS feeds use elements on that whitelist.
One of those is the BBC RSS feeds. This is a problem in that I wanted to download the free BBC podcasts automatically.
So it was back to basics, the Requests library to access the RSS feed and download the files, and the defusedxml library to manipulate the RSS file as XML.
RSS files contain publication dates, so if you record the date of the last file you have downloaded, you can then use that date to only download new files when you retry the feed.
Installation
Remember to check that the version of pip is for Python 3.x (if not then replace pip with pip3)
Requests
pip install requests
Defusedxml
pip install defusedxml
Dateutil
This is required for date reading and comparison
pip install python-dateutil
Design
There are four main elements of the program:
- Read the RSS file
- Iterate through the new files
- Download the podcasts
- Update the last file date
Code
import defusedxml.ElementTree as ET
from pathlib import Path
import requests
from dateutil.parser import parse
import os
# Save the file refereenced by the URL to the supplied filepath
def saveMP3(url, filepath):
print(filepath)
r = requests.get(url)
with open(filepath, 'wb') as f:
f.write(r.content)
# Get all the files from the supplied RSS url that are newer than the
# last file date stored in the matching file
def getFiles(url, target_folder, prefix):
# Build the Namespace for the XML searching
ns = {'media': "http://search.yahoo.com/mrss/"}
# Create target folder if required
if not Path(target_folder).is_dir():
os.makedirs(target_folder)
# Get response from URL
r = requests.get(rss_url)
print(r.status_code)
# Make XML document from response text
root = ET.fromstring(r.text)
# Get the channel title
channel_title = root.find("channel/title").text.replace(": ", "-")
# Build the filename for the last downloaded file date
channel_file = channel_title + "_lastdate.txt"
print(channel_file)
last_file_date = ""
# Set up a default date if there is no file
last_date = parse("2000-01-01 00:00:00 +00:00")
# If there is a file, obtain the date from the file
if Path(channel_file).is_file():
f = open(channel_file, "r")
last_file_date = f.read()
last_date = parse(last_file_date)
f.close()
print(last_date)
# Find the item elements from the XML
items = root.findall("channel/item")
print(channel_title + " files:" + str(len(items)))
download_count = 0
first_file_downloaded_date = ""
# For each item found
for y in items:
# Obtain the information required
title = y.find("title")
link = y.find("media:content", ns)
pub_date = y.find("pubDate")
# Create the published date as datetime
datetime_object = parse(pub_date.text)
print(title.text)
print(datetime_object)
if last_file_date == "": # Download if no previous date
saveMP3(link.attrib['url'], target_folder+prefix+datetime_object.strftime("%Y%m%d%H%M%S") + ".mp3")
download_count = download_count+1
elif last_date < datetime_object: # Download if newer than the last run
saveMP3(link.attrib['url'], target_folder+prefix+datetime_object.strftime("%Y%m%d%H%M%S") + ".mp3")
download_count = download_count+1
else: # Otherwise exit loop
print("Not downloading")
break
# Record the filedate of the first (most recent) file
if first_file_downloaded_date == "":
first_file_downloaded_date = pub_date.text
# Print how many files and the most recent file date
print("Downloaded " + str(download_count))
print(first_file_downloaded_date)
# Write the most recent file date to the file
if first_file_downloaded_date != "":
f = open(channel_file, "w")
f.write(first_file_downloaded_date)
f.close()
# If this is the main call
if __name__ == "__main__":
rss_url = 'https://podcasts.files.bbci.co.uk/p02nrss1.rss'
getFiles(rss_url, "/home/pi/Music/moreorless/","mol")
References