Sunday 17 February 2019

Kitronik :MOVE mini for the BBC mIcro:bit

I ordered a Kitronik MOVE mini robot from Pimoroni last year. This provides a battery powered chassis that can be controlled using an on-board BBC micro:bit.
The kit comes in a robust cardboard box, unfortunately not big enough to take the completed robot (but see later).

The components are neatly bagged up, and include the required AA batteries.
The body is made up of laser cut acrylic pieces. There are two continuous rotation servo motors to provide the motive power.

The controller board is designed to use countersunk screws to provide the connection between the micro:bit  and the board. This does mean that it limits the control to the two motors and the light bar (there is an option to isolate the light bar, giving access to an additional - optional - servo).
The back of the board. Note at bottom right the area to  cut to access the third servo).
The kit does not include a BBC micro:bit. As one of the options is to control the robot's own micro:bit using a second micro:bit, I ordered a second one.

I covered the BBC micro:bit in an earlier post.



The instructions to build the robot can be found here. They are generally straightforward (so much so I forgot to pause to photograph the stages).

The one thing to be aware of is that the controller board only operates on the batteries. I started testing the board assuming that the USB supply would power the micro:bit and the controller board and was testing it with the battery switched off. The micro:bit was fine, but the ZIP LED light bar was not lighting up. Switching the battery pack on solved the problem.
As you can see, the micro:bit is screwed to the controlled board.  The ZIP LEDs are above the micro:bit and the 5x5 matrix is visible.

Side view. The robot has two wheels and uses the front and rear of the side walls as stabilising rails.
The ZIP LEDs are very bright.

Saturday 16 February 2019

Python Web Scraper

Search Engine Optimisation is a mysterious skill, the companies behind search engines do not want to make it easy to play their engines to force certain pages to the top of the results list.

However, one of the things that is listed as being useful is the presence of the key search words in the body of the HTML page.

The following is a simple Python script that retrieves the text from a supplied URL creates a searchable tree using LXML, extracts the text and then counts the occurrences of non numeric words longer than three characters.

# Functions to scrape the body text from a web page
# and return the top 10 occurring words
# Currently does not play well with HTML comments

# Also note that the order of the printed results is subject to change,
# this can be important if there are more than one element with the
# same occurrence as the last element displayed 

from lxml import html
import requests
import re
import collections

noisewords=['at','and','an','the','we','to','is','of','by','not','in','as','be','or','for']
def testgoodword(astring):
    if astring.strip() in noisewords:
        return False
    else:
        return True

def webscrape(aURL):

# Get the page from the URL
    page = requests.get(aURL)
# Make an HTML tree from the text
    tree = html.fromstring(page.content)

# Extract non script and non style text from the HTML tree
    bodytext=""
    for  elt in tree.getiterator():
        if elt.text is not None:
            if elt.tag!="script" and elt.tag!="style":
                if elt.text.strip()!='':
                    bodytext=bodytext+' '+elt.text
                    
# Define a regular expression to extract words
# (one or more alphanumerics followed by white space character)
    p = re.compile(r'\w\w+\s')

# Use a Counter collection to record the occurrences (the word is the key)
# Counter collections return zero if there is no element with a supplied key
    c = collections.Counter()

# Iterate through the "words2 found by the regular expression
    iterator=p.finditer(bodytext)
    for match in iterator:
        testword=match.group().strip().lower()
        if not testword.isnumeric():        # Ignore numbers
            if testgoodword(testword):      # Only use non noise words
                c[testword]+=1

    return c

# Print the most common words
def print_webscrape(c):
    print ('Most common:')
    for word, count in c.most_common(10):
        print ('\'%s\': %7d' % (word, count))

# Testing code 
if __name__ == "__main__":
    # execute only if run as a script
    c=webscrape("https://en.wikipedia.org/wiki/Python_(programming_language)")
    print_webscrape(c)


Most common:
'python':     189
'retrieved':     127
'programming':      49
'software':      35
'language':      29
'edit':      29
'with':      29
'pep':      26
'languages':      24
'march':      23
>>> 
Most common:
'python':     189
'retrieved':     127
'programming':      49
'software':      35
'edit':      29
'language':      29
'with':      29
'pep':      26
'languages':      24
'march':      23
>>> 
Note the order of "edit", "language" and "with" alter. If you really want the top ten, you need to expand the range until the you have a different count.
Most common:
'python':     189
'retrieved':     127
'programming':      49
'software':      35
'with':      29
'language':      29
'edit':      29
'pep':      26
'languages':      24
'march':      23
'february':      23
'org':      23
'from':      21
'van':      20
'december':      18
A you can see, there are three results with a count of 23. The Counter.most_common() function will return them in a random order

References