Web Scraping with 1 Line of Bash

In the past Python with the Beautiful Soup library has been a great approach for my web scraping.

I was recently doing a small project and I was amazing at what one Bash statement would get me.

My approach was to use the text based Lynx browser and pipe the output to a grep search.

Below is an example where I used Lynx to dump the “Sunshine Village Snow Forecast” web page to find how much snow they had.

The Lynx Text Browser

The first step in web scraping is to get a web page into a searchable format.

I started out by looking at using cURL with the html2text tool, but I found that using the Lynx browser offered a one step solution with a cleaner text output.

To install Lynx on Raspian/Debian/Ubuntu use:

sudo apt install lynx

The lynx -dump option will output a web page to text with HTML tags and Javascript removed. It’s important to note that what you see on the page may not match the outputted text.

Below is an example where I wanted to get the new snow at Sunshine Village. On the web page Javascript is used to show the snow depth as either centimetres or inches, but on the text output both units and their values are shown.

Bash has a good selection of string manipulation tools. Below is an example to extract the first part of string to only show the snow in centimeters (cm):

$ theurl="https://www.snow-forecast.com/resorts/Sunshine/6day/mid"
$ thestr="New snow in Sunshine Village:"
$ # Create a variable with the result string from Lynx
$ newsnow=$(lynx -dump "$theurl" | grep "$thestr")
$ 
$ echo "$newsnow"
   New snow in Sunshine Village:  4.8cm1.9in on Fri 8th (after 3 PM)
$    
$ # Get the first part of the string before "cm"
$ # The %% gets the first part
$ echo "${newsnow%%cm*} cm"
   New snow in Sunshine Village:  4.8 cm

My Final App

We were going on a family ski trip and to get pumped I created a morning notification script that showed the new morning snow and the base.

#!/bin/bash
#
# skitrip.sh - show the Sunshine ski conditions in a notification
#
theurl="https://www.snow-forecast.com/resorts/Sunshine/6day/mid"

# Get the new snow depth
thestr="New snow in Sunshine Village:"
result=$(lynx -dump "$theurl" | grep "$thestr")
newsnow="${result%%cm*} cm"

# Get the base
thestr="Top Lift:"
base=$(lynx -dump "$theurl" | grep "$thestr")

# Show the results in a desktop notification
msg="$newsnow\n$base (base)"
icon="$HOME/Downloads/mountain.png"
notify-send -t 10000000 -i "$icon"  "Sunshine Ski Resort" "$msg"

The notify-send utility will put a message on a Linux desktop, another option could be to send a SMS message.

Summary

Scraping web pages can be tricky and the pages can change at anytime.

I found that Lynx worked on many pages but not all.

The grep utility is extremely useful and it offers a lot of interesting options, such as getting lines before or after the found string.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s