Tools and approaches

View slides for this section

As we saw with Data Miner, modern visual scraping tools are robust, designed to avoid getting blocked by the requirements of a specific website, and generally easy to set up. Tools like Data Miner can truly accomplish basic web scraping tasks with a few simple clicks. But what if Data Miner ceases to exist? What if you want to create a workflow that others can use no matter what tool they are using? Working with web scraping in a research context requires a level of reproducibility which may not be easy to achieve with proprietary tools (including Data Miner).

In this workshop we focus on visual web scraping with Data Miner to familiarize you with the structure of a website and how to target different areas with a scraping tool. Our focus is to build a mental model of web scraping generally. However, if you would like to have more control over what you are doing, be able to share a detailed explanation of how you got the data that you scraped, and share your process in a sustainable way, scripting is a better approach for you.

A note about environment setup

This workshop is not intended to teach you how to program or to set up your computer to work with programming languages. For absolute beginners who are interested in exploring scripts without the overhead of environment setup, Project Jupyter and Google Colab are great resources. They are hosted computational environments that require minimal environment setup.

You can see Google Colab in action and try some code here: https://colab.research.google.com/. Login with a Google account, then select “File -> New notebook” for a basic coding environment.

What are scripting tools for web scraping

Generally tools can be divided into two categories:

Tools that get information from the web
Tools that parse the information you getting

We will look at a quick example of a Python script run through Jupyter. Python is a common programming language and has a number of tools inside of it that help with web scraping. It is also relatively friendly for beginners to understand. In Python sets of related tools get organised into “Libraries” (eg. tools for both scraping and parsing in one) and “Frameworks” which offer more structure for the application of a certain set of tools than a library would.

Web Scraping with Python

The most common Python based tools for web scraping that you are likely to run into are:

requests
- a library for getting data from the web.
- tools to communicate over HTTP.
urllib
- tools to communicate over HTTP.
- best for smaller amounts of data, very similar to requests.
lxml
- a set of tools for parsing html and xml.
Beautiful Soup
- a slightly larger set of tools for parsing html and xml.
Scrapy
- a framework for web crawling and web scraping. For getting small and large amounts of data from the web and automating requests to happen repeatedly or over time.
- A parser which processes html or xml by standardizing it. Beautiful soup can troubleshoot structural problems in the output of your scrape such missing or open html tags.

Tutorials to explore

Web Scraping with R

R is a feature rich programming language designed for statistical computing and graphics.

The most common R based tool for web scraping is:

rvest
- rvest is not included in tidyverse but it is related and works well with the collection of included packages.