Intro to web scraping - what is web scraping?

A multidisciplinary hub supporting research endeavours, partnerships, and education.

More from the Research Commons at (UBC-V)

And from the Center for Scholarly Communication (UBC-O)

What is web scraping?

Acquire non-tabular or poorly structured data from a site and convert it to a structured format (.csv, spreadsheet)

Crawling. What Google does to index the web - systematically "crawling" through all content on specified sites.

Scraping. More targeted than crawling - identifies and extracts specific content from pages it accesses.

Some sites disallow web scraping with a robots.txt file.

Am I allowed to take this data?

Check the website for terms of use that affect web scraping.

Are there restrictions on what I can do with this data?

Making a local copy of publicly available data is usually OK, but there may be limits on use and redistribution (check for copyright statements).

Am I overloading the website's servers?

Scraping practice should respect the website's access rules, often encoded in robots.txt files.

When in doubt ask a librarian or contact UBC's Copyright Office
https://copyright.ubc.ca/support/contact-us/