Understanding website structure

Content is usually represented using HTML
HyperText Markup Language


Servers make content available through HTTP
HyperText Transfer Protocol

Web scraping tools use a website's HTML structure to navigate the page and identify the content to scrape.


Sites with well organized and descriptive structure are usually easier to scrape.

Anatomy of an HTML element

Source: Anatomy of an HTML element

Browser "inspect" tools allow you to explore the HTML structure of a web page.


Right-click and select Inspect or Inspect element to reveal how the selected content is coded in HTML.

A simple site listing buyer names and item prices

http://econpy.pythonanywhere.com/ex/001.html

Buyer names are structured like this in the HTML

Moe Dess

The XPath expression that identifies all buyer-name nodes is

//div[@title="buyer-name"]

Scraping examples with Data Miner