Skip to content Skip to site navigation Skip to service navigation

Scraping and Sourcing Data with Python

Class Code


Class Description

Effective immediately in response to COVID-19, all Technology Training classes will be delivered online until further notice.

In advance of each session, Tech Training will provide you with a Zoom link to your class, along with any required class materials.

Learners should have an understanding of Basic Python Programming.

The ability to locate and acquire important data is a valuable skill for doing data analysis and data science. 

In this class, we will: 

  • Explore many sources and repositories for valuable data acquisition such as open government and university datasets
  • Explore popular social APIs (e.g., Facebook, Spotify, Twitter) and domain-specific APIs (e.g., healthcare, news, science and math) that store a wealth of data
  • Discuss methods to query web servers, and request and parse data to extract the information you need
  • Explore scraping various types of data from websites and how to read and extract text from documents (e.g., PDF, Word) along with methods to clean and store sourced and scraped data

Learning Objectives

During this course, you will have the opportunity to:

  • Explore a Variety of Public Data Repositories
  • Understand Effective Means to Search for Valuable Data
  • Use the Python Programming Language to Source and Scrape Data
  • Use Popular Social and Domain-specific APIs to Access Data (e.g., Slack)
  • Extract Text from Documents (e.g., data in PDFs, Word) and access PDF Tables
  • Scrape Data from Web Pages
  • Clean Scraped Data and store Sourced and Scraped Data

Topic Outline

Overview of Data Sourcing

  1. Public Open Dataset
  2. Government Data
  3. University Data
  4. Milestone 1 Learning Exercise: Explore public data repositories

Introduction to the Python Programming Language

  1. Installing Anaconda
  2. Milestone 2 Learning Exercise: Learn how to use Jupyter Notebooks

- Using Public APIs (Application Programming Interfaces)

  1. Explore Popular and Domain-specific APIs
  2. Common Conventions
  3. Parsing JSON
  4. Milestone 3 Learning Exercise: Access a public API (e.g., Facebook, Twitter, Google)

Extracting Text from Documents

  1. Milestone 4 Learning Exercise: Extract data from PDFs

Overview of Data Scraping

  1. Introduction to BeautifulSoup
  2. Parsing HTML and Javascript
  3. Milestone 5 Learning Exercise: Scrape data from a website

Cleaning Scraped Data

  1. Storing Sourced and Scraped Data

Conclusion: Next steps

University IT Technology Training classes are only available to Stanford University staff, faculty, students and Stanford Hospitals & Clinics employees. A valid SUNet ID is needed in order to enroll in a class.