Skip to content Skip to site navigation Skip to service navigation

Scraping and Sourcing Data with Python

Class Code Date Delivery Method Cost
ITS-2598
  • Fri May 17, 1:00 pm to 4:00 pm
  • Fri May 24, 1:00 pm to 4:00 pm
Live Online - 2 sessions $450

Most Technology Training classes will be delivered online until further notice.

Before each sesson, Tech Training will provide a Zoom link for live online classes, along with any required class materials.
 


Prerequisite:
Learners should have an understanding of Basic Python Programming.

The ability to locate and acquire important data is a valuable skill for doing data analysis and data science. 

In this class, we will: 

  • Explore many sources and repositories for valuable data acquisition such as open government and university datasets
  • Explore popular social APIs (e.g., Facebook, Spotify, Twitter) and domain-specific APIs (e.g., healthcare, news, science and math) that store a wealth of data
  • Discuss methods to query web servers, and request and parse data to extract the information you need
  • Explore scraping various types of data from websites and how to read and extract text from documents (e.g., PDF, Word) along with methods to clean and store sourced and scraped data


Learning Objectives

During this course, you will have the opportunity to:

  • Explore a Variety of Public Data Repositories
  • Understand Effective Means to Search for Valuable Data
  • Use the Python Programming Language to Source and Scrape Data
  • Use Popular Social and Domain-specific APIs to Access Data (e.g., Slack)
  • Extract Text from Documents (e.g., data in PDFs, Word) and access PDF Tables
  • Scrape Data from Web Pages
  • Clean Scraped Data and store Sourced and Scraped Data

 
Topic Outline

Overview of Data Sourcing

  1. Public Open Dataset
  2. Government Data
  3. University Data
  4. Milestone 1 Learning Exercise: Explore public data repositories

Introduction to the Python Programming Language

  1. Installing Anaconda
  2. Milestone 2 Learning Exercise: Learn how to use Jupyter Notebooks

- Using Public APIs (Application Programming Interfaces)

  1. Explore Popular and Domain-specific APIs
  2. Common Conventions
  3. Parsing JSON
  4. Milestone 3 Learning Exercise: Access a public API (e.g., Facebook, Twitter, Google)

Extracting Text from Documents

  1. Milestone 4 Learning Exercise: Extract data from PDFs

Overview of Data Scraping

  1. Introduction to BeautifulSoup
  2. Parsing HTML and Javascript
  3. Milestone 5 Learning Exercise: Scrape data from a website

Cleaning Scraped Data

  1. Storing Sourced and Scraped Data

Conclusion: Next steps

Custom training workshops are available for this program

Technology training sessions structured around individual or group learning objectives. Learn more about custom training


University IT Technology Training sessions are available to a wide range of participants, including Stanford University staff, faculty, students, and employees of Stanford Hospitals & Clinics, such as Stanford Health Care, Stanford Health Care Tri-Valley, Stanford Medicine Partners, and Stanford Medicine Children's Health.

Additionally, some of these programs are open to interested individuals not affiliated with Stanford, allowing for broader community engagement and learning opportunities.