Skip to content Skip to site navigation Skip to service navigation

Applied Machine Learning using Python and Apache Spark (3-day Class)

Class Sessions

Date Location Cost
  • Tue May 12, 9:00 am to 4:00 pm
  • Wed May 13, 9:00 am to 4:00 pm
  • Thu May 14, 9:00 am to 4:00 pm
Cardinal Hall at Redwood City C144 (IT Training Room) $1,150

Class Code

ITS-1919

Class Description

 

This course will provide you with a thorough understanding of Machine Learning concepts, terminology and usage. It would enable you to perform Machine Learning in two ways, namely using Python libraries and Apache Spark.
 

Pre-requisite:

- Familiarity with Python required

- No machine learning knowledge is required

- Working knowledge of Spark is a plus

 

Audience:

Data Analysts, Software Engineers, Data Engineer, Data Professional, Business Intelligence Developer, Data Architect

 

This course will provide you with a thorough understanding of Machine Learning concepts, terminology and usage. It would enable you to perform Machine Learning in two ways, namely using Python libraries and Apache Spark. Specifically, you will learn NumPy, Pandas, Matplotlib and Scikit-learn, known as the Python data stack. Then we will explore how distributed architecture in Apache Spark helps scale Machine Learning capabilities for large volumes of data. The course will culminate with you applying these tools to a hands-on Machine Learning project.

 

We will discuss using Python libraries to visually explore, clean and prepare your data for analysis. We will then look at the various option with Apache Spark to perform Machine Learning in a distributed architecture using Spark MLlib library. We will be using Databricks Notebooks to perform Spark ML hands-on. This approach will give us a comprehensive comparison of Machine learning through the most popular offerings using Python and Apache Spark.             

 

Learning Objectives

In this course, you will have the opportunity to:

  • Have a basic understanding of Machine Learning
  • Understand the differences between Supervised and Unsupervised Learning
  • Understand how to use Python libraries to explore, clean and prepare data
  • Describe the role of Machine Learning and where it fits into Information Technology strategies
  • Explain the technical and business drivers that result from using Machine Learning
  • Understand techniques like Classification, Clustering, and Regression
  • Discuss how to identify which kinds of technique to be applied for specific use case
  • Understand the popular Machine offerings like Amazon Machine Learning, TensorFlow, Azure Machine Learning, Google Cloud , Spark mlib, Python and R, etc.
  • Install and Setup Anaconda.
  • Perform hands-on activities using Jupyter Notebooks.
  • Understand the popular Machine Learning Algorithms like Linear Regression, Decision Tree, Logistic Regression, K Nearest Neighbor, K-Means clustering etc.
  • Perform hands-on activity on Python libraries like NumPy, Pandas, Matplotlib and Scikit-learn
  • Understand Apache Spark Processing Framework and distributed architecture
  • Compare Machine learning using Python versus Apache Spark
  • Perform hands-on activity on Databricks cloud using Apache Spark MLlib

 

 

Topic Outline

Day 1

  • Course Introduction
  • History and background of Machine Learning
  • Compare Traditional Programming Vs Machine Learning
  • Supervised and Unsupervised Learning Overview
  • Machine Learning patterns
    - Classification
    - Clustering
    - Regression
  • Gartner Hype Cycle for Emerging Technologies
  • Machine Learning offerings in Industry
  • Hands-on exercise 1: Install and Setup Anaconda.
  • Descriptive statistics
  • Milestone 1: Learn how to use Jupyter Notebooks
  • Essential libraries
    - Numpy
    - Pandas
    - Matplotlib
  • Milestone 2: Exploratory data analysis

 

Day 2

  • Getting data
  • Feature selection
  • Essential libraries
    - Scikit-learn
  • Milestone 3: End to End project (Initiation)
  • Transforming data
  • Binary encoding
  • One-hot encoding
  • Feature Engineering
  • Training and test sets
  • Algorithms
    - Linear Regression
    - Naive Bayes
    - Decision Tree
    - Random Forest
    - Logistics Regression
    - Support Vector Machine
    - K-Nearest Neighbor
    - K-Means Clustering
  • Milestone 4: Data modeling

 

Day 3

  • Apache Spark Overview
  • Spark Libraries
  • Compare Machine Learning using Python vs Spark
  • Milestone 5: Databricks Cloud Community Account Setup
  • Measuring performance
    - Confusion Matrix
    - ROC curve, Area Under Curve (AUC)
  • Refining the model
  • Hyper parameter tuning
  • Grid search
  • Milestone 6: Spark mLlib Hands-on
  • Milestone 7: End-to-end project Completion
  • Next steps

 

Structured Activity/Exercises/Case Studies:

Day 1:

  • Milestone 1: Learn how to use Jupyter Notebooks
  • Milestone 2: Exploratory data analysis

 

Day 2:

  • Milestone 3: End to End project (Initiation)
  • Milestone 4: Data modeling

 

Day 3:

  • Milestone 5: Model selection
  • Milestone 6: Spark mLlib Hands-on
  • Milestone 7: End-to-end project (Completion
     


University IT Technology Training classes are only available to Stanford University staff, faculty, or students. A valid SUNet ID is needed in order to enroll in a class.