Skip to content Skip to site navigation Skip to service navigation

Big Data: Data Engineering at Scale

New session times will be displayed below upon confirmation.

 

Effective immediately in response to COVID-19, all Technology Training classes will be delivered online until further notice.

In advance of each session, Tech Training will provide you with a Zoom link to your class, along with any required class materials.
 



This course provides an opportunity to deep dive into Big Data application development. Topics include how to design and develop applications using Spark and other Big Data Ecosystem components to manipulate, analyze and perform computations on Big Data.

Pre-requisite:

  • Experience of one programming language like Python/Java/Scala required
  • SQL and Data knowledge preferred
  • Familiarity with Big data is a plus

 

Audience:

  • Data Analysts, Software Engineers, Data Engineer, Data Professional, Business Intelligence Developer, Data Architect, DevOps Engineer


Today, there is a great need for Big Data for all aspects of software, making the enterprise software smart. In fact, many companies implemented some form of Big Data distribution to be able to tame the increased volumes of data being collected at a high velocity.

Much of enterprise software can benefit from Data Engineering and Data Science. The argument one often hears is, "If our smartphones can do it, why can't my enterprise software?" This course addresses the need for software made smart through the advanced use of Big Data.

The course is intended for software architects and engineers. It gives them a practical level of experience, achieved through a combination of about 50% lecture, and 50% demo work with student¿s participation.

 

Learning Objectives:

This course will provide you the opportunity to:

  • Have a broad understanding of Big Data Ecosystem
  • Understand the various offerings like Cloudera, Hortonworks, MapR, Amazon EMR and Microsoft Azure HDInsight in the industry around Big data on cloud and on Premise
  • Perform data ingestion using Sqoop and Kafka
  • Manage data using Hive
  • Understand the impact and value of Apache Spark in the Big Data Ecosystem
  • Understand the Apache Spark Architecture and the various libraries to perform various use cases like Streaming, Machine & Deep Learning, GraphX etc.
  • Set up an account on Apache Spark Databricks Cloud
  • Perform hands-on activity on Big Data Ecosystem

 

Topic Outline:

Big Data overview

  • A brief history of Big Data
  • History and background of Big Data and Hadoop
  • 5 Vs of Big Data
  • Secret Sauce of Big Data Hadoop
  • Big Data Distributions in Industry
  • End-to-End Big Data Life cycle overview
  • Demos and Labs

 

Big Data Ecosystem before Spark

  • Big Data Ecosystem before Apache Spark
  • Storage options -- HDFS and No-SQL
  • Processing options -- MapReduce, Hive, etc.
  • Administrative tools -- Zookeeper, Ozzie, etc.
  • Ingestion tools -- Sqoop, Flume
  • Demos and Labs

 

Big Data Ecosystem after Spark

  • Big Data Ecosystem after Apache Spark
  • Compare MapReduce Vs Apache Spark
  • Apache Spark Architecture
  • Understand Apache Architecture and Libraries like Streaming, Machine & Deep Learning, GraphX, etc.
  • Understanding Spark RDD
  • Demos and Labs

 

Extracting Data into Hadoop Using Sqoop

  • Syntax to Use Sqoop Commands
  • Sqoop Import
  • Controlling the Import
  • Exercise: Importing to HDFS
  • Exercise: Importing to HDFS Directory
  • Exercise: Importing a Subset of Rows
  • Exercise: Encoding Database NULL Values While Importing
  • Exercise: Importing Tables Using One Command
  • Exercise: Using Sqoop's Incremental Import Feature
  • Sqoop Export
  • Sqoop's Export Methodology
  • Export Control Arguments
  • Exercise: Import and Export using Sqoop

 

Ingestion using Kafka

  • Introduction to Apache Kafka
  • What is a Messaging System?
  • Why Apache Kafka
  • Apache Kafka - Fundamentals & Architecture
  • Apache Kafka - Work Flow
  • Apache Kafka - Zookeeper Role
  • Apache Kafka - Basic Operations
  • Console Producer & Consumer
  • Apache Kafka - Simple Producer
  • Apache Kafka - Simple Consumer
  • Kafka APIs
  • Lab Exercises

 

Managing Data Using Hive

  • HiveQL (HQL)
  • The Components of Hive
  • Query Flows
  • Interfacing with Hive
  • Hive Commands
  • Hive Data
  • Data Types
  • Operators and Functions
  • Creating and Dropping Databases
  • Hive Tables
  • Hive Views
  • Order By, Group By, Cluster by, Distributed By
  • Hive Partitions
  • Browsing, Altering, and Dropping Tables and Partitions
  • Loading Data
  • Exercise: Basic Commands for Working with Hive Tables
  • Exercise: Partition a Table
  • Bucketing in Hive
  • Lab Exercise
  • Indexes in Hive
  • Lab Exercise
  • User Defined Functions in Hive
  • Lab Exercise
  • HIVE DML (ACID) Operations (Alter and Delete)
  • Lab Exercise
  • Sampling in Hive
  • Lab Exercise
  • Running Hive as Script
  • Lab Exercise

 

Apache Spark SQL, DataFrames, Datasets

  • Introduction to Spark SQL
  • SQL, DataFrames and Datasets Spark Library
  • Compare the various APIs - RDD, DataFrames and Datasets
  • Demos and Labs

 

Machine Learning using Apache Spark

  • Introduction to Machine Learning and Data Science
  • Machine Learning Spark Library
  • Spark Machine Learning examples
  • Demos and Labs

 

Streaming using Apache Spark

  • Need of real time processing
  • Streaming Spark Library
  • Spark Streaming examples
  • Demos and Labs

 

 

Getting Started with Apache Spark

  • Introduction to Spark RDD
  • Spark RDD Transformation and Actions
  • Spark Lifecycle
  • Spark Caching
  • Setup Account on Apache Spark Databricks Cloud
  • Databricks Notebooks overview
  • Lab - Spark RDD Transformation & Actions
  • Lab - Spark RDD Advanced Transformation & Actions
  • Demos and Labs

 

Non-relational databases

 

Evolution of Data Storage

  • History of data stores
  • OLTP versus OLAP
  • Data warehousing concepts
  • Data growth and usage patterns
  • 5Vs of Big Data
  • Hadoop and HDFS

 

Relational versus non-relational databases

  • CAP Theorem
  • Comparison of Relational and non-relational databases
  • No-SQL Databases
  • No-SQL Databases types
  • Document Stores
  • Graph Database
  • Column-Oriented Database
  • Key-value
  • Search

 

Hbase

  • What is Hbase?
  • Hbase Architecture
  • Understanding HBase Schema
  • Hbase Vs RDBMS
  • Hbase Shell
  • Hbase commands
  • Importing Data to Hbase


University IT Technology Training classes are only available to Stanford University staff, faculty, or students. A valid SUNet ID is needed in order to enroll in a class.

Custom training workshops are available for this program

Technology training sessions structured around individual or group learning objectives. Learn more about custom training


University IT Technology Training sessions are available to a wide range of participants, including Stanford University staff, faculty, students, and employees of Stanford Hospitals & Clinics, such as Stanford Health Care, Stanford Health Care Tri-Valley, Stanford Medicine Partners, and Stanford Medicine Children's Health.

Additionally, some of these programs are open to interested individuals not affiliated with Stanford, allowing for broader community engagement and learning opportunities.