Big Data: Data Engineering at Scale

Effective immediately in response to COVID-19, all Technology Training classes will be delivered online until further notice.

In advance of each session, Tech Training will provide you with a Zoom link to your class, along with any required class materials.

This course provides an opportunity to deep dive into Big Data application development. Topics include how to design and develop applications using Spark and other Big Data Ecosystem components to manipulate, analyze and perform computations on Big Data.

Pre-requisite:

Experience of one programming language like Python/Java/Scala required
SQL and Data knowledge preferred
Familiarity with Big data is a plus

Audience:

Data Analysts, Software Engineers, Data Engineer, Data Professional, Business Intelligence Developer, Data Architect, DevOps Engineer

Today, there is a great need for Big Data for all aspects of software, making the enterprise software smart. In fact, many companies implemented some form of Big Data distribution to be able to tame the increased volumes of data being collected at a high velocity.

Much of enterprise software can benefit from Data Engineering and Data Science. The argument one often hears is, "If our smartphones can do it, why can't my enterprise software?" This course addresses the need for software made smart through the advanced use of Big Data.

The course is intended for software architects and engineers. It gives them a practical level of experience, achieved through a combination of about 50% lecture, and 50% demo work with student¿s participation.

Learning Objectives:

This course will provide you the opportunity to:

Have a broad understanding of Big Data Ecosystem
Understand the various offerings like Cloudera, Hortonworks, MapR, Amazon EMR and Microsoft Azure HDInsight in the industry around Big data on cloud and on Premise
Perform data ingestion using Sqoop and Kafka
Manage data using Hive
Understand the impact and value of Apache Spark in the Big Data Ecosystem
Understand the Apache Spark Architecture and the various libraries to perform various use cases like Streaming, Machine & Deep Learning, GraphX etc.
Set up an account on Apache Spark Databricks Cloud
Perform hands-on activity on Big Data Ecosystem

Topic Outline:

Big Data overview

A brief history of Big Data
History and background of Big Data and Hadoop
5 Vs of Big Data
Secret Sauce of Big Data Hadoop
Big Data Distributions in Industry
End-to-End Big Data Life cycle overview
Demos and Labs

Big Data Ecosystem before Spark

Big Data Ecosystem before Apache Spark
Storage options -- HDFS and No-SQL
Processing options -- MapReduce, Hive, etc.
Administrative tools -- Zookeeper, Ozzie, etc.
Ingestion tools -- Sqoop, Flume
Demos and Labs

Big Data Ecosystem after Spark

Big Data Ecosystem after Apache Spark
Compare MapReduce Vs Apache Spark
Apache Spark Architecture
Understand Apache Architecture and Libraries like Streaming, Machine & Deep Learning, GraphX, etc.
Understanding Spark RDD
Demos and Labs

Extracting Data into Hadoop Using Sqoop

Syntax to Use Sqoop Commands
Sqoop Import
Controlling the Import
Exercise: Importing to HDFS
Exercise: Importing to HDFS Directory
Exercise: Importing a Subset of Rows
Exercise: Encoding Database NULL Values While Importing
Exercise: Importing Tables Using One Command
Exercise: Using Sqoop's Incremental Import Feature
Sqoop Export
Sqoop's Export Methodology
Export Control Arguments
Exercise: Import and Export using Sqoop

Ingestion using Kafka

Introduction to Apache Kafka
What is a Messaging System?
Why Apache Kafka
Apache Kafka - Fundamentals & Architecture
Apache Kafka - Work Flow
Apache Kafka - Zookeeper Role
Apache Kafka - Basic Operations
Console Producer & Consumer
Apache Kafka - Simple Producer
Apache Kafka - Simple Consumer
Kafka APIs
Lab Exercises

Managing Data Using Hive

HiveQL (HQL)
The Components of Hive
Query Flows
Interfacing with Hive
Hive Commands
Hive Data
Data Types
Operators and Functions
Creating and Dropping Databases
Hive Tables
Hive Views
Order By, Group By, Cluster by, Distributed By
Hive Partitions
Browsing, Altering, and Dropping Tables and Partitions
Loading Data
Exercise: Basic Commands for Working with Hive Tables
Exercise: Partition a Table
Bucketing in Hive
Lab Exercise
Indexes in Hive
Lab Exercise
User Defined Functions in Hive
Lab Exercise
HIVE DML (ACID) Operations (Alter and Delete)
Lab Exercise
Sampling in Hive
Lab Exercise
Running Hive as Script
Lab Exercise

Apache Spark SQL, DataFrames, Datasets

Introduction to Spark SQL
SQL, DataFrames and Datasets Spark Library
Compare the various APIs - RDD, DataFrames and Datasets
Demos and Labs

Machine Learning using Apache Spark

Introduction to Machine Learning and Data Science
Machine Learning Spark Library
Spark Machine Learning examples
Demos and Labs

Streaming using Apache Spark

Need of real time processing
Streaming Spark Library
Spark Streaming examples
Demos and Labs

Getting Started with Apache Spark

Introduction to Spark RDD
Spark RDD Transformation and Actions
Spark Lifecycle
Spark Caching
Setup Account on Apache Spark Databricks Cloud
Databricks Notebooks overview
Lab - Spark RDD Transformation & Actions
Lab - Spark RDD Advanced Transformation & Actions
Demos and Labs

Non-relational databases

Evolution of Data Storage

History of data stores
OLTP versus OLAP
Data warehousing concepts
Data growth and usage patterns
5Vs of Big Data
Hadoop and HDFS

Relational versus non-relational databases

CAP Theorem
Comparison of Relational and non-relational databases
No-SQL Databases
No-SQL Databases types
Document Stores
Graph Database
Column-Oriented Database
Key-value
Search

Hbase

What is Hbase?
Hbase Architecture
Understanding HBase Schema
Hbase Vs RDBMS
Hbase Shell
Hbase commands
Importing Data to Hbase

University IT Technology Training classes are only available to Stanford University staff, faculty, or students. A valid SUNet ID is needed in order to enroll in a class.

Big Data: Data Engineering at Scale

Custom training workshops are available for this program

For Stanford Affiliates: