Skip to main content

Hadoop, Hive, and Pig (2-day class)

Learn techniques to store and process Petabytes of data using the Hadoop Distributed File System (HDSF), Pig and Hive. You will learn Hive queries and Pig scripts to process the data and learn how the HDFS stores large data in clusters with replication.

We'll show you the basics of distributed computing, the corresponding programming models; and explain fault tolerance and redundancy. You'll also use No-SQL in the context of Hadoop to perform data analytics. Both Cloudera and Hortonworks platforms will be covered in the course.

Topics covered include:

Big Data Overview
- Differences between traditional tools vs. Big Data tools
- How is ETL done in Big Data?
- What are the infrastructure tools?
- Cloud infrastructure
- Physical Infrastructure
- Hybrid model
- What is Big Data eco system?
- Hadoop/Cassandra/ Apache spark
- NO-SQL (Mongo DB, HBASE etc.)

Hadoop Overview
- Parallel Computer vs. Distributed Computing
- RDBMS/SQL vs. Hadoop.
- Hadoop Architecture (V1 and V2): Name Node, Data Node, Job Tracker, Task Tracker, YARN
- Vendor Comparison (Cloudera, Hortonworks, MapR, Amazon EMR)
- Use cases

Planning Cluster
- General Planning Considerations
- Choosing The Right Hardware
- Network Considerations
- Configuring Nodes

HDFS Deep Dive
- Name Node architecture (Edit Log, FsImage, location of replicas)
- Secondary Name Node architecture
- Data Node architecture
- Write Pipeline
- Read Pipeline
- Heartbeats, Data Node commissioning/decommissioning, Rack Awareness, Block
- Scanner, Balancer, Trash, Health Check, Safe mode
- HDFS Federation (next gen)
- HDFS HA (next gen)1
- HDFS Benchmarking
- Exploring the HDFS Apache Web UI
- Exploring the Cloudera Web UI for HDFS functions
- LAB #1: HDFS commands using Hadoop cluster

Data Ingestion Tools        
- Flume
- Sqoop
- Kafka
- LAB #2: Book Store: Ingest unstructured data using Flume
- LAB #3: Book Store: Ingest Books tables using Sqoop from MySQL
 
Hive & Impala
- Philosophy and architecture
- Hive vs. RDBMS
- HiveQL and Hive Shell
- Managing tables
- Data types and schemas
- Querying data
- Partitions and Buckets
- Intro to User Defined Functions
- Hive Query Optimization
- LAB #6: Book Store: Data analysis with Hive
 
Pig
- Philosophy and architecture
- Why Pig?
- Pig Latin and the Grunt shell
- Loading and analyzing structured/unstructured data
- Data types and schemas
- Pig Latin details: structure, functions, expressions, and relational operators
- Intro to User Defined Functions and Scripts
- LAB #7: Book Store: Data analysis with PIG

Custom training workshops are available for this program

Technology training sessions structured around individual or group learning objectives. Learn more about custom training


University IT Technology Training sessions are available to a wide range of participants, including Stanford University staff, faculty, students, and employees of Stanford Hospitals & Clinics, such as Stanford Health Care, Stanford Health Care Tri-Valley, Stanford Medicine Partners, and Stanford Medicine Children's Health.

Additionally, some of these programs are open to interested individuals not affiliated with Stanford, allowing for broader community engagement and learning opportunities.