Skip to content Skip to site navigation Skip to service navigation

Fundamentals of Big Data

Effective immediately in response to COVID-19, all Technology Training classes will be delivered online until further notice.


In advance of each session, Tech Training will provide you with a Zoom link to your class, along with any required class materials.
 




This class will help you get started with the background and introduction of the history of Big Data. Get an introduction to working with Big Data Ecosystem technologies, which include HDFS, MadReduce, Hive, Pig, Machine Learning, and more.

Pre-requisite: Basic Programming knowledge, SQL and Data knowledge preferred

Description:

The Introduction to Big Data course is the first stop in the Big Data curriculum series coming up at Stanford. It will help you get started with the background and introduction of the history of Big Data. 

Along the way, you will get an introduction to working with Big Data Ecosystem technologies (HDFS, MapReduce, Sqoop, Flume, Hive, Pig, Mahout (Machine Learning), R Connector, Ambari, Zookeeper, Oozie and No-SQL like HBase) for Big Data scenarios. 

This course will provide an understanding of Big data ecosystem before and after Apache Spark. Finally, you will be learning the Spark core fundamentals and architecture. You will setup an account on Apache Spark Databricks Cloud and perform an exercise on big data analysis using Apache Spark.

Learning Objectives:

After this course, you will be able to:

  • Understand the history and background of Big data and Hadoop 
  • Describe the Big Data landscape including examples of real-world big data problems
  • Explain the 5 V’s of Big Data (volume, velocity, variety, veracity, and value)
  • Understand the foundational principles that have made Big Data so successful.
  • Provide an explanation of the ecosystem components like HDFS, MapReduce, Sqoop, Flume, Hive, Pig, Mahout (Machine Learning), R Connector, Ambari, Zookeeper, Oozie and No-SQL like HBase.
  • Understand the various offerings like Cloudera, Hortonworks, MapR, Amazon EMR and Microsoft Azure HDInsight in the industry around Big data on cloud and on Premise.
  • Understand the impact and value of Apache Spark in the Big Data Ecosystem.
  • Understand the Apache Spark Architecture and the various libraries to perform various use cases like Streaming, Machine & Deep Learning, GraphX etc.
  • Setup Account on Apache Spark Databricks Cloud.
  • Perform hands-on activity on Big Data Ecosystem.

Topic Outline:

  • Course Introduction
  • History and background of Big Data and Hadoop
  • 5 V’s of Big Data
  • Secret Sauce of Big Data Hadoop 
  • Big Data Distributions in Industry
  • Big Data Ecosystem before Apache Spark
  • Big Data Ecosystem after Apache Spark
  • Comparison of MapReduce Vs Apache Spark
  • Big Data Ecosystem after Apache Spark
  • Understand Apache Architecture and Libraries like Streaming, Machine & Deep Learning, GraphX etc.
  • Exercise 1 - Setup Account on Apache Spark Databricks Cloud.
  • Exercise 2 – First Spark Program
  • Exercise 3 – Spark RDD Transformation & Actions
  • Exercise 4 – Spark RDD Advanced Transformation & Actions
  • References and Next steps

Structured Activity/Exercises/Case Studies:

  • Exercise 1 - Setup Account on Apache Spark Databricks Cloud.
  • Exercise 2 – First Spark Program
  • Exercise 3 – Spark RDD Transformation & Actions
  • Exercise 4 – Spark RDD Advanced Transformation & Actions
     

 

University IT Technology Training classes are only available to Stanford University staff, faculty, students and Stanford Hospitals & Clinics employees. A valid SUNet ID is needed in order to enroll in a class.

 

 

Custom training workshops are available for this program

Technology training sessions structured around individual or group learning objectives. Learn more about custom training


University IT Technology Training sessions are available to a wide range of participants, including Stanford University staff, faculty, students, and employees of Stanford Hospitals & Clinics, such as Stanford Health Care, Stanford Health Care Tri-Valley, Stanford Medicine Partners, and Stanford Medicine Children's Health.

Additionally, some of these programs are open to interested individuals not affiliated with Stanford, allowing for broader community engagement and learning opportunities.