KEMBAR78
Big Data Syllabus | PDF | Apache Spark | Apache Hadoop
0% found this document useful (0 votes)
3 views1 page

Big Data Syllabus

The document outlines a curriculum for a 45-hour course on Big Data, covering five units: Introduction to Big Data, MapReduce Development & Hive/Pig, Apache Spark for Batch & Real-Time Processing, NoSQL, Kafka & Real-Time Analytics, and Visualization, Optimization & Cloud Deployment. Each unit includes specific topics such as Hadoop architecture, MapReduce pipeline, Spark cluster architecture, NoSQL database models, and data visualization approaches. The course aims to provide a comprehensive understanding of Big Data technologies and their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views1 page

Big Data Syllabus

The document outlines a curriculum for a 45-hour course on Big Data, covering five units: Introduction to Big Data, MapReduce Development & Hive/Pig, Apache Spark for Batch & Real-Time Processing, NoSQL, Kafka & Real-Time Analytics, and Visualization, Optimization & Cloud Deployment. Each unit includes specific topics such as Hadoop architecture, MapReduce pipeline, Spark cluster architecture, NoSQL database models, and data visualization approaches. The course aims to provide a comprehensive understanding of Big Data technologies and their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 1

UNIT TITLE HOURS

I Introduction to Big Data 9


Introduction to Big Data, Big Data characteristics, Challenges of Conventional System, Types of Big Data,
Intelligent data analysis, Traditional vs. Big Data business approach, Case Study of Big Data Solutions,
Hadoop architecture: HDFS, Namenode/Datanode, block replication, Setting up and configuring HDFS in
standalone/pseudo-distributed mode, HDFS commands and data ingestion best practices, Hadoop
ecosystem: YARN, MapReduce framework overview, Data ingestion patterns: Sqoop for RDBMS, Flume
for streaming
UNIT TITLE HOURS
II MapReduce Development & Hive/Pig 9
MapReduce pipeline: Mapper, MapReduce pipeline: Mapper, Reducer, Combiner, Partitioner
Reducer, Combiner, Partitioner,
Data formats: Writables, Data formats: Writables, SequenceFile, Avro, Parquet
SequenceFile, Avro, Parquet, Hive
architecture, HiveQL: table Hive architecture, HiveQL: table creation, partitions, UDFs
creation, partitions, UDFs, Pig
Latin: scripting, data flow
Pig Latin: scripting, data flow operators, performance
operators, performance
considerations considerations

UNIT TITLE HOURS


III Apache Spark for Batch & Real-Time Processing 9
Spark cluster architecture: driver, executors, master, RDD vs DataFrame vs Dataset abstractions, Spark
SQL and DataFrame transformations & actions, Spark Streaming: micro-batch processing, MLlib
introduction: basic ML pipelines
UNIT TITLE HOURS
IV NoSQL, Kafka & Real-Time Analytics 9
NoSQL database models: key-value, document, column-family, graph, Cassandra data modelling and
architecture, MongoDB CRUD operations and indexing strategies, Kafka architecture: producers,
consumers, partitions, Integration of Kafka‐Spark for real-time processing
UNIT TITLE HOURS
V Visualization, Optimization & Cloud Deployment 9
Data visualization approaches using Zeppelin, Jupyter, or Grafana, Spark optimization: shuffles, caching,
partitioning strategies, Hadoop & Spark deployment models: standalone, YARN, Mesos, Kubernetes,
Integration with cloud services: AWS EMR, Azure HDInsight, End-to-end workflow orchestration using
Oozie or Airflow
TOTAL HOURS : 45

You might also like