Data Analyst
Core Data Analyst
This course will introduce you to modern tools, such as Jupyter Notebook and various Python
libraries, and how to work with data. In the process, you will learn about the many different
types of data and how to clean, blend, visualize, and analyse data to gain useful insights.
Data literacy is the ability to read, work with, analyse, and argue with data. Data analysis is
the process of cleaning and modelling your data to discover useful information.
Python for Data Analysis
Why Python Programming
Data Types and Operators
• Arithmetic Operators
• Variables and Assignment Operators
• Integers and Floats
• Booleans, Comparison Operators, and Logical Operators
• Strings
• Type and Type Conversion
• String Methods
• Lists and Membership Operators
• List Methods
• Tuples
• Sets
• Dictionaries and Identity Operators
• Compound Data Structures
Functions
• Defining Functions
• Variable Scope
• Documentation
• Lambda Expressions
NumPy
• Introduction to NumPy
• Why Use NumPy?
• Creating and Saving NumPy ndarrays
• Using Built-in Functions to Create ndarrays
• Create an ndarray
• Accessing, Deleting, and Inserting Elements Into ndarrays
• Slicing ndarrays
• Boolean Indexing, Set Operations, and Sorting
• Manipulating ndarrays
• Arithmetic operations and Broadcasting
• Creating ndarrays with Broadcasting
Pandas
• Introduction to Pandas
• Why Use Pandas?
• Creating Pandas Series
• Accessing and Deleting Elements in Pandas Series
• Arithmetic Operations on Pandas Series
• Manipulate a Series
• Creating Pandas DataFrames
• Accessing Elements in Pandas DataFrames
• Dealing with NaN
• Manipulate a DataFrame
• Loading Data into a Pandas DataFrame
Data Wrangling with Pandas
• What is data wrangling?
• Cleaning up the data
• Restructuring the data
• Handling duplicate, missing, or invalid data
Aggregating Pandas DataFrames
• Database-style operations on DataFrames
• DataFrame operations
• Aggregations with pandas and numpy
• Time series
Exploratory Data Analysis
• Exploratory Data Analysis Fundamentals
• Visual Aids for EDA
• EDA with Personal Email
• Data Transformation
SQL
• Basic SQL
• SQL Joins
• SQL Aggregations
• SQL Subqueries Temporary Tables
• SQL Data Cleaning
Data Visualization
Visualizing Data with Pandas and Matplotlib
• An introduction to matplotlib
• The basics
• Plot components
• Additional options
• Plotting with pandas
• Evolution over time
• Relationships between variables
• Distributions
• Counts and frequencies
• The pandas.plotting subpackage
• Scatter matrices
• Lag plots
• Autocorrelation plots
• Bootstrap plots
Plotting with Seaborn and Customization Techniques
• Utilizing seaborn for advanced plotting
• Categorical data
• Correlations and heatmaps
• Regression plots
• Distributions
• Faceting
• Formatting
• Titles and labels
• Legends
• Formatting axes
• Customizing visualizations
• Adding reference lines
• Shading regions
• Annotations
• Colors
Working with Unstructured Big Data
Exploring Text Data and Unstructured Data
• Preparing to work with unstructured data
• Tokenization explained
• Counting words and exploring results
• Normalizing text techniques
• Stemming and lemmatization in action
• Excluding words from analysis
Practical Sentiment Analysis
• Why sentiment analysis is important
• Elements of an NLP model
• Sentiment analysis packages
• Sentiment analysis in action
• Manual input
• Social media file input
Data Engineer
Big Data Analytics with Hadoop
Apache Hadoop is the most popular platform for big data processing, and can be combined
with a host of other big data tools to build powerful analytics solutions. Big Data Analytics
with Hadoop 3 shows you how to do just that, by providing insights into the software as well
as its benefits with the help of practical examples.
Once you have taken a tour of Hadoop 3's latest features, you will get an overview of HDFS,
MapReduce, and YARN, and how they enable faster, more efficient big data processing. You
will then move on to learning how to integrate Hadoop with open source tools, such as Python
and R, to analyse and visualize data and perform statistical computing on big data. As you
become acquainted with all of this, you will explore how to use Hadoop 3 with Apache Spark
and Apache Flink for real-time data analytics and stream processing.
Chapter 1: Introduction to Hadoop
• Hadoop Distributed File System
• MapReduce framework
• YARN
• Installing Hadoop 3
Chapter 2: Overview of Big Data Analytics
• Introduction to data analytics
• Introduction to big data
• Distributed computing using Apache Hadoop
• The MapReduce framework
• Hive
• Apache Spark
Chapter 3: Big Data Processing with MapReduce
• The MapReduce framework
• MapReduce job types
• MapReduce patterns
o Aggregation patterns
o Filtering patterns
o Join patterns
Chapter 4: Scientific Computing and Big Data Analysis with Python and Hadoop
• Installation
• Data analysis
Chapter 5: Statistical Big Data Computing with R and Hadoop
• Introduction
• Methods of integrating R and Hadoop
• Data analytics
Chapter 6: Batch Analytics with Apache Spark
• SparkSQL and DataFrames
• DataFrame APIs and the SQL API
• Schema – structure of data
• Loading datasets
• Saving datasets
• Aggregations
• Joins
Chapter 7: Real-Time Analytics with Apache Spark
• Streaming
• Spark Streaming
• fileStream
• Transformations
• Checkpointing
• Driver failure recovery
• Interoperability with streaming platforms (Apache Kafka)
• Handling event time and late date
• Fault-tolerance semantics
Chapter 8: Batch Analytics with Apache Flink
• Introduction to Apache Flink
• Installing Flink
• Using the Flink cluster UI
• Batch analytics
Chapter 9: Stream Processing with Apache Flink
• Introduction to streaming execution model
• Data processing using the DataStream API
Chapter 10: Visualizing Big Data
• Introduction
• Tableau
• Chart types
• Using Python to visualize data
• Using R to visualize data
• Big data visualization tools
Optional:
Chapter 11: Introduction to Cloud Computing
• Concepts and terminology
• Goals and benefits
• Risks and challenges
• Roles and boundaries
• Cloud characteristics
• Cloud delivery models
• Cloud deployment models