KEMBAR78
FDSNotes | PDF | Data Science | Statistics
0% found this document useful (0 votes)
80 views12 pages

FDSNotes

Ty bsc cs sppu

Uploaded by

student.2004.in
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views12 pages

FDSNotes

Ty bsc cs sppu

Uploaded by

student.2004.in
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Foundation of Data Science

1. Introduction to Data Science


Introduction to Data Science
• Definition: Data Science is a multidisciplinary field that uses scientific methods, algorithms,
processes, and systems to extract knowledge and insights from structured and unstructured
data.
• Key Components: It involves the integration of statistics, computer science, machine
learning, data mining, and domain knowledge.
• The 3 V’s of Data:
• Volume: Refers to the vast amount of data generated every second from various
sources (e.g., social media, sensors, transactions).
• Velocity: The speed at which data is generated, processed, and analyzed. In today’s
fast-paced world, data needs to be processed in real-time or near real-time.
• Variety: The different forms and types of data, including structured (e.g., databases),
semi-structured (e.g., XML, JSON), and unstructured data (e.g., text, images,
videos).

Why Learn Data Science?


• Demand for Data Scientists: The demand for data scientists is high across various
industries, as businesses increasingly rely on data-driven decision-making.
• Versatility: Data Science skills are applicable in numerous fields such as healthcare,
finance, marketing, and technology.
• Problem Solving: Data Science enables professionals to solve complex problems, improve
business processes, and innovate.
• Career Growth: Offers lucrative career opportunities with high earning potential and job
security.

Applications of Data Science


• Healthcare: Predictive analytics for patient outcomes, personalized medicine, and medical
image analysis.
• Finance: Fraud detection, risk management, algorithmic trading, and customer
segmentation.
• Marketing: Customer behavior analysis, targeted advertising, sentiment analysis, and
recommendation systems.
• Retail: Inventory management, demand forecasting, and personalized shopping experiences.
• Transportation: Route optimization, autonomous vehicles, and predictive maintenance.

The Data Science Lifecycle


• Data Collection: Gathering data from various sources such as databases, sensors, or the
web.
• Data Cleaning: Preprocessing the data to handle missing values, outliers, and errors to
ensure quality.
• Data Exploration: Analyzing the data to discover patterns, trends, and relationships using
statistical methods.
• Data Modeling: Building predictive models using machine learning algorithms to make
forecasts or decisions.
• Data Interpretation: Interpreting the results to gain insights and inform decision-making.
• Model Deployment: Implementing the model in a production environment where it can be
used to make real-time decisions.
• Monitoring & Maintenance: Continuously monitoring the model’s performance and
updating it as needed.

Data Scientist’s Toolbox


• Programming Languages: Python, R, and SQL are essential for data manipulation,
analysis, and modeling.
• Libraries & Frameworks:
• Pandas: Data manipulation and analysis.
• NumPy: Numerical computing.
• Scikit-learn: Machine learning algorithms.
• TensorFlow & PyTorch: Deep learning frameworks.
• Data Visualization Tools: Matplotlib, Seaborn, and Tableau for creating visual
representations of data.
• Big Data Technologies: Hadoop and Spark for processing and analyzing large datasets.
• Database Management: SQL databases (e.g., MySQL, PostgreSQL) and NoSQL databases
(e.g., MongoDB, Cassandra).

Types of Data
• Structured Data:
• Definition: Data that is organized in a specific format, often in rows and columns,
making it easily searchable in databases.
• Examples: Excel sheets, SQL databases.
• Semi-structured Data:
• Definition: Data that doesn’t have a fixed format but includes tags or markers to
separate elements.
• Examples: XML, JSON files.
• Unstructured Data:
• Definition: Data that lacks a specific format or structure, making it more challenging
to process and analyze.
• Examples: Text documents, images, videos, emails.
• Problems with Unstructured Data:
• Storage Issues: Requires more space and advanced storage solutions.
• Processing Complexity: Difficult to process and analyze due to its lack of
structure.
• Interpretation Challenges: Requires advanced techniques like natural
language processing (NLP) or image recognition.
Data Sources
• Open Data: Publicly available data that can be freely used and shared. Examples include
government datasets, public health data, and environmental data.
• Social Media Data: Data generated from social media platforms, such as posts, likes,
shares, and comments. Useful for sentiment analysis and trend prediction.
• Multimodal Data: Data that combines multiple types of information, such as text, images,
and audio. Examples include video files with subtitles or annotated images.
• Standard Datasets: Widely-used datasets in Data Science for benchmarking algorithms and
models. Examples include the Iris dataset, MNIST dataset, and ImageNet.

Data Formats
• Integers and Floats:
• Integers: Whole numbers used for counting or indexing.
• Floats: Numbers with decimal points, used for representing continuous data.
• Text Data:
• Plain Text: Simple text data stored without any formatting (e.g., .txt files).
• Text Files:
• CSV Files: Comma-separated values, often used for storing tabular data.
• JSON Files: JavaScript Object Notation, used for storing and exchanging data.
• XML Files: Extensible Markup Language, used for encoding documents in a format
that is both human-readable and machine-readable.
• HTML Files: Hypertext Markup Language, used for creating web pages.
• Dense Numerical Arrays: Arrays containing numerical data, typically used in scientific
computing and data analysis (e.g., NumPy arrays).
• Compressed or Archived Data:
• Tar Files: Archive files that can contain multiple files and directories.
• GZip Files: Compressed files that reduce storage space and transfer time.
• Zip Files: Archive files that can contain multiple files in a compressed format.
• Image Files:
• Rasterized Images: Images made up of pixels (e.g., JPEG, PNG).
• Vectorized Images: Images made up of paths and curves, scalable without losing
quality (e.g., SVG files).
• Compressed Images: Images that have been compressed to reduce file size (e.g.,
JPEG).

2. Statistical Data Analysis


Role of Statistics in Data Science
• Definition: Statistics is the branch of mathematics that deals with the collection, analysis,
interpretation, presentation, and organization of data.
• Importance in Data Science:
• Data Collection: Statistics provides methods to design surveys and experiments to
collect data efficiently.
• Data Analysis: Statistical techniques are essential for analyzing and interpreting
complex data sets.
• Inference: Statistics helps in making inferences about a population based on sample
data.
• Decision Making: Statistical methods enable data-driven decision-making by
providing a quantitative basis for assessing the reliability and significance of results.

Descriptive Statistics (6 Lectures)


• Definition: Descriptive statistics involves summarizing and organizing data to understand
its main characteristics, typically through numerical summaries, graphs, and tables.
• Key Components:
• Measuring the Frequency:
• Definition: Frequency refers to how often a data point occurs in a dataset.
• Tools: Frequency distributions, histograms, and bar charts are used to
visualize frequency.
• Measuring the Central Tendency:
• Mean: The arithmetic average of a set of numbers.
• Median: The middle value in a dataset when arranged in ascending or
descending order.
• Mode: The value that appears most frequently in a dataset.
• Measuring the Dispersion:
• Range: The difference between the highest and lowest values in a dataset.
• Standard Deviation: A measure of the amount of variation or dispersion in a
set of values.
• Variance: The square of the standard deviation, representing the spread of a
dataset.
• Interquartile Range (IQR): The difference between the first quartile (Q1)
and the third quartile (Q3), representing the middle 50% of the data.

Inferential Statistics (10 Lectures)


• Definition: Inferential statistics involves making predictions or inferences about a
population based on a sample of data drawn from that population.
• Key Concepts:
• Hypothesis Testing:
• Definition: A method used to determine if there is enough evidence to reject
a null hypothesis in favor of an alternative hypothesis.
• Steps:
1. Formulate Hypotheses: Define the null hypothesis (H0) and
alternative hypothesis (H1).
2. Choose Significance Level (α): Commonly used levels are 0.05 or
0.01.
3. Calculate Test Statistic: Based on the sample data.
4. Determine p-value: Compare the p-value with the significance level
to make a decision.
5. Make a Conclusion: Accept or reject the null hypothesis.
• Multiple Hypothesis Testing:
• Definition: Testing several hypotheses simultaneously, often using
adjustments like the Bonferroni correction to control the overall error rate.
• Parameter Estimation Methods:
• Point Estimation: Estimating an unknown parameter using a single value
(e.g., sample mean for population mean).
• Interval Estimation: Providing a range within which the parameter is
expected to lie, with a certain level of confidence (e.g., confidence intervals).

Measuring Data Similarity and Dissimilarity


• Definition: Similarity and dissimilarity measures are used to compare data points or objects,
which is essential for clustering, classification, and other data analysis tasks.
• Key Concepts:
• Data Matrix versus Dissimilarity Matrix:
• Data Matrix: Represents data with rows as objects and columns as attributes.
• Dissimilarity Matrix: Represents pairwise dissimilarities between objects,
with values indicating how different two objects are.
• Proximity Measures for Nominal Attributes:
• Definition: Nominal attributes are categorical attributes with no intrinsic
ordering (e.g., color, gender).
• Proximity Measures: Jaccard coefficient, Simple Matching Coefficient
(SMC).
• Proximity Measures for Binary Attributes:
• Definition: Binary attributes take on two values (e.g., 0 or 1).
• Proximity Measures: Hamming distance, Jaccard coefficient for binary data.
• Dissimilarity of Numeric Data:
• Euclidean Distance: The straight-line distance between two points in a
multi-dimensional space.
• Manhattan Distance: The sum of absolute differences between the
coordinates of two points (also known as L1 distance).
• Minkowski Distance: A generalization of Euclidean and Manhattan
distances, parameterized by a value 'p' that determines the specific distance
measure (p=1 for Manhattan, p=2 for Euclidean).
• Proximity Measures for Ordinal Attributes:
• Definition: Ordinal attributes have a clear, ordered relationship between
values (e.g., rankings).
• Proximity Measures: Can use rank correlation coefficients like Spearman's
rank correlation or Kendall's tau.

Concept of Outlier
• Definition: An outlier is a data point that significantly differs from other observations in a
dataset.
• Types of Outliers:
• Univariate Outliers: Outliers that occur in a single variable.
• Multivariate Outliers: Outliers that occur in a combination of variables, not
apparent when looking at individual variables.
• Contextual Outliers: Outliers that are only considered abnormal in a specific
context (e.g., temperature readings that are normal in summer but outliers in winter).
• Outlier Detection Methods:
• Z-Score Method: Calculates how many standard deviations a data point is from the
mean. Data points with a Z-score beyond a certain threshold (e.g., ±3) are considered
outliers.
• IQR Method: Outliers are identified as data points that fall below Q1 - 1.5IQR or
above Q3 + 1.5IQR.
• Machine Learning Methods: Techniques like clustering, isolation forests, and one-
class SVMs can be used to detect outliers in more complex datasets.

3. Data Preprocessing
Data Objects and Attribute Types
• What is an Attribute?
• Definition: An attribute (or feature) is a property or characteristic of an object or
data point. In a dataset, attributes are the columns that describe different aspects of
the data objects (rows).
• Types of Attributes:
• Nominal Attributes:
• Definition: Categorical attributes with no inherent order or ranking among
the values.
• Examples: Colors (red, blue, green), gender (male, female).
• Binary Attributes:
• Definition: Attributes that have two possible states or values.
• Types:
• Symmetric Binary: Both outcomes are equally important (e.g., 0 or 1
in a binary variable).
• Asymmetric Binary: One outcome is more significant than the other
(e.g., success/failure, where success is more critical).
• Ordinal Attributes:
• Definition: Categorical attributes with a meaningful order or ranking
between values.
• Examples: Education levels (high school, bachelor's, master's), customer
satisfaction ratings (poor, fair, good, excellent).
• Numeric Attributes:
• Definition: Attributes that are quantifiable and expressible in numbers.
• Types:
• Discrete Attributes: Attributes that take on a countable number of
distinct values.
• Examples: Number of students in a class, number of cars in a
parking lot.
• Continuous Attributes: Attributes that can take on any value within a
range.
• Examples: Temperature, height, weight.
Data Quality: Why Preprocess the Data?
• Importance of Data Preprocessing:
• Accuracy: Ensures the accuracy and reliability of the analysis by addressing issues
such as missing data, noise, and inconsistencies.
• Efficiency: Reduces the complexity of data, making it easier to process and analyze.
• Consistency: Aligns data from different sources or formats, ensuring that it is
coherent and uniform.
• Improves Model Performance: Clean and well-preprocessed data lead to better
model performance and more accurate predictions.

Data Munging/Wrangling Operations


• Definition: Data munging or wrangling refers to the process of transforming raw data into a
clean, structured format suitable for analysis.
• Common Operations:
• Data Parsing: Converting raw data into a structured format.
• Data Filtering: Removing irrelevant or redundant data.
• Data Aggregation: Summarizing or combining data from multiple sources.
• Data Enrichment: Enhancing data with additional relevant information.

Cleaning Data
• Definition: Data cleaning is the process of identifying and correcting (or removing) errors
and inconsistencies in data to improve its quality.
• Common Data Cleaning Issues:
• Missing Values: Data points where information is absent.
• Handling Methods: Imputation (filling in missing values), deletion, or using
algorithms that can handle missing data.
• Noisy Data: Data that contains errors, inconsistencies, or irrelevant information.
• Types of Noisy Data:
• Duplicate Entries: Multiple records for the same entity.
• Multiple Entries for a Single Entity: Different entries representing
the same entity with slight variations.
• Missing Entries: Partial data missing for certain records.
• NULLs: Missing values represented as NULL.
• Huge Outliers: Data points that are significantly different from other
observations.
• Out-of-Date Data: Data that is no longer accurate or relevant.
• Artificial Entries: Data that is not genuine or was created for testing
purposes.
• Irregular Spacings: Inconsistent spacing within text data.
• Formatting Issues: Different formatting styles used across tables or
columns.
• Extra Whitespace: Unnecessary spaces that can cause parsing issues.
• Irregular Capitalization: Inconsistent use of uppercase and
lowercase letters.
• Inconsistent Delimiters: Different delimiters used to separate data
fields.
• Irregular NULL Format: Inconsistent representation of missing
data.
• Invalid Characters: Characters that do not belong in the dataset.
• Incompatible Datetimes: Different date and time formats that need
standardization.

Data Transformation
• Definition: Data transformation involves converting data into a suitable format or structure
for analysis.
• Common Data Transformation Techniques:
• Rescaling: Adjusting the range of data values to a specific scale, often to bring all
variables into the same range.
• Example: Rescaling data to a range of 0 to 1.
• Normalizing: Adjusting the data to have a mean of 0 and a standard deviation of 1.
• Example: Z-score normalization.
• Binarizing: Converting numerical data into binary form (e.g., 0 or 1).
• Example: Converting a continuous attribute into a binary attribute based on a
threshold.
• Standardizing: Ensuring data follows a standard normal distribution with a mean of
0 and standard deviation of 1.
• Example: Standardizing data to remove the effects of different scales.
• Label Encoding: Converting categorical attributes into numerical form by assigning
a unique integer to each category.
• One-Hot Encoding: Converting categorical attributes into binary vectors where each
category is represented by a binary variable (0 or 1).

Data Reduction
• Definition: Data reduction involves reducing the volume of data while maintaining its
integrity and meaning, making it easier to analyze.
• Techniques:
• Dimensionality Reduction: Reducing the number of attributes or features while
retaining essential information (e.g., PCA, LDA).
• Numerosity Reduction: Reducing the number of data points or records through
techniques like clustering, sampling, or aggregation.

Data Discretization
• Definition: Data discretization involves converting continuous data into discrete intervals or
categories.
• Importance: Useful for transforming continuous attributes into categorical attributes, which
can simplify analysis and improve model performance.
• Methods:
• Binning: Dividing data into intervals, or "bins," and assigning a categorical label to
each bin.
• Histogram Analysis: Using histograms to define intervals based on data distribution.
• Cluster Analysis: Grouping similar data points and assigning them to discrete
categories.

4. Data Visualization
Introduction to Exploratory Data Analysis (EDA)
• Definition: EDA is an approach to analyzing data sets to summarize their main
characteristics, often using visual methods.
• Purpose of EDA:
• Identifying Patterns: Detecting trends, correlations, and relationships in data.
• Spotting Anomalies: Finding outliers or irregularities in the data.
• Checking Assumptions: Verifying the validity of assumptions made about the data.
• Guiding Further Analysis: Informing the choice of statistical models or algorithms
to apply.

Data Visualization and Visual Encoding


• Definition: Data visualization is the graphical representation of data to make complex data
more accessible and understandable.
• Visual Encoding: The process of mapping data attributes (e.g., numbers, categories) to
visual elements like color, shape, size, or position in a chart.
• Examples of Visual Encoding:
• Position: The location of data points on a plot (e.g., x and y axes in a scatter
plot).
• Color: Used to distinguish different categories or indicate data intensity (e.g.,
heat maps).
• Size: Represents the magnitude of data points (e.g., bubble size in bubble
plots).
• Shape: Differentiates between categories (e.g., different marker shapes in a
scatter plot).

Data Visualization Libraries


• Definition: Libraries or software packages that provide tools and functions for creating
visual representations of data.
• Popular Libraries:
• Matplotlib: A widely used Python library for creating static, animated, and
interactive visualizations.
• Seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for
creating attractive and informative statistical graphics.
• Plotly: An interactive graphing library that enables complex, web-based
visualizations.
• ggplot2: A popular data visualization package in R, based on the Grammar of
Graphics.
• D3.js: A JavaScript library for producing dynamic, interactive data visualizations in
web browsers.
Basic Data Visualization Tools
• Histograms:
• Definition: A graphical representation of the distribution of a dataset. It shows the
frequency of data points in specified ranges (bins).
• Use: Ideal for displaying the distribution of a single continuous variable.
• Bar Charts/Graphs:
• Definition: A chart that presents categorical data with rectangular bars. The length of
each bar is proportional to the value it represents.
• Use: Best for comparing the frequency or count of different categories.
• Scatter Plots:
• Definition: A plot that shows the relationship between two numerical variables. Each
point represents an observation in the dataset.
• Use: Useful for identifying correlations or patterns between variables.
• Line Charts:
• Definition: A type of chart that displays data points connected by a line. It shows
trends over time or ordered categories.
• Use: Commonly used to track changes over time.
• Area Plots:
• Definition: Similar to line charts, but the area under the line is filled with color or
shading.
• Use: Good for visualizing cumulative data or comparing multiple variables.
• Pie Charts:
• Definition: A circular chart divided into sectors, each representing a proportion of
the whole.
• Use: Ideal for showing the relative proportions of categories in a dataset.
• Donut Charts:
• Definition: A variation of the pie chart with a central hole, often used to provide
additional information in the center.
• Use: Similar to pie charts but with an added aesthetic appeal.

Specialized Data Visualization Tools


• Boxplots:
• Definition: A graphical representation of the distribution of a dataset based on five
summary statistics: minimum, first quartile (Q1), median, third quartile (Q3), and
maximum.
• Use: Effective for identifying outliers and understanding the spread and skewness of
data.
• Bubble Plots:
• Definition: A variation of a scatter plot where each point is replaced by a bubble, and
the size of the bubble represents a third variable.
• Use: Useful for visualizing three dimensions of data on a two-dimensional plane.
• Heat Maps:
• Definition: A graphical representation of data where individual values are
represented by colors.
• Use: Ideal for displaying the intensity or density of data points across a matrix.
• Dendrogram:
• Definition: A tree-like diagram used to illustrate the arrangement of clusters
produced by hierarchical clustering.
• Use: Useful for visualizing the structure and hierarchy of data clusters.
• Venn Diagram:
• Definition: A diagram that shows all possible logical relations between a finite
collection of sets.
• Use: Effective for illustrating set relationships, such as intersections and unions.
• Treemap:
• Definition: A hierarchical structure represented as nested rectangles, where each
rectangle's size is proportional to the data value.
• Use: Useful for visualizing large amounts of hierarchical data.
• 3D Scatter Plots:
• Definition: An extension of the scatter plot into three dimensions, where each point
is defined by three numerical coordinates.
• Use: Ideal for visualizing the relationship between three continuous variables.

Advanced Data Visualization Tools - Wordclouds


• Definition: A visual representation of text data where the size of each word reflects its
frequency or importance.
• Use: Effective for quickly identifying the most prominent words or themes in a text dataset.

Visualization of Geospatial Data


• Definition: The process of visualizing data that includes geographical or spatial
components.
• Tools and Techniques:
• Choropleth Maps: Maps where areas are shaded or patterned in proportion to the
data value.
• Point Maps: Maps that represent individual data points as symbols, such as dots.
• Heat Maps: Geographical maps that use color to represent the density of data points
in a given area.
• Interactive Maps: Maps that allow users to interact with data by zooming, clicking,
or filtering.

Data Visualization Types


• Categorical Data Visualization:
• Tools: Bar charts, pie charts, donut charts.
• Purpose: Comparing different categories or understanding the distribution of
categorical data.
• Numerical Data Visualization:
• Tools: Histograms, boxplots, scatter plots.
• Purpose: Understanding the distribution, trends, and relationships between
numerical variables.
• Hierarchical Data Visualization:
• Tools: Treemaps, dendrograms.
• Purpose: Displaying the structure and relationships within hierarchical datasets.
• Network Data Visualization:
• Tools: Network graphs, node-link diagrams.
• Purpose: Visualizing relationships and interactions between entities within a
network.

You might also like