CC DataScience Material
CC DataScience Material
Data Science has become the most demanding job of the 21st century. Every organization is looking
for candidates with knowledge of data science.
Data mining, machine learning, and data visualization are just a few of the tools and methods we
frequently employ to draw meaning from data. They may deal with both structured and unstructured
data, including text and pictures, databases, and spreadsheets.
A number of sectors, including healthcare, finance, marketing, and more, use the insights and
experience gained via data analysis to steer innovation, advise business decisions, and address
challenging problems.
o Collecting data from a range of sources, including databases, sensors, websites, etc.
    o   Making sure data is in a format that can be analyzed while also organizing and processing it to
        remove mistakes and inconsistencies.
    o   Finding patterns and correlations in the data using statistical and machine learning
        approaches.
    o   Developing visual representations of the data to aid in comprehension of the conclusions and
        insights.
    o   Creating mathematical models and computer programs that can classify and forecast based
        on data.
Example:
Let's suppose we want to travel from station A to station B by car. Now, we need to make some
decisions such as which route will be the best route to reach faster at the location, in which route there
will be no traffic jam, and which will be cost-effective. All these decision factors will act as input data,
and we will get an appropriate answer from these decisions, so this analysis of data is called the data
analysis, which is a part of data science.
Some years ago, data was less and mostly available in a structured form, which could be easily stored
in excel sheets, and processed using BI tools.
Every Company requires data to work, grow, and improve their businesses.
Now, handling of such huge amount of data is a challenging task for every organization. So to handle,
process, and analysis of this, we required some complex, powerful, and efficient algorithms and
technology, and that technology came into existence as data Science. Following are some main reasons
for using data science technology:
    o   Every day, the world produces enormous volumes of data, which must be processed and
        analysed by data scientists in order to provide new information and understanding.
    o   Data science is now crucial for creating and educating intelligent systems as artificial
        intelligence and machine learning have grown in popularity.
    o   Data science increases productivity and lowers costs in a variety of industries, including
        manufacturing and logistics, by streamlining procedures and forecasting results.
The average salary range for data scientist will be approximately $95,000 to $ 165,000 per annum, and
as per different researches, about 11.5 millions of job will be created by the year 2026.
If you learn data science, then you get the opportunity to find the various exciting job roles in this
domain. The main job roles are given below:
1. Data Scientist
2. Data Analyst
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst
1. Data Scientist: A data scientist is in charge of deciphering large, complicated data sets for patterns
and trends, as well as creating prediction models that may be applied to business choices. They could
also be in charge of creating data-driven solutions for certain business issues.
Skill Required: To become a data scientist, one needs skills in mathematics, statistics, programming
languages(such as Python, R, and Julia), Machine Learning, Data Visualisation, Big Data Technologies
(such as Hadoop), domain expertise( such that the person is capable of understanding data which is
related to the domain), and communication and presentation skills to efficiently convey the insights
from the data.
2. Machine Learning Engineer: A machine learning engineer is in charge of creating, testing, and
implementing machine learning algorithms and models that may be utilized to automate tasks and
boost productivity.
Skill Required: Programming languages like Python and Java, statistics, machine learning frameworks
like TensorFlow and PyTorch, big data technologies like Hadoop and Spark, software engineering, and
problem-solving skills are all necessary for a machine learning engineer.
3. Data Analyst: Data analysts are in charge of gathering and examining data in order to spot patterns
and trends and offer insights that may be applied to guide business choices. Creating data
visualizations and reports to present results to stakeholders may also fall within the scope of their
responsibility.
Skill Required: Data analysis and visualization, statistical analysis, database querying, programming in
languages like SQL or Python, critical thinking, and familiarity with tools and technologies like Excel,
Tableau, SQL Server, and Jupyter Notebook are all necessary for a data analyst.
4. Business Intelligence Analyst: Data analysis for business development and improvement is the
responsibility of a business intelligence analyst. They could also be in charge of developing and putting
into use data warehouses and other types of data management systems.
Skill Required: A business intelligence analyst has to be skilled in data analysis and visualization,
business knowledge, SQL and data warehousing, data modeling, and ETL procedures, as well as
programming languages like Python and knowledge of BI tools like Tableau, Power BI, or QlikView.
5. Data Engineer: A data engineer is in charge of creating, constructing, and maintaining the
infrastructure and pipelines for collecting and storing data from diverse sources. In addition to
guaranteeing data security and quality, they could also be in charge of creating data integration
solutions.
Skill Required: To create, build, and maintain scalable and effective data pipelines and data
infrastructure for processing and storing large volumes of data, a data engineer needs expertise in
database architecture, ETL procedures, data modeling, programming languages like Python and SQL,
big data technologies like Hadoop and Spark, cloud computing platforms like AWS or Azure, and tools
like Apache Airflow or Talend.
6. Big Data Engineer: Big data engineers are in charge of planning and constructing systems that can
handle and analyze massive volumes of data. Additionally, they can be in charge of putting scalable
data storage options into place and creating distributed computing systems.
Skilled Required: Big Data Engineers must be proficient in distributed systems, programming
languages like Java or Scala, data modeling, database management, cloud computing platforms like
AWS or Azure, big data technologies like Apache Spark, Kafka, and Hive, and experience with tools like
Apache NiFi or Apache Beam in order to design, build, and maintain large-scale distributed data
processing systems for hand.
7. Data Architect: Data models and database systems that can support data-intensive applications
must be designed and implemented by a data architect. They could also be in charge of maintaining
data security, privacy, and compliance.
Skill Required: A data architect needs knowledge of database design and modeling, data warehousing,
ETL procedures, programming languages like SQL or Python, proficiency with data modeling tools like
ER/Studio or ERwin, familiarity with cloud computing platforms like AWS or Azure, and expertise in
data governance and security.
8. Data Administrator: An organization's data assets must be managed and organized by a data
administrator. They are in charge of guaranteeing the security, accuracy, and completeness of data as
well as making sure that those who require it can readily access it.
Advertisement
Skill Required: A data administrator needs expertise in database management, backup, and recovery,
data security, SQL programming, data modeling, familiarity with database platforms like Oracle or SQL
Server, proficiency with data management tools like SQL Developer or Toad, and experience with cloud
computing platforms like AWS or Azure.
9. Business Analyst: A business analyst is a professional who helps organizations identify business
problems and opportunities and recommends solutions to those problems through the use of data
and analysis.
Skill Required: A business analyst needs expertise in data analysis, business process modeling,
stakeholder management, requirements gathering and documentation, proficiency in tools like Excel,
Power BI, or Tableau, and experience with project management.
While technical skills are essential for data science, there are also non-technical skills that are
important for success in this field. Here are some non-technical prerequisites for data science:
    1. Domain knowledge: To succeed in data science, it might be essential to have a thorough grasp
       of the sector or area you are working in. Your understanding of the data and its importance to
       the business will improve as a result of this information.
    2. Problem-solving skills: Solving complicated issues is a common part of data science, thus, the
       capacity to do it methodically and systematically is crucial.
    3. Communication skills: Data scientists need to be good communicators. You must be able to
       communicate the insights to others.
    4. Curiosity and creativity: Data science frequently entails venturing into unfamiliar territory, so
       being able to think creatively and approach issues from several perspectives may be a
       significant skill.
    5. Business Acumen: For data scientists, it is crucial to comprehend how organizations function
       and create value. This aids in improving your comprehension of the context and applicability
       of your work as well as pointing up potential uses of data to produce commercial results.
    6. Critical thinking: In data science, it's critical to be able to assess information with objectivity
       and reach logical conclusions. This involves the capacity to spot biases and assumptions in data
       and analysis as well as the capacity to form reasonable conclusions based on the facts at hand.
Technical Prerequisite:
Since data science includes dealing with enormous volumes of data and necessitates a thorough
understanding of statistical analysis, machine learning algorithms, and programming languages,
technical skills are crucial. Here are some technical prerequisites for data science:
    1. Mathematics and Statistics: Data science is working with data and analyzing it using statistical
       methods. As a result, you should have a strong background in statistics and mathematics.
       Calculus, linear algebra, probability theory, and statistical inference are some of the important
       ideas you should be familiar with.
    3. Data Manipulation and Analysis: Working with data is an important component of data
       science. You should be skilled in methods for cleaning, transforming, and analyzing data, as
       well as in data visualization. Knowledge of programs like Tableau or Power BI might be helpful.
    4. Machine Learning: A key component of data science is machine learning. Decision trees,
       random forests, and clustering are a few examples of supervised and unsupervised learning
       algorithms that you should be well-versed in. Additionally, you should be familiar with well-
       known machine learning frameworks like Scikit-learn and TensorFlow.
    5. Deep Learning: Neural networks are used in deep learning, a kind of machine learning. Deep
       learning frameworks like TensorFlow, PyTorch, or Keras should be familiar to you.
    6. Big Data Technologies: Large and intricate datasets are a common tool used by data scientists.
       Big data technologies like Hadoop, Spark, and Hive should be known to you.
    7. Databases: The depth of understanding of Databases, such as SQL, is essential for data science
       to get the data and to work with data.
Data science involves several components that work together to extract insights and value from data.
Here are some of the key components of data science:
   1. Statistics: Statistics is one of the most important components of data science. Statistics is a
      way to collect and analyze numerical data in a large amount and find meaningful insights from
      it.
   2. Mathematics: Mathematics is a critical part of data science. Mathematics involves the study
      of quantity, structure, space, and changes. For a data scientist, knowledge of good
      mathematics is essential.
   3. Domain Expertise: In data science, domain expertise binds data science together. Domain
      expertise means specialized knowledge or skills in a particular area. In data science, there are
      various areas for which we need domain experts.
    4. Data Collection: Data is gathered and acquired from a number of sources. This can be
       unstructured data from social media, text, or photographs, as well as structured data from
       databases.
    6. Data Exploration and Visualization: This entails exploring the data and gaining insights using
       methods like statistical analysis and data visualization. To aid in understanding the data, this
       may entail developing graphs, charts, and dashboards.
    7. Data Modeling: In order to analyze the data and derive insights, this component entails
       creating models and algorithms. Regression, classification, and clustering are a few examples
       of supervised and unsupervised learning techniques that may be used in this.
    8. Machine Learning: Building predictive models that can learn from data is required for this.
       This might include the increasingly significant deep learning methods, such as neural
       networks, in data science.
    9. Communication: This entails informing stakeholders of the data analysis's findings. Explain the
       results, and this might involve producing reports, visualizations, and presentations.
    10. Deployment and Maintenance: The models and algorithms need to be deployed and
        maintained when the data science project is over. This may entail keeping an eye on the
        models' performance and upgrading them as necessary.
o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner.
o Regression
o Decision tree
o Clustering
o Naive Bayes
o Apriori
We will provide you some brief introduction for few of the important algorithms here,
1. Linear Regression Algorithm: Linear regression is the most popular machine learning algorithm
based on supervised learning. This algorithm work on regression, which is a method of modeling target
values based on independent variables. It represents the form of the linear equation, which has a
relationship between the set of inputs and predictive output. This algorithm is mostly used in
forecasting and predictions. Since it shows the linear relationship between input and output variable,
hence it is called linear regression.
The below equation can describe the relationship between x and y variables:
    1. Y= mx+c
Where, y=                                     Dependent                                        variable
X=                                        independent                                          variable
M=                                                                                               slope
C= intercept.
Advertisement
2. Decision Tree: Decision Tree algorithm is another machine learning algorithm, which belongs to the
supervised learning algorithm. This is one of the most popular machine learning algorithms. It can be
used for both classification and regression problems.
In the decision tree algorithm, we can solve the problem, by using tree representation in which, each
node represents a feature, each branch represents a decision, and each leaf represents the outcome.
In the decision tree, we start from the root of the tree and compare the values of the root attribute
with record attribute. On the basis of this comparison, we follow the branch as per the value and then
move to the next node. We continue comparing these values until we reach the leaf node with
predicated class value.
3. K-Means Clustering:
K-means clustering is one of the most popular algorithms of machine learning, which belongs to the
unsupervised learning algorithm. It solves the clustering problem.
If we are given a data set of items, with certain features and values, and we need to categorize those
set of items into groups, so such type of problems can be solved using k-means clustering algorithm.
4. SVM: The supervised learning technique known as SVM, or support vector machine, is used for
regression and classification. The fundamental principle of SVM is to identify the hyperplane in a high-
dimensional space that best discriminates between the various classes of data.
SVM, to put it simply, seeks to identify a decision boundary that maximizes the margin between the
two classes of data. The margin is the separation of each class's nearest data points, known as support
vectors, from the hyperplane.
The use of various kernel types that translate the input data to a higher-dimensional space where it
may be linearly separated allows SVM to be used for both linearly separable and non-linearly separable
data.
Among the various uses for SVM are bioinformatics, text classification, and picture classification. Due
to its strong performance and theoretical assurances, it has been widely employed in both industry
and academic studies.
5. KNN: The supervised learning technique known as KNN, or k-Nearest Neighbours, is used for
regression and classification. The fundamental goal of KNN is to categorize a data point by selecting
the class that appears most frequently among the "k" nearest labeled data points in the feature space.
Simply said, KNN is a lazy learning method that saves all training data points in memory and uses them
for classification or regression whenever a new data point is provided, rather than developing a model
manually.
The value of "k" indicates how many neighbors should be taken into account for classification when
using KNN, which may be utilized for both classification and regression issues. A smoother choice
boundary will be produced by a bigger value of "k," whereas a more complicated decision boundary
will be produced by a lower value of "k".
There are several uses for KNN, including recommendation systems, text classification, and picture
classification. Due to its efficacy and simplicity, it has been extensively employed in both academic and
industrial research. When working with big datasets can be computationally costly and necessitates
the careful selection of the value of "k" and the distance metric employed to determine the separation
between data points.
6. Naive Bayes: A supervised learning method used for classification and regression analysis is called
Naive Bayes. It is founded on the Bayes theorem, a probability theory that determines the likelihood
of a hypothesis in light of the data currently available.
The term "naive" refers to the assumption made by Naive Bayes, which is that the existence of one
feature in a class is unrelated to the presence of any other features in that class. This presumption
makes conditional probability computation easier and increases the algorithm's computing efficiency.
Naive Bayes utilizes the Bayes theorem to determine the likelihood of each class given a collection of
input characteristics for binary and multi-class classification problems. The projected class for the input
data is then determined by selecting the class with the highest probability.
Naive Bayes has several uses, including document categorization, sentiment analysis, and email spam
screening. Due to its ease of use, effectiveness, and strong performance across a wide range of
activities, it has received extensive use in both academic research and industry. However, it could not
be effective for complicated issues in which the independence assumption is violated.
7. Random Forest: A supervised learning system called Random Forest is utilized for regression and
classification. It is an ensemble learning technique that mixes various decision trees to increase the
model's robustness and accuracy.
Simply said, Random Forest builds a number of decision trees using randomly chosen portions of the
training data and features, combining the results to provide a final prediction. The characteristics and
data used to construct each decision tree in the Random Forest are chosen at random, and each tree
is trained independently of the others.
Both classification and regression issues may be solved with Random Forest, which is renowned for its
excellent accuracy, resilience, and resistance to overfitting. It may be used for feature selection and
ranking and can handle huge datasets with high dimensionality and missing values.
There are several uses for Random Forest, including bioinformatics, text classification, and picture
classification. Due to its strong performance and capacity for handling complicated issues, it has been
widely employed in both academic research and industry. For issues involving strongly linked traits or
class inequalities, it might not be very effective.
8. Logistic Regression: For binary classification issues, where the objective is to predict the likelihood
of a binary result (such as Yes/No, True/False, or 1/0), logistic regression is a form of supervised
learning technique. It is a statistical model that converts the result of a linear regression model into a
probability value between 0 and 1. It does this by using the logistic function.
Simply expressed, logistic functions are used in logistic regression to represent the connection
between the input characteristics and the output probability. Any input value is converted by the
logistic function to a probability value between 0 and 1. Given the input attributes, this probability
number indicates the possibility that the binary result will be 1.
Both basic and difficult issues may be solved using logistic regression, which can handle input
characteristics with both numerical and categorical data. It may be used for feature selection and
ranking since it is computationally efficient and simple to understand.
Now, let's understand what are the most common types of problems occurred in data science and
what is the approach to solving the problems. So in data science, problems are solved using algorithms,
and below is the diagram representation for applicable algorithms for possible questions:
Is this A or B? :
We can refer to this type of problem which has only two fixed solutions such as Yes or No, 1 or 0, may
or may not. And this type of problems can be solved using classification algorithms.
Is this different? :
We can refer to this type of question which belongs to various patterns, and we need to find odd from
them. Such type of problems can be solved using Anomaly Detection Algorithms.
The other type of problem occurs which ask for numerical values or figures such as what is the time
today, what will be the temperature today, can be solved using regression algorithms.
Now if you have a problem which needs to deal with the organization of data, then it can be solved
using clustering algorithms.
Clustering algorithm organizes and groups the data based on features, colors, or other common
characteristics.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to
perform the following tasks:
o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
After performing all the above tasks, we can easily use this data for our further processes.
3. Model Planning: In this phase, we need to determine the various methods and techniques to
establish the relation between input variables. We will apply Exploratory data analytics(EDA) by using
various statistical formula and visualization tools to understand the relations between variable and to
see what data can inform us. Common tools used for model planning are:
    o   SQL Analysis Services
o R
o SAS
o Python
4. Model-building: In this phase, the process of model building starts. We will create datasets for
training and testing purpose. We will apply different techniques such as association, classification, and
clustering, to build the model.
o WEKA
o SPCS Modeler
o MATLAB
5. Operationalize: In this phase, we will deliver the final reports of the project, along with briefings,
code, and technical documents. This phase provides you a clear overview of complete project
performance and other components on a small scale before the full deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which we have set on the
initial phase. We will communicate the findings and final result with the business team.
    o   Gaming                                                                               world:
        In the gaming world, the use of Machine learning algorithms is increasing day by day. EA
        Sports, Sony, Nintendo, are widely using data science for enhancing user experience.
    o   Internet                                                                             search:
        When we want to search for something on the internet, then we use different types of search
        engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use the data science
        technology to make the search experience better, and you can get a search result with a
        fraction of seconds.
    o   Transport:
        Transport industries also using data science technology to create self-driving cars. With self-
        driving cars, it will be easy to reduce the number of road accidents.
    o   Healthcare:
        In the healthcare sector, data science is providing lots of benefits. Data science is being used
        for tumor detection, drug discovery, medical image analysis, virtual medical bots, etc.
    o   Recommendation systems:
        Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data science
        technology for making a better user experience with personalized recommendations. Such
        as, when you search for something on Amazon, and you started getting suggestions for
        similar products, so this is because of data science technology.
    o   Risk detection:
        Finance industries always had an issue of fraud and risk of losses, but with the help of data
        science, this can be rescued.
        Most of the finance companies are looking for the data scientist to avoid risk and any type of
        losses with an increase in customer satisfaction.
Unit-2
Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc
,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It
is stated that almost 90% of today's data has been generated in the past 3 years.
    o   Social networking sites: Facebook, Google, LinkedIn all these sites generate huge amount of
        data on a day to day basis as they have billions of users worldwide.
    o   E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from
        which users buying trends can be traced.
    o   Weather Station: All the weather station and satellite gives very huge data which are stored
        and manipulated to forecast weather.
    o   Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
        accordingly publish their plans and for this they store the data of its million users.
    o   Share Market: Stock exchange across the world generates huge amount of data through its
        daily transaction.
    1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will
       double in every 2 years.
    2. Variety: Now a days data are not stored in rows and column. Data is structured as well as
       unstructured. Log file, CCTV footage is unstructured data. Data which can be saved in tables
       are structured data like the transaction data of the bank.
    3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
What is Big Data?
Data science is the study of data analysis by advanced technology (Machine Learning, Artificial
Intelligence, Big data). It processes a huge amount of structured, semi-structured, and unstructured
data to extract insight meaning, from which one pattern can be designed that will be useful to take a
decision for grabbing the new business opportunity, the betterment of product/service, and
ultimately business growth. Data science process to make sense of Big data/huge amount of data
that is used in business. The workflow of Data science is as below:
    •   Objective and the issue of business determining – What is the organization’s objective, what
        level the organization wants to achieve, and what issue the company is facing -these are the
        factors under consideration. Based on such factors which type of data are relevant is
        considered.
• Collection of relevant data- relevant data are collected from various sources.
    •   Explore the filtered, cleaned data – Finding any hidden pattern, synchronization in data,
        plotting them in the graph, chart, etc. form that is understandable to a non-technical person.
    •   Help businesspeople in making the decision and taking the step for the sack of business
        growth.
Data Mining: It is a process of extracting insight meaning, hidden patterns from collected data that is
useful to take a business decision for the purpose of decreasing expenditure and increasing
revenue. Big Data: This is a term related to extracting meaningful data by analyzing the huge amount
of complex, variously formatted data generated at high speed, that cannot be handled, or processed
by the traditional system. Data Expansion Day by Day: Day by day amount of data increasing
exponentially because of today’s various data production sources like a smart electronic devices. As
per IDC (International Data Corporation) report, new data created per each person in the world per
second by 2020 will be 1.7 MB. The amount of total data in the world by 2020 will reach around 44
ZettaBytes (44 trillion GigaByte) and 175 ZettaBytes by 2025. It is being seen that the total volume of
data is double every two years. The total size growth of data worldwide, year to year as per the IDC
report is shown below:
    •   Social Media: Today’s world a good percent of the total world population is engaged with
        social media like Facebook, WhatsApp, Twitter, YouTube, Instagram, etc. Each activity on such
        media like uploading a photo, or video, sending a message, making comment, putting like,
        etc create data.
    •   A sensor placed in various places: Sensor placed in various places of the city that gathers
        data on temperature, humidity, etc. A camera placed beside the road gather information
        about traffic condition and creates data. Security cameras placed in sensitive areas like
        airports, railway stations, and shopping malls create a lot of data.
    •   IoT Appliance: Electronic devices that are connected to the internet create data for their
        smart functionality, examples are a smart TV, smart washing machine, smart coffee machine,
        smart AC, etc. It is machine-generated data that are created by sensors kept in various
        devices. For Example, a Smart printing machine – is connected to the internet. A number of
        such printing machines connected to a network can transfer data within each other. So, if
        anyone loads a file copy in one printing machine, the system stores that file content, and
        another printing machine kept in another building or another floor can print out that file
        hard copy. Such data transfer between various printing machines generates data.
    •   Transactional Data: Transactional data, as the name implies, is information obtained through
        online and offline transactions at various points of sale. The data contains important
        information about transactions, such as the date and time of the transaction, the location
        where it took place, the items bought, their prices, the methods of payment, the discounts
        or coupons that were applied, and other pertinent quantitative data. These are some of the
        sources of transactional data: orders for payment, Invoices, E-receipts and recordkeeping etc.
Classification of Data
In this article, we are going to discuss the classification of data in which we will cover structured,
unstructured data, and semi-structured data. Also, we will cover the features of the data. Let’s
discuss one by one.
Data Classification :
Process of classifying data in relevant categories so that it can be used or applied more efficiently.
The classification of data makes it easy for the user to retrieve it. Data classification holds its
importance when comes to data security and compliance and also to meet different types of
business or personal objective. It is also of major requirement, as data must be easily retrievable
within a specific period of time.
1. Structured Data :
Structured data is created using a fixed schema and is maintained in tabular format. The elements in
structured data are addressable for effective analysis. It contains all the data which can be stored in
the SQL database in a tabular format. Today, most of the data is developed and processed in the
simplest way to manage information.
Examples –
2. Unstructured Data :
It is defined as the data in which is not follow a pre-defined standard or you can say that any does
not follow any organized format. This kind of data is also not fit for the relational database because in
the relational database you will see a pre-defined manner or you can say organized way of data.
Unstructured data is also very important for the big data domain and To manage and store
Unstructured data there are many platforms to handle it like No-SQL Database.
Examples –
3. Semi-Structured Data :
Semi-structured data is information that does not reside in a relational database but that have some
organizational properties that make it easier to analyze. With some process, you can store them in
a relational database but is very hard for some kind of semi-structured data, but semi-structured
exist to ease space.
Example –
XML data.
The main goal of the organization of data is to arrange the data in such a form that it becomes fairly
available to the users. So it’s basic features as following.
• Homogeneity – The data items in a particular group should be similar to each other.
    •   Clarity – There must be no confusion in the positioning of any data item in a particular
        group.
    •   Stability – The data item set must be stable i.e. any investigation should not affect the same
        set of classification.
    •   Elastic – One should be able to change the basis of classification as the purpose of
        classification changes.
This article explores some of the most pressing challenges associated with Big Data and offers
potential solutions for overcoming them.
   •   Challenge: The most apparent challenge with Big Data is the sheer volume of data being
       generated. Organizations are now dealing with petabytes or even exabytes of data, making
       traditional storage solutions inadequate. This vast amount of data requires advanced storage
       infrastructure, which can be costly and complex to maintain.
   •   Solution: Adopting scalable cloud storage solutions, such as Amazon S3, Google Cloud
       Storage, or Microsoft Azure, can help manage large volumes of data. These platforms offer
       flexible storage options that can grow with your data needs. Additionally, implementing data
       compression and deduplication techniques can reduce storage costs and optimize the use of
       available storage space.
   •   Challenge: Big Data encompasses a wide variety of data types, including structured data
       (e.g., databases), semi-structured data (e.g., XML, JSON), and unstructured data (e.g., text,
       images, videos). The diversity of data types can make it difficult to integrate, analyze, and
       extract meaningful insights.
   •   Solution: To address the challenge of data variety, organizations can employ data integration
       platforms and tools like Apache Nifi, Talend, or Informatica. These tools help in consolidating
       disparate data sources into a unified data model. Moreover, adopting schema-on-read
       approaches, as opposed to traditional schema-on-write, allows for more flexibility in
       handling diverse data types.
    •   Challenge: With Big Data, ensuring the quality, accuracy, and reliability of data—referred to
        as data veracity—becomes increasingly difficult. Inaccurate or low-quality data can lead to
        misleading insights and poor decision-making. Data veracity issues can arise from various
        sources, including data entry errors, inconsistencies, and incomplete data.
    •   Solution: Implementing robust data governance frameworks is crucial for maintaining data
        veracity. This includes establishing data quality standards, performing regular data audits,
        and employing data cleansing techniques. Tools like Trifacta, Talend Data Quality, and Apache
        Griffin can help automate and streamline data quality management processes.
    •   Challenge: As organizations collect and store more data, they face increasing risks related to
        data security and privacy. High-profile data breaches and growing concerns over data privacy
        regulations, such as GDPR and CCPA, highlight the importance of safeguarding sensitive
        information.
    •   Solution: To mitigate security and privacy risks, organizations must adopt comprehensive
        data protection strategies. This includes implementing encryption, access controls, and
        regular security audits. Additionally, organizations should stay informed about evolving data
        privacy regulations and ensure compliance by adopting privacy-by-design principles in their
        data management processes.
    •   Challenge: Integrating data from various sources, especially when dealing with legacy
        systems, can be a daunting task. Data silos, where data is stored in separate systems without
        easy access, further complicate the integration process, leading to inefficiencies and
        incomplete analysis.
    •   Solution: Data integration platforms like Apache Camel, MuleSoft, and IBM DataStage can
        help streamline the process of integrating data from multiple sources. Adopting a
        microservices architecture can also facilitate easier data integration by breaking down
        monolithic applications into smaller, more manageable services that can be integrated more
        easily.
    •   Challenge: As data becomes a critical asset, establishing effective data governance becomes
        essential. However, many organizations struggle with creating and enforcing policies and
        standards for data management, leading to issues with data consistency, quality, and
        compliance.
Conclusion
While Big Data offers tremendous potential for driving innovation and business growth, it also
presents significant challenges that must be addressed. By adopting the right tools, strategies, and
best practices, organizations can overcome these challenges and unlock the full value of their data.
As the field of Big Data continues to evolve, staying informed and proactive in addressing these
challenges will be crucial for maintaining a competitive edge in the data-driven landscape.
Big Data contains a large amount of data that is not being processed by traditional data storage or
the processing unit. It is used by many multinational companies to process the data and business of
many organizations. The data flow would exceed 150 exabytes per day before replication.
There are five v's of Big Data that explains the characteristics.
o Volume
o Veracity
o Variety
o Value
    o   Velocity
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data generated
from many sources daily, such as business processes, machines, social media platforms, networks,
human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button is
recorded, and more than 350 million new posts are uploaded each day. Big data technologies can
handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these days
the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
    1. Structured data: In Structured schema, along with all the required columns. It is in a tabular
       form. Structured Data is stored in the relational database management system.
    4. Quasi-structured Data:The data format contains textual data with inconsistent data formats
       that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that contains a
list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also essential
in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Advertisement
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity
bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.
Applications of Big Data
Big companies utilize big data for their business growth. By analyzing this data, the useful decision
can be made in various cases as discussed below:
1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like Amazon, Walmart,
Big Bazar etc.) management team has to keep data of customer’s spending habit (in which product
customer spent, in which brand they wish to spent, how frequently they spent), shopping behavior,
customer’s most liked product (so that they can keep those products in the store). Which product is
being searched/sold most, based on that data, production/collection rate of that product get fixed.
Banking sector uses their customer’s spending behavior-related data so that they can provide the
offer to a particular customer to buy his particular liked product by using bank’s credit or debit card
with discount or cashback. By this way, they can send the right offer to the right person at the right
time.
2. Recommendation: By tracking customer spending habit, shopping behavior, Big retails store
provide a recommendation to the customer. E-commerce site like Amazon, Walmart, Flipkart does
product recommendation. They track what product a customer is searching, based on that data they
recommend that type of product to that customer.
As an example, suppose any customer searched bed cover on Amazon. So, Amazon got data that
customer may be interested to buy bed cover. Next time when that customer will go to any google
page, advertisement of various bed covers will be seen. Thus, advertisement of the right product to
the right customer can be sent.
YouTube also shows recommend video based on user’s previous liked, watched video type. Based on
the content of a video, the user is watching, relevant advertisement is shown during video running.
As an example suppose someone watching a tutorial video of Big data, then advertisement of some
other big data course will be shown during that video.
3. Smart Traffic System: Data about the condition of the traffic of different road, collected through
camera kept beside the road, at entry and exit point of the city, GPS device placed in the vehicle (Ola,
Uber cab, etc.). All such data are analyzed and jam-free or less jam way, less time taking ways are
recommended. Such a way smart traffic system can be built in the city by Big data analysis. One more
profit is fuel consumption can be reduced.
4. Secure Air Traffic System: At various places of flight (like propeller etc) sensors present. These
sensors capture data like the speed of flight, moisture, temperature, other environmental condition.
Based on such data analysis, an environmental parameter within flight are set up and varied.
By analyzing flight’s machine-generated data, it can be estimated how long the machine can operate
flawlessly when it to be replaced/repaired.
5. Auto Driving Car: Big data analysis helps drive a car without human interpretation. In the various
spot of car camera, a sensor placed, that gather data like the size of the surrounding car, obstacle,
distance from those, etc. These data are being analyzed, then various calculation like how many
angles to rotate, what should be speed, when to stop, etc carried out. These calculations help to take
action automatically.
6. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant tool (like Siri in
Apple Device, Cortana in Windows, Google Assistant in Android) to provide the answer of the various
question asked by users. This tool tracks the location of the user, their local time, season, other data
related to question asked, etc. Analyzing all such data, it provides an answer.
As an example, suppose one user asks “Do I need to take Umbrella?”, the tool collects data like
location of the user, season and weather condition at that location, then analyze these data to
conclude if there is a chance of raining, then provide the answer.
7. IoT:
    •     Manufacturing company install IOT sensor into machines to collect operational data.
          Analyzing such data, it can be predicted how long machine will work without any problem
          when it requires repairing so that company can take action before the situation when
          machine facing a lot of issues or gets totally down. Thus, the cost to replace the whole
          machine can be saved.
    •     In the Healthcare field, Big data is providing a significant contribution. Using big data tool,
          data regarding patient experience is collected and is used by doctors to give better
          treatment. IoT device can sense a symptom of probable coming disease in the human body
          and prevent it from giving advance treatment. IoT Sensor placed near-patient, new-born
          baby constantly keeps track of various health condition like heart bit rate, blood presser, etc.
          Whenever any parameter crosses the safe limit, an alarm sent to a doctor, so that they can
          take step remotely very soon.
8. Education Sector: Online educational course conducting organization utilize big data to search
candidate, interested in that course. If someone searches for YouTube tutorial video on a subject,
then online or offline course provider organization on that subject send ad online to that person
about their course.
9. Energy Sector: Smart electric meter read consumed power every 15 minutes and sends this read
data to the server, where data analyzed and it can be estimated what is the time in a day when the
power load is less throughout the city. By this system manufacturing unit or housekeeper are
suggested the time when they should drive their heavy machine in the night time when power load
less to enjoy less electricity bill.
10. Media and Entertainment Sector: Media and entertainment service providing company like
Netflix, Amazon Prime, Spotify do analysis on data collected from their users. Data like what type of
video, music users are watching, listening most, how long users are spending on site, etc are
collected and analyzed to set the next business strategy.
Most of the data is generated from social media sites like Facebook, Instagram, Twitter, etc, and the
other sources can be e-business, e-commerce transactions, hospital, school, bank data, etc. This data
is impossible to manage by traditional data storing techniques. So Big-Data came into existence for
handling the data which is big and impure.
Big Data is the field of collecting the large data sets from various sources like social media, GPS,
sensors etc and analyzing them systematically and extract useful patterns using some tools and
techniques by enterprises. Before analyzing and determining the data, the data architecture must be
designed by the architect.
Data is one of the essential pillars of enterprise architecture through which it succeeds in the
execution of business strategy.
Data architecture design is important for creating a vision of interactions occurring between data
systems, like for example if data architect wants to implement data integration, so it will need
interaction between two systems and by using data architecture the visionary model of data
interaction during the process can be achieved.
Data architecture also describes the type of data structures applied to manage data and it provides
an easy way for data preprocessing. The data architecture is formed by dividing into three essential
models and then are combined :
    •   Conceptual model –
        It is a business model which uses Entity Relationship (ER) model for relation between entities
        and their attributes.
    •   Logical model –
        It is a model where problems are represented in the form of logic such as rows and column
        of data, classes, xml tags and other DBMS techniques.
    •   Physical model –
        Physical models holds the database design like which type of database technology will be
        suitable for architecture.
A data architect is responsible for all the design, creation, manage, deployment of data architecture
and defines how data is to be stored and retrieved, other decisions are made by internal bodies.
    •   Business requirements –
        These include factors such as the expansion of business, the performance of the system
        access, data management, transaction management, making use of raw data by converting
        them into image files and records, and then storing in data warehouses. Data warehouses
        are the main aspects of storing transactions in business.
    •   Business policies –
        The policies are rules that are useful for describing the way of processing data. These policies
        are made by internal organizational bodies and other government agencies.
    •   Technology in use –
        This includes using the example of previously completed data architecture design and also
        using existing licensed software purchases, database technology.
    •   Business economics –
        The economical factors such as business growth and loss, interest rates, loans, condition of
        the market, and the overall cost will also have an effect on design architecture.
Data Management :
    •   Data management is the process of managing tasks like extracting data, storing data,
        transferring data, processing data, and then securing data with low-cost consumption.
    •   Main motive of data management is to manage and safeguard the people’s and organization
        data in an optimal way so that they can easily create, access, delete, and update the data.
    •   Because data management is an essential process in each and every enterprise growth,
        without which the policies and decisions can’t be made for business advancement. The
        better the data management the better productivity in business.
    •   Large volumes of data like big data are harder to manage traditionally so there must be the
        utilization of optimal technologies and tools for data management such as Hadoop, Scala,
        Tableau, AWS, etc. Which can further used for big data analysis in achieving improvements in
        patterns.
    •   Data management can be achieved by training the employees necessarily and maintenance
        by DBA, data analyst, and data architects.
                                               Unit-3
What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
    1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of
       that HDFS was developed. It states that the files will be broken into blocks and stored in
       nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
    3. Map Reduce: This is a framework which helps Java programs to do the parallel computation
       on data using key value pair. The Map task takes input data and converts it into a data set
       which can be computed in Key value pair. The output of Map task is consumed by reduce
       task and then the out of reducer gives the desired result.
    4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
       Hadoop modules.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop
Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job
Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode and
TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a
master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run the
NameNode and DataNode software.
NameNode
    o   It manages the file system namespace by executing an operation like the opening, renaming
        and closing the files.
DataNode
    o   It is the responsibility of DataNode to read and write requests from the file system's clients.
    o   It performs block creation, deletion, and replication upon instruction from the NameNode.
Job Tracker
    o   The role of Job Tracker is to accept the MapReduce jobs from client and process the data by
        using NameNode.
Task Tracker
    o   It receives task and code from Job Tracker and applies that code on the file. This process can
        also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to Job
Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes,
the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.
Advantages of Hadoop
    o   Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
        retrieval. Even the tools to process the data are often on the same servers, thus reducing the
        processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
    o   Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
        really cost effective as compared to traditional relational database management system.
    o   Resilient to failure: HDFS has the property with which it can replicate data over the network,
        so if one node is down or some other network failure happens, then Hadoop takes the other
        copy of data and use it. Normally, data are replicated thrice but the replication factor is
        configurable.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File
System paper, published by Google.
  o    While working on Apache Nutch, they were dealing with big data. To store that data they
       have to spend a lot of costs which becomes the consequence of that project. This problem
       becomes one of the important reason for the emergence of Hadoop.
  o    In 2003, Google introduced a file system known as GFS (Google file system). It is a
       proprietary distributed file system developed to provide efficient access to data.
  o    In 2004, Google released a white paper on Map Reduce. This technique simplifies the data
       processing on large clusters.
  o    In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
       (Nutch Distributed File System). This file system also includes Map reduce.
  o    In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project,
       Dough Cutting introduces a new project Hadoop with a file system known as HDFS (Hadoop
       Distributed File System). Hadoop first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
  o    In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster
       within 209 seconds.
Year Event
o Hadoop introduced.
What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel
application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and
node name.
    o   Streaming Data Access: The time to read whole data set is more important than latency in
        reading the first. HDFS is built on write-once and read-many-times pattern.
    o   Low Latency data access: Applications that require very less time to access the first data
        should not use HDFS as it is giving importance to whole data rather than time to fetch the
        first record.
    o   Lots Of Small Files:The name node contains the metadata of files in memory and if the files
        are small in size it takes a lot of memory for name node's memory which is not feasible.
o Multiple Writes:It should not be used when we have to write multiple times.
HDFS Concepts
    1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are 128
       MB by default and this is configurable.Files n HDFS are broken into block-sized chunks,which
       are stored as independent units.Unlike a file system, if the file is in HDFS is smaller than block
       size, then it does not occupy full block?s size, i.e. 5 MB of file stored in HDFS of block size 128
       MB takes 5MB of space only.The HDFS block size is large just to minimize the cost of seek.
    2. Name Node: HDFS works in master-worker pattern where the name node acts as
       master.Name Node is controller and manager of HDFS as it knows the status and the
       metadata of all the files in HDFS; the metadata information being file permission, names and
       location of each block.The metadata are small, so it is stored in the memory of name
       node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
       clients concurrently,so all this information is handled bya single machine. The file system
       operations like opening, closing, renaming etc. are executed by it.
    3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
       They report back to name node periodically, with list of blocks that they are storing. The data
       node being a commodity hardware also does the work of block creation, deletion and
       replication as stated by the name node.
Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points.It communicates with the name node and take snapshot of meta data
which helps minimize downtime and loss of data.
Starting HDFS
The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.
To Start $ start-dfs.sh
o First create a folder in HDFS where data can be put form local file system.
            o    Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS
                 folder /user/ test
Recursive deleting
Example:
"<localSrc>" and "<localDest>" are paths as above, but on the local file system
o put <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
o copyFromLocal <localSrc><dest>
Identical to -put
o copyFromLocal <localSrc><dest>
Identical to -put
o moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and
then deletes the local copy on success.
Copies the file or directory in HDFS identified by src to the local file system path identified by
localDest.
o cat <filen-ame>
o moveToLocal <src><localDest>
o touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path,
unless the file is already size 0.
Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n),
block size (%o), replication (%r), and modification date (%y, %Y).
What is YARN
Yet Another Resource Manager takes programming to the next level beyond Java , and makes it
interactive to let another application Hbase, Spark etc. to work on it.Different Yarn applications can
co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time bringing great
benefits for manageability and cluster utilization.
Components Of YARN
    o   Node Manager:For launching and monitoring the computer containers on machines in the
        cluster.
    o   Map Reduce Application Master: Checks tasks running the MapReduce job. The application
        master and the MapReduce tasks run in containers that are scheduled by the resource
        manager, and managed by the node managers.
Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which were responsible for
handling resources and checking progress management. However, Hadoop 2.0 has Resource
manager and NodeManager to overcome the shortfall of Jobtracker & Tasktracker.
Benefits of YARN
    o   Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes and 40000 task, but Yarn
        is designed for 10,000 nodes and 1 lakh tasks.
    o   Utiliazation: Node Manager manages a pool of resources, rather than a fixed number of the
        designated slots thus increasing the utilization.
    o   Multitenancy: Different version of MapReduce can run on YARN, which makes the process of
        upgrading MapReduce more manageable.
What is MapReduce?
A MapReduce is a data processing tool which is used to process the data parallelly in a distributed
form. It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data
Processing on Large Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In
the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to the
reducer as input. The reducer runs only after the Mapper is over. The reducer too takes input in key-
value format, and the output of reducer is the final output.
    o   The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will
        not be unique in this case.
    o   Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort
        and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of
        values associated with this unique key <key, list(values)>.
    o   An output of sort and shuffle sent to the reducer phase. The reducer performs a defined
        function on a list of values for unique keys, and Final output <key, value> will be
        stored/displayed.
Sort and Shuffle
The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper task
is complete, the results are sorted by key, partitioned if there are multiple reducers, and then written
to disk. Using the input from each Mapper <k2,v2>, we collect all the values for each unique key k2.
This output from the shuffle phase in the form of <k2, list(v2)> is sent as input to reducer phase.
Usage of MapReduce
    o   It can be used in various application like document clustering, distributed sorting, and web
        link-graph reversal.
    o   It can be used for distributed pattern-based searching.
o It was used by Google to regenerate Google's index of the World Wide Web.
Unit-4
Types of Databases
There are various types of databases used for storing different varieties of data:
1) Centralized Database
It is the type of database that stores data at a centralized database system. It comforts the users to
access the stored data from different locations through several applications. These applications
contain the authentication process to let users access data securely. An example of a Centralized
database can be Central Library that carries a central database of each library in a college/university.
    o   It has decreased the risk of data management, i.e., manipulation of data will not affect the
        core data.
o It provides better data quality, which enables organizations to establish data standards.
o It is less costly because fewer vendors are required to handle the data sets.
    o   The size of the centralized database is large, which increases the response time for fetching
        the data.
    o   If any server failure occurs, entire data will be lost, which could be a huge loss.
2) Distributed Database
Unlike a centralized database system, in distributed systems, data is distributed among different
database systems of an organization. These database systems are connected via communication
links. Such links help the end-users to access the data easily. Examples of the Distributed database
are Apache Cassandra, HBase, Ignite, etc.
    o   Homogeneous DDB: Those database systems which execute on the same operating system
        and use the same application process and carry the same hardware devices.
    o   Heterogeneous DDB: Those database systems which execute on different operating systems
        under different application procedures, and carries different hardware devices.
    o   Modular development is possible in a distributed database, i.e., the system can be expanded
        by including new computers and connecting them to the distributed system.
o One server failure will not affect the entire data set.
3) Relational Database
This database is based on the relational data model, which stores data in the form of rows(tuple) and
columns(attributes), and together forms a table(relation). A relational database uses SQL for storing,
manipulating, as well as maintaining the data. E.F. Codd invented the database in 1970. Each table in
the database carries a key that makes the data unique from others. Examples of Relational databases
are MySQL, Microsoft SQL Server, Oracle, etc.
There are following four commonly known properties of a relational model known as ACID
properties, where:
A means Atomicity: This ensures the data operation will complete either with success or with failure.
It follows the 'all or nothing' strategy. For example, a transaction will either be committed or will
abort.
C means Consistency: If we perform any operation over the data, its value before and after the
operation should be preserved. For example, the account balance before and after the transaction
should be correct, i.e., it should remain conserved.
I means Isolation: There can be concurrent users for accessing data at the same time from the
database. Thus, isolation between the data should remain isolated. For example, when multiple
transactions occur at the same time, one transaction effects should not be visible to the other
transactions in the database.
D means Durability: It ensures that once it completes the operation and commits the data, data
changes should remain permanent.
4) NoSQL Database
Non-SQL/Not Only SQL is a type of database that is used for storing a wide range of data sets. It is
not a relational database as it stores data not only in tabular form but in several different ways. It
came into existence when the demand for building modern applications increased. Thus, NoSQL
presented a wide variety of database technologies in response to the demands. We can further
divide a NoSQL database into the following four types:
    1. Key-value storage: It is the simplest type of database storage where it stores every single
       item as a key (or attribute name) holding its value, together.
    4. Wide-column stores: It is similar to the data represented in relational databases. Here, data
       is stored in large columns together, instead of storing in rows.
    o   It enables good productivity in the application development as it is not required to store data
        in a structured format.
o Users can quickly access data from the database through key-value.
5) Cloud Database
A type of database where data is stored in a virtual environment and executes over the cloud
computing platform. It provides users with various cloud computing services (SaaS, PaaS, IaaS, etc.)
for accessing the database. There are numerous cloud platforms, but the best options are:
o Microsoft Azure
o Kamatera
o PhonixNAP
o ScienceSoft
6) Object-oriented Databases
The type of database that uses the object-based data model approach for storing data in the
database system. The data is represented and stored as objects which are similar to the objects used
in the object-oriented programming language.
7) Hierarchical Databases
It is the type of database that stores data in the form of parent-children relationship nodes. Here, it
organizes data in a tree-like structure.
Data get stored in the form of records that are connected via links. Each child record in the tree will
contain only one parent. On the other hand, each parent record can have multiple child records.
8) Network Databases
It is the database that typically follows the network data model. Here, the representation of data is in
the form of nodes connected via links between them. Unlike the hierarchical database, it allows each
record to have multiple children and parent nodes to form a generalized graph structure.
9) Personal Database
Collecting and storing data on the user's system defines a Personal Database. This database is
basically designed for a single user.
The type of database which creates and updates the database in real-time. It is basically designed for
executing and handling the daily data operations in several businesses. For example, An organization
uses operational databases for managing per day transactions.
Large organizations or enterprises use this database for managing a massive amount of data. It helps
organizations to increase and improve their efficiency. Such a database allows simultaneous access to
users.
NoSQL Databases
We know that MongoDB is a NoSQL Database, so it is very necessary to know about NoSQL Database
to understand MongoDB throughly.
NoSQL Database
It provides a mechanism for storage and retrieval of data other than tabular relations model used in
relational databases. NoSQL database doesn't use tables for storing data. It is generally used to store
big data and real-time web applications.
In the early 1970, Flat File Systems are used. Data were stored in flat files and the biggest problems
with flat files are each company implement their own flat files and there are no standards. It is very
difficult to store data in the files, retrieve data from files because there is no standard way to store
data.
Then the relational database was created by E.F. Codd and these databases answered the question of
having no standard way to store data. But later relational database also get a problem that it could
not handle big data, due to this problem there was a need of database which can handle every types
of problems then NoSQL database was developed.
Advantages of NoSQL
SQL vs NoSQL
There are a lot of databases used today in the industry. Some are SQL databases, some are NoSQL
databases. The conventional database is SQL database system that uses tabular relational model to
represent data and their relationship. The NoSQL database is the newer one database that provides a
mechanism for storage and retrieval of data other than tabular relations model used in relational
databases.
          SQL databases are not best suited for     NoSQL databases are best suited for
 7)
          hierarchical data storage.                hierarchical data storage.
      •
    •
A database is a collection of structured data or information which is stored in a computer system and
can be accessed easily. A database is usually managed by a Database Management System (DBMS).
NoSQL is a non-relational database that is used to store the data in the nontabular form. NoSQL
stands for Not only SQL. The main types are documents, key-value, wide-column, and graphs.
• Document-based databases
• Key-value stores
• Column-oriented databases
• Graph-based databases
Document-Based Database:
The document-based database is a nonrelational database. Instead of storing the data in rows and
columns (tables), it uses the documents to store the data in the database. A document database
stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects used in
applications which means less translation is required to use these data in the applications. In the
Document database, the particular elements can be accessed by using the index value that is
assigned for faster querying.
Collections are the group of documents that store documents that have similar contents. Not all the
documents are in any collection as they require a similar schema because document databases have
a flexible schema.
    •   Flexible schema: Documents in the database has a flexible schema. It means the documents
        in the database need not be the same schema.
    •   Faster creation and maintenance: the creation of documents is easy and minimal
        maintenance is required once we create the document.
Key-Value Stores:
A key-value store is a nonrelational database. The simplest form of a NoSQL database is a key-value
store. Every data element in the database is stored in key-value pairs. The data can be retrieved by
using a unique key allotted to each element in the database. The values can be simple data types like
strings and numbers or complex objects.
A key-value store is like a relational database with only two columns which is the key and the value.
• Simplicity.
• Scalability.
• Speed.
A column-oriented database is a non-relational database that stores the data in columns instead of
rows. That means when we want to run analytics on a small number of columns, you can read those
columns directly without consuming memory with the unwanted data.
Columnar databases are designed to read data more efficiently and retrieve the data with greater
speed. A columnar database is used to store a large amount of data. Key features of columnar
oriented database:
• Scalability.
• Compression.
• Very responsive.
Graph-Based databases:
Graph-based databases focus on the relationship between the elements. It stores the data in the
form of nodes in the database. The connections between the nodes are called links or relationships.
    •   In a graph-based database, it is easy to identify the relationship between the data by using
        the links.
• The speed depends upon the number of relationships among the database elements.
    •   Updating data is also easy, as adding a new node or edge to a graph database is a
        straightforward task that does not require significant schema changes.
Unit-5
Data Analytics is a systematic approach that transforms raw data into valuable insights. This process
encompasses a suite of technologies and tools that facilitate data collection, cleaning,
transformation, and modelling, ultimately yielding actionable information. This information serves as
a robust support system for decision-making. Data analysis plays a pivotal role in business growth
and performance optimization. It aids in enhancing decision-making processes, bolstering risk
management strategies, and enriching customer experiences. By presenting statistical summaries,
data analytics provides a concise overview of quantitative data.
While data analytics finds extensive application in the finance industry, its utility is not confined to
this sector alone. It is also leveraged in diverse fields such as agriculture, banking, retail, and
government, among others, underscoring its universal relevance and impact. Thus, data analytics
serves as a powerful tool for driving informed decisions and fostering growth across various
industries.
Dive into the world of data analytics with the insightful “100 Days of Data Analytics” article. A
must-read for all data enthusiasts!
Data analysts, data scientists, and data engineers together create data pipelines which helps to set
up the model and do further analysis. Data Analytics can be done in the following steps which are
mentioned below:
    1. Data Collection : It is the first step where raw data needs to be collected for analysis
       purposes. It consists of two steps in which data collection can be done. If the data are from
       different source systems then using data integration routines the data analysts have to
       combine the different data whereas sometimes the data are the subset of the data set. In
       this case, the data analyst would perform some steps to extract the useful subset and
       transfer it to the other compartment in the system.
    2. Data Cleansing : After collecting the data the next step is to clean the quality of the data as
       the collected data consists of a lot of quality problems such as errors, duplicate entries and
       white spaces which need to be corrected before moving to the next step. By running data
        profiling and data cleansing tasks these errors can be corrected. These data are organised
        according to the needs of the analytical model by the analysts.
    3. Data Analysis and Data Interpretation: Analytical models are created using software and
       other tools which interpret the data and understand it. The tools include Python, Excel, R,
       Scala and SQL. Lastly this model is tested again and again until the model works as it needs
       to be then in production mode the data set is run against the model.
    4. Data Visualisation: Data visualisation is the process of creating visual representation of data
       using the plots, charts and graphs which helps to analyse the patterns, trends and get the
       valuable insights of the data. By comparing the datasets and analysing it data analysts find
       the useful data from the raw data.
There are different types of data analysis in which raw data is converted into valuable insights. Some
of the types of data analysis are mentioned below:
    1. Descriptive Data Analytics : Descriptive data Analytics is a type of data analysis which
       summarises the data set and it is used to compare the past results, differentiate between the
       weakness and strength, and identify the anomalies. Descriptive data analysis is used by the
       companies to identify the problems in the data set as it helps in identifying the patterns.
    2. Real-time Data Analytics: Real time data Analytics doesn’t use data from past events. It is a
       type of data analysis which involves using the data when the data is immediately entered in
       the database. This type of analysis is used by the companies to identify the trends and track
       the competitors’ operations.
    3. Diagnostic Data Analytics: Diagnostic Data Analytics uses past data sets to analyse the cause
       of an anomaly. Some of the techniques used in diagnostic analysis are correlation analysis,
       regression analysis and analysis of variance.The results which are provided by diagnostic
       analysis help the companies to give accurate solutions to the problems.
    4. Predictive Data Analytics: This type of Analytics is done in the current data to predict future
       outcomes. To build the predictive models it uses machine learning algorithms, statistical
       model techniques to identify the trends and patterns. Predictive data analysis is also used in
       sales forecasting, to estimate the risk and to predict customer behaviour.
There are two types of methods in data analysis which are mentioned below:
Qualitative data analysis doesn’t use statistics and derives data from the words, pictures and
symbols. Some common qualitative methods are:
    •   Narrative Analytics is used for working with data acquired from diaries, interviews and so on.
    •   Content Analytics is used for Analytics of verbal data and behaviour.
Quantitative data Analytics is used to collect data and then process it into the numerical data. Some
of the quantitative methods are mentioned below:
    •   Sample size determination is the method of taking a small sample from a large group of
        people and then analysing it.
    •   Average or mean of a subject is dividing the sum total numbers in the list by the number of
        items present in that list.
There are multiple skills which are required to be a Data analyst. Some of the main skills are
mentioned below:
• Some of the common programming languages which are used are R and Python.
• In order to better analyse and interpret probability and statistics are used.
• For collecting and organising data, Data Management is used in data analysis.
In Data Analytics For Entry level these job roles are available:
In Data Analytics For Experienced level these mentioned job roles are available:
• Data Analyst
• Data Architect
• Data Engineer
• Data Scientist
• Marketing Analyst
• Business Analyst
    •   Data Analytics targets the main audience of the business by identifying the trends and
        patterns from the data sets. Thus, it can improve the businesses to grow and optimise its
        performance.
    •   By doing data analysis it shows the areas where business needs more resources, products
        and money and where the right amount of interaction with the customer is not happening in
        the business. Thus by identifying the problems then working on those problems to grow in
        the business.
    •   Data analysis also helps in the marketing and advertising of the business to make it popular
        and thus more customers will know about the business.
    •   The valuable information which is taken out from the raw data can bring advantage to the
        organisation by examining present situations and predicting future outcomes.
    •   From data Analytics the business can get better by targeting the right audience, disposable
        outcomes and audience spending habits which helps the business to set prices according to
        the interest and budget of customers.
Conclusion
In conclusion, Data Analytics serves as a powerful catalyst for business growth and performance
optimization. By transforming raw data into actionable insights, it empowers businesses to make
informed decisions, bolster risk management strategies, and enhance customer experiences. The
process of data analysis involves identifying trends and patterns within data sets, thereby facilitating
the extraction of meaningful conclusions.
While the finance industry is a prominent user of data analytics, its applications are not confined to
this sector alone. It finds extensive use in diverse fields such as agriculture, banking, retail, and
government, underscoring its universal relevance and impact.
The above discussion elucidates the processes, types, methods, and significance of data analysis. It
underscores the pivotal role of data analytics in today’s data-driven world, highlighting its potential
to drive informed decisions and foster growth across various industries. Thus, Data Analytics stands
as a cornerstone in the realm of data-driven decision making, playing an instrumental role in shaping
the future of businesses across the globe
                                UNIT – 1 Introduction to Data Science
1. What is Data Science
   Data science is a field that combines statistical analysis, machine learning, data visualization, and domain
   expertise to understand and interpret complex data. It involves various processes, including data collection,
   cleaning, analysis, modelling, and communication of insights. The goal is to extract valuable information
   from data to inform decision-making and strategy.
   Multiple-Choice Questions
   1. What is the primary goal of data science?
     - A) To collect data
     - B) To analyze data
     - C) To extract insights from data
     - D) To visualize data
       Answer: C) To extract insights from data
Data science has become essential in today's data-driven world for several reasons:
   1. Data-Driven Decision Making : Organizations use data science to make informed decisions based on
   empirical evidence rather than intuition or guesswork.
   2. Understanding Customer Behaviour: Data science helps businesses analyse customer data to
   understand preferences and behaviours, leading to personalized marketing and improved customer
   experiences.
   3. Predictive Analytics : Companies leverage data science to forecast trends and outcomes, allowing them
   to be proactive rather than reactive in their strategies.
   4. Operational Efficiency : Data science can identify inefficiencies in processes and suggest improvements,
   leading to cost savings and better resource allocation.
   5. Competitive Advantage : Organizations that utilize data science effectively can gain insights that provide
   a competitive edge in their industry.
   6. Innovation : Data science enables organizations to explore new business models, products, and services
   based on data insights.
 1. Definition: Focuses on analyzing historical data to provide actionable insights and improve decision-
 making.
3. Techniques: Dashboards, SQL queries, data visualization tools (e.g., Power BI, Tableau).
 1. Definition: Extracts insights and predictions from large, complex datasets using advanced algorithms.
 2. Key Goal: Predictive and prescriptive analytics.
 3. Techniques: Machine learning, AI, statistical modelling, coding (e.g., Python, R).
 4. Usage: Forecasting trends, customer segmentation, anomaly detection.
 5. Output: Predictive models, recommendations, insights for innovation.
  MCQs
  1. Data Collection
  Definition: The process of gathering raw data from various sources such as databases, APIs, web scraping,
  or manual entry.
  Tools: SQL, Python (libraries like requests, Beautiful Soup), APIs.
5. Model Evaluation
Definition: Assessing the performance of a model using metrics to ensure accuracy and reliability.
Metrics: Accuracy, precision, recall, F1 score, RMSE.
Tools: Python (Scikit-learn), R, cross-validation techniques.
6. Data Visualization
Definition: Representing data insights through charts, graphs, and dashboards.
Purpose: Communicating results effectively to stakeholders.
Tools: Matplotlib, Seaborn, Power BI, Tableau.
MCQs
 1. Problem Definition
        Understanding the business problem and defining the goals.
 2. Data Collection
        Gathering relevant data from various sources.
 5. Model Building
       Selecting algorithms and training machine learning models.
 6. Model Evaluation
       Testing the model's performance using metrics (e.g., accuracy, precision).
 7. Deployment
        Integrating the model into a production environment.
 MCQs
 1. Which is the first step in the Data Science Life Cycle?
    A. Data Collection
    B. Problem Definition
    C. Model Building
    D. Data Visualization
    Answer: B. Problem Definition
 2. What is the main objective of Data Preparation?
    A. Building machine learning models
    B. Cleaning and organizing data for analysis
    C. Testing the model’s accuracy
    D. Deploying the model
    Answer: B. Cleaning and organizing data for analysis
 3. Which process helps identify patterns and trends in data?
    A. Model Deployment
    B. Data Collection
    C. Exploratory Data Analysis (EDA)
    D. Data Cleaning
    Answer: C. Exploratory Data Analysis (EDA)
   4. What step involves splitting data into training and testing datasets?
      A. Model Deployment
      B. Model Evaluation
      C. Model Building
      D. Data Preparation
      Answer: C. Model Building
   5. Which of the following is NOT a model evaluation metric?
      A. Accuracy
      B. Precision
      C. Data Cleaning
      D. Recall
      Answer: C. Data Cleaning
   6. After deploying a model, what is the next step?
      A. Model Building
      B. Data Wrangling
      C. Monitoring and Maintenance
      D. EDA
      Answer: C. Monitoring and Maintenance
   7. What is the purpose of data visualization during EDA?
      A. Build predictive models
      B. Clean raw data
      C. Identify patterns and insights
      D. Test model performance
      Answer: C. Identify patterns and insights
MCQs
1. Volume
   The sheer size of data generated daily, often measured in terabytes or petabytes. Examples include social media
   posts, transaction records, and sensor data.
2. Velocity
   The speed at which data is generated and processed. For instance, real-time data streams from IoT devices or
   financial transactions.
3. Variety
   The diversity of data types, such as structured (databases), semi-structured (XML, JSON), and unstructured (videos,
   images, text).
4. Veracity
The quality and accuracy of data, which can often be incomplete, noisy, or misleading.
5. Value
The actionable insights and benefits derived from analyzing Big Data.
4. IoT (Internet of Things): Data from connected devices like sensors, cameras, and wearables.
5. Government & Public Services: Census data, transportation data, and utility records.
1. Storage
2. Processing
Apache Hadoop
Apache Spark
3. Database Systems
 Big Data can be classified into different types based on its nature, source, and structure:
   1. Based on Data Type
Structured Data: Organized in rows and columns, stored in relational databases (e.g., SQL).
Unstructured Data: Lacks a predefined format, difficult to process (e.g., videos, images, social media posts).
Semi-structured Data: Does not follow strict schema but has some organizational properties.
2. Based on Source
Batch Data: Processed in chunks over time (e.g., transaction data at the end of the day).
Stream Data: Real-time data processed as it is generated (e.g., stock market feeds).
4. Based on Domain
A) Structured
B) Unstructured
C) Semi-structured
D) Machine-generated
Answer: C) Semi-structured
C) Audio files
A) Human-generated
B) Semi-structured
C) Machine-generated
D) Batch data
Answer: C) Machine-generated
A) Batch Data
B) Stream Data
C) Structured Data
D) Scientific Data
A) Structured B) Unstructured
C) Semi-structured D) Machine-generated
Answer: C) Semi-structured
Tools like SQL were used to manage and query relational databases.
Social media platforms and IoT devices started producing massive volumes of unstructured data.
Limitations of traditional databases led to the development of new frameworks like Hadoop.
Advancements in cloud computing, distributed systems, and machine learning enhanced Big Data analytics.
Big Data became critical in industries such as finance, healthcare, and retail.
3. 2010s: NoSQL databases and real-time processing tools like Kafka and Spark emerge.
4. 2020s: Integration of Big Data with AI, IoT, and edge computing.
The origin of the data, which can include:Structured data (databases, spreadsheets).
A) MapReduce
B) Apache Spark
C) HDFS
D) MongoDB
A) Centralized storage
B) Distributed storage
C) Local storage
D) Cloud-based storage
4. Which layer in Big Data architecture is responsible for analyzing and visualizing the data?
Sqoop: For importing and exporting data between Hadoop and relational databases.
Flume: Designed to collect, aggregate, and move large amounts of log data.
2. Fault Tolerance: Automatically replicates data across nodes to prevent data loss.
a) To manage relational databases b) To process and store large datasets in a distributed manner
a) HDFS
b) MapReduce
c) SQL Server
d) YARN
a) 1
b) 2
c) 3
d) 4
Answer: c) 3
a) Pig b) Hive
c) Flume d) Sqoop
Answer: b) Hive
 Hadoop Architecture
  The Hadoop architecture is designed to process and store massive datasets efficiently in a distributed and fault-
  tolerant manner. It is based on the Master-Slave architecture and includes the following major components:
Keeps track of where data blocks are stored across the cluster.
NodeManager: Manages individual nodes in the cluster and monitors resource usage.
3. MapReduce Framework
Map Phase: Processes input data and produces intermediate key-value pairs.
  Reduce Phase: Aggregates the output from the Map phase into meaningful results.
 Hadoop Ecosystem
  The Hadoop ecosystem consists of various tools that extend Hadoop's functionality, enabling it to manage, process,
  and analyze diverse data types effectively.
  Key Components:
  1. Hive: SQL-like querying for structured data.
  2. Pig: High-level scripting for data transformation.
  3. HBase: A NoSQL database for real-time data access.
  4. Sqoop: Import/export data between Hadoop and relational databases.
  5. Flume: Collects and moves log data into HDFS.
  6. Spark: An in-memory data processing engine for real-time analytics.
  7. Zookeeper: Coordination and synchronization for distributed systems.
  8. Mahout: Machine learning and recommendation system libraries.
  MCQs
  1. What is the primary function of HDFS in Hadoop?
   a) To process data
   b) To store data in a distributed manner
   c) To query data
   d) To manage resources
  Answer: b) To store data in a distributed manner
  2. What is the responsibility of the NameNode in HDFS?
   a) Store data blocks
   b) Manage metadata and block locations
   c) Schedule jobs in the cluster
   d) Monitor resource usage
  Answer: b) Manage metadata and block locations
  3. Which component is responsible for resource allocation in YARN?
   a) DataNode
   b) ResourceManager
   c) NameNode
   d) NodeManager
  Answer: b) ResourceManager
  4. What are the two phases of a MapReduce job?
   a) Store and Process
   b) Map and Reduce
   c) Fetch and Aggregate            d) Query and Execute
  Answer: b) Map and Reduce
5. Which tool in the Hadoop ecosystem is used for real-time data processing?
 a) Sqoop
 b) Spark
 c) Pig
 d) Hive
Answer: b) Spark
6. What is the default block size in HDFS for storing data?
 a) 32MB
 b) 64MB
 c) 128MB
 d) 256MB
Answer: c) 128MB
7. Which Hadoop ecosystem component is used for machine learning?
 a) Flume
 b) Mahout
 c) Hive
 d) HBase
Answer: b) Mahout
8. Which of the following tools is used for importing and exporting data in Hadoop?
 a) Hive
 b) Pig
 c) Sqoop
 d) Flume
Answer: c) Sqoop
9. In the Hadoop architecture, which node performs data storage tasks?
 a) NameNode
 b) DataNode
 c) ResourceManager
 d) NodeManager
Answer: b) DataNode
10. What is the role of Zookeeper in the Hadoop ecosystem?
 a) Data storage
 b) Log management
 c) Coordination and synchronization
 d) Machine learning
Answer: c) Coordination and synchronization
   Hadoop Ecosystem Components
    The Hadoop ecosystem includes a variety of tools and frameworks that complement Hadoop's core functionality
    (HDFS, MapReduce, and YARN) to handle diverse big data tasks like storage, processing, analysis, and real-time
    computation.
    1. Core Components
             HDFS (Hadoop Distributed File System): Distributed storage for large datasets.
             MapReduce: Programming model for batch data processing.
             YARN (Yet Another Resource Negotiator): Resource management and task scheduling.
2. Ecosystem Tools
    1. Hive
          Data warehousing and SQL-like querying for large datasets.
          Suitable for structured data.
          Converts SQL queries into MapReduce jobs.
    2. Pig
          High-level scripting language for data transformation.
          Converts Pig scripts into MapReduce tasks.
          Suitable for semi-structured and unstructured data.
    3. HBase
          A distributed NoSQL database for real-time read/write access.
          Built on HDFS, optimized for random access.
    4. Spark
          An in-memory processing engine for fast analytics.
          Supports batch processing, machine learning, graph processing, and stream processing.
    5. Sqoop
          Transfers data between Hadoop and relational databases like MySQL or Oracle.
          Ideal for ETL (Extract, Transform, Load) operations.
    6. Flume
          Collects, aggregates, and moves large amounts of log data into HDFS.
          Best for streaming data.
    7. Zookeeper
          Manages and coordinates distributed systems.
          Ensures synchronization across nodes.
    8. Oozie
          Workflow scheduler for managing Hadoop jobs.
          Executes workflows involving Hive, Pig, and MapReduce tasks.
    9. Mahout
          Provides machine learning algorithms for clustering, classification, and recommendations.
    10. Kafka
          A distributed messaging system for real-time data streams.
          Often used with Spark or Flume.
    MCQs
    1. Which Hadoop tool provides SQL-like querying capabilities?
     a) Pig                  b) Hive
     c) Sqoop                d) Oozie
    Answer: b) Hive
2. What is the purpose of HBase in the Hadoop ecosystem?
 a) To process data in real-time
 b) To provide NoSQL database storage
 c) To collect log data
 d) To manage workflows
Answer: b) To provide NoSQL database storage
3. Which component is used for data import/export between Hadoop and relational databases?
 a) Flume
 b) Sqoop
 c) Spark
 d) Mahout
Answer: b) Sqoop
4. What is the main use of Flume in the Hadoop ecosystem?
 a) To manage workflows
 b) To process structured data
 c) To collect and move log data
 d) To provide SQL queries
Answer: c) To collect and move log data
5. Which of the following is a workflow management tool in Hadoop?
 a) Zookeeper
 b) Oozie
 c) Pig
 d) Hive
Answer: b) Oozie
6. What is the primary function of Spark in the Hadoop ecosystem?
 a) Data storage
 b) Batch processing
 c) In-memory data processing
 d) Real-time log collection
Answer: c) In-memory data processing
7. Which tool is used for machine learning tasks in Hadoop?
 a) Oozie
 b) Mahout
 c) Hive
 d) Sqoop
Answer: b) Mahout
  8. What does Zookeeper provide in the Hadoop ecosystem?
   a) Data synchronization and coordination
   b) SQL-like querying
   c) Real-time data streaming
   d) Workflow scheduling
  Answer: a) Data synchronization and coordination
  9. Which component handles the streaming of real-time data in Hadoop?
   a) Kafka
   b) Pig
   c) Hive
   d) Spark
  Answer: a) Kafka
  10. Which tool in the Hadoop ecosystem is ideal for large-scale graph processing?
   a) Hive
   b) Spark
   c) HBase
   d) Flume
  Answer: b) Spark
 MapReduce Overview
  MapReduce is a programming model and processing framework in Hadoop for processing large datasets in a
  distributed and parallel manner. It consists of two key phases:
1. Map Phase:
       Splits the input data into smaller chunks and processes them independently.
       Converts data into key-value pairs.
2. Reduce Phase:
       Aggregates, filters, or summarizes the intermediate key-value pairs produced by the Map phase.
       Produces the final output.
  Key Features:
       Works with HDFS for fault tolerance and distributed processing.
       Suitable for batch processing of large-scale datasets.
 Workflow of MapReduce
  1. Input data is split into chunks.
3. The Shuffle and Sort phase organizes these key-value pairs by key.
4. Block Storage
4. Write Once, Read Many: Data is written once and read multiple times, making it ideal for analytics.
HDFS Workflow
1. Data Input: Data is divided into blocks and sent to DataNodes for storage.
3. Fault Tolerance: If a DataNode fails, the system retrieves the data from replicas stored on other nodes.
MCQs
1. What is the primary function of HDFS?
a) DataNode
b) NameNode
c) Secondary NameNode
d) ResourceManager
Answer: b) NameNode
a) 64MB
b) 128MB
c) 256MB d) 512MB
Answer: b) 128MB
4. What does the Secondary NameNode do?
a) Replaces the NameNode in case of failure
b) Stores actual data blocks
c) Periodically merges and checkpoints metadata
d) Manages task scheduling
Answer: c) Periodically merges and checkpoints metadata
5. How does HDFS ensure fault tolerance?
a) By using expensive hardware
b) By replicating data blocks across multiple nodes
c) By compressing data
d) By storing all data on a single node
Answer: b) By replicating data blocks across multiple nodes
6. What is the default replication factor in HDFS?
a) 1
b) 2
c) 3
d) 4
Answer: c) 3
7. Which of the following is NOT a characteristic of HDFS?
a) Write Once, Read Many
b) Real-time data updates
c) Fault tolerance
d) Distributed storage
Answer: b) Real-time data updates
8. What happens if a DataNode fails in HDFS?
a) The system shuts down.
b) Data is lost permanently.
c) Data is retrieved from replicated blocks on other nodes.
d) The NameNode fails.
Answer: c) Data is retrieved from replicated blocks on other nodes.
9. Which node in HDFS stores actual data?
a) NameNode
b) DataNode
c) Secondary NameNode
d) ResourceManager
Answer: b) DataNode
 YARN (Yet Another Resource Negotiator)
  YARN is a core component of Hadoop responsible for cluster resource management and task scheduling. It enables
  multiple data processing engines like MapReduce, Spark, and others to run on Hadoop, making the system more
  efficient and versatile.
3. ApplicationMaster:
4. Container:
  Features of YARN
  1. Scalability: Efficiently handles large clusters.
4. Flexibility: Supports various processing frameworks like MapReduce, Spark, and Tez.
  YARN Workflow
  1. The client submits an application to the ResourceManager.
3. The ApplicationMaster requests containers from the ResourceManager for task execution.
  MCQs
  1. What is the primary function of YARN in Hadoop?
a) Data storage
c) Query execution
d) Metadata management
a) Data blocks
b) Metadata
c) Containers
d) Files
Answer: c) Containers
a) Only MapReduce
b) Only Spark
2. NoSQL Databases
3. Distributed Databases
4. Cloud Databases
5. Object-Oriented Databases
6. Hierarchical Databases
7. Network Databases
  MCQs
  1. Which database is based on a table structure of rows and columns?
a) NoSQL Database
b) Relational Database
c) Hierarchical Database
d) Object-Oriented Database
a) Relational Database
b) Hierarchical Database
a) Hierarchical Database
b) Graph Database
c) Relational Database
d) Object-Oriented Database
a) Relational Database
b) Cloud Database
c) Hierarchical Database
d) Network Database
4. Distributed Architecture: Often built for distributed systems and fault tolerance.
2. Document Stores:
4. Graph Databases:
  MCQs
  1. Which of the following best describes NoSQL databases?
  a) They store data only in relational tables.
  b) They provide support for unstructured or semi-structured data.
  c) They are always slower than relational databases.
  d) They cannot scale horizontally.
  Answer: b) They provide support for unstructured or semi-structured data.
  2. Which type of NoSQL database is optimized for managing relationships between data?
  a) Document Store
  b) Key-Value Store
  c) Graph Database
  d) Column-Family Store
  Answer: c) Graph Database
  3. MongoDB is an example of which type of NoSQL database?
  a) Key-Value Store
  b) Document Store
  c) Column-Family Store
  d) Graph Database
  Answer: b) Document Store
  4. Which of the following is a feature of NoSQL databases?
  a) Fixed schema                           b) Vertical scaling
  c) Support for distributed data           d) Mandatory use of SQL
  Answer: c) Support for distributed data
  5. Cassandra is an example of which type of NoSQL database?
a) Document Store
b) Key-Value Store
c) Graph Database
d) Column-Family Store
  Relational databases struggle with the massive data generated by IoT, social media, and web applications. NoSQL
  databases can efficiently process and store large volumes of data.
2. Scalability:
Traditional databases rely on vertical scaling (adding more resources to a single server), which can be expensive.
NoSQL databases support horizontal scaling (adding more servers to a cluster), providing cost-effective scalability.
3. Flexibility:
  Relational databases require a fixed schema. In contrast, NoSQL databases allow schema-less designs, which are
  more adaptable for dynamic and evolving data structures.
  With the rise of unstructured and semi-structured data (e.g., images, videos, JSON), NoSQL databases are better
  equipped to handle such data types.
5. High Performance:
  NoSQL databases are optimized for high-speed read and write operations, making them ideal for applications
  requiring real-time data processing.
6. Distributed Systems:
  Modern applications are often distributed globally, and NoSQL databases are designed for distributed architectures,
  ensuring reliability and fault tolerance.
7. Cost-Effectiveness:
Many NoSQL solutions are open-source and can run on commodity hardware, reducing costs.
  MCQs
  1. Why are NoSQL databases preferred for handling big data?
c) They can process large volumes of unstructured data. d) They cannot scale horizontally.
d) Vertical scalability
c) Static scalability
d) Single-node scalability
5. For real-time applications requiring high-speed data access, which type of database is suitable?
  NoSQL databases are designed for horizontal scaling, allowing the addition of more servers to handle increasing data
  and traffic.
2. Flexibility:
Schema-less design enables dynamic changes to data structures without disrupting operations.
Supports structured, semi-structured, and unstructured data such as JSON, XML, videos, and images.
4. High Performance:
Optimized for fast read and write operations, suitable for real-time applications.
  Built for distributed architecture, ensuring fault tolerance and high availability.
6. Cost-Effectiveness:
Many NoSQL solutions are open-source and can run on commodity hardware, reducing costs.
Efficiently processes large volumes of data and provides insights in real time.
8. Easy Integration:
9. No Complex Joins:
10. Cloud-Friendly:
MCQs
1. What is the main advantage of NoSQL databases in terms of scalability?
a) Vertical scaling
b) Horizontal scaling
c) Limited scaling
d) No scaling
Answer: b) They are optimized for high-speed read and write operations.
  MCQs
  1. Which type of database uses a predefined schema?
  a) SQL
  b) NoSQL
  c) Both SQL and NoSQL
  d) Neither SQL nor NoSQL
  Answer: a) SQL
2. What is a major advantage of NoSQL databases over SQL databases?
a) Strong support for joins
b) Fixed schema
c) Horizontal scalability
d) Limited support for unstructured data
Answer: c) Horizontal scalability
3. Which type of database is better suited for handling structured data?
a) SQL
b) NoSQL
c) Both SQL and NoSQL
d) None of the above
Answer: a) SQL
4. Which of the following is an example of a NoSQL database?
a) MySQL
b) PostgreSQL
c) MongoDB
d) Oracle
Answer: c) MongoDB
5. What makes NoSQL databases more flexible than SQL databases?
a) Predefined schema
b) Schema-less design
c) Complex relationships
d) Dependence on SQL
Answer: b) Schema-less design
6. SQL databases are typically scaled by:
a) Adding more servers (horizontal scaling)
b) Adding more resources to the same server (vertical scaling)
c) Both horizontal and vertical scaling equally   d) Neither horizontal nor vertical scaling
Answer: b) Adding more resources to the same server (vertical scaling)
7. Which type of database is ideal for applications with rapidly changing data models?
a) SQL
b) NoSQL
c) Both SQL and NoSQL
d) None of the above
Answer: b) NoSQL
  8. Which query language is used by SQL databases?
  a) Structured Query Language (SQL)
  b) JSON Query Language (JQL)
  c) NoSQL Query Language
  d) Custom APIs
  Answer: a) Structured Query Language (SQL)
MCQs
1. Which type of NoSQL database stores data in key-value pairs?
a) Document Store               b) Column-Family Store
c) Key-Value Store              d) Graph Database
Answer: c) Key-Value Store
2. MongoDB is an example of which type of NoSQL database?
a) Document Store               b) Key-Value Store
c) Column-Family Store          d) Graph Database
Answer: a) Document Store
3. Which type of NoSQL database is optimized for managing relationships between entities?
a) Document Store               b) Graph Database
c) Column-Family Store          d) Key-Value Store
Answer: b) Graph Database
4. What type of NoSQL database is best suited for analytics and time-series data?
a) Key-Value Store              b) Column-Family Store
c) Graph Database               d) Document Store
Answer: b) Column-Family Store
5. Redis is an example of which type of NoSQL database?
a) Document Store               b) Key-Value Store
c) Column-Family Store          d) Graph Database
Answer: b) Key-Value Store
6. Which type of NoSQL database is most suitable for storing hierarchical data in JSON format?
Answer: b) Neo4j
8. Which NoSQL database type is best for storing large-scale tabular data?
MCQs
Question: What is the primary goal of data analytics?
A) To store large amounts of data
B) To create complex algorithms
C) To extract valuable insights from data
D) To visualize data in graphs and charts
Answer: C) To extract valuable insights from data
1. Which of the following is a key step in the data analytics process?
A) Data collection
B) Data visualization
C) Data cleaning
D) All of the above
Answer: D) All of the above
2. What type of analysis is used to make predictions based on historical data?
A) Descriptive analysis
B) Diagnostic analysis
C) Predictive analysis
D) Prescriptive analysis
Answer: C) Predictive analysis
3. Which of the following tools is commonly used for data visualization?
A) Excel
B) Power BI
C) Tableau
D) All of the above
Answer: D) All of the above
4. Which of these is a type of unstructured data?
A) A database table
B) A text document
C) A CSV file
D) A spreadsheet
Answer: B) A text document
5. What is the purpose of data cleaning in the analytics process?
The use of data analytics spans across various fields and industries, enabling organizations to gain valuable insights,
make data-driven decisions, and optimize processes.
Use: Data analytics helps monitor energy usage, optimize supply distribution, and improve sustainability practices.
   Example: Smart meters in homes use analytics to track energy consumption, helping users optimize usage and
   reduce costs.
  MCQs
  1. What is the first step in the Data Analytics Life Cycle?
  A) Data Collection                B) Problem Definition
  C) Model Building                 D) Data Cleaning and Preprocessing
  Answer: B) Problem Definition
  2. In which stage of the Data Analytics Life Cycle is data cleaned and transformed for analysis?
  A) Model Evaluation               B) Exploratory Data Analysis
  C) Data Collection                D) Data Cleaning and Preprocessing
  Answer: D) Data Cleaning and Preprocessing
  3. Which of the following is the primary goal of Exploratory Data Analysis (EDA)?
  A) To collect data from external sources                      B) To build a predictive model
  C) To visualize and summarize the data to find patterns and insights
  D) To deploy the model into production
  Answer: C) To visualize and summarize the data to find patterns and insights
  4. At which stage of the Data Analytics Life Cycle is the model’s performance evaluated?
  A) Data Collection                B) Model Building           C) Model Evaluation               D) Problem Definition
  Answer: C) Model Evaluation
  5. Which of the following describes the deployment and monitoring stage of the Data Analytics Life Cycle?
  A) Applying models to historical data                                 B) Collecting and cleaning data
  C) Putting the model into use and tracking its performance            D) Visualizing the data
  Answer: C) Putting the model into use and tracking its performance
  6. What is the purpose of the "Problem Definition" stage in the Data Analytics Life Cycle?
  A) To collect data from different sources          B) To identify the specific questions to answer and goals to achieve
  C) To build a predictive model                     D) To evaluate the model's accuracy
  Answer: B) To identify the specific questions to answer and goals to achieve
 Types of Analytics:
  1. Descriptive Analytics
       Focuses on understanding historical data and summarizing what has happened. It answers questions like
        "What happened?" through methods like data aggregation and visualization.
  2. Diagnostic Analytics
       Goes a step further than descriptive analytics by investigating the reasons behind past outcomes. It answers
        "Why did it happen?" through techniques like correlation analysis and root cause analysis.
  3. Predictive Analytics
       Uses statistical models and machine learning techniques to forecast future outcomes. It answers "What
        could happen?" by analyzing patterns in historical data to make predictions.
4. Prescriptive Analytics
     Recommends actions to achieve desired outcomes by using optimization and simulation techniques. It
      answers "What should we do?" to optimize decisions and strategies.
5. Cognitive Analytics
     Involves advanced AI techniques that simulate human thought processes. It focuses on improving decision-
      making by learning from experience, often using natural language processing (NLP) and machine learning.
MCQs
1. Which type of analytics answers the question, "What happened?"
A) Predictive Analytics           B) Prescriptive Analytics          C) Descriptive Analytics           D) Cognitive
Analytics
Answer: C) Descriptive Analytics
2. What is the primary goal of diagnostic analytics?
A) To summarize past events                        B) To predict future outcomes
C) To understand why something happened            D) To recommend the best course of action
Answer: C) To understand why something happened
3. Which type of analytics is used to predict future outcomes based on historical data?
A) Prescriptive Analytics                 B) Cognitive Analytics
C) Predictive Analytics                   D) Descriptive Analytics
Answer: C) Predictive Analytics
4. Which of the following is a feature of prescriptive analytics?
A) It predicts future trends based on past data.              B) It answers why something occurred in the past.
C) It recommends the best course of action to optimize outcomes.
D) It visualizes and summarizes historical data.
Answer: C) It recommends the best course of action to optimize outcomes.
5. Which type of analytics involves using AI to simulate human thought processes?
A) Predictive Analytics           B) Descriptive Analytics           C) Cognitive Analytics     D) Diagnostic Analytics
Answer: C) Cognitive Analytics
6. What is the key difference between descriptive and diagnostic analytics?
A) Descriptive analytics focuses on predicting future events, while diagnostic analytics focuses on past events.
B) Descriptive analytics summarizes historical data, while diagnostic analytics explores the reasons behind past
events.
C) Descriptive analytics makes recommendations, while diagnostic analytics predicts outcomes.
D) Descriptive analytics uses AI, while diagnostic analytics does not.
Answer: B) Descriptive analytics summarizes historical data, while diagnostic analytics explores the reasons behind
past events.
Thank You.