Q.1 What is Big Data? Write its sources and importance?
Answer:
   Big Data:
          It is defined as “The process of examining large amounts of data, from A variety
   of data sources and in different formats, to deliver insights that can enable decisions in
   real or near real time”.
   Sources of Big Data:
   1) Social Networking - Facebook, Twitter, Instagram, Google+, etc.,
   2) Sensors - Used in aircrafts, cars, Industrial Machine, Space Technology, CCTV
      Footage, etc.,
   3) Data created from Transportation Services – Aviation, Railways, Shipping, etc.,
   4) Online Shopping Portal - Amazon, Flipcart, Snapdeal, Alibaba, etc.,
   5) Mobile Applications – What’s App, Google Handout, Hike, etc.,
   6) Data created by Different Firms – Education Institute, Banks, Hospitals, Companies,
      etc.,
   The Importance of Big Data:
                The real issue is not that you are acquiring large amounts of data. It's what you
      do with the data that counts. The hopeful vision is that organizations will be able to
      take data from any source, harness relevant data and analyze it to find answers that
      enable:
      •   Cost reductions
      •   Time reductions
      •   New product development and optimized offerings
      •   Smarter business decision making
Q.2 What are applications of Big Data?
Answer:
      Applications:
      1) Science
             Data bases from astronomy, genomics, environmental data, transportation
      data, …
      2) Humanities and Social Sciences
             Scanned books, historical documents, social interactions data, new technology
      like GPS …
      3) Business & Commerce
             Corporate sales, stock market transactions, census, airline traffic, …
      4) Entertainment
          Internet images, Hollywood movies, MP3 files, …
      5) Medicine
          MRI & CT scans, patient records, …
Q.3 How Big Data Works?
Answer:
          The fundamental thought behind big data is that the more you think about
      anything, the more you can acquire experiences and settle on a choice or discover an
      answer. Along these lines, you need to realise how big data works and the three
      fundamental activities behind it:
      1. Integration
      2. Management
      3. Analysis
          Data analysts, data scientists, predictive modelers, statisticians and other analytics
      professionals collect, process, clean and analyze growing volumes of structured
      transaction data as well as other forms of data not used by conventional BI and
      analytics programs.
    Here is an overview of the four steps of the big data analytics process:
1. Data professionals collect data from a variety of different sources. Often, it is a mix
    of semi-structured and unstructured data. While each organization will use different
    data streams, some common sources include:
   internet clickstream data;
   web server logs;
   cloud applications;
   mobile applications;
   social media content;
   text from customer emails and survey responses;
   mobile phone records; and
   machine data captured by sensors connected to the internet of things (IoT).
2. Data is prepared and processed. After data is collected and stored in a data warehouse
    or data lake, data professionals must organize, configure and partition the data
    properly for analytical queries. Thorough data preparation and processing makes for
    higher performance from analytical queries.
3. Data is cleansed to improve its quality. Data professionals scrub the data using
    scripting tools or data quality software. They look for any errors or inconsistencies,
    such as duplications or formatting mistakes, and organize and tidy up the data.
4. The collected, processed and cleaned data is analyzed with analytics software. This
    includes tools for:
   data mining, which sifts through data sets in search of patterns and relationships
   predictive analytics, which builds models to forecast customer behavior and other
    future actions, scenarios and trends
   machine learning, which taps various algorithms to analyze large data sets
   deep learning, which is a more advanced offshoot of machine learning
   text mining and statistical analysis software
   artificial intelligence (AI)
   mainstream business intelligence software
   data visualization tools
Q.4 What is big data analysis explain? What are Analytics?
   Answer:
   Big Data Analysis:
       Big data analytics is the use of advanced analytic techniques against very large,
   diverse big data sets that include structured, semi-structured and unstructured data, from
   different sources, and in different sizes from terabytes to zettabytes.
    Analytics:
       Analytics is a process to take the data then apply some mathematical and statistical
   algorithm or tool to build some model.
       This model will be predictive and exploratory which is having information that allow
   us to get insights and insights allow us to take action.
Q.5 What are main types of big data?
 Answer:
   Types of Big Data:
   1. Structured Data
   2. Unstructured Data
   3. Semi-Structured Data
   4. Subtypes of Data
   5. Interacting with Data Through Programming
   6. Advantages of Big Data
1. Structured Data
       Any data that can be processed, accessed and stored as a fixed format is named
structured data. Throughout some period, ability in software engineering has made more
noteworthy progress in creating techniques for working with such sort of data and inferring
an incentive out of it. Notwithstanding, these days, we are anticipating issues when the size of
such data develops to an enormous degree, average sizes are being in the fury of various
zettabytes.
        Structured data in big data is the most straightforward to work with. Structured data is
a type of big data that is profoundly coordinated with measurements described by setting
parameters.
It’s all your quantitative data:
    1. Address
    2. Debit/credit card numbers
    3. Age
    4. Expenses
    5. Contact
    6. Billing
Example:
An ‘Employee’ table in a database is a Structured Data Examples.
 Employee_ID        Employee_Name            Gender        Department       Salary_In_ Lacs
 1865               Meg Lanning              Female        HR               6,30,000
 2145               Virat Kohli              Male          Finance          6,30,000
 4500               Ellyse Perry             Female        HR               4,00,000
 5475               Alyssa Healy             Female        HR               4,00,000
 6570               Rohit Sharma             Male          Finance          5,30,000
2. Unstructured Data:
        This is one of the types of big data where the data format of the relative multitude of
unstructured files, for example, image files, audio files, log files, and video files, are
incorporated. Any data which has an unfamiliar structure or model is arranged as
unstructured data. Since the size is huge, unstructured data in big data has different
difficulties as far as preparing for determining a value it.
An illustration of this is an intricate data source that contains a mix of images, videos, and
text files. A few associations have a ton of data accessible to them. However, these
associations don’t know how to infer an incentive out of it since the data is in its raw form.
Example:
The output returned by ‘Google Search.’
3. Semi-Structured Data
    Semi-structured data is one of the types of big data related to the data containing both the
formats referenced over, that is, unstructured and structured data. To be exact, it alludes to
the data that, even though it has not been ordered under a specific database, yet contains
essential tags or information that isolate singular components inside the data. Along these
lines, we arrive at the finish of types of big data.
Semi-structured data Examples:
Personal data is stored in an XML file.
4. Subtypes of Data
    In spite of the fact that not officially viewed as big data, there are subtypes of data that
hold some degree of relevance to the field of analytics. Frequently, these allude to the
beginning of the data, for example, social media, machine, geospatial or event-triggered.
These subtypes can likewise allude to get to levels: linked, lost/dark or open.
5. Interacting with Data Through Programming
    Diverse programming languages will get various things done when working with the data.
There are three significant players available:
    1. Scala: On the come up in fame is Scala, a Java based-language. It was utilised to build
        up a few Apache items, including Spark, a significant part of the big data stages
        market.
    2. R: For more modern examination and explicit structure, R is the language of decision.
        It is one of the top coding languages accessible for data control and can be utilised at
        each progression of an investigation cycle completely through to perception.
    3. Python: It is an open-source language and is viewed as one of the least complexes to
        learn. It uses compact abstraction and syntax.
6. Advantages of Big Data
      Predictive analysis is a major benefit of Big Data. Analytics of Big Data help
       businesses make better decisions, while simultaneously maximizing operational
       efficiency and reducing risk.
      With the help of Big Data analytics tools, businesses around the world are improving
       their digital marketing strategies by utilising and processing data from social media
       platforms. Insights from Big Data allow companies to improve their products and
       services based on customer pain points.
      Big Data combines data from multiple sources to produce actionable insights.
       Companies can save time and money by using analytics tools to filter out redundant
       data.
      Using Big Data analytics, companies could increase their revenue by generating more
       sales leads. Many businesses are turning to it to learn how well their products and
       services are performing on the market and how their customers are responding. In this
       way, they can make informed decisions about where to invest their time and
       resources.
Q.6 How is big data analysis useful? What is an example of big data analysis?
Answer:
       Big Data Analysis: Big data analytics helps organizations harness their data and use it
       to identify new opportunities. That, in turn, leads to smarter business moves, more
       efficient operations, higher profits and happier customers.
       Examples: Two conspicuous examples are Amazon Prime, which uses Big Data
       analytics to recommend programming for individual users, and Spotify, which does
       the same to offer personalized music suggestions.
Q.7 How is big data analytics transforming agriculture?
Answer:
   1. Boosting productivity – Data collected from GPS-equipped tractors, soil sensors, and
       other external sources has helped in better management of seeds, pesticides, and
       fertilizers while increasing productivity to feed the ever-increasing global population.
   2. Access to plant genome information – This has allowed the development of useful
       agronomic traits.
   3. Predicting yields – Mathematical models and machine learning are used to collate and
       analyze data obtained from yield, chemicals, weather, and biomass index. The use of
       sensors for data collection reduces erroneous manual work and provides useful
       insights on yield prediction.
   4. Risk management– Data-driven farming has mitigated crop failures arising due to
       changing weather patterns.
   5. Food safety – Collection of data relating to temperature, humidity, and chemicals,
       lowers the risk of food spoilage by early detection of microbes and other
       contaminants.
Q.8 What is Attribute? What are types of attribute & explain it’s?
Answer:
  Attribute: An attribute is a data item that appears as a property of a data entity.
       Attributes may be classified into two main types depending on their domain
               1) Numeric Attributes
               2) Categorial Attributes
1) Numeric Attributes:
    It has a real-valued or integer-valued domain.
    It takes on a finite or countably infinite set of values are called discrete, whereas those
       that can take on any real value are called continuous.
    As a special case of discrete, if an attribute has as its domain the set {0,1}, it is called
       a binary attribute.
      Interval-scaled:
                 For these kinds of attributes only differences (addition or subtraction) make
       sense.
                 For example, attribute temperature measured in ◦C or ◦F is interval-scaled.
       If it is 20 ◦C on one day and 10 ◦C on the following day, it is meaningful to talk about
       a temperature drops of 10 ◦C, but it is not meaningful to say that it is twice as cold as
       the previous day.
      Ratio-scaled:
                 Here one can compute both differences as well as ratios between values.
                 For example, for attribute Age, we can say that someone who is 20 years old is
       twice as old as someone who is 10 years old.
   2) Categorical Attributes
                 A categorical attribute is one that has a set-valued domain composed of a set
       of symbols. For example, Sex and Education could be categorical attributes with their
       domains given as
                                domain (Sex) = {M, F}
                                domain (Education) = {High School, BS, MS, PhD}
                 It has two types: A) Nominal         B) Ordinal
A) Nominal:
    The attribute values in the domain are unordered, and thus only equality comparisons
       are meaningful.
    It check only whether the value of the attribute for two given instances is same or not.
    Example: Sex is a nominal attribute.
B) Ordinal:
    The attribute values are ordered, and thus both equality comparisons and inequality
       comparisons are allowed, it may not be possible to quantify the difference between
       values.
    For example, Education is an ordinal attribute because its domain values are ordered
       by increasing educational qualification.
Q.9 Write short note on Data Probabilistic view?
Answer:
   1. The probabilistic view of the data assumes that each numeric attribute X is a random
      variable, defined as a function that assigns a real number to each outcome of an
      experiment (i.e., some process of observation or measurement).
   2. Formally, X is a function X : O → R, where O, the domain of X, is the set of all
      possible outcomes of the experiment, also called the sample space, and R, the range of
      X, is the set of real numbers.
   3. If the outcomes are numeric, and represent the observed values of the random
      variable, then X: O → O is simply the identity function: X(v) = v for all v ∈ O.
   4. A random variable X is called a discrete random variable if it takes on only a finite or
      countably infinite number of values in its range.
   5. whereas X is called a continuous random variable if it can take on any value in its
      range.
Q.10 What are Cumulative Distribution Function?
Answer:
               For any random variable X, whether discrete or continuous, we can define the
      cumulative distribution function (CDF) F : R → [0,1], which gives the probability of
      observing a value at most some given value x:
                              F(x) = P(X ≤ x) for all −∞ < x < ∞
               When X is discrete, F is given as,
                              F(x) = P(X ≤ x) =∑ f ( u )
                                                  u≤ x
               and when X is continuous, F is given as,
                              F(x) = P(X ≤ x) =     ∫ f ( u ) du
                                                  −∞
Unit I
Short Answer Questions (30)
1. Define data mining and its importance.
2. Explain the algebraic view of data.
3. Describe the geometric view of data.
4. What is the probabilistic view of data?
5. Define CRISP-DM and its phases.
6. What is business understanding in CRISP-DM?
7. Explain data understanding in CRISP-DM.
8. What is data preparation in CRISP-DM?
9. Describe the modeling phase in CRISP-DM.
10. What is evaluation in CRISP-DM?
11. Define data warehouse and data mart.
12. Explain the difference between database and data warehouse.
13. What is data scrubbing?
14. Define data privacy and security.
15. Explain the importance of data quality.
16. What are the types of data?
17. Define qualitative and quantitative data.
18. Explain temporal data.
19. What is data collation?
20. Define data set and its types.
21. Explain the purposes of data mining.
22. What are the intents of data mining?
23. Define limitations of data mining.
24. Explain organizational understanding in CRISP-DM.
25. What is data understanding in CRISP-DM?
26. Define data preparation in CRISP-DM.
27. Explain data transformation.
28. What is data reduction?
29. Define data selection.
30. Explain data visualization.
[5:11 pm, 20/7/2024] Er.Ganesh Kahar: 1. CRISP-DM stands for:*
   - A) Comprehensive Review of Industry Standard Process for Data
Mining
 - B) Cross-Industry Standard Process for Data Mining
     - C) Critical Review of Integrated Standard Process for Data
Management
 - D) Comprehensive Resource for Industrial Data Mining
 *Answer: B*
2. *Which phase of CRISP-DM focuses on analyzing the data to discover
initial insights?*
 - A) Data Preparation
 - B) Data Understanding
 - C) Business Understanding
 - D) Modeling
 *Answer: B*
3. *The primary goal of the Business Understanding phase in CRISP-DM is
to:*
 - A) Evaluate model performance
 - B) Understand and define the project objectives
 - C) Clean and transform data
 - D) Select modeling techniques
 *Answer: B*
4. *Data preparation includes:*
 - A) Data collection and analysis
 - B) Data cleaning and transformation
 - C) Data modeling
 - D) Data visualization
 *Answer: B*
5. *Which phase involves selecting and applying various modeling
techniques in CRISP-DM?*
 - A) Data Preparation
 - B) Modeling
 - C) Business Understanding
 - D) Evaluation
 *Answer: B*
6. *Data Understanding primarily involves:*
 - A) Setting business objectives
 - B) Exploring and understanding data characteristics
 - C) Developing a predictive model
 - D) Data cleaning and transformation
 *Answer: B*
7. *The Evaluation phase in CRISP-DM focuses on:*
 - A) Building the data model
 - B) Understanding the business context
 - C) Assessing the model results and performance
 - D) Data preparation and cleaning
 *Answer: C*
8. *Which of the following is NOT a component of CRISP-DM?*
 - A) Data Preparation
 - B) Data Collection
 - C) Modeling
 - D) Evaluation
 *Answer: B*
9. *In data mining, discovering patterns and knowledge from data typically
involves:*
 - A) Data modeling
 - B) Data collection
 - C) Data warehousing
 - D) Data cleaning
 *Answer: A*
10. *One limitation of data mining is:*
  - A) It always ensures data accuracy
  - B) It requires large amounts of data
  - C) It processes only unstructured data
  - D) It guarantees data security
  *Answer: B*
11. *Data mining is used for:*
  - A) Generating data backups
  - B) Predicting future trends and patterns
  - C) Creating data warehouses
  - D) Designing database schemas
  *Answer: B*
12. *Data mining techniques are primarily aimed at:*
  - A) Reducing data redundancy
  - B) Discovering hidden patterns in data
  - C) Increasing data storage capacity
  - D) Securing data from unauthorized access
  *Answer: B*
13. *A data warehouse is typically used for:*
  - A) Real-time transaction processing
  - B) Storing large volumes of historical data
  - C) Managing operational databases
  - D) Conducting real-time analytics
  *Answer: B*
14. *A data mart is:*
  - A) A type of operational database
  - B) A subset of a data warehouse focused on a specific area
  - C) A tool for data visualization
  - D) A method for real-time data collection
  *Answer: B*
15. *A dataset is:*
  - A) A database management system
  - B) A collection of related data points
  - C) A type of data warehouse
  - D) A tool for data mining
  *Answer: B*
16. *In a database, data is typically organized into:*
  - A) Files and folders
  - B) Tables with rows and columns
  - C) Document formats
  - D) Unstructured text files
  *Answer: B*
17. *Categorical data represents:*
  - A) Numeric values
  - B) Quantitative measurements
  - C) Qualitative attributes
  - D) Time-based sequences
  *Answer: C*
18. *Time-series data is characterized by:*
  - A) Random data without a sequence
  - B) Data organized according to time intervals
  - C) Data grouped into categories
  - D) Data with no specific order
  *Answer: B*
19. *Ordinal data provides:*
  - A) Exact numerical values
  - B) Data organized into categories without a meaningful order
  - C) A ranking or order of categories
  - D) Data with no inherent order
  *Answer: C*
20. *Ratio data differs from interval data in that it:*
  - A) Lacks a true zero point
  - B) Includes a meaningful zero point
  - C) Measures only categorical attributes
  - D) Is not used for statistical analysis
  *Answer: B*
21. *Data privacy primarily concerns:*
  - A) Improving data processing speed
  - B) Protecting personal and sensitive information
  - C) Enhancing data storage capacity
  - D) Increasing data redundancy
  *Answer: B*
22. *A common method to secure data is:*
  - A) Data normalization
  - B) Data encryption
  - C) Data aggregation
  - D) Data transformation
  *Answer: B*
23. *Data anonymization is used to:*
  - A) Increase data accuracy
  - B) Protect individual identities in datasets
  - C) Improve data storage efficiency
  - D) Generate new data records
  *Answer: B*
24. *GDPR is a regulation focused on:*
  - A) Data processing in the US
  - B) Data protection and privacy in the EU
  - C) Data warehousing in Asia
  - D) Data mining techniques in Canada
  *Answer: B*
25. *Data preparation tasks include:*
  - A) Building predictive models
  - B) Cleaning and transforming data
  - C) Defining business goals
  - D) Evaluating model results
  *Answer: B*
26. *Data scrubbing involves:*
  - A) Creating data models
  - B) Removing or correcting errors and inconsistencies
  - C) Generating new data
  - D) Encrypting sensitive information
  *Answer: B*
27. *Data collation refers to:*
  - A) Combining data from multiple sources
  - B) Analyzing data trends
  - C) Cleaning and transforming data
  - D) Securing data from unauthorized access
  *Answer: A*
28. *Data cleaning is the process of:*
  - A) Combining data from various sources
  - B) Removing inaccuracies and inconsistencies in data
  - C) Designing database schemas
  - D) Generating new data records
  *Answer: B*
29. *Data normalization is aimed at:*
  - A) Reducing data redundancy
  - B) Scaling data to a specific range
  - C) Aggregating data from multiple sources
  - D) Encrypting data
  *Answer: B*
30. *Data validation ensures:*
  - A) Data is accurate and consistent
  - B) Data is properly indexed
  - C) Data is backed up regularly
  - D) Data is encrypted
  *Answer: A*
31. *Data integration involves:*
  - A) Combining data from various sources into a unified dataset
  - B) Generating new business rules
  - C) Securing data from unauthorized access
  - D) Creating predictive models
  *Answer: A*
32. *Data mining aims to:*
  - A) Increase data redundancy
  - B) Discover patterns and relationships in data
  - C) Generate raw data
  - D) Design database structures
  *Answer: B*
33. *The primary purpose of a data warehouse is to:*
  - A) Process real-time transactions
  - B) Store large volumes of historical data
  - C) Manage operational databases
  - D) Facilitate real-time analytics
  *Answer: B*
34. *A data mart is different from a data warehouse in that it:*
  - A) Is larger in scale
  - B) Focuses on a specific business area
  - C) Stores real-time data
  - D) Manages operational data
  *Answer: B*
35. *Data privacy measures are designed to:*
  - A) Increase data processing speed
  - B) Protect sensitive and personal information
  - C) Enhance data storage capacity
  - D) Improve data redundancy
  *Answer: B*
36. *Data encryption is used to:*
  - A) Increase data volume
  - B) Secure data from unauthorized access
  - C) Normalize data
  - D) Cleanse data
  *Answer: B*
37. *Which of the following is NOT a step in data preparation?*
  - A) Data cleaning
  - B) Data transformation
  - C) Data visualization
  - D) Data integration
  *Answer: C*
38. **Data transformation is intended to:
[5:11 pm, 20/7/2024] Er.Ganesh Kahar: Certainly! Here are the remaining
questions:
38. *Data transformation is intended to:*
  - A) Convert data into a suitable format for analysis
  - B) Generate new data records
  - C) Secure data from unauthorized access
  - D) Aggregate data from different sources
  *Answer: A*
39. *The process of combining data from multiple sources to create a
unified dataset is known as:*
  - A) Data collation
  - B) Data encryption
  - C) Data validation
  - D) Data anonymization
  *Answer: A*
40. *In data mining, evaluating the results involves:*
  - A) Building the data model
  - B) Assessing the effectiveness of the data mining process and models
  - C) Transforming data into a suitable format
  - D) Collecting new data
  *Answer: B*