Tutorial 2: Getting to know your data
1. Define what “data” refers to. A collection of data objects and their attributes.
2. Explain what an “attribute” represents. A variable property or characteristic of an object.
3. Name the two broad types of variables that may be present within any given set of data.
   Qualitative, quantitative
4. Discuss the similarities and differences between nominal and ordinal variables. Both relate to
   qualitative attributes with varying states or categories. However, for nominal variables, there is
   no natural order to the categories and they cannot be ranked in any way, whereas the opposite
   is true for ordinal variables.
5. Which of the following statements are correct regarding an attribute that is on the interval
   scale? You may choose more than one.
       a. There is no meaningful zero, and ratios don’t make sense
       b. There is a natural zero, but differences are meaningless.
       c. There is no meaningful zero, but differences are meaningful.
       d. There is a natural zero, and ratios are meaningful
6. Discuss how discrete and continuous attributes differ? Discrete attributes have only a finite or
   countable set of values whereas continuous attributes can take on an infinite number of
   possibilities in the real number system, usually in some interval.
7. When would you reduce dimensions in your data? When you have a large set of features with
   similar characteristics
8. A binary variable with states that are not equally important may be described as…
       a. Symmetric
       b. Asymmetric
       c. None of the above
9. Briefly discuss the characteristics that need to be considered for any given data set.
   Dimensionality: having to deal with a large number of features can become problematic,
   particularly if the number of attributes exceeds the number of observations.
   Sparsity: sometimes most of the data features have high percentage of zeros.
   Resolution: It is frequently possible to obtain data at different levels of resolution, and often the
   properties of the data are different at different resolutions (or any related answer).
10. Which of the following refers to a type of record data?
       a. Transaction or Market Basket Data
       b. Graph-Based Data
       c. Both A and B correct
11. Briefly describe what time series data refers to. Time series data is a special type of ordered,
    sequential data in which each record has a time stamp and the time interval between records is
    regular i.e. a series of measurements recorded over time at regular intervals
12. Briefly discuss the 3 different types of data structures?
    Structured data is considered as the most traditional form of data – it can be presented in a
    tabular format and is straightforward to analyse.
   Unstructured data is information that either does not have a predefined data model or is not
   organised in a pre-defined manner.
   Semi structured data is a form of structured data that does not conform with the formal
   structure of data models associated with relational databases or other forms of data tables, but
   nonetheless contain tags or other markers to separate semantic elements and enforce
   hierarchies of records and fields within the data. Therefore, it is also known as self-describing
   structure.
13. Which of the following might be considered correct for the ordered process of Data Mining?
       a. Infrastructure, Exploration, Analysis, Interpretation, Exploitation
       b. Exploration, Infrastructure, Analysis, Interpretation, Exploitation
       c. Exploration, Infrastructure, Interpretation, Analysis, Exploitation
       d. Exploration, Infrastructure, Analysis, Exploitation, Interpretation
14. Which of the following may be described as a process in which intelligent methods are applied
    to extract hidden patterns and relationships?
       a. Warehousing
       b. Data Mining
       c. Text Mining
       d. Data Selection
15. What does the acronym KDD stand for?
       a. Knowledge Discovery in Databases
       b. Knowledge Discovery Dimension
       c. Knowledge Data Definition
       d. Knowledge Data Dimension
16. What are the functions of Data Mining?
       a. Association and correctional analysis classification
       b. Prediction and characterization
       c. Cluster analysis and Evolution analysis
       d. All of the above
17. Which one of the following statements about the mean is incorrect?
       a. It is sensitive to extreme values (outliers)
       b. It is a single value that is useful for describing a data set
       c. It is always the best way of measuring the centre of the data
       d. None of the above
18. Which one of the following statements about the median is incorrect?
       a. It is sensitive to extreme values (outliers)
       b. The median is often referred to as “the middle”
       c. It is robust
       d. None of the above
19. Which one of the following statements about the mode is incorrect?
       a. It is the most common value
       b. It is not affected by outliers
       c. There may not actually be a mode
       d. None of the above
20. A data set which has two or more modes is called ……….
       a. Unimodal
       b. Symmetric
       c. Multimodal
       d. None of the above
21. Which one of the following statements about perfectly symmetric, bell-shaped data is incorrect?
       a. The highest point on the curve represents the mean
       b. Its standard deviation depicts the bell curve's relative width around the mean
       c. Median=mean=mode
       d. None of the above
22. Which one of the following statements about positively skewed data is correct?
       a. There are outliers/extreme values on the upper end of the scale
       b. The tail of the distribution lies on the left side
       c. The mean is the best measure of centrality for such data
       d. None of the above
23. Which one of the following statements about negatively skewed data is incorrect?
       a. There are outliers/extreme values on the lower end of the scale.
       b. The value of the mean is the greatest one followed by median and then by mode
       c. The median is the best measure of centrality for such data
       d. None of the above
24. Which of the following measures are appropriate for assessing the dispersion of an asymmetric
    numeric variable?
       a. Range
       b. Percentile
       c. Variance
       d. All of the above
25. For data containing multiple continuous variables, the spread or variability of the data can be
    represented using…
       a. Range
       b. Covariance matrix
       c. Interquartile range
       d. None of the above
26. Which one of the following statements regarding the measured covariance between two
    variables is correct?
       a. A value near zero denotes that the two variables do not have a linear relationship
       b. It measures the direction of a relationship between two variables
       c. Covariance is different from the correlation coefficient
       d. All of the above
27. Which one of the following statements about the correlation coefficient is incorrect?
       a. Correlation coefficient indicates how strongly two variables are (linearly) related
       b. A positive correlation means that the two variables move together in the same direction
           while a negative correlation means they move inversely.
       c. A perfect positive correlation means that the correlation coefficient is exactly 1
       d. None of the above
28. State what type of data visualization would be best to use in order to:
       a. Explore the relationship between 3 different quantitative variables 3-dimensional scatter
          plot / bubble plot
       b. Compare the shape of the distributions of 4 different quantitative variables multiple box
          plots
       c. Identify outliers box plot
       d. Partition spatial data in to regions of similar values contour plot
       e. Explore the relationship between 2 different quantitative variables plus a 3rd categorical
          variable scatter plot