Types of Data
Types of Data
A term common in data science and computer science is the data type, which refers to the nature
of values in the data. This includes data types such as characters, integers, float, boolean, date,
time, etc.
Going forward in the article: When a data type is mentioned, it will refer to the abovementioned
types.
However, the data types discussed ahead refer to the variables/samples you come across
when performing statistical computing or various data science operations.
While one can divide the variables based on the aforementioned data types, categorizing the data
discussed ahead is more beneficial from a data mining and statistical computing point of view
1. Quantitative
2. Qualitative
Quantitative Data
Quantitative data refers to data that is generated by counting or measuring something. Generally,
quantitative refers to numerical data.
It’s important to note that by numerical, it doesn’t mean the data type will be numeric as, in some
cases, it can be string too. By numeric data type, we mean that it represents a number that can be
counted.
For example, if a variable in a dataframe provides the number of cars a household has and has
values such as ‘one,’ ‘two,’ ‘three,’ and so on, then this variable will be considered quantitative
data even when its data type is ‘string’.
Therefore the quantitative data can have a numeric data type and a string/character. This is
because here, the information is gathered by counting something, and if some transformation is
done, then statistical analysis can be performed on it.
Thus quantitative data answers questions of the type ‘how many’, ‘how much’, ‘how often’ etc.
Quantitative data can be categorized
1. Continuous vs. discrete
2. Ration vs. interval and discrete
A discrete variable expresses data in countable numbers, i.e., integers. Therefore, data is known
as discrete data whenever a count of individual items is recorded.
Typically discrete data includes numeric, finite, non-negative countable integers. Examples of such
data include the number of houses, the number of students in a class, etc.
Continuous data differs from discrete data as it may or may not be comprised of whole numbers,
can consider any value in a specific range, and can take on an infinite number of values.
Another difference is that decimal numbers of fractions can be there in continuous data, whereas
this is not the case with discrete data. Such data include weight, height, distance, volume, etc.
Another way quantitative data can be categorized is Ratio and Interval data.
Ratio data is where a true zero exists, and there is an equal interval between neighboring points.
Here a zero indicates a total absence of a value.
For example, a zero measurement of population, length, or area would mean an absence of the
subject. Another example can be temperature measured in Kelvin, where zero indicates a
complete lack of thermal energy.
Interval data is similar to ratio data, with the difference being that there is no true zero point in
such data.
For example, the temperature measured in Celsius is an example of interval data where zero
doesn’t indicate an absence of temperature.
Qualitative Data
Qualitative data is fundamentally different from quantitative data as it can be measured or counted
as it is more descriptive. It relies more on language, i.e., characters, than numerical values to
provide information.
However, when the qualitative data is encoded, it can be represented through numbers.
Note: Qualitative data is often called categorical data. In contrast, quantitative data is often
referred to as numerical data (not to be confused with the numeric data type. As mentioned
earlier, in special cases, a numerical data type can have a string as the data type).
Qualitative data can be categorized into:
1. Binary
2. Ordinal
3. Nominal data
Binary Data
Binary Data is also known as dichotomous variables, where there are always two possible
outcomes. For example, Yes/No, Heads/Tails, Win/Lose, etc.
Ordinal Data
Ordinal data is where we have different categories with a natural rank order. For example, a
variable with pain level has five categories- ‘no pain’, ‘mild pain’, ‘moderate pain’, ‘high pain’, and
‘severe pain’.
Here if you notice, there is a natural order to the categories as the categories can be ranked from
the category indicating lower pain intensity to higher.
Identifying ordinal variables is important because when the data is being prepared for statistical
modeling, the categorical variable is often encoded, i.e., represented in numbers. The categorical
variable type dictates the encoding mechanism to be used.
Label encoding is used for ordinal variables where the categories are ranked, and value labels are
provided to them from 1 to N (N being the number of unique categories in the variable).
An important thing to note here is that ordinal data can be confused with several other data types.
For example, ordinal data, when encoded, can resemble discrete data.
Ordinal data can be confused with interval data.
The difference is that the distance between the two categories is unknown in ordinal data. In
contrast, the distance between two adjacent values is known and fixed in interval data.
For example, if data providing a scale of pain from 0 to 10 is there, with 0 indicating no pain to 10
indicating severe pain, then such data is interval.
Here, the values have fixed measurement units that are of equal and known size.
On the other hand, if the data has five categories (‘ no pain’, ‘mild pain’, ‘moderate pain’, ‘high
pain’, and ‘severe pain’), then such data will be considered to be ordinal data as we can’t quantify
the difference between one category to another.
Nominal Data
Another type of qualitative data is nominal data. Here the categories of the data are mutually
exclusive and cannot be ordered in a meaningful way.
For example, if a variable indicates a mode of transportation with categories like ‘bus’, ‘car’, ‘train’,
and ‘motorcycle’, then such data is nominal. Other examples can include zip code and genre of
music.
Nominal variables are encoded during the data preparation process using a method known as
one-hot encoding (also known as dummy variable creation).
It’s important to note that you can get confused between the two types of categorical variables.
For example, the variable indicating the color of a car and having categories like ‘green’, ‘yellow’,
and ‘red’ is nominal. In contrast, the same categories can be a part of an ordinal variable where
the colors indicate a place’s danger level, with ‘green’ indicating safe and ‘red’ indicating unsafe.
Similarly, a variable indicating the temperature of an object having categories like ‘cold’ and ‘hot’
may seem binary. Still, having more than another category, such as ‘mild,’ is impossible.
6 steps in data processing
1. Data collection
The first stage of data collection involves gathering and discovering raw data from various
sources, such as sensors, databases, or customer surveys. It is essential to ensure the collected
data is accurate, complete, and relevant to the analysis or processing goals. Care must be taken
to avoid selection bias, where the method of collecting data inadvertently favors certain outcomes
or groups, potentially skewing results and leading to inaccurate conclusions.
2. Data preparation
Once the data is collected, it moves to the data preparation stage. Here, the raw data is cleaned
up, organized, and often enriched for further processing. This stage involves checking for errors,
removing any bad data (redundant, incomplete, or incorrect), and enhancing the dataset with
additional relevant information from external sources, a process known as data enrichment. Data
preparation aims to create high-quality, reliable, and comprehensive data for subsequent
processing steps.
3. Data input
The next stage is data input. In this stage, the clean and prepped data is fed into a processing
system, which could be software or an algorithm designed for specific data types or analysis
goals. Various methods, such as manual entry, data import from external sources, or automatic
data capture, can be used to input data into the processing system.
4. Data processing
In the data processing stage, the input data is transformed, analyzed, and organized to produce
relevant information. Several data processing techniques, like filtering, sorting, aggregation, or
classification, may be employed to process the data. The choice of methods depends on the
desired outcome or insights from the data.
The data output and interpretation stage deals with presenting the processed data in an easily
digestible format. This could involve generating reports, graphs, or visualizations that simplify
complex data patterns and help with decision-making. Furthermore, the output data should be
interpreted and analyzed to extract valuable insights and knowledge.
6. Data storage
Finally, in the data storage stage, the processed information is securely stored in databases
or data warehouses for future retrieval, analysis, or use. Proper storage ensures data longevity,
availability, and accessibility while maintaining data privacy and security.
Batch processing
Batch processing involves handling large volumes of data collectively at predetermined times,
making it ideal for non-time-sensitive tasks. This method allows organizations to efficiently
manage data by aggregating it and processing it during off-peak hours to minimize the impact on
daily operations.
Example: Financial institutions batch process checks and transactions overnight, updating account
balances in one comprehensive sweep to ensure accuracy and efficiency.
Real-time processing
Real-time processing is essential for tasks that require immediate handling of data upon receipt,
providing instant processing and feedback. This type of processing is crucial for applications
where delays cannot be tolerated, ensuring timely decisions and responses.
Example: GPS navigation systems rely on real-time processing to offer turn-by-turn directions,
adjusting routes based on live traffic and road conditions to ensure the fastest path.
Example: Movie production often utilizes multiprocessing for rendering complex 3D animations. By
distributing the rendering across multiple computers, the overall project's completion time is
significantly reduced, leading to faster production cycles and improved visual quality.
Online processing
Online processing facilitates the interactive processing of data over a network, with continuous
input and output for instant responses. It enables systems to handle user requests immediately,
making it an essential component of e-commerce and online services.
Example: Online banking systems utilize online processing for real-time financial transactions,
allowing users to transfer funds, pay bills, and check account balances with immediate updates.
Manual data processing requires human intervention for the input, processing, and output of data,
typically without the aid of electronic devices. This labor-intensive method is prone to errors but
was common before the advent of computerized systems.
Example: Before the widespread use of computers, libraries cataloged books manually, requiring
librarians to carefully record each book's details by hand for inventory and retrieval purposes.
Mechanical data processing uses machines or equipment to manage and process data tasks, a
prevalent method before the digital era. This approach involved using tangible, mechanical
devices to input, process, and output data.
Example: Voting in the early 20th century often involved mechanical lever machines, where votes
were tallied by pulling levers for each choice, simplifying vote counting and reducing the potential
for errors.
Electronic data processing employs computers and digital technology to process, store, and
communicate data with efficiency and accuracy. This modern approach to data handling allows for
rapid processing speeds, vast storage capabilities, and easy data retrieval.
Example: Retailers use electronic data processing at checkouts, where barcode scans instantly
update inventory systems and process sales, enhancing checkout speed and inventory
management.
Distributed processing
Example: Video streaming services use distributed processing to deliver content efficiently. By
storing videos on multiple servers, they ensure smooth playback and quick access for users
worldwide.
Cloud computing
Cloud computing offers computing resources, such as servers, storage, and databases, over the
internet, providing flexibility and scalability. This model enables users to access and utilize
computing resources as needed, without the burden of maintaining physical infrastructure.
Example: Small businesses leverage cloud computing for data storage and software services,
avoiding the need for significant upfront hardware investments and allowing easy scaling as the
business grows.
Automatic data processing uses software to automate routine tasks, reducing the need for manual
input and increasing operational efficiency. This method streamlines repetitive processes,
minimizes human error, and frees up personnel for more strategic tasks.
Example: Automated billing systems in telecommunications automatically calculate and send out
monthly charges to customers, streamlining billing operations and reducing errors.
Data processing is the key to unlocking the potential of raw data, transforming it into the
knowledge that shapes our future. By systematically analyzing and interpreting data, organizations
gain critical insights that inform strategic decisions, streamline processes, and drive innovation.
As the volume and complexity of data continue to expand, the ability to understand and effectively
process it will become even more essential for success in a data-driven world.
A one sample median test allows us to test whether a sample median differs significantly from a
hypothesized value. We will use the same variable, write, as we did in the one sample t-
test example above, but we do not need to assume that it is interval and normally distributed (we
only need to assume that write is an ordinal variable).
Binomial test
A one sample binomial test allows us to test whether the proportion of successes on a two-level
categorical dependent variable significantly differs from a hypothesized value. For example, using
the hsb2 data file, say we wish to test whether the proportion of females (female) differs
significantly from 50%, i.e., from .5. We can do this as shown below.
A chi-square goodness of fit test allows us to test whether the observed proportions for a
categorical variable differ from hypothesized proportions. For example, let’s suppose that we
believe that the general population consists of 10% Hispanic, 10% Asian, 10% African American
and 70% White folks. We want to test whether the observed proportions from our sample differ
significantly from these hypothesized proportions.
An independent samples t-test is used when you want to compare the means of a normally
distributed interval dependent variable for two independent groups. For example, using the hsb2
data file, say we wish to test whether the mean for write is the same for males and females.
Wilcoxon-Mann-Whitney test
A chi-square test is used when you want to see if there is a relationship between two categorical
variables. In SPSS, the chisq option is used on the statistics subcommand of
the crosstabs command to obtain the test statistic and its associated p-value. Using the hsb2
data file, let’s see if there is a relationship between the type of school attended (schtyp) and
students’ gender (female). Remember that the chi-square test assumes that the expected value
for each cell is five or higher. This assumption is easily met in the examples below. However, if
this assumption is not met in your data, please see the section on Fisher’s exact test below.
The Fisher’s exact test is used when you want to conduct a chi-square test but one or more of
your cells has an expected frequency of five or less. Remember that the chi-square test assumes
that each cell has an expected frequency of five or more, but the Fisher’s exact test has no such
assumption and can be used regardless of how small the expected frequency is. In SPSS unless
you have the SPSS Exact Test Module, you can only perform a Fisher’s exact test on a 2×2 table,
and these results are presented by default. Please see the results from the chi squared example
above.
One-way ANOVA
A one-way analysis of variance (ANOVA) is used when you have a categorical independent
variable (with two or more categories) and a normally distributed interval dependent variable and
you wish to test for differences in the means of the dependent variable broken down by the levels
of the independent variable. For example, using the hsb2 data file, say we wish to test whether
the mean of write differs between the three program types (prog). The command for this test
would be:
The Kruskal Wallis test is used when you have one independent variable with two or more levels
and an ordinal dependent variable. In other words, it is the non-parametric version of ANOVA and
a generalized form of the Mann-Whitney test method since it permits two or more groups. We will
use the same data file as the one way ANOVA example above (the hsb2 data file) and the same
variables as in the example above, but we will not assume that write is a normally distributed
interval variable.
.387).
Wilcoxon signed rank sum test
The Wilcoxon signed rank sum test is the non-parametric version of a paired samples t-test. You
use the Wilcoxon signed rank sum test when you do not wish to assume that the difference
between the two variables is interval and normally distributed (but you do assume the difference is
ordinal). We will use the same example as above, but we will not assume that the difference
between read and write is interval and normally distributed.
McNemar test
You would perform McNemar’s test if you were interested in the marginal frequencies of two
binary outcomes. These binary outcomes may be the same outcome variable on matched pairs
(like a case-control study) or two outcome variables from a single group. Continuing with
the hsb2 dataset used in several above examples, let us create two binary outcomes in our
dataset: himath and hiread. These outcomes can be considered in a two-way contingency table.
The null hypothesis is that the proportion of students in the himath group is the same as the
proportion of students in hiread group (i.e., that the contingency table is symmetric).
You would perform a one-way repeated measures analysis of variance if you had one categorical
independent variable and a normally distributed interval dependent variable that was repeated at
least twice for each subject. This is the equivalent of the paired samples t-test, but allows for two
or more levels of the categorical variable. This tests whether the mean of the dependent variable
differs by the categorical variable. We have an example data set called rb4wide, which is used in
Kirk’s book Experimental Design. In this data set, y is the dependent variable, a is the repeated
measure and s is the variable that
If you have a binary outcome measured repeatedly for each subject and you wish to run a logistic
regression that accounts for the effect of multiple measures from single subjects, you can perform
a repeated measures logistic regression. In SPSS, this can be done using the GENLIN command
and indicating binomial as the probability distribution and logit as the link function to be used in the
model. The exercise data file contains 3 pulse measurements from each of 30 people assigned to
2 different diet regiments and 3 different exercise regiments. If we define a “high” pulse as being
over 100, we can then predict the probability of a high pulse using diet regiment.
Factorial ANOVA
A factorial ANOVA has two or more categorical independent variables (either with or without the
interactions) and a single normally distributed interval dependent variable. For example, using
the hsb2 data file we will look at writing scores (write) as the dependent variable and gender
(female) and socio-economic status (ses) as independent variables, and we will include an
interaction of female by ses. Note that in SPSS, you do not need to have the interaction term(s)
in your data set. Rather, you can have SPSS create it/them temporarily by placing an asterisk
between the variables that will make up the interaction term(s).
Friedman test
You perform a Friedman test when you have one within-subjects independent variable with two or
more levels and a dependent variable that is not interval and normally distributed (but at least
ordinal). We will use this test to determine if there is a difference in the reading, writing and math
scores. The null hypothesis in this test is that the distribution of the ranks of each type of score
(i.e., reading, writing and math) are the same. To conduct a Friedman test, the data need to be in
a long format. SPSS handles this for you, but in other statistical packages you will have to
reshape the data before you can conduct this test.
Friedman’s chi-square has a value of 0.645 and a p-value of 0.724 and is not statistically
significant. Hence, there is no evidence that the distributions of the three types of scores are
different.
Ordered logistic regression
Ordered logistic regression is used when the dependent variable is ordered, but not continuous.
For example, using the hsb2 data file we will create an ordered variable called write3. This
variable will have the values 1, 2 and 3, indicating a low, medium or high writing score. We do not
generally recommend categorizing a continuous variable in this way; we are simply creating a
variable to use for this example. We will use gender (female), reading score (read) and social
studies score (socst) as predictor variables in this model. We will use a logit link and on
the print subcommand we have requested the parameter estimates, the (model) summary
statistics and the test of the parallel lines assumption.
A factorial logistic regression is used when you have two or more categorical independent
variables but a dichotomous dependent variable. For example, using the hsb2 data file we will
use female as our dependent variable, because it is the only dichotomous variable in our data set;
certainly not because it common practice to use gender as an outcome variable. We will use type
of program (prog) and school type (schtyp) as our predictor variables. Because prog is a
categorical variable (it has three levels), we need to create dummy codes for it. SPSS will do this
for you by making dummy codes for all variables listed after the keyword with. SPSS will also
create the interaction term; simply list the two variables that will make up the interaction separated
by the keyword by.
Correlation
A correlation is useful when you want to see the relationship between two (or more) normally
distributed interval variables. For example, using the hsb2 data file we can run a correlation
between two continuous variables, read and write.
Simple linear regression allows us to look at the linear relationship between one normally
distributed interval predictor and one normally distributed interval outcome variable. For example,
using the hsb2 data file, say we wish to look at the relationship between writing scores (write) and
reading scores (read); in other words, predicting write from read.
Non-parametric correlation
A Spearman correlation is used when one or both of the variables are not assumed to be normally
distributed and interval (but are assumed to be ordinal). The values of the variables are converted
in ranks and then correlated. In our example, we will look for a relationship
between read and write. We will not assume that both of these variables are normal and interval.
Logistic regression assumes that the outcome variable is binary (i.e., coded as 0 and 1). We have
only one variable in the hsb2 data file that is coded 0 and 1, and that is female. We understand
that female is a silly outcome variable (it would make more sense to use it as a predictor variable),
but we can use female as the outcome variable to illustrate how the code for this command is
structured and how to interpret the output. The first variable listed after the logistic command is
the outcome (or dependent) variable, and all of the rest of the variables are predictor (or
independent) variables. In our example, female will be the outcome variable, and read will be the
predictor variable. As with OLS regression, the predictor variables must be either dichotomous or
continuous; they cannot be categorical.
Multiple regression
Multiple regression is very similar to simple regression, except that in multiple regression you have
more than one predictor variable in the equation. For example, using the hsb2 data file we will
predict writing score from gender (female), reading, math, science and social studies (socst)
scores.
Multiple logistic regression is like simple logistic regression, except that there are two or more
predictors. The predictors can be interval variables or dummy variables, but cannot be categorical
variables. If you have categorical predictors, they should be coded into one or more dummy
variables. We have only one variable in our data set that is coded 0 and 1, and that is female. We
understand that female is a silly outcome variable (it would make more sense to use it as a
predictor variable), but we can use female as the outcome variable to illustrate how the code for
this command is structured and how to interpret the output. The first variable listed after
the logistic regression command is the outcome (or dependent) variable, and all of the rest of
the variables are predictor (or independent) variables (listed after the keyword with). In our
example, female will be the outcome variable, and read and write will be the predictor variable
Discriminant analysis
Discriminant analysis is used when you have one or more normally distributed interval
independent variables and a categorical dependent variable. It is a multivariate technique that
considers the latent dimensions in the independent variables for predicting group membership in
the categorical dependent variable. For example, using the hsb2 data file, say we wish to
use read, write and math scores to predict the type of program a student belongs to (prog).
One-way MANOVA
MANOVA (multivariate analysis of variance) is like ANOVA, except that there are two or more
dependent variables. In a one-way MANOVA, there is one categorical independent variable and
two or more dependent variables
Multivariate multiple regression is used when you have two or more dependent variables that are
to be predicted from two or more independent variables. In our example using the hsb2 data file,
we will predict write and read from female, math, science and social studies (socst) scores.
Canonical correlation
Canonical correlation is a multivariate technique used to examine the relationship between two
groups of variables. For each set of variables, it creates latent variables and looks at the
relationships among the latent variables. It assumes that all variables in the model are interval and
normally distributed. SPSS requires that each of the two groups of variables be separated by the
keyword with. There need not be an equal number of variables in the two groups (before and
after the with).
Factor analysis
Factor analysis is a form of exploratory multivariate analysis that is used to either reduce the
number of variables in a model or to detect relationships among variables. All variables involved
in the factor analysis need to be interval and are assumed to be normally distributed. The goal of
the analysis is to try to identify factors which underlie the variables. There may be fewer factors
than variables, but there may not be more factors than variables. For our example using the hsb2
data file, let’s suppose that we think that there are some common factors underlying the various
test scores. We will include subcommands for varimax rotation and a plot of the eigenvalues. We
will use a principal components extraction and will retain two factors.