KEMBAR78
Unit 3&4 | PDF | Dependent And Independent Variables | Cluster Analysis
0% found this document useful (0 votes)
27 views26 pages

Unit 3&4

Uploaded by

guna007kpm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views26 pages

Unit 3&4

Uploaded by

guna007kpm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT – III

DATA COLLECTION
VARIABLE
Meaning:
In research, a variable is a characteristic, quantity, or number that can be measured or quantified, and
that can change or vary. Variables are essential to research because they allow researchers to:

 Frame research questions


 Formulate hypotheses
 Interpret results
 Gain insights into relationships, causes, and effects
Variables can be categorized in different ways, including:

 Type of data: Whether the variable is quantitative or categorical


 Role in the study: Whether the variable is independent, dependent, controlled, or confounding
 Relationship to other variables: Whether the variable is confounding or controlled
Some examples of variables include:

 Age
 Height
 Satisfaction levels
 Economic status
 Time it takes for something to occur
 Whether or not an object is used within a study

TYPES OF VARIABLE:
Qualitative Variables
Qualitative variables are those that express a qualitative attribute, such as hair color, religion, race,
gender, social status, method of payment, and so on. The values of a qualitative variable do not imply
a meaningful numerical ordering.
The value of the variable „religion‟ (Muslim, Hindu.., etc..) differs qualitatively; no ordering of
religion is implied. Qualitative variables are sometimes referred to as categorical variables.
For example, the variable sex has two distinct categories: „male‟ and „female.‟ Since the values of this
variable are expressed in categories, we refer to this as a categorical variable.
Similarly, the place of residence may be categorized as urban and rural and thus is a categorical
variable.
Categorical variables may again be described as nominaland ordinal.
Ordinal variables can be logically ordered or ranked higher or lower than another but do not
necessarily establish a numeric difference between each category, such as examination grades (A+, A,
B+, etc., and clothing size (Extra large, large, medium, small).
Nominal variables are those that can neither be ranked nor logically ordered, such as religion, sex, etc.
A qualitative variable is a characteristic that is not capable of being measured but can be categorized
as possessing or not possessing some characteristics. ▸ iedunote.com/variables

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 57


Quantitative Variables
Quantitative variables, also called numeric variables, are those variables that are measured in terms of
numbers. A simple example of a quantitative variable is a person‟s age.
Age can take on different values because a person can be 20 years old, 35 years old, and so on.
Likewise, family size is a quantitative variable because a family might be comprised of one, two, or
three members, and so on.
Each of these properties or characteristics referred to above varies or differs from one individual to
another. Note that these variables are expressed in numbers, for which we call quantitative or
sometimes numeric variables.
A quantitative variable is one for which the resulting observations are numeric and thus possess a
natural ordering or ranking.
Discrete and Continuous Variables
Quantitative variables are again of two types: discrete and continuous.
Variables such as some children in a household or the number of defective items in a box are discrete
variables since the possible scores are discrete on the scale.
For example, a household could have three or five children, but not 4.52 children.
Other variables, such as „time required to complete an MCQ test‟ and „waiting time in a queue in front
of a bank counter,‟ are continuous variables.
The time required in the above examples is a continuous variable, which could be, for example, 1.65
minutes or 1.6584795214 minutes.
Of course, the practicalities of measurement preclude most measured variables from being
continuous.
Discrete Variable
A discrete variable, restricted to certain values, usually (but not necessarily) consists of whole
numbers, such as the family size and a number of defective items in a box. They are often the results
of enumeration or counting.
A few more examples are;

 The number of accidents in the twelve months.


 The number of mobile cards sold in a store within seven days.
 The number of patients admitted to a hospital over a specified period.
 The number of new branches of a bank opened annually during 2001- 2007.
 The number of weekly visits made by health personnel in the last 12 months.
Continuous Variable
A continuous variable may take on an infinite number of intermediate values along a specified
interval. Examples are:

 The sugar level in the human body;


 Blood pressure reading;
 Temperature;

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 58


 Height or weight of the human body;
 Rate of bank interest;
 Internal rate of return (IRR),
 Earning ratio (ER);
 Current ratio (CR)
No matter how close two observations might be, if the instrument of measurement is precise enough,
a third observation can be found, falling between the first two.
A continuous variable generally results from measurement and can assume countless values in the
specified range.
Dependent Variables and Independent Variable
In many research settings, two specific classes of variables need to be distinguished from one another:
independent variable and dependent variable.
Many research studies aim to reveal and understand the causes of underlying phenomena or problems
with the ultimate goal of establishing a causal relationship between them.
Look at the following statements:

 Low intake of food causes underweight.


 Smoking enhances the risk of lung cancer.
 Level of education influences job satisfaction.
 Advertisement helps in sales promotion.
 The drug causes improvement of health problems.
 Nursing intervention causes more rapid recovery.
 Previous job experiences determine the initial salary.
 Blueberries slow down aging.
 The dividend per share determines share prices.
In each of the above queries, we have two independent and dependent variables. In the first example,
„low intake of food‟ is believed to have caused the „problem of being underweight.‟
It is thus the so-called independent variable. Underweight is the dependent variable because we
believe this „problem‟ (the problem of being underweight) has been caused by „the low intake of
food‟ (the factor).
Similarly, smoking, dividend, and advertisement are all independent variables, and lung cancer, job
satisfaction, and sales are dependent variables.
In general, an independent variable is manipulated by the experimenter or researcher, and its effects
on the dependent variable are measured.
Independent Variable
The variable that is used to describe or measure the factor that is assumed to cause or at least to
influence the problem or outcome is called an independent variable.
The definition implies that the experimenter uses the independent variable to describe or explain its
influence or effect of it on the dependent variable.
Variability in the dependent variable is presumed to depend on variability in the independent variable.

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 59


Depending on the context, an independent variable is sometimes called a predictor variable, regressor,
controlled variable, manipulated variable, explanatory variable, exposure variable (as used in
reliability theory), risk factor (as used in medical statistics), feature (as used in machine learning and
pattern recognition) or input variable. ▸ iedunote.com/variables
The explanatory variable is preferred by some authors over the independent variable when the
quantities treated as independent variables may not be statistically independent or independently
manipulable by the researcher.
If the independent variable is referred to as an explanatory variable, then the term response variable is
preferred by some authors for the dependent variable.
Dependent Variable
The variable used to describe or measure the problem or outcome under study is called a dependent
variable.
In a causal relationship, the cause is the independent variable, and the effect is the dependent variable.
If we hypothesize that smoking causes lung cancer, „smoking‟ is the independent variable and cancer
the dependent variable.
A business researcher may find it useful to include the dividend in determining the share prices. Here
dividend is the independent variable, while the share price is the dependent variable.
The dependent variable usually is the variable the researcher is interested in understanding,
explaining, or predicting.
In lung cancer research, the carcinoma is of real interest to the researcher, not smoking behavior per
se. The independent variable is the presumed cause of, antecedent to, or influence on the dependent
variable.
Depending on the context, a dependent variable is sometimes called a response variable, regressand,
predicted variable, measured variable, explained variable, experimental variable, responding variable,
outcome variable, output variable, or label.
An explained variable is preferred by some authors over the dependent variable when the quantities
treated as dependent variables may not be statistically dependent.
If the dependent variable is referred to as an explained variable, then the term predictor variable is
preferred by some authors for the independent variable.
Levels of an Independent Variable
If an experimenter compares an experimental treatment with a control treatment, then the independent
variable (a type of treatment) has two levels: experimental and control.
If an experiment were to compare five types of diets, then the independent variables (types of diet)
would have five levels.
In general, the number of levels of an independent variable is the number of experimental conditions.
Background Variable
In almost every study, we collect information such as age, sex, educational attainment, socioeconomic
status, marital status, religion, place of birth, and the like. These variables are referred to as
background variables.

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 60


Suppose we find a similar association between work status and duration of breast-feeding in both the
groups of mothers. In that case, we conclude that mothers‟ educational level is not a confounding
variable.
Intervening Variable
Often an apparent relationship between two variables is caused by a third variable.
For example, variables X and Y may be highly correlated, but only because X causes the third
variable, Z, which in turn causes Y. In this case, Z is the intervening variable.
An intervening variable theoretically affects the observed phenomena but cannot be seen, measured,
or manipulated directly; its effects can only be inferred from the effects of the independent and
moderating variables on the observed phenomena.
We might view motivation or counselling as the intervening variable in the work-status and
breastfeeding relationship.
Thus, motive, job satisfaction, responsibility, behaviour, and justice are some of the examples of
intervening variables.
Suppressor Variable
In many cases, we have good reasons to believe that the variables of interest have a relationship, but
our data fail to establish any such relationship. Some hidden factors may suppress the true relationship
between the two original variables.
Such a factor is referred to as a suppressor variable because it suppresses the relationship between the
other two variables.
The suppressor variable suppresses the relationship by being positively correlated with one of the
variables in the relationship and negatively correlated with the other. The true relationship between
the two variables will reappear when the suppressor variable is controlled for.
Thus, for example, low age may pull education up but income down. In contrast, a high age may pull
income up but education down, effectively cancelling the relationship between education and income
unless age is controlled.
TECHNIQUES OF DATA COLLECTION
Data collection is the process of collecting, measuring and analysing different types of information
using a set of standard validated techniques. The main objective of data collection is to gather
information-rich and reliable data, and analyze them to make critical business decisions. Once the
data is collected, it goes through a rigorous process of data cleaning and data processing to make this
data truly useful for businesses. There are two main methods of data collection in research based on
the information that is required, namely:

 Primary Data Collection


 Secondary Data Collection
1. PRIMARY DATA COLLECTION
Primary data refers to original data collected directly from its source for a specific research or analysis
purpose. This information has not been previously gathered, processed, or interpreted by anyone else.
It is the data that researchers or analysts collect first-hand. Primary data collection methods include
surveys, interviews, experiments, observations, or direct measurements.

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 62


Primary data is often contrasted with secondary data, which others have already collected and
analysed for a different purpose. It is valuable because it can be tailored to address specific research
questions or objectives and is typically more reliable and relevant to the study.
For example, if a company conducts a customer satisfaction survey to gather customer feedback, the
responses it collects from the survey would be considered primary data. The company gathers this
data directly for its use and analysis.
Advantages:
Primary data collection has many advantages over traditional data collection methods. Primary data is
collected directly from the people who are experiencing the problem or issue you‟re trying to solve.
This means that it‟s more accurate and reliable than other types of data, which can be collected
through surveys or interviews.
Traditional data collection methods rely on asking questions of a sample of people in order to get an
understanding of the problem or issue. However, this method isn‟t always as accurate as primary data
because it doesn‟t allow for feedback from the people who are experiencing the problem.
By collecting primary data, you‟re able to gather information from those who are affected by the issue
at hand. This allows you to get a better understanding of how they‟re feeling and what needs to be
done in order to address their concerns.
It also makes research more efficient since there‟s no need for a large number of respondents—just
enough people who have experienced the issue firsthand will do fine.
Plus, primary data is often more relevant because it considers all aspects of an individual‟s experience
rather than just one aspect (like with survey results). Ultimately, this leads to better solutions that
reflect everyone‟s reality accurately and efficiently.
Limitations of Primary data:
1. Time-consuming
Primary data collection can be a lengthy process, from designing data collection tools to analyzing the
results.
2. Expensive
Primary data collection can be costly, requiring resources for material creation, data gathering,
personnel, and data analysis.
3. Limited scope
Primary data collection is usually focused on a specific research question or context, which may limit
the breadth of the data.
4. Bias
The process of collecting primary data can be susceptible to various biases, which can compromise
the data's accuracy and reliability.
5. Ethical, legal, or logistical challenges
There may be challenges in accessing and contacting the target population.
6. Invasive and disruptive

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 63


Primary data collection can be invasive and disruptive, often requiring people to take time away from
their normal activities.
7. May not be representative
Primary data may not be representative of the entire audience.
8. Incomplete, inconsistent, or inaccurate
Primary sources can be incomplete, inconsistent, or inaccurate due to gaps, errors, contradictions, or
distortions.
Methods of Primary Data Collection
Primary data collection involves gathering first-hand information directly from the source for specific
research purposes. This process includes various methods, allowing researchers to obtain relevant and
accurate data tailored to their study's objectives.

1. INTERVIEW:
Interviews are a direct method of data collection. It is simply a process in which the interviewer asks
questions and the interviewee responds to them. It provides a high degree of flexibility because
questions can be adjusted and changed anytime according to the situation.
Techniques of Interview:
Here are some common techniques used in research interviews:
Structured interviews:
In a structured interview, the interviewer asks a set of standard, predetermined questions about
particular topics, in a specific order. The respondents need to select their answers from a list of
options. The interviewer may provide clarification on some questions. Structured Interviews are
typically used in surveys (see our “Survey Research Methods” Tip Sheet for more information).

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 64


Instructions to the informants: The questionnaire should provide necessary instructions to the
informants. For example, it should specify the time within which it should be sent back and the
address to which it should be sent. Instructions necessary to fill up the questions can also be given in
the questionnaire.
Type of answer: As far as possible the answers for the questions should be objective type, that is
„Yes‟ or „No‟ type questions are most welcome. However, when the alternative is not clear cut, the
„Yes‟ or „No‟ questions should be avoided.
Questions requiring calculations: Questions requiring calculation of ratios, percentages, and totals
should not be asked as it may take much time and the respondents may feel reluctant.
Attraction: A questionnaire should be made to look as attractive as possible. The printing and paper
used should be neat and qualitative. Enough space should be left for answering the questions.
QUESTIONNAIRE DESIGN PROCESS:
STEP I: Determine survey objectives, resources and constraints
STEP II: Determine the data collection method
STEP III: Determine the question response format
STEP IV: Decide on the question wording
STEP V: Establish questionnaire flow and layout
STEP VI: Evaluate the questionnaire
STEP VII: obtain approval of all relevant parties
STEP VIII: Pre-test And Revise
STEP IX: Prepare and final copy
STEP X: Implement the survey

TYPES OF QUESTIONNAIRES:
Exploratory questionnaire (qualitative)
Exploratory questionnaires are Structured questionnaire analysis used to collect the qualitative data
that information can be observed and recorded but not in a numerical form. It‟s used to obtain
approximate and characterize the data. A case of personal information would be somebody giving
your input about your composition. They may specify things about the tone, clearness, word decision,
and so forth, it causes you to order your essay. However, you can‟t connect a number to the criticism
in the questionnaire development in research. Exploratory surveys are ideal when you‟re in the
beginning phases and need to become familiar with a subject before planning an answer or theory. For
instance, in case you‟re in the beginning phases of item improvement. You don‟t think enough about
the market; at that point, exploratory questionnaires are ideal using the questionnaire hypothesis
survey.
Formal standardized questionnaire (quantitative)
They‟re otherwise called organized surveys. These are utilized to gather quantitative information
which is data recorded as a check or mathematical worth. The data is quantifiable, which implies it
very well may be utilized for numerical counts or factual investigation. It addresses the topic of how
much, the number of, or how frequently. A case of quantitative information would be the response to
the accompanying inquiry, “how old are you?” which requires a mathematical answer. Normalized
surveys are best utilized when you‟ve framed underlying speculation or worked out a model for an
item. You‟ll utilize it to stretch test your suspicions, plans, use cases, and so forth before going further
with item advancement. Because of its reasonable centre, the inquiries you pose are limited in scope
and request detailed data.

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 71


Similarly, as essential as the survey type are the inquiry types you pick. Not all inquiry types are ideal
in each circumstance. That is the reason it‟s vital to comprehend the kind of poll you‟re making first.
With that data, it gets simpler to pick the right sorts of inquiry useful for questionnaire design
business research
Open-ended questionnaire
As the name states, these questions are open for the respondent to answer with more freedom. Instead
of presenting a set of answers choices, the respondent writes as much is as little as they want. It is
ideal for exploratory questionnaires which collect Qualitative data analysis.
Closed questionnaire
Closed questionnaires structure the appropriate response by just permitting reactions which fit into
pre-chosen classes. Information that can be put into a classification is called ostensible information.
The classification can be limited to as not many as two choices, i.e., dichotomous (e.g., „yes‟ or „no,‟
„male‟ or „female‟), or incorporate very unpredictable arrangements of choices from which the
respondent can pick (e.g., multiple choices). Closed questionnaires can likewise give ordinal
information (which can be positioned). This type frequently includes utilizing a persistent rating scale
to gauge the quality of perspectives or feelings and useful in business survey questionnaire design.
For instance, emphatically concur/concur/nonpartisan/differ/firmly differ/incapable to reply.
Multiple-choice questionnaire
This inquiry gives the respondent top-notch of answer choices, and they can choose at least one. The
test with numerous decision questions is giving fragmented answer choices. For instance, you may
ask what industry accomplish your work in and rattle off 5 of the most widely recognized enterprises.
There are more than five ventures on the planet so that a few people won‟t be spoken to in this
circumstance. A basic answer to this issue is adding an “other” choice.
Dichotomous questionnaire
A question with only two possible answers is Dichotomous questionnaire. It often solves a yes or no
problem, but it can also be something like agree/disagree or true/false. Use this when all you need is
necessary validation without going too deeply into the motivations.
Scaled questionnaire
Scaled questions are common in questionnaires, and they are mainly used to judge the degree of a
feeling. Both exploratory and standardized questionnaires can be used because there are many
different types of scaled questions such as:

 Rating scale
 Likert scale
 Semantic differential scale
Pictorial questionnaire
Images are the final type of question used in questionnaires substitutes‟ text. Respondents are asked a
question and allowed to choose pictures. It usually has a greater response rate than other question
types.

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 72


The observer as participant: This is someone who takes no part in the activity but whose status as a
researcher is known to the participants. Such a state is aspired to by many researchers using
systematic observation. However, it is questionable whether anyone who is known to be a researcher
can be said not to take part in the activity - in the sense that their role is now one of the roles within
the larger group that includes the researcher.
SECONDARY DATA COLLECTION
The next techniques of data collection is Secondary data collection which involves using existing data
collected by someone else for a purpose different from the original intent. Researchers analyze and
interpret this data to extract relevant information. Secondary data can be obtained from various
sources, including:
a. Published Sources: Researchers refer to books, academic journals, magazines, newspapers,
government reports, and other published materials that contain relevant data.
b. Online Databases: Numerous online databases provide access to a wide range of secondary data,
such as research articles, statistical information, economic data, and social surveys.
c. Government and Institutional Records: Government agencies, research institutions, and
organizations often maintain databases or records that can be used for research purposes.
d. Publicly Available Data: Data shared by individuals, organizations, or communities on public
platforms, websites, or social media can be accessed and utilized for research.
e. Past Research Studies: Previous research studies and their findings can serve as valuable secondary
data sources. Researchers can review and analyze the data to gain insights or build upon existing
knowledge.

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 74


UNIT – IV
DATA ANALYSIS

Data Analysis
Data analysis is the systematic process of inspecting, cleaning, transforming, and interpreting
data with the objective of discovering valuable insights and drawing meaningful conclusions.
This process involves several steps:
1. Inspecting: Initial examination of data to understand its structure, quality, and
completeness.
2. Cleaning: Removing errors, inconsistencies, or irrelevant information to ensure
accurate analysis.
3. Transforming: Converting data into a format suitable for analysis, such as
normalization or aggregation.
4. Interpreting: Analyzing the transformed data to identify patterns, trends, and
relationships.
Types of Data Analysis Techniques in Research
Data analysis techniques in research are categorized into qualitative and quantitative
methods, each with its specific approaches and tools. These techniques are instrumental in
extracting meaningful insights, patterns, and relationships from data to support informed
decision-making, validate hypotheses, and derive actionable recommendations. Below is an
in-depth exploration of the various types of data analysis techniques commonly employed in
research:
1) Qualitative Analysis:
Definition: Qualitative analysis focuses on understanding non-numerical data, such as
opinions, concepts, or experiences, to derive insights into human behavior, attitudes, and
perceptions.
Content Analysis: Examines textual data, such as interview transcripts, articles, or open-
ended survey responses, to identify themes, patterns, or trends.
Narrative Analysis: Analyzes personal stories or narratives to understand individuals‘
experiences, emotions, or perspectives.
Ethnographic Studies: Involves observing and analyzing cultural practices, behaviors, and
norms within specific communities or settings.
2) Quantitative Analysis:
Quantitative analysis emphasizes numerical data and employs statistical methods to explore
relationships, patterns, and trends. It encompasses several approaches:
a) Descriptive Analysis:
Frequency Distribution: Represents the number of occurrences of distinct values within a
dataset.

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 75


Central Tendency: Measures such as mean, median, and mode provide insights into the
central values of a dataset.
Dispersion: Techniques like variance and standard deviation indicate the spread or
variability of data.
b) Diagnostic Analysis:
Regression Analysis: Assesses the relationship between dependent and independent
variables, enabling prediction or understanding causality.
ANOVA (Analysis of Variance): Examines differences between groups to identify significant
variations or effects.
c) Predictive Analysis:
Time Series Forecasting: Uses historical data points to predict future trends or outcomes.
Machine Learning Algorithms: Techniques like decision trees, random forests, and neural
networks predict outcomes based on patterns in data.
d) Prescriptive Analysis:
Optimization Models: Utilizes linear programming, integer programming, or other
optimization techniques to identify the best solutions or strategies.
Simulation: Mimics real-world scenarios to evaluate various strategies or decisions and
determine optimal outcomes.
e) Specific Techniques:
Monte Carlo Simulation: Models probabilistic outcomes to assess risk and uncertainty.
Factor Analysis: Reduces the dimensionality of data by identifying underlying factors or
components.
Cohort Analysis: Studies specific groups or cohorts over time to understand trends,
behaviors, or patterns within these groups.
Cluster Analysis: Classifies objects or individuals into homogeneous groups or clusters
based on similarities or attributes.
Sentiment Analysis: Uses natural language processing and machine learning techniques to
determine sentiment, emotions, or opinions from textual data.
UNIVARIATE ANALYSIS
It involves examining a single variable at a time. This approach allows you to summarize and
describe the distribution of that variable without considering relationships with other factors.
In univariate analysis, you focus on measures such as:
 Central tendency (mean, median, mode)
 Dispersion (range, standard deviation, variance)

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 76


tendency is to determine a single figure which may be used to represent a whole series
involving magnitudes of the same variable. Second object is that an average represents the
empire data; it facilitates comparison within one group or between groups of data. Thus, the
performance of the members of a group can be compared with the average performance of
different groups. Third object is that an average helps in computing various other statistical
measures such as dispersion, skewness, kurtosis etc.
Different methods of measuring ―Central Tendency‖ provide us with different kinds of
averages. The following are the main types of averages that are commonly used:
1. Mean
2. Median
3. Mode
1. Arithmetic Mean:
Arithmetic mean is the most commonly used average or measure of the central tendency
applicable only in case of quantitative data; it is also simply called the ―mean‖. Arithmetic
mean is defined as: ―Arithmetic mean is a quotient of sum of the given values and number of
the given values‖. Arithmetic mean can be computed for both ungrouped data (raw data: data
without any statistical treatment) and grouped data (data arranged in tabular form containing
different groups).
Pros and Cons of Arithmetic Mean:
Pros:
 It is rigidly defined
 It is easy to calculate and simple to follow
 It is based on all the observations
 It is determined for almost every kind of data
 It is finite and not indefinite
 It is readily used in algebraic treatment
 It is least affected by fluctuations of sampling
Cons:
 The arithmetic mean is highly affected by extreme values
 It cannot average the ratios and percentages properly
 It is not an appropriate average for highly skewed distributions
 It cannot be computed accurately if any item is missing
 The mean sometimes does not coincide with any of the observed values
Formula:Mean(xˉ)=∑fi∑fixi

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 78


Calculate the mean height for the following data using the direct method.

Height (in inches) 60 – 62 62 – 64 64 – 66 66 – 68 68 – 70 70 – 72

Frequency 3 6 9 12 8 2

As, xˉ=∑fixi∑fixˉ=∑fi∑fixi
Height (in inches) Frequency(fi) Midpoint (xi) fi × xi

60 – 62 3 61 183

62 – 64 6 63 378

64 – 66 9 65 585

66 – 68 12 67 804

68 – 70 8 69 552

70 – 72 2 71 142

∑fi = 40 ∑fi xi = 2644

⇒ Mean = 2644/40 = 66.1


Thus, mean height is 66.1 inches.

2. Median
The median is that value of the variable which divides the group in two equal parts. One part
comprising the values greater than and the other all values less than median. Median of a
distribution may be defined as that value of the variable which exceeds and is exceeded by
the same number of observation. It is the value such that the number of observations above it
is equal to the number of observations below it. Thus we know that the arithmetic mean is
based on all items of thedistribution, the median is positional average, i.e., it depends upon
the position occupied by a value in the frequency distribution. When the items of a series are
arranged in ascending or descending order of magnitude the value of the middle item in the
series is known as median in the case of individual observation.
Symbolically, Median = size of nth item

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 79


It the number of items is even, and then there is no value exactly in the middle of the series.
In such a situation the median is arbitrarily taken to be halfway between the two middle
items.
Advantages of Median:
1. It is very simple to understand and easy to calculate. In some cases it is
obtained simply by inspection.
2. Median lies at the middle part of the series and hence it is not affected by the
extreme values.
3. It is a special average used in qualitative phenomena like intelligence or
beauty which are not quantified but ranks are given. Thus we can locate the
person whose intelligence or beauty is the average.
4. In grouped frequency distribution it can be graphically located by drawing
gives.
5. It is specially useful in open-ended distributions since the position rather than
the value of item that matters in median.
Disadvantages of Median:
1. In simple series, the item values have to be arranged. If the series contains large
number of items, then the process becomes tedious.
2. It is a less representative average because it does not depend on all the items in the
series.
3. It is not capable of further algebraic treatment. For example, we cannot find a
combined median of two or more groups if the median of different groups are given.
4. It is affected more by sampling fluctuations than the mean as it is concerned with
on1y one item i.e. the middle item.
5. It is not rigidly defined. In simple series having even number of items, median cannot
be exactly found. Moreover, the interpolation formula applied in the continuous
series is based on the unrealistic assumption that the frequency of the median class is
evenly spread over the magnitude of the class interval of the median group.
Median for Ungrouped or Raw data:
Problem:
The number of rooms in the seven five stars hotel in Chennai city is 71, 30, 61, 59, 31, 40 and 29.
Find the median number of rooms
Solution:
Arrange the data in ascending order 29, 30, 31, 40, 59, 61, 71
n = 7 (odd)
Median = 7+1 / 2 = 4th positional value
Median = 40 rooms

Median for Discrete grouped data:

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 80


3. Mode
Mode is that value of the variable which occurs or repeats itself maximum number of item.
The mode is most ―fashionable‖ size in the sense that it is the most common and typical and
is defined by Zizek as ―the value occurring most frequently in series of items and around
which the other items are distributed most densely.‖ In the words of Croxton and Cowden,
the mode of a distribution is the value at the point where the items tend to be most heavily
concentrated. According to A.M. Tuttle, Mode is the value which has the greater frequency
density in its immediate neighbourhood. In the case of individual observations, the mode is
that value which is repeated the maximum number of times in the series. The value of mode
can be denoted by the alphabet z also.
Graphic Location of Mode:

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 85


Since mode is a positional average it can be located graphically by the following process:
 A histogram of the frequency distribution is drawn.
 In the histogram, the highest rectangle represents the model class.
 The top left corner of the highest rectangle is joined with the top left corner of the
following rectangle and the top right corner of the highest rectangle is joined with the
top right corner of the preceding rectangle respectively.
 From the point of intersection of both the lines a perpendicular is drawn on the X-
axis, and check that point on the X-axis. This will be the required value of mode.
Advantages and Disadvantages of Mode:
Advantages:
 It is easy to understand and simple to calculate.
 It is not affected by extremely large or small values.
 It can be located just by inspection in ungrouped data and discrete frequency
distribution.
 It can be useful for qualitative data.
 It can be computed in an open-end frequency table.
 It can be located graphically.
Disadvantages:
 It is not well defined.
 It is not based on all the values.
 It is stable for large values so it will not be well defined if the data consists of a small
number of values.
 It is not capable of further mathematical treatment.
 Sometimes the data has one or more than one mode and sometimes the data has no
mode at all.
For Ungrouped or Raw Data:
Problem:
The following are the marks scored by 20 students in the class. Find the mode 90, 70, 50, 30,
40, 86, 65, 73, 68, 90, 90, 10, 73, 25, 35, 88, 67, 80, 74, 46
Solution:
Since the marks 90 occurs the maximum number of times, three times compared with the
other numbers, mode is 90.
Problem:
A doctor who checked 9 patients‘ sugar level is given below. Find the mode value of the
sugar levels. 80, 112, 110, 115, 124, 130, 100, 90, 150, 180
Solution:
Since each values occurs only once, there is no mode.
Mode for Continuous data:

The mode or modal value of the distribution is that value of the variate for which the
frequency is maximum. It is the value around which the items or observations tend to be most
heavily concentrated. The mode is computed by the formula.

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 86


ANOVA Test
ANOVA Test is used to analyze the differences among the means of various groups using
certain estimation procedures. ANOVA means analysis of variance. ANOVA test is a
statistical significance test that is used to check whether the null hypothesis can be rejected or
not during hypothesis testing.
An ANOVA test can be either one-way or two-way depending upon the number of
independent variables. In this article, we will learn more about an ANOVA test, the one-way
ANOVA and two-way ANOVA, its formulas and see certain associated examples.
What is ANOVA Test?
ANOVA test, in its simplest form, is used to check whether the means of three or more
populations are equal or not. The ANOVA test applies when there are more than two
independent groups. The goal of the ANOVA test is to check for variability within the groups
as well as the variability among the groups. The ANOVA test statistic is given by the f test.
ANOVA Test Definition
ANOVA test can be defined as a type of test used in hypothesis testing to compare whether
the means of two or more groups are equal or not. This test is used to check if the null
hypothesis can be rejected or not depending upon the statistical significance exhibited by the
parameters. The decision is made by comparing the ANOVA test statistic with the critical
value.
ANOVA Test Example
Suppose it needs to be determined if consumption of a certain type of tea will result in a mean
weight loss. Let there be three groups using three types of tea - green tea, earl grey tea, and
jasmine tea. Thus, to compare if there was any mean weight loss exhibited by a certain group,
the ANOVA test (one way) will be used.
Suppose a survey was conducted to check if there is an interaction between income and
gender with anxiety level at job interviews. To conduct such a test a two-way ANOVA will
be used.
ANNOVA FORMULA:

Source of Sum of Degree of


Variation Squares Freedom Mean Squares F Value

f = MSB / MSE
Between SSB = Σnj(X̄j– MSB = SSB /
df1 = k – 1 or, F =
Groups X̄)2 (k – 1)
MST/MSE

SSE = Σnj(X̄- MSE = SSE /


Error df2 = N – k
X̄j)2 (N – k)

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 110
Source of Sum of Degree of
Variation Squares Freedom Mean Squares F Value

SST = SSB +
Total df3 = N – 1
SSE

where,
 F = ANOVA Coefficient
 MSB = Mean of the total of squares between groupings
 MSW = Mean total of squares within groupings
 MSE = Mean sum of squares due to error
 SST = total Sum of squares
 p = Total number of populations
 n = The total number of samples in a population
 SSW = Sum of squares within the groups
 SSB = Sum of squares between the groups
 SSE = Sum of squares due to error
 s = Standard deviation of the samples
 N = Total number of observations

Examples of the use of ANOVA Formula


Assume it is necessary to assess whether consuming a specific type of tea will result in a
mean weight decrease. Allow three groups to use three different varieties of tea: green tea,
Earl Grey tea, and Jasmine tea. Thus, the ANOVA test (one way) will be utilized to examine
if there was any mean weight decrease displayed by a certain group.
Assume a poll was undertaken to see if there is a relationship between salary and gender and
stress levels during job interviews. A two-way ANOVA will be utilized to carry out such a
test.
ANOVA Table
An ANOVA (Analysis of Variance) test table is used to summarize the results of an ANOVA
test, which is used to determine if there are any statistically significant differences between
the means of three or more independent groups. Here‘s a general structure of an ANOVA
table:

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 111
Types of ANOVA Formula
One-Way ANOVA
This test is used to see if there is a variation in the mean values of three or more groups. Such
a test is used where the data set has only one independent variable. If the test statistic exceeds
the critical value, the null hypothesis is rejected, and the averages of at least two different
groups are significant statistically.
Two-Way ANOVA
Two independent variables are used in the two-way ANOVA. As a result, it can be viewed as
an extension of a one-way ANOVA in which only one variable influences the dependent
variable. A two-way ANOVA test is used to determine the main effect of each independent
variable and whether there is an interaction effect. Each factor is examined independently to
determine the main effect, as in a one-way ANOVA. Furthermore, all components are
analyzed at the same time to test the interaction impact.
Solved Examples on ANOVA Formula
Example 1: Three different kinds of food are tested on three groups of rats for 5 weeks. The
objective is to check the difference in mean weight(in grams) of the rats per week. Apply
one-way ANOVA using a 0.05 significance level to the following data:

Food I Food II Food III

8 4 11

12 5 8

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 112
Using a chi-square distribution table, the critical value for α=0.05\alpha = 0.05α=0.05 and 2 degrees
of freedom is 5.99.
Since the calculated test statistic (8.88.88.8) is greater than the critical value (5.99), we reject the
null hypothesis.
Conclusion:
There is a significant difference in weight loss across the three diets at the 0.05 significance level.
CLUSTER ANALYSIS
Cluster analysis is a statistical technique used to group similar objects or observations into clusters or
categories based on their characteristics or features. The goal of cluster analysis is to identify natural
groupings within the data, where objects in the same group are more similar to each other than to
objects in other groups.
Applications of Cluster Analysis:
Cluster analysis is widely used across various fields such as:
 Marketing: Segmenting customers based on purchasing behavior.
 Biology: Grouping species based on genetic or phenotypic traits.
 Social Sciences: Identifying patterns in human behavior or attitudes.
 Healthcare: Classifying patients based on symptoms for disease diagnosis.
Types of Cluster Analysis Methods
1. Hierarchical Clustering:
o Agglomerative (bottom-up): Starts with each observation as its own cluster and then
progressively merges the closest clusters.
o Divisive (top-down): Starts with all observations in one cluster and then divides them
into smaller clusters.
o Dendrogram: A tree-like diagram used to visualize hierarchical clustering.
2. Partitioning Clustering:
o Divides the data into a predetermined number of clusters.
o K-Means Clustering: One of the most common partitioning methods. It minimizes
the variance within each cluster by repeatedly adjusting the cluster centers (centroids)
and reassigning points to the closest centroid.
o K-Medoids (or PAM): Similar to K-Means, but instead of centroids, it uses actual
data points (medoids) as cluster centers.
3. Density-Based Clustering:
o Clusters are defined as dense regions of points separated by areas of low density.
o DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups
together closely packed points and marks points that are isolated as noise.

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 139
4. Model-Based Clustering:
o Assumes that the data is generated from a mixture of underlying probability
distributions (e.g., Gaussian distributions).
o Expectation-Maximization (EM): A popular model-based approach that estimates
the parameters of the probability distributions to find the clusters.
5. Fuzzy Clustering:
o Allows each data point to belong to more than one cluster, assigning a membership
value to each cluster.
o Fuzzy C-Means (FCM): Similar to K-Means but allows for soft clustering by giving
partial membership to data points.
Steps in Cluster Analysis:
Data Pre-processing:
Standardize the data if the variables are on different scales to ensure that no variable dominates the
distance calculation.
Handle missing values and decide whether to remove outliers or treat them in a special way.
Choosing a Clustering Method:
Based on the type of data and the desired outcome, select a clustering algorithm (e.g., K-Means,
hierarchical, DBSCAN).
Define Similarity Measure:
Use a similarity or distance measure to determine how close or far apart the data points are from each
other.
Common distance measures:
Euclidean distance: Measures the straight-line distance between two points.
Manhattan distance: Measures the sum of the absolute differences between two points.
Cosine similarity: Measures the angle between two vectors, often used for text or high-dimensional
data.
Determine the Number of Clusters:
For methods like K-Means, the number of clusters � needs to be specified in advance. Common
techniques for determining � include:
Elbow Method: Plots the explained variance as a function of the number of clusters and looks for the
"elbow point."
Silhouette Analysis: Measures how similar an object is to its own cluster compared to other clusters.
Perform Clustering:
Apply the chosen clustering algorithm and generate the clusters.
Evaluate Clustering Results:

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 140
The CFA results confirm that the survey‘s items are valid measures of two distinct but related
constructs: Job Satisfaction and Work-Life Balance. The strong factor loadings and good fit indices
provide evidence that the hypothesized model fits the data well.
STRUCTURAL EQUATION MODELING (SEM)
Structural Equation Modelling (SEM) is a comprehensive statistical technique that allows for the
examination of complex relationships among observed and latent (unobserved) variables. It combines
elements of factor analysis and multiple regression to estimate interrelated dependencies and test
theoretical models.
SEM is used to:
 Model complex relationships between variables.
 Test theories or hypotheses about how variables are related.
 Confirm models (e.g., confirmatory factor analysis) or explore relationships (e.g., path
analysis).
Advantages of SEM:
Simultaneous Estimation: SEM allows for the simultaneous estimation of multiple relationships
between variables, unlike traditional regression models.
Latent Variables: SEM can handle both observed and latent variables, offering a more comprehensive
view of constructs that are not directly measurable.
Mediation and Moderation: SEM is excellent for testing complex relationships, including mediation
and moderation effects.
Theory Testing: SEM is primarily a confirmatory technique, allowing researchers to test theoretical
models with empirical data.
Concepts in SEM:
Observed Variables:
These are directly measured variables, often referred to as indicators. In surveys or tests, they are the
responses or data points that researchers directly collect.
Latent Variables:
Unobserved variables that are inferred from observed variables. These represent underlying
theoretical constructs (e.g., intelligence, satisfaction) that cannot be measured directly.
Measurement Model:
This portion of the SEM specifies how latent variables are measured by observed variables. It is
equivalent to a confirmatory factor analysis (CFA).
Structural Model:
The structural model specifies the relationships between latent variables, often through regression-like
equations. This is the core of SEM, showing how latent variables influence each other.
Path Diagram:

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 144
SEM models are often represented graphically using path diagrams. Variables are represented by
circles (for latent variables) or squares (for observed variables), and arrows indicate the direction of
influence.
Factor Loadings:
The coefficients that describe the relationships between latent variables and their observed indicators
in the measurement model. High factor loadings mean the observed variable is a strong indicator of
the latent variable.
Direct and Indirect Effects:
SEM allows for the estimation of both direct effects (relationships between two variables with a single
arrow) and indirect effects (the effect of one variable on another through one or more mediators).
Model Fit:
SEM provides several fit indices to evaluate how well the proposed model fits the observed data. Key
fit indices include the Chi-Square, RMSEA (Root Mean Square Error of Approximation), CFI
(Comparative Fit Index), and TLI (Tucker-Lewis Index).
Example of SEM
Imagine we want to test the relationship between Job Satisfaction (latent variable) and Job
Performance (latent variable), with Work-Life Balance (latent variable) acting as a mediator.
 Observed Variables for Job Satisfaction: Satisfaction with pay, opportunities for growth,
relationships with colleagues.
 Observed Variables for Work-Life Balance: Time for personal activities, flexible work
schedule.
 Observed Variables for Job Performance: Quality of work, punctuality, meeting deadlines.
Hypothesized Model:
 Job Satisfaction influences Work-Life Balance.
 Work-Life Balance mediates the effect of Job Satisfaction on Job Performance.
 Job Satisfaction also has a direct effect on Job Performance.
In this model:
 Direct effect: From Job Satisfaction to Job Performance.
 Indirect effect: From Job Satisfaction to Job Performance through Work-Life Balance.
Path Diagram:
 Latent variables (Job Satisfaction, Work-Life Balance, and Job Performance) are represented
by circles.
 Observed variables (e.g., Satisfaction with pay) are represented by squares.
 Arrows indicate hypothesized causal relationships.
Software for SEM:

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 145
Several software packages can perform SEM:
1. AMOS (Analysis of Moment Structures): User-friendly with a drag-and-drop interface for
building path diagrams.
2. LISREL: One of the oldest SEM programs, favored for advanced users.
3. Mplus: Offers a range of SEM features and is highly versatile.
4. lavaan package in R: A popular SEM tool for R users, offering flexibility and open-source
accessibility.
MULTIPLE DISCRIMINANT ANALYSIS (MDA):
Multiple Discriminant Analysis (MDA) is a classification and dimensionality reduction technique
used to distinguish between two or more groups of objects or individuals based on several predictor
variables. It is a generalization of linear discriminant analysis (LDA) and is primarily used in cases
where the dependent variable is categorical and the independent variables are continuous.
MDA is used when:

 You have multiple groups (typically 2 or more) to classify.


 You want to understand how different predictor variables discriminate between these groups.
Concepts in MDA:
Dependent (Categorical) Variable:

The variable that indicates group membership. For example, if you are studying customers' buying
behavior, the dependent variable could be whether they purchase a product (Yes/No), or which
product category they belong to.
Independent (Continuous) Variables:
The set of continuous variables that are used to predict group membership. For instance, income, age,
or spending habits could serve as predictor variables.
Discriminant Function:
The core of MDA is the discriminant function, which is a linear combination of the predictor
variables. It takes the form:

Discriminant Coefficients:

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 146
These coefficients (or weights) represent the contribution of each predictor variable to the
discriminant function. Larger absolute values suggest that the variable contributes more to
distinguishing between groups.
Centroids:
The mean values of the discriminant scores for each group. These help determine which group an
observation belongs to by comparing its discriminant score to the group centroids.
Classification:
Once the discriminant function is created, new cases can be classified into one of the predefined
groups based on their discriminant scores.
Advantages of MDA:
Multivariate: MDA allows for the simultaneous consideration of multiple independent variables to
classify individuals or objects into groups.
Interpretability: MDA provides a linear discriminant function that is easy to interpret, showing the
contribution of each variable to the discrimination between groups.
Predictive Power: MDA can be used to predict group membership for new observations based on the
discriminant function.
Handling of Multiple Groups: MDA can handle more than two groups, unlike simple LDA, which is
limited to binary classification.
Limitations of MDA:
Strict Assumptions: The assumption of multivariate normality and homogeneity of variance-
covariance matrices can be difficult to meet in real-world data.
Sensitive to Outliers: Outliers can distort the results of MDA by significantly affecting the
discriminant function.
Linear Relationships: MDA assumes linear relationships between the independent variables and group
membership, which may not always hold.
Multicollinearity: High correlations between independent variables can lead to unreliable discriminant
functions.

Manonmaniam Sundaranar University, Tirunelveli. Directorate of Distance & Continuing Education 147

You might also like