Data Preparation Stage
Do you have all the
data needed to Is there a
sample bias? What is the sample
solve the problem?
size, and is it large
enough for use?
Are there any
extreme values? Are there any
Data Preparation
erroneous values?
Are there any
Are they isolated? Are they outliers? i.e. visible trends in
extremes data pts. located
e.g. errors in the data set?
units, coordinate along general trend.
errors, typos etc. Are there any
missing samples?
EXPLORATORY
DATA
ANALYSIS
Univariate Analysis Bivariate Analysis Spatial Analysis
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is not solely geostatistical; however it
is a prerequisite for ensuring data integrity and is the first critical step
in reservoir modelling.
EDA consists of:
• Checking the data for errors.
• Calculating the descriptive statistics (univariate statistics).
• Identifying relationships between two (bivariate stats) or more
(multivariate stats) variables.
• Looking for overall trends or anomalies in the data and the degree of
continuity.
• Describing or predicting how a variable will change in space or from
one location to another.
• Describing any drift or trend in the data set, any isotropy or anisotropy.
Univariate Analysis
Consists of profiling the data by calculating
traditional statistical descriptors such as mean,
mode, median, std. deviation, variance etc.
Bivariate Analysis
Consists of examining the relationship
between two or more variables with
methods such as linear regression, the
correlation coefficient, cluster analysis etc.
Spatial Analysis
• Petrophysical properties such as k, Ø and
Sw are distributed anisotropically within
the depositional environment.
• This principle is not addressed adequately
in most computer-based interpolation
algorithms
• Semi-variograms are used to identify and
quantify anisotropic behaviour in data.
BIVARIATE ANALYSIS
Quantile-Quantile Conditional
Plots Expectation
Scatter-plots Linear
Regression
Correlation
Coefficient (ρ) Rank
Correlation
Coefficient (ρrank)
Covariance
Do the data sets
come from
populations with a
common distribution?
Do they have common Do they have similar
Location and scale?
Q-Q Plot distributional shapes?
Do they have similar
tail behaviour?
Q-Q Plots
• A Q-Q plot is a graphical technique for
determining if two data sets come from
populations with a common distribution.
• It is a plot of the quantiles of the 1st data set
against the quantiles of the 2nd data set.
• The quantile values are obtained from the semi-
log cumulative frequency plot (normal probability
plot) of each variable.
• A quantile corresponds to a certain percentile.
Q-Q Plots
• A Q-Q plot is used to compare the shapes
of distributions, providing a graphical view
of parameters such as:
Centre of location (mode, mean, median)
Scale
Spread
And how they are similar or different in the
two distributions.
Q-Q Plots
A 45º reference line is also plotted. If the two sets come from a
population with the same distribution, the points should fall
approximately along this reference line.
A shift to a slope > 1 (above the 45º) indicates that the y-distribution
values are higher than the x-distribution values. While a shift to a
slope < 1 (below the 45º) indicates that the x-distribution values are
higher than the y-distribution values.
If the q-q plot produces a straight line other than y=x, then the two
distributions have the same shape, but their centre of location and
spread differ.
m>1 indicates that σ2 y > σ2x.
m<1 indicates that σ2x > σ2y .
A curved q-q plot indicates that the two distributions have a different
shape.
Normal Probability Plot of Var. V Normal Probability Plot of Var. U
105
100 105
95 100
90 95
85
90
85
80
80
75
75
Cumulative Frequency
70 70
Cumulative Frequency
65 65
60 60
55 55
50 50
45 45
40 40
35 35
30 30
25
25
20
20
15
15
10
10 5
5 0
0
1 10 100
1 10 100 1000
UCB UCB
Q-Q Plot
40.0
35.0
30.0
25.0
Qu
20.0
15.0
10.0
5.0
0.0
50.0 60.0 70.0 80.0 90.0 100.0 110.0 120.0 130.0 140.0
Qv
Scatter-plots
Scatter plots show the relationship between two variables by
displaying data pts on a 2-D graph. The explanatory variable is plotted
on the x-axis, while the response variable is plotted on the y-axis.
They provide the following info. about the relationship bet. 2 variables:
• Strength of the relationship – represented by the distance between
data points.
• Shape – linear, quadratic, polynomial, etc.
• Direction – positive or negative.
• Presence of outliers – aberrant or anomalous data points.
Scatter plots usually consist of a large body of data. The closer the
data points come to making a straight line when plotted, the higher the
correlation between the two variables, or the stronger the relationship.
Scatter-plot with regression line
If there appears to be a linear relationship from the scatter-plot, then
a regression line may be used to model the relationship. The
regression line is a straight line of best fit drawn using the “least
squares method”. The “red” sample points represent outliers.
Correlation Coefficient (ρ)
Having established the fact that there is a linear positive correlation between
the two sets of data (inferred from Scatter Plot). The correlation coefficient now
seeks to determine the “strength” of this correlation i.e. how close data points
are to the linear regression line. Values range from -1 to +1.
The correlation coefficient is affected by aberrant pairs of data.
The statistical formula is:
mx & my are the means of the x & y variables respectively.
σx & σy are the standard deviations of the x & y variables respectively.
ρ = 1 indicates perfect +ve linear correlation.
ρ = -1 indicates perfect -ve linear correlation.
ρ = 0 indicates no correlation.
ρ = -1 ρ = +1
ρ = +0.3
ρ = -0.8 ρ = -0.3
ρ = +0.8
ρ=0
Covariance
Evaluates the magnitude by the which the observed
values vary from their respective means.
represents the observed x value
represents the observed y value
represents the mean of the x values
represents the mean of they values
Rank correlation coefficient (ρrank)
The rank correlation coefficient measures the strength of the linear
relationship between the rankings of two variables.
ρrank is not affected by aberrant pairs of data, hence large variations
between ρ and ρrank suggests the presence of outliers in the data set.
The statistical formula is:
Rxi is the rank of xi among all the x values.
Ryi is the rank of yi among all the y values.
mRx is the mean of all the ranked x values.
mRy is the mean of all the ranked y values.
σRx is the std. deviation of all the ranked x values.
σRy is the std. deviation of all the ranked y values.
Rank correlation coefficient (ρrank)
If ρrank > ρ, then a few outliers are spoiling an
otherwise good correlation.
If ρrank < ρ, then a few outliers are enhancing an
otherwise poor correlation.
If ρrank = ρ, then there are not many outliers.
If ρrank = 1, then a non-linear transform of one
covariate can make ρ = 1.
Linear Regression
This is a technique used to develop an equation (a
linear regression line) for predicting a value of the
dependent variables given a value of the independent
variable.
The regression equation of Y on X is given by:
Y = a+bX
• X is the independent variable.
• Y is the dependent variable.
• a is the intercept.
• b is the slope of the line.
Statistical formulae for a & b:
a = ρ(σy/σx) & b = my - a*mx.
Conditional Expectation
A conditional expectation curve can be used
to describe the relationship between two non-
linear variables.
The conditional expectation curve allows us
to predict the mean value of a variable from a
corresponding class of known values of
another variable.
The procedure involves calculating my for
different ranges of x.
Conditional Expectation
References
• www.netmba.com/statistics/plot/scatter
• http://mste.illinois.edu/courses/ci330ms/youtsey/scatterinfo.html
• www.itl.nist.gov/div898/handbook/eda/eda.htm
• www.GSLIB.com
• Jeffrey M. Yarus and Richard L. Chambers, Quantitative
Geosciences LLP: Practical Geostatistics – An Armchair Overview
for Petroleum Reservoir Engineers, JPT November 2006.
• Keith R. Holdaway, SPE, SAS Institute Inc.: Exploratory Data
Analysis in Reservoir Characterization Projects, SPE 125368.
• Mohan Kelkar and Godofredo Perez: Applied Geostatistics for
Reservoir Characteristics.
• Ye Zhang: Introduction to Geostatistics, University of Wyoming,
Dept. of Geology & Geophysics.
Case Study
Optimizing Recovery Factors
• The first step on the road to determining appropriate
algorithms for attaining improved RF’s is to run an EDA
that entails techniques that are both graphical and
quantitative in nature.
Case Study
By default, a 95% bivariate normal density
ellipse is imposed on each scatter plot.
Thus, it can be noted that the RF has a strong
correlation with the OOIP.
A small significance level should be specified.
It can be seen readily that the recovery factor
has the strongest correlations with both OOIP
and porosity with Pearson correlation values of
0.7509 and 0.6089 respectively.
Case Study