R (and S-PLUS) Manual to Accompany Agrestis Categorical Data Analysis (2002) 2nd edition Laura A.
Thompson, 2009
Table of Contents Introduction and Changes from First Edition ..................... 1
A. Obtaining the R Software for Windows.................................................................... 1 B. Libraries in S-PLUS and Packages in R.................................................................. 1 C. Setting contrast types using Options() .................................................................... 3 D. Credit for functions .................................................................................................. 3 E. Editing functions ...................................................................................................... 3 F. A note about using Splus Menus ............................................................................. 4 G. Notice of errors ....................................................................................................... 4 H. Introductions to the S Language ............................................................................. 4 I. References ............................................................................................................... 4 J. Acknowledgements.................................................................................................. 5
Chapter 1: Distributions and Inference for Categorical Data: .................................................................................. 6
A. B. C. D. E. F. G. Summary of Chapter 1, Agresti .............................................................................. 6 Categorical Distributions in S-PLUS and R ............................................................ 6 Proportion of Vegetarians (Statistical Inference for Binomial Parameters)............. 8 The Mid-P-Value .................................................................................................. 11 Pearsons Chi-Squared Statistic........................................................................... 11 Likelihood Ratio Chi-Squared Statistic ................................................................. 12 Maximum Likelihood Estimation........................................................................... 12
Chapter 2: Describing Contingency Tables ..................... 16
A. Summary of Chapter 2, Agresti ............................................................................. 16 B. Comparing two proportions ................................................................................... 18 C. Partial Association in Stratified 2 x 2 Tables ......................................................... 19 D. Conditional Odds Ratios ....................................................................................... 23 E. Summary Measures of Assocation: Ordinal Trends .............................................. 24
Chapter 3: Inference for Contingency Tables .................. 28
A. Summary of Chapter 3, Agresti ............................................................................. 28 B. Confidence Intervals for Association Parameters.................................................. 29 C. Testing Independence in Two-way Contingency Tables ....................................... 35 D. Following Up Chi-Squared Tests........................................................................... 37 E. Two-Way Tables with Ordered Classification ........................................................ 39 F. Small Sample Tests of Independence ................................................................... 41 G. Small-Sample Confidence Intervals For 2x2 Tables ............................................. 44
Chapter 4: Generalized Linear Models ............................ 50
A. Summary of Chapter 4, Agresti ............................................................................. 50 B. Generalized Linear Models for Binary Data........................................................... 51 C. Generalized Linear Models for Count Data ........................................................... 56 D. Overdispersion in Poisson Generalized Linear Models......................................... 61 E. Negative Binomial GLIMs ...................................................................................... 63 F. Residuals for GLIMs .............................................................................................. 65 G. Quasi-Likelihood and GLIMs................................................................................. 67 H. Generalized Additive Models (GAMs) ................................................................... 68
Chapter 5 : Logistic Regression....................................... 72
A. B. C. D. E. F. Summary of Chapter 5, Agresti ............................................................................ 72 Logistic Regression for Horseshoe Crab Data ..................................................... 73 Goodness-of-fit for Logistic Regression for Ungrouped Data............................... 77 Logit Models with Categorical Predictors ............................................................. 78 Multiple Logistic Regression................................................................................. 82 Extended Example (Problem 5.17)....................................................................... 88
Chapter 6 Building and Applying Logistic Regression Models ............................................................................. 92
A. Summary of Chapter 6, Agresti ............................................................................. 92 B. Model Selection..................................................................................................... 93 C. Using Causal Hypotheses to Guide Model Fitting ................................................. 94 D. Logistic Regression Diagnostics ........................................................................... 96 E. Inference about Conditional Associations in 2 x 2 x K Tables ............................. 102 F. Estimation/Testing of Common Odds Ratio......................................................... 105 G. Using Models to Improve Inferential Power ........................................................ 106 H. Sample Size and Power Considerations ............................................................. 107 I. Probit and Complementary Log-Log Models ....................................................... 109 J. Conditional Logistic Regression and Exact Distributions ..................................... 111 K. Bias-reduced Logistic Regression ....................................................................... 116
Chapter 7 Logit Models for Multinomial Responses .... 117
A. Summary of Chapter 7, Agresti ........................................................................... 117 B. Nominal Responses: Baseline-Category Logit Models........................................ 118 C. Cumulative Logit Models ..................................................................................... 121 D. Cumulative Link Models ...................................................................................... 125 E. Adjacent-Categories Logit Models....................................................................... 127 F. Continuation-Ratio Logit Models.......................................................................... 128 G. Mean Response Models ..................................................................................... 134 H. Generalized Cochran-Mantel Haenszel Statistic for Ordinal Categories ............ 139
ii
Chapter 8 Loglinear Models for Contingency Tables .. 141
A. Summary of Chapter 8, Agresti ........................................................................... 141 B. Loglinear Models for Three-way Tables .............................................................. 142 C. Inference for Loglinear Models............................................................................ 145 D. Loglinear Models for Higher Dimensions ........................................................... 147 E. Loglinear-Logit Model Connection....................................................................... 150 F. Contingency Table Standardization..................................................................... 151
Chapter 9 Building and Extending Loglinear Models... 152
A. Summary of Chapter 9, Agresti ........................................................................... 152 B. Model Selection and Comparison....................................................................... 153 C. Diagnostics for Checking Models ....................................................................... 155 D. Modeling Ordinal Assocations............................................................................ 156 E. Assocation Models .............................................................................................. 158 F. Association Models, Correlation Models, and Correspondence Analysis ............ 164 G. Poisson Regression for Rates............................................................................. 170 H. Modeling Survival Times ..................................................................................... 172 I. Empty Cells and Sparseness................................................................................ 174
Chapter 10 Models for Matched Pairs ........................ 176
A. Summary of Chapter 10, Agresti ......................................................................... 176 B. Comparing Dependent Proportions ..................................................................... 177 C. Conditional Logistic Regression for Binary Matched Pairs .................................. 178 D. Marginal Models for Square Contingency Tables............................................... 181 E. Symmetry, Quasi-symmetry, and Quasi-independence ..................................... 186 F. Square Tables with Ordered Categories ............................................................ 189 G. Measuring Agreement Between Observers ....................................................... 192 H. Kappa Measure of Agreement ........................................................................... 195 I. Bradley-Terry Model for Paired Preferences ....................................................... 196 J. Bradley-Terry Model with Order Effect................................................................ 199 K. Marginal and Quasi-symmetry Models for Matched Sets ................................... 200
Chapter 11 Analyzing Repeated Categorical Response Data ............................................................................... 203
A. Summary of Chapter 11, Agresti ......................................................................... 203 B. Comparing Marginal Distributions: Multiple Responses ...................................... 203 C. Marginal Modeling: Maximum Likelihood Approach ............................................ 205 D. Marginal Modeling: Maximum Likelihood Approach. Modeling a Repeated Multinomial Response ............................................................................................. 211 E. Marginal Modeling: GEE Approach. Modeling a Repeated Multinomial Response215 F. Marginal Modeling: GEE Approach. Modeling a Repeated Multinomial Ordinal Response ................................................................................................................ 219
iii
G. Markov Chains: Transitional Modeling ................................................................ 221
Chapter 12 Random Effects: Generalized Linear Mixed Models for Categorical Responses................................ 226
A. Summary of Chapter 12, Agresti ......................................................................... 226 B. Logit GLIMM for Binary Matched Pairs................................................................ 227 C. Examples of Random Effects Models for Binary Data......................................... 230 D. Random Effects Models for Multinomial Data ..................................................... 243 E. Multivariate Random Effects Models for Binary Data .......................................... 245
Chapter 13 Other Mixture Models for Categorical Data252
A. Summary of Chapter 13, Agresti ......................................................................... 252 B. Latent Class Models........................................................................................... 252 C. Nonparametric Random Effects Models............................................................. 260 D. Beta-Binomial Models ........................................................................................ 268 E. Negative-Binomial Regression ........................................................................... 273 F. Poisson Regression with Random Effects.......................................................... 275
iv
Introduction and Changes from First Edition
This manual accompanies Agrestis Categorical Data Analysis (2002). It provides assistance in doing the statistical methods illustrated there, using S-PLUS and the R language. Although I have used the Windows versions of these two softwares, I suspect there are few changes in order to use the code in other ports. I have included examples of almost all of the major (and some minor) analyses introduced by Agresti. The manual chapters parallel those from Agresti so that, for example, in Chapter 2 I discuss using the software to conduct analyses from Agrestis Chapter 2. In most cases I use the data provided in the text. There are only one or two occasions where I use data from the problems sections in Agresti. Discussion of results of analyses is brief since the discussion appears in the text. That said, one of the improvements in the current manual over the previous version of the manual (Thompson, 1999) is that it is more self-contained. In addition, I include a summary of the corresponding chapter from Agresti at the beginning of each chapter in the manual. However, it will still be helpful to refer to the text to understand completely the analyses in this manual. In the manual, I frequently employ functions that come from user-contributed libraries (packages) of S-PLUS (or R). In the text, I note when a particular library or package is used. These libraries are not automatically included with the software, but can be easily downloaded from the internet, especially for the newest version of R for Windows. I mention in the next section how to get these libraries and how to install them for use. Many times, the library or package has its own help manual or help section. I will demonstrate how to access these from inside the software. I used S-PLUS 6.1 through 7.0 for Windows and R versions 1.8 through 2.8.1 for Windows for the analyses. However, S-PLUS for Windows versions as far back as 3.0 will do many of the analyses (but not all). This is not so easily said for R, as user-contributed packages frequently apply to the newer versions of R (e.g., at least 1.3.0). Many of the analyses can be applied to either S-PLUS or R. Some need small changes in order to apply to both softwares; these changes I have provided where necessary. In general, when I refer to both of the softwares, I will use the generic name, S. Also, if I do not indicate that I am using either S-PLUS or R for a particular command, that means it will work in both softwares. To separate input to R or S-PLUS and output from other text in this manual, I have put normal text in Arial font and commands and output in courier font. The input commands are in bold font, whereas the output is not. Also, all assignments will use the <- convention instead of = (or, _). Finally, this manual assumes some familiarity in using the basic commands of S. To keep the manual from being too long I do not discuss at great length functions that I use which are not directly related to categorical data analysis. See Section H below for information on obtaining introductory documentation for R or S-PLUS.
A. Obtaining the R Software for Windows
The language (and associated software interface) R can loosely be described as open-source S. It is downloadable from the site http://cran.r-project.org. Information on how to install R, as well as several PDF documents and user-contributed documents on the language and its features are included on the website.
B. Libraries in S-PLUS and Packages in R
The S-PLUS libraries used in this manual that do not come with the software are MASS (B. Ripley) - (used throughout) Multiv (F. Murtagh) - (used for correspondence analysis) cond (A Brazzale) (used for conditional logistic regression in chapter 6, NOTE: no longer supported) http://www.ladseb.pd.cnr.it/~brazzale/lib.html#ins Design (F. Harrell) - (used throughout) Hmisc (F. Harrell) - (support for Design library) nnet (B. Ripley) - for the function multinom (chapter 7, multinomial logit models) nolr (M. Mathieson) - (nonlinear ordinal models - supplement to chapter 9) rmtools (A. Azzalini & M. Chiogna) - (used for chapter 11)
2
yags2 (V. Carey) - (used for chapter 11) Most of these libraries can be obtained in .zip form from URL http://lib.stat.cmu.edu/DOS/S/Swin or http://lib.stat.cmu.edu/DOS/S. Currently, the URL http://www.stats.ox.ac.uk/pub/MASS4/Winlibs/ contains many ports to S-PLUS 6.0 for Windows. To install a library, first download it to any folder on your computer. Next, unzip the file using an unzipping program. This will extract all the files into a new folder in the directory into which you downloaded the zip file. Move the entire folder to the library directory under your S-PLUS directory (e.g., c:/program files/Insightful/splus61/library). To load a library, you can either pull down the File menu in S-PLUS and select Load Library or type one of the following in a script or command window
library(libraryname,first=T) library(libraryname)
# loads libraryname into first database position
To use the librarys help manual from inside S-PLUS or R type in a script or command window
help(library=libraryname)
Many of the R packages used in this manual that do not come with the software are listed below (not a compete list) MASS (VR bundle, Venables and Ripley) rmutil (J. Lindsey) (used with gnlm) http://alpha.luc.ac.be/~lucp0753/rcode.html gnlm (J. Lindsey) http://alpha.luc.ac.be/~lucp0753/rcode.html repeated (J. Lindsey) http://alpha.luc.ac.be/~lucp0753/rcode.html SuppDists (B. Wheeler) (used in chapter 1) combinant (V. Carey) (used in chapters 1, 3) methods (used in chapter 3) Bhat (E. Luebeck) (used throughout) mgcv (S. Wood) (used for fitting GAM models) modreg (B. Ripley) (used for fitting GAM models) gee and geepack (J. Yan) (used in chapter 11) yags (V. Carey) (used in chapter 11) gllm (used for generalized log linear models and latent class models) GlmmGibbs (Myles and Clayton) (used for generalized linear mixed models, chap. 12) glmmML (G. Brostrm) (used for generalized linear mixed models, chapter 12) CoCoAn (S. Dray) (used for correspondence analysis) e1071 (A. Weingessel) (used for latent class analysis) vcd (M. Friendly) (used in chapters 2, 3, 5, 9 and 10) brlr (D. Firth) (used in chapter 6) BradleyTerry (D. Firth) (used in chapter 10) ordinal (P. Lindsey) (used in chapter 7) http://popgen.unimaas.nl/~plindsey/rlibs.html design (F. Harrell) (used throughout) http://hesweb1.med.virginia.edu/biostat Hmisc (F. Harrell) (used throughout) VGAM (T. Yee) (used in chapters 7 and 9) http://www.stat.auckland.ac.nz/~yee/VGAM/download.shtml mph (J. Lang) (used in chapters 7, 10) http://www.stat.uiowa.edu/~jblang/mph.fitting/mph.fit.documentation.htm#availability exactLoglinTest (B. Caffo) (used in chapters 9, 10) http://www.biostat.jhsph.edu/~bcaffo/downloads.htm aod used in chapter 13 lca used in chapter 13 mmlcr used in chapter 13 flexmix used in chapter 13 npmlreg used in chapter 13
3
Rcapture used in chapter 13 R packages can be installed from Windows using the install.packages function. This function can be called from the console or from the pull-down Packages menu. A package is loaded in the same way as for S-PLUS. As well, the command help(package=pkg) can used to invoke the help menu for package pkg. To detach a library or package, named library.name or pkg, respectively, one can issue the following commands, among others.
detach(library.name) detach(package:pkg) # S-PLUS # R
C. Setting contrast types using Options()
The options function can be used to set the type of contrasts used in estimating models. The default for S-PLUS is Helmert contrasts for factors and polynomial contrasts for ordered factors. The default for R is treatment contrasts for factors and polynomial contrasts for ordered factors. I use treatment contrasts for factors for most of the analysis so that they match Agrestis estimates. However, there are times when I use sum-to-zero contrasts (contr.sum). The type of contrasts I use is usually indicated by a call to options prior to fitting the model, if necessary. If the call to options has been omitted, please assume I am using treatment contrasts for factors. One can find out exactly what the contrasts are in a glm-type fit by using the functions model.matrix and contrasts. Issuing the comand contrasts(model.matrix(fit)) gives the contrasts.
D. Credit for functions
The author of a function is named if the function was not written by me. Whenever I use functions that do not come with S, I will mention the S-PLUS library or R package that includes them. I also give a URL, if applicable.
E. Editing functions
In several places in the text, I mention creating functions or editing existing functions that come with either S-PLUS or R. There are many ways to edit or create a function. Some of them are specific to your platform. However, one procedure that works on all platforms from the command line is a call to fix. For example, to edit the method function update.default, and call the new version update.crosstabs, type the following at the command line (R or S-PLUS)
update.crosstabs<-fix(update.default)
This will bring up the function code for update.default in a text editor from which you can make changes to the function, save them, and exit the editor. The changes will be incorporated in update.crosstabs. Note that the function edit works in mostly the same way here, but is actually a generic function that allows editing of not just function objects, but all other S objects as well. To create a function from scratch, put the name of the new function as the argument to fix. For example,
fix(my.new.function)
4
To create functions from a script file (e.g., S-PLUS) or another editing program, one general procedure is to source the script file using e.g.,
source(c:/path/name.of.script.file)
F. A note about using S-PLUS Menus
Many of the more common methods I will illustrate can be accomplished via the S-PLUS menus. If you want to know what code corresponds to a particular menu command, issue the menu command and call up the History window (using the Window menu). All commands you have issued from menus will be there in (gui) code form which can then be used in a command window or script.
G. Notice of errors
All code has been tested, but there are undoubtedly still errors. Please notify me of any errors in the manual or of easier ways to perform tasks. My email address is lthompson10@yahoo.com.
H. Introductions to the S Language
This manual assumes some working knowledge of the S language. There is not space to also describe it. Fortunately, there are many tutorials available for learning S. Some of them are listed in your Users Guides that come with S-PLUS. Others are listed on the CRAN website for R (see Section A above).
I. References
Agresti, A. (2002). Categorical Data Analysis 2nd edition. Wiley. Bishop, C. (1995). Neural Networks for Pattern Recognition. Cambridge University Press Chambers, J. (1998). Programming with Data. Springer-Verlag. Chambers, J. and Hastie, T. (1992). Statistical Models in S. Chapman & Hall. Ewens, W. and Grant. G. (2001) Statistical Methods in Bioinformatics. Springer-Verlag. Gentleman, R. and Ilhaka, R. (2000). Lexical Scope and Statistical Computing. Journal of Computational and Graphical Statistics, 9, 491-508. Green, P. and Silverman, B. (1994). Nonparametric Regression and Generalized Linear Models, Chapman & Hall. Fletcher, R. (1987). Practical Methods of Optimization. Wiley. Harrell, F. (1998). Predicting Outcomes: Applied Survival Analysis and Logistic Regression. Unpublished manuscript. Now available as Regression Modeling Strategies : With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer (2001). Liao, L. and Rosen, O. (2001). Fast and Stable Algorithms for Computing and Sampling From the Noncentral Hypergeometric Distribution." The American Statistician, 55, 366-369. Lloyd, C. (1999). Statistical Analysis of Categorical Data, Wiley.
5
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, 2nd ed., Chapman & Hall. Ripley B. (2002). On-line complements to Modern Applied Statistics with SPLUS (http://www.stats.ox.ac.uk/pub/MASS4). Ross, S. (1997). Introduction to Probability Models. Addison-Wesley. Selvin, S. (1998). Modern Applied Biostatistical Methods Using SPLUS. Oxford University Press. Sprent, P. (1997). Applied Nonparametric Statistical Methods. Chapman & Hall. Thompson, L. (1999) S-PLUS Manual to Accompany Agrestis (1990) Catagorical Data Analysis. (http://math.cl.uh.edu/~thompsonla/5537/Splusdiscrete.PDF). Venables, W. and Ripley, B. (2000). S programming. Springer-Verlag. Venables, W. and Ripley, B. (2002). Modern Applied Statistics with S. Springer-Verlag. Wickens, T. (1989). Multiway Contingency Tables Analysis for the Social Sciences. LEA.
J. Acknowledgements
Thanks to Gregory Rodd and Frederico Zanqueta Poleto for reviewing the manuscript and finding many errors that I overlooked. Remaining errors are the sole responsibility of the author.
Chapter 1: Distributions and Inference for Categorical Data
A. Summary of Chapter 1, Agresti
In Chapter 1, Agresti introduces categorical data, including its types, distributions and statistical inference. Categorical variables measured on a nominal scale take values that do not have a natural ordering, whereas categorical variables measured on an ordinal scale take values that do have an ordering, but not necessarily a numerical ordering. Categorical variables are sometimes called discrete variables because they only take on a discrete or countable number of values. Agresti discusses three main distributions for categorical variables: binomial, multinomial, and Poisson. The binomial distribution describes the distribution of the sum of a fixed number of independent Bernoulli trials (i.e., binary outcomes), where the probability of success is fixed across trials. The multinomial distribution extends the binomial distribution to handle trials with possibly more than two outcomes. The Poisson distribution describes the distribution of the number of events occurring in a specified length of time or space, where the intervals of time or space are independent with respect to the occurrence of an event. Formal assumptions related to the Poisson distribution (which derives from the homogeneous Poisson process) can be found in any text on stochastic processes (see, for example, Ross, 1997). A nice explanation, however, can also be found in Ewens and Grant (2001). When observations modeled as coming from a binomial or Poisson distribution are much more variable than that predicted by the respective theoretical distributions, the use of these distributions can lead to poor predictions because the excess variability is not considered. The prescence of this excess variability is called overdispersion. There are options for dealing with overdispersion when it is suspected in a data set that otherwise could be reasonably modeled using a conventional distribution like the binomial or Poisson. One option is to incorporate random effects into the model (see Agresti, Chapter 12). Another option is to use a distribution with a greater variance than that dictated by the binomial or Poisson (e.g., the beta-binomial distribution is a mixture of binomials with different probabilities of success). Given a particular probability distribution to describe the data, the likelihood function is the probability of the data as a function of the parameters. The maximum likelihood estimate (MLE) is the value of the parameter that maximizes this function. Thus, the MLE is the value for the set of parameters that give the observed data the highest probability of occurrence. Inference for categorical data deals with the MLE and tests derived from maximum likelihood. Tests of the null hypothesis of zero-valued parameters can be carried out via the Wald Test, Likelihood Ratio Test, and the Score Test. These tests are based on different aspects of the likelihood function, but are asymptotically equivalent. However, in smaller samples they yield different answers and are not equally reliable. Agresti also discusses confidence intervals derived from inverting the above three tests. As mentioned on p. 13 of Agresti, a 95% confidence interval for is the set of 0 for which the test of H0:
= 0
has a P-value exceeding 0.05. That is, it describes the set of
values for which we would
keep the null hypothesis (assuming a significance level of 0.05). The chapter concludes with illustration of statistical inference for binomial and multinomial parameters.
B. Discrete Probability Distributions in S-PLUS and R
S-PLUS and R come with support for many built-in probability distributions, and support for many specialized distributions can be downloaded from statlib or from the CRAN website (see Introduction Section). Each distribution has functions for obtaining cumulative probabilities, density values, quantiles, and realizations of random variates. For example, in both S-PLUS and R
dbinom(x, size, prob) computes the density of the indicated binomial distribution at x pbinom(x, size, prob) computes the cumulative density of the indicated binomial distribution
7
at x
qbinom(p, size, prob) computes the pth quantile of the indicated binomial distribution rbinom(n, size, prob) draws n random variates from the indicated binomial distribution
Some of the other discrete distributions included with S-PLUS and R are the Poisson, negative binomial, geometric, hypergeometric, and discrete uniform. rnegbin is included with the MASS library and can generate negative binomial variates from a distribution with nonintegral size. The beta-binomial is included in R with packages rmutil and gnlm (for beta-binomial regression via gnlr) from Jim Lindsey, as well as the R package SuppDists from Bob Wheeler. Both the rmutil package and SuppDists package contain functions for many other distributions as well (e.g., negative hypergeometric, generalized hypergeometric, beta-negative binomial, and beta-pascal or beta-geometric). The library wle in R has a function for generating from a discrete uniform using runif. The library combinat in R gives functions for the density of a multinomial random vector (dmnom) and for generating multinomial random vectors (rmultinomial). The rmultinomial function is given below
rmultinomial<-function (n, p, rows = max(c(length(n), nrow(p)))) { rmultinomial.1 <- function(n, p) { k <- length(p) tabulate(sample(k, n, replace = TRUE, prob = p), nbins = k) } n <- rep(n, length = rows) p <- p[rep(1:nrow(p), length = rows), , drop = FALSE] t(apply(matrix(1:rows, ncol = 1), 1, function(i) rmultinomial.1(n[i], p[i, ]))) # could be replaced by # sapply(1:rows, function(i) rmultinomial.1(n[i], p[i, ])) }
Because of the difference in scoping rules between the two implementations of S (i.e., R uses lexical scooping and S-PLUS uses static scooping), the function rmultinomial in R cannot just be sourced into S-PLUS. One alternative is to use the following S-PLUS implementation, which avoids defining the rmultinomial.1 function, the source of the scoping problem above.
rmultinomial<-function (n, p, rows = max(c(length(n), nrow(p)))) { n <- rep(n, length = rows) p <- p[rep(1:nrow(p), length = rows), , drop = FALSE] sapply(1:rows,function(i,n,p) { k <- length(p[i,]) tabulate(sample(k, n[i], replace = TRUE, prob = p[i,]), nbins = k) },n=n,p=p) }
See Gentleman and Ilhaka (2000) or the R-FAQ for more information on the differences in scoping rules between R and S-PLUS. Jim Lindseys gnlm package contains a function called fit.dist, which fits a probability distribution of choice to frequency data. One can also use the sample function to generate (with or without replacement) from multiple categories given a set of probabilities for those categories. More information on the use of these functions can be found in the S-PLUS or R online manuals or on pages 107-108 of Venables and Ripley (2002).
8 C. Proportion of Vegetarians (Statistical Inference for Binomial Parameters)
An example of the use of the S-PLUS distribution functions comes from computing the asymptotic and exact confidence intervals on the proportion of vegetarians (p. 16). In a questionnaire given to an introductory statistics class (n = 25), zero students said they were vegetarians. Assuming that the number responding yes is distributed binomially with success probability , what is a 95% confidence , is 0? Agresti cites three approximate methods and two interval for , given that a point estimate, exact methods. The approximate methods are given first.
1. Approximate Confidence Intervals on
1) Inverting the Wald Test (AKA Normal Approximation)
z / 2
which is computed quite easily in S-PLUS by hand.
(1 ) n
(1.1)
phat <- 0 n <- 25 phat + c(-1, 1) * qnorm(p = 0.975) * sqrt((phat * (1 - phat))/n) [1] 0 0
However, its value is available via the binconf function in the Hmisc library in both S-PLUS and R, using the option method=asymptotic.
library(Hmisc, T) binconf(x=0, n=25, method="asymptotic") PointEst Lower Upper 0 0 0
2)
The score confidence interval contains the
values for which
| zS | < z.025 , where
0 ) zS = (
0 (1 0 ) / n .
The endpoints of the interval solve the equations
0 ) (
0 (1 0 ) / n = z.025
(1.2)
Agresti gives the analytical expressions for these endpoints on his p. 16, which we could type into SPLUS or R to get the values. However, even if there were no analytical expression, or we didnt want to try to find them, we could use the function nlmin to solve the set of simultaneous equations in (1.2), yielding an approximate confidence interval. (See point 3) and Chapter 3 for examples). Built-in functions that compute the score confidence interval include the prop.test function (S-PLUS and R)
res<-prop.test(x=0,n=25,conf.level=0.95,correct=F) res$conf.int [1] 0.0000000 0.1331923 attr(, "conf.level"): [1] 0.95
and the binconf function from Hmisc via its method=wilson option:
library(Hmisc, T) binconf(x=0, n=25, alpha=.05, method="wilson") PointEst Lower Upper 0 1.202937e-017 0.1331923
3) A 95% likelihood-ratio (LR) confidence interval is the set of
p-value exceeding the significance level, . The expression for the LR statistic is simple enough so that we can find the upper confidence bound analytically. However, we could have obtained it using uniroot (both R and S-PLUS) or using nlmin (S-PLUS only). These functions can be used to solve an equation, i.e., find its zeroes. uniroot is only for univariate functions. nlmin can also be used to solve a system of equations. A function doing the same in R would be function nlm or function nls (for nonlinear least squares) in library nls. Using uniroot we set up the LR statistic as a function, giving as first argument the parameter for which we want the confidence interval. The remaining arguments are the observed successes, the total, and the significance level. uniroot takes an argument called interval that specifies where we want to search for the root. There cannot be a singularity at either of these values. Thus, we cannot use 0 and 1.
LR <- function(pi.0, y, n, alpha) { -2*(y*log(pi.0) + (n-y)*log(1-pi.0)) - qchisq(1-alpha,df=1) } uniroot(f=LR, interval=c(0.000001,.999999), n=25, y=0, alpha=.05) $root: [1] 0.07395187 $f.root: [1] -5.615436e-006
for which the likelihood ratio test has a
The function nlmin can be used to solve the nonlinear equation
50 log(1 0 ) = 12 (0.05)
for
0 .
We can take advantage of the function solveNonlinear listed in the help page for nlmin in S-
PLUS. This function minimizes the squared difference between the LR statistic and the chi-squared quantile at .
solveNonlinear <- function(f, y0, x, ...){ # solve f(x) = y0 # x is vector of initial guesses, same length as y0 # ... are additional arguments to nlmin (not to f) g <- function(x, y0, f) sum((f(x) - y0)^2) g$y0 <- y0 # set g's default value for y0 g$f <- f # set g's default value for f nlmin(g, x, ...) } LR <- function(x) -50*log(1-x) # define the LR function # start finding the solution at
solveNonlinear(f=LR, y0= qchisq(.95, df=1), x=.5) 0.5 $x: [1] 0.07395197 $converged:
10
[1] T $conv.type: [1] "x convergence"
2. Exact Confidence Intervals on
There are several functions available for computing exact confidence intervals in R and S-PLUS. In R, the function binom.test computes a Clopper-Pearson interval. But, the same function in S-PLUS doesnt give confidence intervals, at least not automatically.
binom.test(x=0, n=25, conf.level=.95) Exact binomial test data: 0 and 25 number of successes = 0, number of trials = 25, p-value = 5.96e-08 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.0000000 0.1371852 sample estimates: probability of success 0 # R
The function binconf in the Hmisc library also gives Clopper-Pearson intervals via the use of the exact method.
library(Hmisc, T) binconf(x = 0, n = 25, alpha = .05, method = "exact") # SPLUS PointEst Lower Upper 0 0 0.1371852
In addition, several individuals have written their own functions for computing Clopper-Pearson as well as other types of approximate intervals besides the normal approximation interval. A search of exact binomial confidence interval on the S-news search page (http://lib.stat.cmu.edu/cgi-bin/iform?SNEWS) gave several user-made functions. The improved confidence interval described in Blaker (2000, cited in Agresti) can be computed in S-PLUS using the following function, which is a slight modification of the function appearing in the Appendix of the original reference.
acceptbin<-function(x, n, p){ # computes the acceptability of p when x is observed and X is Bin(n, p) # adapted from Blaker (2000) p1<-1-pbinom(x-1, n, p) p2<-pbinom(x, n, p) a1<-p1 + pbinom(qbinom(p1, n, p) - 1, n, p) a2<-p2 + 1 - pbinom(qbinom(1-p2, n, p), n, p) min(a1, a2) } acceptinterval<-function(x, n, level, tolerance=1e-04){ # computes acceptability interval for p at 1 - alpha equal to level # (in (0,1)) when x is an observed value of X which is Bin(n, p) # adapted from Blaker (2000) lower<-0; upper<-1 if(x) {
11
lower<-qbeta((1-level)/2, x, n-x+1) while(acceptbin(x, n, lower) < (1-level)) lower<-lower+tolerance } if(x!=n) { upper<-qbeta(1-(1-level)/2, x + 1, n - x) while(acceptbin(x, n, upper) < (1-level)) upper<-upper-tolerance } c(lower=lower, upper=upper) } acceptinterval(x=0, n=25, level=.95) lower upper 0 0.1275852
D. The Mid-P-Value
A confidence interval can be based on the mid-p-value discussed in Section 1.4.5 of Agresti. For the Vegetarian example above, we can obtain a 100(1 )% Clopper-Pearson confidence interval on using the mid-p-value by finding all values of
1 2
for which
P ( y = k | 0 ) + P ( y < k | 0 ) > / 2
and
1 2
P ( y = n k | 0 ) + P ( y > n k | 0 ) > / 2
where P ( y = k | 0 ) is the binomial probability mass function, y = 0, and n = 25. For the example, this set of inequalities reduces to the inequality
0 < 1 1/ n = 1 .051/ 25 = 0.113
because the lower limit is 0 when y = 0 (p. 18, Agresti).
E. Pearsons Chi-Squared Statistic
In both S-PLUS and R, one can find functions that will compute Pearsons Chi-Squared statistic. However, they appear in different places across the two implementations. In R, the ctest library contains the function chisq.test, which takes as arguments a set of frequencies, x, and its corresponding null probabilities, p. On Mendels results (p. 22, Agresti) we get
library(ctest) chisq.test(x=c(6022,2001),p=c(.75,.25)) Chi-squared test for given probabilities data: c(6022, 2001) X-squared = 0.015, df = 1, p-value = 0.9025
In S-PLUS, there does exist a built-in function called chisq.test, but its arguments are different, and the above code will not work. However, it calls a function .pearson.X2 (as does the function chisq.gof) which allows one to input observed and expected frequencies.
res<-.pearson.x2(observed=c(6022,2001),expected=c(8023*.75,8023*.25)) 1-pchisq(res$X2,df=1) [1] 0.902528
12
The S-PLUS function chisq.test is used for situations where one has a 2x2 cross-tabulation (preferably computed using the function table) or two factor vectors, and wants to test for independence of the row and column variables.
F. Likelihood Ratio Chi-Squared Statistic
The likelihood ratio chi-squared statistic is easily computed by hand for multinomial frequencies. Using vectorized calculations, G2 is for Mendels data
obs <- c(6022, 2001) expected <- 8023 * c(0.75, 0.25) 1-pchisq(2 * sum(obs * log(obs/expected)), df=1) [1] 0.9025024
These methods can also be used on the dairy calves example on p. 25.
G. Maximum Likelihood Estimation
S-PLUS and R come with several functions for general optimization. We already saw examples of the use of uniroot for finding a zero of a univariate function and nlmin for minimizing a sum of squares. Here we look at some simple examples of using general optimization functions for maximum likelihood estimation with categorical data. Venables and Ripley (2002, Chapter 16) discuss these methods in more detail and give a few guidelines about parameterizing the objective function. Lloyd (1999) gives data on the number of serious (fatal/nonfatal) accidents in the state of Victoria in Australia in 1985, separated by age group (over and under 21 years). Thus, we have a 2 x 2 contingency table with variables age and seriousness of accident. Lloyd initially treats the four accident counts as independent Poisson random variables with means 1 , 2 , 3 , and 4 , then considers several submodels. One of the sub-models uses the (fictitious) information that 23.3% of serious accidents are expected to be fatal. This information gives the restriction 1 (1 + 2 ) = 0.233, where 1 = 1 + 3 , the mean number of fatalities and be given by e1 = 1 ,
2 = 2 + 4 , the mean number of nonfatalities. e2 = 2 , e3 = 1 1 , and e4 = 3.292 1 2 . Thus,
The expected counts can this submodel has three
parameters to estimate. The likelihood is (equation 1.10 in Lloyd)