Top 80 R Programming Interview Questions
Core Concepts
1. What is R and why is it so popular for data science?
R is an open-source language built specifically for statistical analysis and data visualization. It
has over 18,000 CRAN packages, rich graphics capabilities, strong community support, and is
easily extendable using C/C++, Python, and more.
2. Name three disadvantages of using R in production.
● Higher memory usage compared to Python or Java
● Slower for large-scale computations
● Less robust error handling and package QA in community contributions
3. What are the five atomic data types in R?
● Numeric: 3.14
● Integer: 2L
● Character: "hello"
● Logical: TRUE, FALSE
● Complex: 2 + 3i
4. Difference between vector, list, matrix, and data frame:
● Vector: 1D homogeneous data
● List: 1D heterogeneous
● Matrix: 2D homogeneous
● Data frame: 2D list with equal-length columns (heterogeneous)
5. How to read CSV and tab-delimited files?
Use read.csv("file.csv") for CSV and read.delim("file.txt") for tab-delimited
files.
6. How do you import an Excel sheet into R?
Use readxl::read_excel("file.xlsx") or openxlsx::read.xlsx().
7. Difference between library() and require()?
Both attach packages. library() throws an error if the package is missing; require()
returns FALSE, allowing conditional logic.
8. How does R handle missing values and NaNs?
Missing values are represented as NA. Undefined numeric values (e.g., 0/0) are NaN. Use
is.na() or is.nan() to detect them.
9. Why do R programmers prefer <- over = for assignment?
Because <- is always assignment, while = also binds arguments in functions, so <- avoids
ambiguity.
10. Two ways to open help for a function in R?
Use ?mean or help(mean).
11. How to append a row to a data frame?
Use rbind(df, new_row) or bind_rows(df, new_row) from dplyr.
12. How to filter and select columns from a data frame?
df[df$score > 90, "name"] or df %>% filter(score > 90) %>% pull(name)
13. What is the pipe operator %>% used for?
Used to chain operations, where the output of one function becomes the input of the next.
14. When should you use data.table instead of dplyr?
When working with very large datasets needing in-place updates or fastest performance.
15. How do you convert a string to a date?
Use as.Date("2025-06-27") or lubridate::ymd("20250627").
Data Manipulation
16. What is the apply() family used for?
To apply functions over margins (rows/columns) of data: apply(), lapply(), sapply(),
tapply(), etc.
17. How to join two data frames by column id?
Using dplyr: left_join(df1, df2, by = "id") or base: merge(df1, df2, by =
"id").
18. Difference between rbind() and cbind()?
● rbind() adds rows
● cbind() adds columns
19. What does aggregate(score ~ class, df, mean) do?
Returns average scores for each class as a summary data frame.
20. How to sort a data frame by column descending?
Base R: df[order(df$col, decreasing = TRUE), ]
Dplyr: df %>% arrange(desc(col))
21. Name five dplyr verbs and their uses:
● filter(): filter rows
● select(): pick columns
● mutate(): add/modify columns
● summarise(): summary stats
● arrange(): reorder rows
22. How to pivot data to wide format?
Use pivot_wider(names_from = category, values_from = value).
23. How does lubridate help with dates?
Simplifies date parsing and transformations with functions like ymd(), mdy_hms(), year(),
month().
24. What is a factor and when to use it?
Factor is for categorical data—more memory-efficient and used properly in modeling.
25. Memory difference between list and data frame?
Similar underlying memory, but data frames have extra class metadata and allow column-level
operations.
Visualization
26. What are ggplot2’s main layers?
● Data
● Aesthetics
● Geometries
● Scales
● Themes
● Coordinates
27. Create a histogram of prices with binwidth = 5:
ggplot(df, aes(price)) + geom_histogram(binwidth = 5)
28. How to add regression line to scatterplot?
Add: + geom_smooth(method = "lm", se = FALSE)
29. When to use base plotting over ggplot?
For quick EDA, scripting, or for-loop plotting where speed matters more than aesthetics.
30. How to make a correlation heatmap?
Use heatmap(cor(df)) or melt + geom_tile() in ggplot2.
Programming & Functions
31. Define a function for geometric mean:
geom_mean <- function(x) exp(mean(log(x), na.rm = TRUE))
32. What is lexical scoping in R?
Functions remember variables from their defining environment, not where they’re called.
33. What does <<- do?
Assigns to a variable in a parent (not local) environment. Dangerous if overused.
34. Difference between S3, S4, R6?
● S3: informal OOP
● S4: formal slots, type checking
● R6: reference objects (mutable), better for apps
35. How do you debug R code in a loop?
Use browser(), debug(), or traceback() after error to inspect step-by-step.
36. What’s Rprof() used for?
Profiles code performance to find bottlenecks. Use summaryRprof() to analyze.
37. Why prefer vectorization over loops?
Faster and more readable. Avoids interpreter overhead.
38. Show a closure example:
counter <- local({ i <- 0; function() { i <<- i + 1; i } })
39. How to write and document a package?
Use usethis::create_package(), add code in R/, use roxygen2, document via
devtools::document().
40. What is non-standard evaluation in dplyr?
Allows column names to be used without quotes. Enables tidy programming with {{ }}.
Statistics & Machine Learning
41. Code for two-sample t-test:
t.test(x, y, var.equal = TRUE)
42. How to extract R-squared from linear model?
summary(lm_model)$r.squared
43. Fit a logistic regression:
glm(default ~ income + balance, data = df, family = binomial)
44. How to choose best k for k-means?
Use elbow plot, silhouette score, or gap statistic.
45. PCA code:
prcomp(df, scale. = TRUE)
46. Random forest with 1000 trees:
randomForest(y ~ ., data = df, ntree = 1000)
47. Techniques to handle class imbalance:
SMOTE, down/oversampling, weighting loss function.
48. What is AUC?
Area under ROC curve; measures classification performance.
49. Cross-validation using tidymodels:
Use vfold_cv(), workflow(), and fit_resamples().
50. Time series forecasting function?
forecast::auto.arima()
Deployment & Reproducibility
51. REST API with R?
Use plumber:
#* @post /predict
function(input) { predict(model, input) }
52. How to monitor model drift?
Track changes in input distributions or prediction errors; retrain if thresholds are crossed.
53. What’s the difference between caret and tidymodels?
caret is older, single-interface; tidymodels is modern, modular, tidyverse-aligned.
54. Matching for causal inference in R?
Use MatchIt for propensity matching.
55. What does vetiver do?
Helps standardize and deploy R models via APIs or dashboards.
56. What does renv do?
Manages per-project package versions and dependencies. Use renv::init().
57. Automate reports daily?
Use cronR or RStudio Connect to schedule rendering .Rmd.
58. How to revert last Git commit but keep changes?
git reset --soft HEAD~1
59. Why use Docker for R?
Creates reproducible environments, especially for Shiny/Plumber apps.
60. Minimal Shiny app example:
ui <- fluidPage(textInput("x", ""), textOutput("y"))
server <- function(input, output) { output$y <- renderText(input$x) }
shinyApp(ui, server)
61. What is DBI in R?
It provides a backend-agnostic interface to SQL databases, used by dplyr and dbplyr.
62. Why use Apache Arrow in R?
For reading/writing Parquet and large data files efficiently.
63. Parameterized R Markdown example:
yaml
region: "Asia"
Access with params$region.
64. What is the Rocker project?
Official Docker images for R, e.g., rocker/verse for tidyverse and R Markdown.
65. Outline your complete data pipeline in R:
Ingest (DBI) → Clean (dplyr) → Model (tidymodels) → Deploy (plumber/Shiny) → Monitor
(logs/alerts/Grafana).
66. How do you monitor concept drift once a model is deployed?
Monitor distribution changes in input features (e.g., using JS divergence, population stability
index) and prediction outcomes. Set up alerts when thresholds are crossed. You may use
packages like drifter or tools like Prometheus + Grafana.
67. Explain the difference between caret and tidymodels.
● caret is a single package that wraps multiple ML models but is less modular.
● tidymodels is a modern, tidyverse-style framework with separate packages for
preprocessing (recipes), modeling (parsnip), resampling (rsample), and evaluation
(yardstick), providing better structure and extensibility.
68. Which R package helps with causal inference through matching?
Use the MatchIt package to perform propensity score matching, which balances covariates
across treatment groups in observational data.
69. What is the purpose of the vetiver package?
vetiver streamlines the deployment of R and Python models by versioning them, storing
metadata, and providing prediction APIs using plumber. It supports reproducible and secure
MLOps workflows.
70. Describe how to deploy an R model as a REST API using plumber.
● Create a plumber file with annotated endpoints
#* @post /predict
function(req) {
predict(model, req$input)
● Use plumber::plumb("file.R")$run(port = 8000) to start the API
● Host using RStudio Connect, Docker, or cloud platforms like Heroku
Reproducibility & DevOps
71. What problem does the renv package solve and how is it used?
renv manages project-specific R package environments, ensuring consistent dependencies
across collaborators or servers. Use renv::init() to start and renv::snapshot() to lock
versions.
72. How can you schedule an R Markdown report to run daily?
● Use the cronR package to create and manage cron jobs
cron_rscript("report.Rmd", at = "7:00AM")
● Or use RStudio Connect to schedule automatic rendering.
73. What Git command reverts the last commit but retains the changes staged?
git reset --soft HEAD~1
74. Why is Docker useful for deploying R applications?
Docker ensures consistent environments by bundling R, its packages, system libraries, and
configurations. It simplifies deployment of Shiny apps, Plumber APIs, and reports across
different systems.
75. Write a minimal working Shiny app that echoes text input.
ui <- fluidPage(
textInput("text", "Enter something"),
textOutput("output")
server <- function(input, output) {
output$output <- renderText(input$text)
shinyApp(ui, server)
76. What is the DBI package in R and how is it different from dplyr's DB tools?
DBI is a database interface providing low-level access to databases (e.g., using dbConnect,
dbGetQuery). dplyr DB tools build on top of DBI to allow manipulation using SQL-translated
dplyr verbs (filter, mutate, etc.) in a backend-agnostic way.
77. What benefit does Apache Arrow provide for R users?
Arrow allows efficient, in-memory, zero-copy access to columnar data formats like Parquet. It
improves speed and scalability and supports cross-language data sharing (Python, R, Java,
etc.).
78. How do you create a parameterized R Markdown report?
● Define parameters in the YAML header:
yaml
region: "Asia"
● Use params$region inside the document
● Render with:
rmarkdown::render("report.Rmd", params = list(region = "Europe"))
79. What does the Rocker project offer?
The Rocker project provides prebuilt Docker containers for R. Examples include:
● rocker/r-ver: base R
● rocker/verse: includes tidyverse, RMarkdown
● rocker/shiny: includes Shiny Server for deployment
80. Describe an end-to-end R data science pipeline you have built.
A complete R pipeline typically involves:
● Data Ingestion using DBI or readr
● Data Cleaning & Wrangling using dplyr or data.table
● EDA with ggplot2 or DataExplorer
● Model Building using tidymodels
● Model Evaluation via yardstick, cross-validation
● Deployment using plumber, shiny, or vetiver
● Monitoring using logs, dashboards (e.g., Grafana), or concept drift detection