The growing significance of data analysis, statistical modeling, and machine learning across a range of businesses is the main factor driving the substantial and ongoing need for R programmers. Read this R programming language tutorial that helps you understand the R programming basics and some R programming projects. If you are interested in learning, explore our R programming course syllabus.
R Programming Basics
R is a free and open-source software environment and programming language that is mostly used for statistical computation and graphics. It has several packages that support different statistical and data analysis methods, and it is quite extendable.
Key Aspects of R Programming
Here are a few essential R features:
- Statistical Computing: R is used for statistical analysis and has built-in functions for modeling, time-series analysis, classification, clustering, and a variety of statistical tests.
- Data Visualization: To create different kinds of plots, packages like ggplot2 provide strong and adaptable capabilities.
- Open Source: Anything that is open source is free to use, alter, and share.
- Extensive Package Environment: The Comprehensive R Archive Network, or CRAN, is home to thousands of packages that enhance R’s functionality for specialized tasks in domains such as machine learning, finance, biostatistics, and more.
- Data Manipulation: R can be used to prepare data for analysis because it offers capabilities for data transformation, cleaning, and wrangling.
- Cross-Platform: R is compatible with Windows, macOS, and Linux, among other operating systems.
Because of its wide ecosystem of packages, statistical skills, and data visualization features, R is a powerful and versatile language that is preferred by statisticians, data scientists, and academics.
R Programming’s Role in Data Science and Statistics
In statistics and data science, R programming is essential and fundamental. This is an explanation of its importance:
In Data Science:
- Data Wrangling and Preparation: R has robust tools and packages for cleaning, converting, and reshaping data, such as tidyr and dplyr.
- Statistical Analysis: From simple descriptive statistics to sophisticated modeling, R provides a wide range of statistical methods.
- Data Visualization: With tools like ggplot2, R is excellent at producing visually appealing and informative visualizations.
- Machine Learning: For a variety of machine-learning applications, R has powerful packages such as caret, randomForest, and others.
- Reporting and Reproducibility: It is encouraged by the ability to create dynamic reports that include code, output, and narrative using the tools like R Markdown.
- Big and Active Community: The enormous R community makes it simpler to obtain answers and keep up with new techniques by contributing a multitude of packages and offering support.
In Statistics:
- Statistical Computing: For statistical applications, statisticians created R. It offers a thorough setting for putting statistical techniques into practice and creating new ones.
- Hypothesis Testing: To conduct hypothesis testing and derive conclusions from data, R includes built-in functions for a variety of statistical tests (t-tests, ANOVA, chi-squared tests, etc.).
- Statistical Modeling: R excels at creating a wide range of statistical models, such as time series analysis and linear and non-linear regression.
- Probability Distributions: Calculating probabilities, creating random samples, and displaying distributions are just a few of the functions that R provides for working with different probability distributions (normal, binomial, Poisson, etc.).
- Customization and Extensibility: To use certain statistical procedures, statisticians can create new packages in R and write their own functions with ease.
In Short:
- For Data Scientists: With a focus on statistical rigor, R is an effective tool for the full data science pipeline, from data preparation and exploration to modeling and visualization.
- For Statisticians: R offers an adaptable and expandable framework for statistical analysis, computation, and the creation of novel statistical techniques.
R is still a mainstay in these domains, even though other tools and languages, such as Python, are also widely used, especially when thorough statistical analysis and excellent data visualization are crucial.
Installing R and RStudio
Here are the installation guide for R and RStudio:
Installing R:
- Go to the CRAN (Comprehensive R Archive Network) website: https://cran.r-project.org/
- Choose the link corresponding to your operating system (Windows, macOS, Linux).
- Follow the instructions to download and install the base R package.
Installing RStudio
An Integrated Development Environment (IDE) called RStudio greatly simplifies and improves the usability of working with R.
- Go to the RStudio website: https://posit.co/download/rstudio-desktop/
- Under “RStudio Desktop,” choose the installer appropriate for your operating system.
- Download and run the installer. Make sure you have installed R before installing RStudio.
The RStudio Interface
Let’s explore the RStudio interface briefly. Usually, it is separated into four major panes:
Source Editor (Top-Left):
- Where your R scripts are written and edited.
- Here, you can also examine and open data files.
- Enables you to store your code for future use.
Console (Bottom-Left):
- Where the commands you run are actually carried out by R.
- This is where your code’s output—messages, warnings, errors, and results—appear.
- Additionally, commands can be typed into the console and executed instantly.
Environment/History/Connections/Tutorial (Top-Right):
- Environment: Displays the items you’ve made in your current R session, such as variables, data frames, functions, etc.
- History: Shows a list of the commands you’ve typed into the terminal in the past.
- Connections: To establish a database connection.
- Tutorial: Offers interactive lessons, if any are provided.
Files/Plots/Packages/Help/Viewer/Presentations (Bottom-Right):
- Files: The files and folders in your working directory are displayed in a file explorer.
- Plots: Where R-generated graphs and visualizations are shown.
- Packages: Lists the R packages that have been loaded and installed. New packages can be installed here as well.
- Help: Offers documentation for R packages and functions.
- Viewer: Used to show local web material, such as that found in R Markdown papers or Shiny programs.
- Presentations: To view R-created presentations.
The arrangement of these panes can be altered. All of the tools required to work with R are integrated in one convenient location with RStudio.
R Programming Basics
Here are the fundamentals of R Programming Language:
Variables and Assignment:
To assign values to variables, you use the assignment operator <- (or = in some situations, but <- is usually preferred in R).
x <- 10
name <- “Alice”
is_valid <- TRUE
Data Types:
R offers a number of basic data types:
- Numeric: For actual numbers, such as 3.14 and -5.
- Integer: For entire numbers, such as 5L, use an integer. To clearly define an integer, take note of the L suffix.
- Character: For text strings (such as “hello” or “world”) that are surrounded by single or double quotes.
- Logical: For TRUE or FALSE boolean values.
- Factor: It is for categorical data, which is frequently applied to ordinal or nominal variables.
Data Structures:
Data collections can be arranged in the following ways:
Vector: Elements of the same data type that can be stored in a one-dimensional array.
numbers <- c(1, 2, 3, 4, 5)
names <- c(“Bob”, “Charlie”, “David”)
List: Able to store elements of various data kinds.
my_list <- list(name = “Alice”, age = 30, scores = c(85, 92, 78))
Matrix: A two-dimensional array containing identical data type items.
my_matrix <- matrix(1:9, nrow = 3, ncol = 3)
Array: A matrix generalized in several dimensions.
Data Frame: A spreadsheet-like tabular data structure with columns that can hold various data kinds. This is a very popular use.
my_df <- data.frame(
name = c(“Alice”, “Bob”, “Charlie”),
age = c(30, 25, 35),
score = c(85, 92, 78)
)
Operators:
- Arithmetic: + (addition), – (subtraction), * (multiplication), / (division), ^ or ** (exponentiation), %% (modulo – remainder).
- Comparison: == (equal to), != (not equal to), > (greater than), < (less than), >= (greater than or equal to), <= (less than or equal to).
- Logical: & (AND), | (OR), ! (NOT).
Basic Functions:
R comes with a large number of built-in functions.
- print(): Shows the result.
- length(): Determines how many elements there are in a list or vector.
- class(): Gives back an object’s data type.
- str(): Gives a succinct overview of an object’s structure.
- head(): Displays a data frame’s initial few rows.
- tail(): Displays a data frame’s final few rows.
Working with Data in R Programming
Let’s explore how to use R to work with data. Usually, this entails a few crucial steps:
- Importing Data: Getting data into R.
- Exploring and Understanding Data: Gaining an understanding of the content and structure of the data.
- Manipulating Data: Cleaning, transforming, and reshaping the data.
- Analyzing Data: It involves creating models or doing statistical analysis.
Importing Data
Data from a variety of formats can be imported into R. Typical tasks include the following:
read.csv(): For reading comma-separated value files.
my_data <- read.csv(“your_file.csv”)
read.table(): for reading delimited text files with a broader scope. The divider is up to you to decide.
my_data <- read.table(“your_file.txt”, sep = “\t”, header = TRUE) # Tab-separated, with a header row
read_excel(): For reading Excel files (.xls and.xlsx), use the readxl package. This package must be installed first (install.packages(“readxl”)), and then it must be loaded (library(readxl)).
library(readxl)
my_excel_data <- read_excel(“your_file.xlsx”, sheet = “Sheet1”)
Exploring and Understanding Data
- head(my_data): Displays the data frame’s initial few rows.
- tail(my_data): Displays the last few rows.
- str(my_data): Shows the data frame’s structure, including each column’s data type.
- summary(my_data): Offers summary statistics for every column (e.g., frequencies for factors; mean, median, min, max for numeric columns).
- dim(my_data): Displays the data frame’s dimensions, including the number of rows and columns.
- names(my_data) or colnames(my_data): Returns the column names.
Manipulating Data
R provides strong data manipulation features, particularly when using tidyverse packages (such as dplyr). Let’s examine a few fundamental operations:
Selecting Columns:
# Using $
just_names <- my_data$name
# Using square brackets
just_ages <- my_data[, “age”] # All rows, column named “age”
just_name_and_score <- my_data[, c(“name”, “score”)]
Filtering Rows:
older_than_28 <- my_data[my_data$age > 28, ] # Rows where age is greater than 28, all columns
Adding New Columns:
my_data$grade <- ifelse(my_data$score >= 80, “A”, “B”)
Using dplyr (requires installation:
install.packages(“dplyr”) and loading: library(dplyr)):
A simpler syntax for data manipulation is offered by the dplyr package.
select(): Choose columns.
library(dplyr)
selected_data <- my_data %>% select(name, score)
filter(): Choose rows based on conditions.
filtered_data <- my_data %>% filter(age > 28)
mutate(): Change the current columns or add new ones.
mutated_data <- my_data %>% mutate(passed = score >= 70)
arrange(): Sort rows.
sorted_data <- my_data %>% arrange(age) # Ascending order
sorted_data_desc <- my_data %>% arrange(desc(score)) # Descending order
Control Flow Statements in R Programming
By using control flow statements, you can control the sequence in which code is run in response to specific criteria. The following are R’s primary control flow structures:
Conditional Statements (if, else if, else): Depending on whether a condition is true or false, these let you run alternative code blocks.
x <- 10
if (x > 0) {
print(“x is positive”)
}
y <- -5
if (y > 0) {
print(“y is positive”)
} else {
print(“y is not positive”)
}
z <- 0
if (z > 0) {
print(“z is positive”)
} else if (z < 0) {
print(“z is negative”)
} else {
print(“z is zero”)
}
Loops (for, while, repeat): These let you repeatedly run a block of code.
- for loop: It loops through a series (such as a list or a vector).
numbers <- 1:5
for (i in numbers) {
print(i^2)
}
fruits <- c(“apple”, “banana”, “cherry”)
for (fruit in fruits) {
print(paste(“I like”, fruit))
}
while loop: If a condition is true, a while loop will run a block of code. To prevent endless cycles, take care to make sure the condition eventually turns out to be false.
count <- 1
while (count <= 5) {
print(paste(“Count is:”, count))
count <- count + 1
}
repeat loop: It continues to run a block of code until a break statement is found.
counter <- 1
repeat {
print(paste(“Counter:”, counter))
counter <- counter + 1
if (counter > 5) {
break
}
}
break and next: These are employed to change the way loops flow.
break: Terminates the loop instantly.
for (i in 1:10) {
if (i > 5) {
break
}
print(i)
}
next: Proceeds to the subsequent loop iteration, bypassing the current one.
for (i in 1:5) {
if (i == 3) {
next
}
print(i)
}
The creation of increasingly intricate and dynamic R programs requires the use of these control flow statements.
Functions in R Programming
Blocks of reusable code that carry out a particular purpose are called functions. Writing modular, well-structured, and effective R applications requires them.
Defining a Function:
The function() keyword is used in R to define a function. The standard syntax is:
function_name <- function(argument1, argument2, …) {
# Code to be executed inside the function
# …
return(value) # Optional: the value the function returns
}
- function_name: The name you provide your function.
- argument1 and argument2…: The parameters that can be entered into the function. There can be zero or more arguments in a function.
- The body of the function is the code enclosed in curly brackets {}.
- return (value): This is not required. It indicates the value that the function will produce if it is used. The function will return the value of the last expression evaluated in its body if the return() statement is not used explicitly.
Example 1: A function that adds two numbers:
add_numbers <- function(a, b) {
sum_result <- a + b
return(sum_result)
}
# Calling the function
result <- add_numbers(5, 3)
print(result) # Output: 8
Example 2: A function with a default argument:
greet <- function(name = “Guest”) {
greeting <- paste(“Hello,”, name, “!”)
print(greeting)
}
greet(“Alice”) # Output: Hello, Alice !
greet() # Output: Hello, Guest ! (using the default argument)
Example 3: A function that returns multiple values (as a list):
stats <- function(x) {
m <- mean(x)
s <- sd(x)
return(list(mean = m, standard_deviation = s))
}
data <- c(1, 2, 3, 4, 5)
results <- stats(data)
print(results)
# Output:
# $mean
# [1] 3
#
# $standard_deviation
# [1] 1.581139
Key Aspects of R functions:
- First-class objects: Functions in R are considered first-class objects. This implies that you can return them from other functions, send them as parameters to other functions, and assign them to variables.
- Lexical scoping: The values of free variables in a function are searched up in the environment where the function was defined, not the context where it was called, because R employs lexical scoping, also known as static scoping.
One essential component for creating more intricate and manageable R code is functions.
Basic Data Visualization Using R Programming
Let’s explore some fundamental R data visualization. The ggplot2 package is extremely powerful and widely used for creating more complex and visually appealing plots, but the most basic plotting capabilities are found in the base R graphics system.
Base R Graphics:
Several built-in functions in Base R allow you to create common plot types:
Plot(): It is a flexible function that can produce line graphs (if you supply one numeric vector), scatter plots (if you supply two), and more.
# Scatter plot
x <- 1:10
y <- x^2
plot(x, y, main = “Scatter Plot”, xlab = “X-axis”, ylab = “Y-axis”)
# Line plot
z <- sin(seq(0, 2*pi, length.out = 20))
plot(z, type = “l”, main = “Sine Wave”, ylab = “Amplitude”)
hist(): It makes histograms to show how a single numerical variable is distributed.
data <- rnorm(100) # Generate 100 random numbers from a normal distribution
hist(data, main = “Histogram of Random Data”, xlab = “Value”)
boxplot(): It makes box plots to compare how one or more groups are distributed.
group_a <- rnorm(50, mean = 5)
group_b <- rnorm(50, mean = 7)
boxplot(group_a, group_b, names = c(“Group A”, “Group B”), main = “Box Plot Comparison”)
barplot(): It makes bar charts, which are frequently used to show the size of various quantities or the frequency of categorical data.
counts <- c(10, 15, 8, 12)
names <- c(“A”, “B”, “C”, “D”)
barplot(counts, names.arg = names, main = “Bar Chart of Counts”)
ggplot2 Package:
Plots can be constructed layer by layer with ggplot2’s grammar of graphics. It creates graphics of publication quality and is quite versatile.
Before loading it (library(ggplot2)), make sure it is loaded (install.packages(“ggplot2”) first.
A simple ggplot2 plot includes:
- ggplot(): Sets the data frame to be used and initializes the plot.
- aes() (aesthetics): How variables in your data map to visual elements (such as the x-axis, y-axis, color, shape, and size) is defined by the aes() (aesthetics) function.
- geom_ functions (geometric objects): Plot type can be specified using geom_functions (geometric objects) (e.g., points for scatter plots, lines for line plots, and bars for bar charts).
The base R plots above have the following ggplot2 equivalents:
library(ggplot2)
# Scatter plot
df_scatter <- data.frame(x = 1:10, y = (1:10)^2)
ggplot(df_scatter, aes(x = x, y = y)) +
geom_point() +
labs(title = “Scatter Plot”, x = “X-axis”, y = “Y-axis”)
# Line plot
df_line <- data.frame(index = 1:20, value = sin(seq(0, 2*pi, length.out = 20)))
ggplot(df_line, aes(x = index, y = value)) +
geom_line() +
labs(title = “Sine Wave”, y = “Amplitude”, x = “Index”)
# Histogram
df_hist <- data.frame(values = rnorm(100))
ggplot(df_hist, aes(x = values)) +
geom_histogram(binwidth = 0.5, fill = “lightblue”, color = “black”) +
labs(title = “Histogram of Random Data”, x = “Value”, y = “Frequency”)
# Box plot
df_box <- data.frame(
value = c(rnorm(50, mean = 5), rnorm(50, mean = 7)),
group = factor(rep(c(“A”, “B”), each = 50))
)
ggplot(df_box, aes(x = group, y = value)) +
geom_boxplot() +
labs(title = “Box Plot Comparison”, x = “Group”, y = “Value”)
# Bar chart
df_bar <- data.frame(category = c(“A”, “B”, “C”, “D”), count = c(10, 15, 8, 12))
ggplot(df_bar, aes(x = category, y = count)) +
geom_bar(stat = “identity”, fill = “skyblue”) +
labs(title = “Bar Chart of Counts”, x = “Category”, y = “Count”)
Much greater customisation of themes, colors, labels, and the ability to add many layers to a plot are all possible with ggplot2.
Popular R Programming Packages
These well-known R programming packages are arranged according to their main purpose:
For Data Manipulation
- dplyr: It is a robust data manipulation tool that offers a collection of user-friendly verbs for typical data wrangling operations such data selection, filtering, modifying, and arrangement. It belongs to the tidyverse.
- tidyr: Another essential tidyverse package for “tidying” data, which makes reshaping data (such as from wide to long format and vice versa) simple.
- data.table: It offers a fast, high-performance substitute for data frames that is optimized for big datasets.
- stringr: It offers a dependable and intuitive collection of functions for manipulating strings, which facilitates dealing with them.
- lubridate: It offers utilities to parse, process, and format date-time data, making working with dates and times easier.
For Data Visualization
- ggplot2: Based on the Grammar of Graphics, this incredibly versatile and potent tool enables the production of a large variety of customisable plots. A section of the tidyverse.
- plotly: It makes it possible to create interactive plots on the web.
- leaflet: For creating interactive maps.
For Machine Learning
- caret: It offers a single interface for a large range of machine learning models to be trained and assessed.
- randomForest: It uses the Random Forest technique for regression and classification.
- xgboost: An efficient and scalable gradient boosting library.
- tidymodels: A set of packages that offer a neat and uniform method for modeling and machine learning in R, such as rsample, parsnip, dials, workflows, tune, and yardstick.
For Reporting and Reproducibility:
- knitr: R code and its output can be directly embedded into texts (such as Markdown, LaTeX, and HTML)
- rmarkdown: Building on knitr, rmarkdown allows you to build dynamic documents with code, text, and output in a variety of output formats.
- shiny: For utilizing R to create interactive dashboards and online apps.
Conclusion
Well done for venturing into the realm of R programming for the first time! Now that you’ve mastered the basics, you can handle data, write code, and make simple visualizations through this R Programming Language Tutorial. In addition to applying your newly acquired skills to actual data difficulties, we urge you to keep exploring its extensive ecosystem of packages with our R Programming Course in Chennai.