R and RStudio: Introduction and Data Structures

R is a free programming language for statistical computing and graphics. It is an implementation of the S programming language and was created by Ross Ihaka and Robert Gentleman at the Univeristy of Auckland, New Zealand. R is currently developed by the R Development Core Team. RStudio is an Integrated Development Environment (IDE) for R.

To start, download the latest versions of R and RStudio following the instructions provided here

Getting Started

Open RStudio locally and learn how to use the RStudio interface.

Basic Operations in R

We can use R as a calculator to do simple math operations such as addition (+), subtraction (-), multiplication (*), division (), and raise a value to a power (^). We can type these calculations in the console or run them in an R script that extension ends in .R

#We can use hashtags to make comments about our code
#Basic calculations in R
4 + 5
4 - 5
4 * 5
4/5
#Outputs of calculations
[1] 9
[1] -1
[1] 20
[1] 0.8
#Calculate exponents using ^
4^5
#Output of exponent
[1] 1024

Data Structures

R has many data structures and types that we can use, depending on the information we want to work with.

The major data types include:

  • character
  • numeric (real or decimal)
  • integer
  • logical
  • double
  • complex

The major data structures include:

  • Scalars
  • Atomic Vectors
  • Factors
  • Lists
  • Matrices and Arrays
  • Dataframes

Scalars

The simplest type of object is a scalar which is an object with one value. We can assign a value or calculations to a variable using the assignment operator “<-“.

Note: The equals sign “=” is not an assignment operator in R and has a different functionality which will be discussed further below.

To create scalar data objects x and y:

#Set x and y as values
x <- 4
y <- 5

The objects x and y were set a numeric data type.

We can manipulate these objects in R and perform different calculations together. To print the value of these variables, we can use the print() function or call the variable itself.

#Calculations with numeric variables

z <- x+y

z

print(z)

x*y/z
#Output of calculations

[1] 9

[1] 9

[1] 29

As stated above, we can also create data objects of other data types such as logical and character mode.

For logical data, we use TRUE (T) and FALSE (F)

Logical <- T

Logical

[1] TRUE

For characte data, we use single or double quotation to enclose the data

Character_Data <- "T"

Character_Data

[1] "T"

We can use available functions in R to determine the mode or type of data we are working with.

#Use mode function
mode(x)
[1] "numeric"

mode(Logical)
[1] "logical"

mode(Character_Data)
[1] "character"
#Use is.object() function
is.numeric(x)
[1] TRUE

is.logical(Logical)
[1] TRUE

is.numeric(Character_Data)
[1] FALSE

Vectors

A vector is a basic data structure in R. It is a set of scalars of the same data type.

We can create vectors in different ways.

One of the main ways is to use the function c() to concatenate multiple scalars together.

x <- c(1, 5, 4, 9, 0)

x

[1] 1  5  4  9  0

We can use function typeof() to determine the data type of a vector, and we can check the length of the vector using the funtion length() .

typeof(x)

[1] "double"

length(x)

[1] 5

If we set x to have elements of different data types, the elements will be coerced to the same type.

x <- c(1, 5, FALSE, 9, "help")

x

[1] "1"  "5"  "FALSE"  "9"  "help"

typeof(x)

[1] "character"

Instead of reassigning the elements of x using the function c(), we could reassign specific elements based on the index number.

#Reassign third and fifth elements back to original values
x

[1] "1"  "5"  "FALSE"  "9"  "help"

x[3] <- 4

x[5] <- 0

x

[1] 1  5  4  9  0

typeof(x)

[1] "double"

Other ways to creat vectors is to use other operators and functions such as “:” operator, seq() function, and rep() function.

#Create vector of consecutive numbers

y <- 1:10

y

[1] 1  2  3  4  5  6  7  8  9  10

#Create vector of a sequence of numbers
#Defining number of points in an interval or step size

seq(1, 10, by = 1)

[1]  1  2  3  4  5  6  7  8  9 10

seq(1, 10, length.out = 10)

[1]  1  2  3  4  5  6  7  8  9 10

#Create vector of the same values

rep(3, 5)  # A set of 5 numbers with value set as 3

[1] 3 3 3 3 3

Factors

A factor is a special type of character vector. Factors are qualitative or categorical variables that are often used in statistical modeling. To create a factor data structure, we will first create a character vector and convert it to a factor using the factor() function.

temperature <- c("High","Medium","Low")
temperature <- factor(temperature)

Converting temperature character vector to a factor type creates “levels” based on the factor values (these are the values of categorical variables).

temperature

[1] High Medium Low
Levels: High Low Medium

Matrices and Arrays

So far we have discussed one-dimensional objects. We can create objects of multidimensional data. Matrices are data structures that contain data values in two dimensions. An array is a matrix with more than two dimensions. Matrices and arrays are used perform efficient calculations in a computationally fast and efficient manner.

To create a matrix, we can use the matrix() function, which takes as arguments a data vector and parameters for the number of rows and columns.

We can determine the dimensions of a matrix using the dim() function.

#Create a simple 2 by 2 matrix.

mat<-matrix(c(2,6,3,8),nrow=2,ncol=2)

mat

    [,1] [,2]
[1,] 2    3
[2,] 6    8

dim(mat)

[1] 2 2

We can also choose to add row names and column names to the matrix.

#Add row names and column names

rownames(mat) <- c("a", "b")

colnames(mat) <- c("c", "d")

  c d
a 2 3
b 6 8

#Add row names and column through the matrix function

mat<-matrix(c(2,6,3,8),nrow=2,ncol=2,
            dimnames = list(
                c(a,b),
                c(c,d)
            )
            )

mat

  c d
a 2 3
b 6 8

We can also create a matrix by concatenating vectors together using rbind() function to concatenate by rows or cbind() function to concatenate by columns.

x <- 1:3

y <- 4:6

# Combine by rows
a <- rbind(x,y)

a
   [,1] [,2] [,3]
x    1    2    3
y    4    5    6
# Combined by columns
b <- cbind(x,y)

b
     x y
[1,] 1 4
[2,] 2 5
[3,] 3 6

To create an array, we can use the function array(), which takes as arguments vectors as input and uses the values in the dim parameter to create an array.

vector1 <- c(1,2,3)
vector2 <- c(5,6,7,8,9,10)

# Create an array with dimension (3,3,2) that creates 2 arrays each with 3 rows and 3 columns.

array1 <- array(c(vector1,vector2),dim = c(3,3,2))

array1
, , 1

     [,1] [,2] [,3]
[1,]    1    5    8
[2,]    2    6    9
[3,]    3    7   10

, , 2

     [,1] [,2] [,3]
[1,]    1    5    8
[2,]    2    6    9
[3,]    3    7   10

Lists

Lists are data objects which contain elements of different types including numbers, strings, vectors, and other lists. A list can also contain a matrix or even a function as its elements.

#Create a list of different data types

list_data <- list(c(2,4,6,8), "Hello", matrix(c(11,12,13,14),nrow=2,ncol=2),TRUE, 62.13, FALSE)
print(list_data)

# Give names to the elements in the list

names(list_data) <- c("Vector1", "Character1", "Matrix1", "Logical1", "Numeric", "Logical2")

list_data
$Vector1
[1] 2 4 6 8

$Character1
[1] "Hello"

$Matrix1
     [,1] [,2]
[1,]   11   13
[2,]   12   14

$Logical1
[1] TRUE

$Numeric
[1] 62.13

$Logical2
[1] FALSE

We can use the function str() to list the underlying structure of the data object.

str(list_data)
  List of 6
$ Vector1   : num [1:4] 2 4 6 8
$ Character1: chr "Hello"
$ Matrix1   : num [1:2, 1:2] 11 12 13 14
$ Logical1  : logi TRUE
$ Numeric   : num 62.1
$ Logical2  : logi FALSE

Data Frames

A data frame is a table in which each column contains values of one variable or vector and each row contains one set of values from each column. Within each column, all data elements must be of the same data type. However, different columns can be of different data types. The data stored in a data frame can be of numeric, factor or character type. In addition, each column should contain same number of data elements.

To create a data frame, we can use the function data.frame():

#Create a data frame with employee ID, salaries, and start dates

emp.data <- data.frame(
 emp_id = c("U974","U503","U298","U545","U612"),
 salary = c(623.3,515.2,611.0,729.0,843.25),
 start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
    "2015-03-27")),
 stringsAsFactors = FALSE
)

emp.data
    emp_id salary start_date
1     U974 623.30 2012-01-01
2     U503 515.20 2013-09-23
3     U298 611.00 2014-11-15
4     U545 729.00 2014-05-11
5     U612 843.25 2015-03-27

We can use the function str() to list the underlying structure of the data object.

str(emp.data)
 'data.frame': 5 obs. of  3 variables:
$ emp_name  : chr  "U974" "U503" "U298" "U545" ...
$ salary    : num  623 515 611 729 843
$ start_date: Date, format: "2012-01-01" "2013-09-23" ...

We can extract data from the data frame and also add data to the data frame.

#Extract salary information
emp.data$salary
[1] 623.30 515.20 611.00 729.00 843.25
#Add column vector
emp.data$dept <- c("IT","Operations","IT","HR","Finance")
    emp_id salary start_date       dept
1     U974 623.30 2012-01-01         IT
2     U503 515.20 2013-09-23 Operations
3     U298 611.00 2014-11-15         IT
4     U545 729.00 2014-05-11         HR
5     U612 843.25 2015-03-27    Finance

More Examples of Data Structures and Types

To learn more about data types and structures and see more examples, watch these available videos below. Part 1 Part 2

Conditional Statements and Looping

Logical and relational operators

Logical and relational operators can be used to execute code based on certain conditions. Common operators include:

../../_images/Logical_Operators.png

If statements

q <- 3
t<-5

#if else conditional statement

if(q<t){

 w<-q+t

 } else

   w<-q-t
 w

[1] 8
a<-2
b<-3
c<-4
#Using and to test two conditions, both true

if(a<b & b<c) x<-a+b+c
 x
[1] 9

Looping

We can use looping to efficiently repeat code without having to write the same code over and over.

The while loop repeats a condition while the expression in parenthesis holds true and takes the form:

while (condition controlling flow is true)
      perform task
x<-0
while(x<=5){x<-x+1}
x
[1] 6

For loops are used to iterate through a process a specified number of times. A counter variable such as “i” is used to count the number of times the loop is executed:

for (i in start:finish)
    execute task

An example is to add values 1 to 10 to vector y using a for loop.

#Create empty vector
y<-vector(mode="numeric")

#Loop through 1 to 10 to add values to y
for(i in 1:10){
   y[i]<-i
   }
y

[1] 1 2 3 4 5 6 7 8 9 10

To learn more about if statements and logical operators, check out this video

Alternatives to using looping and conditional statements include using the apply function in R. A quick introduction to apply function is provided here.