1 Project Part 1: Finding a dataset

Data Science investigations often involve deriving insights from existing or observational data, which differs from the process of designing an experiment, or collecting data for a specific use. Such an investigation may involve data that is publicly available and may also bound the scope of a given project in terms of which questions may be answerable from those posed a priori. However, relevance, interest, quality, background knowledge, and other aspects can be key factors that influence the questions you want to consider and the data you would like to obtain.

We encourage you to choose data that is relevant and of interest to you. Given that there are many ways to go about this process and each person may have their own interests, we offer some general guidelines so that your data investigation is productive and allows you to apply the programming skills and frameworks that you will learn.

1.1 Dataset Requirements

When choosing your dataset of interest you should think about dataset characteristics that may allow you to pose dynamic and insightful questions, and will allow you to explore potential trends, patterns, and relationships. These considerations include the number of observations, the number of variables, and the types of the variables within the data.

Please note the following guidelines in choosing a dataset for your project.

Data accesibility is a nice feature. If everyone is able to access the data you choose they will also be able to explore and investigate the data in a potentially similar or different way.

Choose a publicly available dataset without restrictions or special permissions.

Your data should contain a sufficient amount of information, but also be manageable within the context of the project and course goals.

Choose a dataset between 300 and 10,000 observations (rows).

In order to consider various and common data types and to be able to compare, group, visualize, and perform other data moves and transformations:

Choose a dataset with 10 to 30 variables (columns), that includes both quantitative and categorical data types. Your chosen data may also include character or text-based data types.

Also consider datasets with a time component, which may give you the option to investigate or observe temporal change.

1.1.1 Other dataset considerations

Other considerations for choosing your dataset may include the format of the available data. For example, if you have a time feature in your data, this can be formatted as a column for each different measure across time (wide format), or as multiple rows per unit, representing different times (long format). Both formats can be useful depending on your project step and the corresponding goal, but keep in mind that some datasets may require more procedural steps than others (e.g., data wrangling).

You should also attempt to identify potential hurdles such as missing data. In this case, you will want to understand why some data is missing and be able to identify references that explain this occurrence. Furthermore, you should ensure that your intended data use complies with ethical guidelines. As a guide on broad considerations, designed for agencies but also applicable to individuals, (e.g., transparency, purpose specification, and others), you can reference the Fair Information Practice Principles (1).

If you would like your class project to involve data from your research, make sure to adhere to the guidelines above and consult your instructor for additional requirements. This may include obtaining permission from your advisor and compliance with any related research guidelines.

Once you find a dataset, you will need to be able to work with it! So, let’s explore some initial steps to get started using the R programming language through the RStudio platform.

1.2 Data Types & Structures

Voltron is a classic cartoon from the 1980s, with later versions made in the 2000s. In this cartoon various independent lions, each with unique attributes, merge together to form the Voltron Robot.

Data types and data structures are fundamental building blocks in any programming language. Data types are like categories of information that come with certain properties, such as the type of operations that can be performed on them. Data structures often involve collections of data types and allow for efficient organization, manipulation, and analysis of the various datatypes, or information contained in a given dataset.

So, data types are like the Voltron lions and can be combined into a data structure to create something new and functional, like the Voltron robot. As you learn more about data types and structures, consider an analogy that you can make to paint a picture of the relationship between these programming building blocks.

R provides a range of data types, including numerical, categorical, logical and others. The main data structures in R are vectors, matrices, lists, and dataframes (or tibbles).

1.2.1 Data types

Numeric is the default quantitative data type in R. Let’s look at some examples in the code below.

# (in R, variable names are flexible but are case sensitive)

num1 <- 5 # Here I'm storing the value 5 in an object called num1
num.2 <- pi
num_3 <- .07
Num4 <- 10e-3

Each of the numbers above (5, pi, .07, and 10e-3) are recognized as numeric data types. The integer, 5, and decimal value, .07, may be obvious candidates for a default numeric type, but what about pi and 10e-3?

R has certain “reserved” objects that are built into the language. pi is an example and is a reserved name for the value 3.141593… Likewise, 10e-3 is how R represents scientific notation. This expression is equivalent to \(10*10^{-3}\). In your future R use, you might encounter certain model output that reports large or small values in this scientific notation format.

There are a few additional very important components of the code above. Notice that each of the numeric data types have been associated with various names through the <- symbol. The arrow pointing to the left is an assignment statement, which serves to store each of the numeric data types into the corresponding name. Our numbers are now objects that we can reference by name!

num1 # this is the name I gave to the value 5

[1] 5

Num4 # this should be 10e-3 or 0.01

[1] 0.01

From the executed code, you can see that calling the variables by name (or referencing them) results in their display. In general, calling an object by its name (running code that references stored information) displays its contents. Now that we know how to store and retrieve information (such as numeric data types), let’s move on to the final component of the code we’ve observed so far.

Notice in both code snippets there are lines with the # symbol followed by descriptions. These are called code comments and are an essential part of the reproducible coding process. Within a code snippet (or code chunk/block) anything following the # symbol on a particular line does not execute. R treats all things that follow the # as comments, or notes, that aren’t part of the code that should be run.

In the first code chunk we made an explanatory comment to emphasize that R object names can be specified in different ways, but are case sensitive. Other restrictions on R object names can be found here: R’s Naming Rules. In the second code chunk, we made comments following the executable code to specify what we should observe in the code output.

Looking Forward

We will emphasize the importance of code comments in the data science programming workflow throughout this book.

Numeric data types are not just for storing numeric data. We can also perform operations on numeric data, such as addition, subtraction, exponentiation, and others.

1 + 3

[1] 4

num1*Num4 # 5*0.01

[1] 0.05

num1^2

[1] 25

Character data includes non numeric information, such as strings or text data, and is specified in quotes.

Let’s look at some examples.

charA <- "Whaz up,"
charB <- "whazz upp,"
charC <- "whazzz uppp!"


# In R, you can pass objects to a function. 
# Here is an example using the paste() function, 
# with charA, charB, and charC

paste(charA, charB, charC)

[1] "Whaz up, whazz upp, whazzz uppp!"

Note that numbers, that would be numeric by default, can be specified as character by using quotes.

A <- "7"
B <- "200"

A

[1] "7"

Logical data types in R represent Boolean values that can be either true or false. Logicals are very useful for checking conditions and are used to construct if-then statements and subsetting data, as a few examples.

1 + 1 == 2

[1] TRUE

8 > 9

[1] FALSE

Now that we’ve learned about R data types, and before we learn about R data structures, let’s pivot to a different topic that will allow us to bring tools into our R environment to facilitate both our data science learning and programming applications.

1.3 Packages & Libraries

R packages contain supplements to all of the wonderful and useful default tools that exist in base R. An R package is typically built for a particular purpose and might contain customized functions, data sources, & related documentation. An example is the ggplot2 R package that is built for data visualization. This package includes data visualization functions built on “geometries” and useful documentation for each function is included in the package for easy reference.

R packages can be installed right through the console by using the install.packages() function, or more conveniently through the Install feature found within the Packages pane.

Once a package is installed it is stored in a location that we call a library. The package contents can be loaded into your current R session from this storage location, sort of like checking out a book from, well, a library. You can “check out” a package and use its contents in your current R session by referencing the package name inside the library() function.

#  Example syntax of calling the library function
#  to load the (already installed) ggplot2 package
#  into our current R session

library(ggplot2)

Looking Ahead

In chapter 7 will see how to use the ggplot2 package to visualize our data.

1.4 Importing data

Recall, the file structure you set up for your R project. The folder associated with your project holds your course files. When your R project is open in your current RStudio session the folder contents can be seen and accessed through the Files pane. This means that your current “working directory” is your R project folder.

You can also import other files into your R session that are not necessarily in your R project by referencing them through their file path - their location on your computer. So, a dataset of a particular file type can be imported using this method whether or not it’s in your current working directory.

For a .csv file the base R import function is read.csv(). You would add the dataset filepath of interest, in quotes, inside of this function to import the data.

Let’s take a look at an example using the college dataset from the ISL with R, 2nd Edition Resources. Since we learned about packages and libraries, for this example we can load the readr package from our library and use a function from this resource.

# Since the dataset is in my current working
# directory, I only need to specify the name
# as the file path

library(readr)
college <- read_csv("College.csv")

If you’re thinking, “but wait! We’re using RStudio. There must be another way,” then you’d be right.

As we mentioned before, within the Environment tab there is an Import feature that will allow you to search your computers file system for the dataset you wish to import, and this will generate the syntax for you with the function and file path built in.

1.5 Data Structures

1.5.1 Vectors

A vector in R is a fundamental data structure that represents a sequence of elements, all of which must be of the same type. For example, every element in a character vector is a character data type, every element in a numeric vector is a numeric data type, and so on. Vectors form the foundation for more complex data structures like matrices and dataframes (or tibbles).

Vectors are one-dimensional data structures where each element corresponds to an index value according to its position. Note that in R, the index values start at 1. This means that the element in the first position of a vector corresponds to index 1 and the element in the second position corresponds to index 2, and so on. This indexing allows for referencing vector elements or groups of elements by their location, which can be leveraged to facilitate data processing and manipulation.

Let’s explore a few vectors by first creating one ourselves.

# the c() function "combines" elements (separated by commas) into a vector.

even_vec <- c(2,4,6,8,10)

odd_vec <- even_vec - 1

even_vec # display the elements of even_vec

[1]  2  4  6  8 10

odd_vec # display the elements of odd_vec

[1] 1 3 5 7 9

Note that in the above code, we “combined” our data values into a vector by using the c() function which is designed for this purpose. Wrapping our vector values in the c() function tells R to treat the group of values as a vector data structure.

Looking Ahead: Vectorization

In the code above we created a vector which contained the first 5 even integers. By subtracting 1 from this vector, we were able to create a new object containing the first 5 odd integers. In other words, through the code above the subtraction operation was applied to each element of even_vec to create odd_vec. This process of applying an operation to each element of a vector without having to iterate through each one in a loop is known as vectorization.

So, now that we’ve created some vectors out of thin air, let’s return to the dataset we loaded into our environment to find a vector existing among its peers (other vectors!). But first, let’s take a look at some of the observations and variables in the college dataset.

Institution	Private	Apps	Accept	Enroll
Abilene Christian University	Yes	1660	1232	721
Adelphi University	Yes	2186	1924	512
Adrian College	Yes	1428	1097	336
Agnes Scott College	Yes	417	349	137
Alaska Pacific University	Yes	193	146	55

The display shows the first 5 rows and columns of the college dataset. We can see that the variable types appear to change across the rows and remain similar within each column. The columns are our vectors!

1.5.2 Dataframes

A dataframe in R is a two-dimensional data structure. Dataframes can store different types of data in each column, such as numeric, character, or logical data types all in one group. We can think of a dataframe as a collection of vectors. However, for typical datasets that may be represented as a dataframe structure not only are the elements of the columns related (e.g., by representing the same data type and variable) but also the rows, which typically represent an observation of many values (or measurements, or data types) on a single unit or individual. This explains what we observed in the college dataset output.

Each element within a dataframe is accessed using a pair of indices, one for the row and one for the column. This indexing system allows for organized data storage and facilitates access and manipulation of entire rows, columns, or specific elements within them.

Let’s print out the first variable of the first observation.

# displays element (1,1) of the dataset.

college[1,1]

# A tibble: 1 × 1
  Institution                 
  <chr>                       
1 Abilene Christian University

Since we read in the college data with the read_csv() package instead of the read.csv() package in base R, the default data structure is a tibble. Tibbles are essentially the same as dataframes.

Finally for our dataframes and tibbles, lets observe code that will demonstrate how indices can be used to display the default output from referencing the first 5 rows and columns of our imported dataset.

# gets rows 1 to 5, and gets columns 1 to 5

college[1:5,1:5]

# A tibble: 5 × 5
  Institution                  Private  Apps Accept Enroll
  <chr>                        <chr>   <dbl>  <dbl>  <dbl>
1 Abilene Christian University Yes      1660   1232    721
2 Adelphi University           Yes      2186   1924    512
3 Adrian College               Yes      1428   1097    336
4 Agnes Scott College          Yes       417    349    137
5 Alaska Pacific University    Yes       193    146     55

Notice that the data structure is listed as a tibble, and along with the data we also see each column data type between the column names and the data values.

1.6 A Few More Data Structures

1.6.1 Matrices (aka Matrixes)

A matrix in R is also a two-dimensional data structure where elements are arranged in rows and columns. Matrices are not as flexible as dataframes. One restriction on matrices is that all elements in a matrix must be of the same type. Due to this restriction, matrices are like two dimensional vectors.

As with a dataframe, each element in the matrix is accessed using a pair of indices, one for the row and one for the column. This structured indexing allows for organized storage of data and supports operations that involve manipulating entire rows, columns, or specific elements within the matrix.

1.6.2 Lists

Lists are another type of important data structure in R. Lists are very flexible and can have as elements entire data structures of various types. Lists can be useful in building functions and for other complex tasks that require a flexible data structure. As useful as lists can be, for this course the data structure of choice is the dataframe (or tibble)!