3 Project Part 3.1: Diving into Data Exploration - R

Now that we are familiar with our dataset “metadata” let’s consider a more thorough exploratory process informed by the background knowledge encoded into the data dictionary. To do this, we will access some data science programming tools and concepts that will provide us with the means to dive and delve into our data.

3.1 Data Moves

To begin exploring your data in more detail, you will likely need to modify your dataset in some way. This may involve removing variables or observations, creating groupings or categories, performing operations on certain values, and other transformations of the data. For example, if you want to understand or visualize a particular trend in your data, you may need to index your observations in a particular order. This might be a necessary step if your data has a temporal component and you want to visualize a trend or phenomenon with respect to time.

Erickson et al. (1), refer to such data transformations as “data moves” and understanding what these are can facilitate your data science explorations and investigations regardless of your programming language or platform of choice. We will consider data moves as a fundamental part of our programming and as a guide to facilitating the many aspects of our data exploration and investigations process, from data cleaning to data visualization and beyond. Our version of data moves will be connected to the dataframe object structure that is part of both R and Python. These concepts are the SCUBA gear that will allow us to wade through the depths of our data with purpose and confidence.

3.2 Data Moves in Base R

Typical dataset modifications involve data moves such as filtering, subsetting, grouping, and merging. In chapter 3, we looked at the Marine6 dataset. This 815x10 dimension dataset was created through a series of data moves from a larger dataset consisting of 48,237 rows and nine columns. Creating this dataset of interest involved data exploration and data moves such as filtering, subsetting, grouping, and creating new variables.

3.2.1 Filtering

The filtering data move is what we will use to reduce or examine a dataset based on certain row criteria. Since we have information about the range of years in the dataset, from our data dictionary, we might wonder “How many observations were recorded from the beginning, in 1970?”.

# read in the Marine6 dataset using base R
# convert characters to factors upon reading in

Marine6 <- read.csv("Marine6.csv", stringsAsFactors = TRUE )

# find the minimum value for the variable "Year"
# aka, the earliest recordings

min(Marine6$Year)

[1] 1970

In the code above, we read in the Marine6 data and used the min() function on the “Year” variable to get the earliest date. The output confirms that the earliest year recoded in the dataset was 1970. We’ve identified our row criteria on which we will filter the data to investigate our question of interest.

# filter the dataset to observations of interest
# get the number of observations

nrow(Marine6[Marine6$Year == 1970,])

[1] 2

In the code above we used the logical statement Marine6$Year == 1970 in the row index position to filter the Marine6 dataset (e.g., get all rows in the dataset where the logical statement is true). This output served as the input for the nrow() function which counts the number of rows and returned the answer to our inquiry.

Like the briefly mentioned process of creating the Marine6 dataset, in many instances, we might filter our data to create a new object, or dataset, of interest. We might store all observations according to a specific set of years.

# Store observation from 1970 to 2000 in M6_2000

M6_2000 <- Marine6[Marine6$Year <= 2000,]
summary(M6_2000$Year) # max should be 2000

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1970    1983    1991    1989    1996    2000

3.2.2 Subsetting

We can use column criteria to reduce our data to only certain variables of interest. We can distinguish the subsetting data move from filtering based on the use of column vs. row criteria, although both operations result in what can be considered subsets of the data.

# summary only of the variables subset of interest

summary(Marine6[,c(1,2,8,10)])

      Year                      Region        Depth       
 Min.   :1970   Northeast US Fall  :164   Min.   :  7.48  
 1st Qu.:1991   Northeast US Spring:132   1st Qu.:  8.43  
 Median :2000   Maritimes Summer   : 99   Median : 52.98  
 Mean   :1999   Southeast US Summer: 93   Mean   : 67.22  
 3rd Qu.:2010   Southeast US Spring: 92   3rd Qu.:119.63  
 Max.   :2020   Southeast US Fall  : 87   Max.   :255.06  
                (Other)            :148   NA's   :53      
               Common.Name 
 American lobster    :187  
 Banded drum         :158  
 Black sea bass      :180  
 Guachanche barracuda:126  
 Rainbow star        : 26  
 Red hake            :138

In the code above, we used a vector of indices to get a subset of the Marine6 variables. In larger datasets, particularly for gleaning insights from the summary() function, subsetting data in this way can render the output more informative and digestible.

3.2.3 Grouping (Loop Example)

The grouping data move may serve various purposes, but is often used to create categories for the purposes of comparisons. Often, the grouping data move is a specific case of creating a new variable or attribute, particularly when the grouping information is needed for subsequent analysis. For example, a new variable containing grouping information may be a parameter that is used to create a data visualization in a certain way.

In the Marine6 dataset, we have species of fish and species of not-fish. We might be interested in comparing the changes in depths over time between these more general classifications. To set this comparison up, we can use grouping.

# Using a for loop to create a new group

for(i in 1:nrow(Marine6))
{
  if(Marine6$Common.Name[i] %in% c("American lobster","Rainbow star"))
  { Marine6$Group[i] = "Other" }
  else if(is.na(Marine6$Common.Name[i])) 
  { Marine6$Group[i] = NA } 
  else 
  { Marine6$Group[i] = "Fish" }
}
Marine6$Group <- as.factor(Marine6$Group)
summary(Marine6$Group)

 Fish Other 
  602   213

Let’s examine the code above, which contains a for loop. This programming sequence iterates through each observation in the Marine6 data and returns the values “Other”, “NA”, or “Fish” based on the specified condition-criteria. In the first condition we introduced the %in% operator which is a logical statement that returns TRUE if the value of “Common.Name” matches any of the characters in the object c("American lobster","Rainbow star"), or FALSE otherwise. The next condition uses is.na(), also a logical statement, which returns TRUE if the value of “Common.Name” is missing, and FALSE otherwise. Although there are no missing values in “Common.Name”, this may not always be the case for a given variable and we should account for this missing value condition in order to run our loop without error. Finally, the last remaining case includes all other common names not specified in the first condition. We do not need to explicitly write this out as it is included in the else statement. Note how our data dictionary can provide this factor level information to us. We may even revise our data dictionary to include information about the levels that correspond to fish and those that do not.

3.2.4 Additional Data Moves

In some cases, there may be information spread across multiple datasets that we want to combine into a single dataframe. If each of these dataset has a common “identifier” which links the information, we can use the merging data move to accomplish this task. For example, we might want to add instructor information to student data, based on a common course ID. We can merge data with the base R merge() function where the inputs would be each dataset and the common identifier on which the merge would be based.

The creating hierarchy data move may not be different from a combination of grouping and creating a new variable. However, certain models are based on data hierarchy, and depending on the dataset there may be a need to create this structure for a related analysis. For example, we may have a dataset that contains information on states, counties, and schools. Across the states there may be counties (and schools) that have the same names. In order to distinguish one county from another, information about the state would be necessary. Thus, we would nest our counties within the states (and the schools within the counties), and this nesting would create a hierarchical data structure. As with comparing our fish and non-fish groups, we might be interested in visualizing the variation in county test scores across a sample of states, and the hierarchical information would be essential to such a task.

In the examples above, we used base R and bracket notation (i.e., referencing indices through conditions in the row aand column spaces of our dataframe) to perform our data moves. Needless to say, there are more ways in R to perform our data moves. In fact, many data moves directly correspond to functions that exist through the dplyr package in the “tidyverse”. Even better, these functions can really simplify our coding process.

3.3 Tidyverse Data Moves - Tidy Moves

In the previous section, we demonstrated data moves one at a time. We could have easily added both row-filtering and column-subsetting information at the same time to create our new dataset of interest. These do not need to be independent steps, but combining these steps into one may lead to rather long code statements that could potentially be visually difficult to parse (and revise). Let’s see an example of this.

# summary of a subset of variables 
# for the observation prior to 2001 and 
# in one region of interest

summary(Marine6[Marine6$Year <= 2000 & 
                  Marine6$Region == "Northeast US Fall", c(1,2,8,10)])

      Year                             Region       Depth       
 Min.   :1974   Northeast US Fall         :93   Min.   : 21.50  
 1st Qu.:1980   Gulf of Alaska            : 0   1st Qu.: 34.83  
 Median :1987   Gulf of Mexico            : 0   Median : 89.55  
 Mean   :1987   Gulf of St. Lawrence South: 0   Mean   : 75.41  
 3rd Qu.:1995   Maritimes Summer          : 0   3rd Qu.:108.07  
 Max.   :2000   Northeast US Spring       : 0   Max.   :134.87  
                (Other)                   : 0   NA's   :1       
               Common.Name
 American lobster    :27  
 Banded drum         :12  
 Black sea bass      :27  
 Guachanche barracuda: 0  
 Rainbow star        : 0  
 Red hake            :27

In the code above, we added two conditions into the row space to filter by observations before 2001 and only in the “Northeast US Fall” region. We also added the column space index subsetting criteria, all in the same statement. This accomplishes the filtering and subsetting task of interest in short order, but we can certainly improve on the readability of this code. Also, it’s not ideal to have to revisit column information to find out which column names correspond to which indices. So, is there a better - or at least different - way?

3.3.1 A different way (and the pipe operator)

The tidyverse that we previewed in chapter 3 contains a very handy package called dplyr. One of the cool things about the dplyr package is that many of the functions that exist within it can be put into direct correspondence with our data moves! Before we jump into the these functions, we need an essential tool called the pipe operator, which we can get from the magrittr package.

# I can add two seperate statements on one line using ";"

library(magrittr); library(dplyr)

#example using the pipe operator

Marine6 %>% summary()

      Year                      Region                     Species   
 Min.   :1970   Northeast US Fall  :164   Centropristis striata:180  
 1st Qu.:1991   Northeast US Spring:132   Homarus americanus   :187  
 Median :2000   Maritimes Summer   : 99   Larimus fasciatus    :158  
 Mean   :1999   Southeast US Summer: 93   Orthasterias koehleri: 26  
 3rd Qu.:2010   Southeast US Spring: 92   Sphyraena guachancho :126  
 Max.   :2020   Southeast US Fall  : 87   Urophycis chuss      :138  
                (Other)            :148                              
    Latitude     Latitude.Standard.Error   Longitude      
 Min.   :26.51   Min.   :0.0000          Min.   :-159.98  
 1st Qu.:32.40   1st Qu.:0.1770          1st Qu.: -79.90  
 Median :37.92   Median :0.2267          Median : -75.05  
 Mean   :37.54   Mean   :0.2942          Mean   : -76.79  
 3rd Qu.:42.23   3rd Qu.:0.3342          3rd Qu.: -68.63  
 Max.   :57.50   Max.   :4.4575          Max.   : -60.24  
 NA's   :5       NA's   :34              NA's   :5        
 Longitude.Standard.Error     Depth        Depth.Standard.Error
 Min.   :0.0000           Min.   :  7.48   Min.   : 0.0000     
 1st Qu.:0.2362           1st Qu.:  8.43   1st Qu.: 0.1287     
 Median :0.3298           Median : 52.98   Median : 4.4436     
 Mean   :0.3859           Mean   : 67.22   Mean   : 5.7448     
 3rd Qu.:0.4475           3rd Qu.:119.63   3rd Qu.: 9.2985     
 Max.   :1.7813           Max.   :255.06   Max.   :63.2701     
 NA's   :34               NA's   :53       NA's   :73          
               Common.Name    Group    
 American lobster    :187   Fish :602  
 Banded drum         :158   Other:213  
 Black sea bass      :180              
 Guachanche barracuda:126              
 Rainbow star        : 26              
 Red hake            :138

In the code above, the %>% symbol is the operator of interest. The pipe operator applies the operations that follows (to the right) to the inputs that precede it (to the left). Just to understand a little more about this useful tool, let’s briefly step away from our dataset context to see another example of how the pipe operator works.

# What's the stadard deviation of [2,6,10]?

c(2,6,10) %>% var() %>% sqrt()

[1] 4

In the example above, we applied the variance function var() to [2,6,10], and took the square root of the variance output by applying the sqrt() function. This is the same as sqrt(var(c(2,6,10))), but (not so) arguably easier to read, and definitely less prone to parentheses errors. Now that we’ve seen the pipe operator, let’s use it to smooth out those data moves!

3.3.2 Filtering 2.0

Previously, using base R, we filtered the Marin6 dataset in order to count the number of observations that corresponded only to the year 1970. Let’s try this using the appropriate dplyr method, aptly known as the filter() function.

Marine6 %>% 
  filter(Year == 1970) %>% #I can just name the variable
  nrow() #counting the previously filtered data

[1] 2

Wow! Above we were able to count the number of observations corresponding to the year 1970 using filter() together with the pipe operator. This new format flexibility also allows for in line commenting relative to each step.

3.3.3 Subsetting 2.0

Next, in base R, we used bracket notation and variable indices to subset our data (to produce a summary of interest). Let’s take a look at how we can go about accomplishing this subsetting data move (and summary) using the select() function.

# a summary of the selected variables

Marine6 %>%
  select(Year, Region, Depth, Common.Name) %>%
  summary()

      Year                      Region        Depth       
 Min.   :1970   Northeast US Fall  :164   Min.   :  7.48  
 1st Qu.:1991   Northeast US Spring:132   1st Qu.:  8.43  
 Median :2000   Maritimes Summer   : 99   Median : 52.98  
 Mean   :1999   Southeast US Summer: 93   Mean   : 67.22  
 3rd Qu.:2010   Southeast US Spring: 92   3rd Qu.:119.63  
 Max.   :2020   Southeast US Fall  : 87   Max.   :255.06  
                (Other)            :148   NA's   :53      
               Common.Name 
 American lobster    :187  
 Banded drum         :158  
 Black sea bass      :180  
 Guachanche barracuda:126  
 Rainbow star        : 26  
 Red hake            :138

Notice that we can specify the variable names as they are in the dataset. This is preferable to using indices which can lack sufficient information about what the subset should actually contain.

3.3.4 Grouping 2.0

The grouping data move may serve various purposes and in some cases it is not necessary to create new variables to group data for comparisons. For example, in the dplyr package, we can use group_by() to create a “grouped” data frame where subsequent functions are applied to each group.

Marine6 %>%
  group_by(Common.Name) %>%
  count()

# A tibble: 6 × 2
# Groups:   Common.Name [6]
  Common.Name              n
  <fct>                <int>
1 American lobster       187
2 Banded drum            158
3 Black sea bass         180
4 Guachanche barracuda   126
5 Rainbow star            26
6 Red hake               138

In the code above, we were able to generate a comparison of counts for the variable on which the grouping was based. We input our grouped data frame (created by group_by()) into the count() function to generate a summary similar to what we got through applying the summary() function to as.factor(Marine6$Common.Name).

3.3.5 A few more dplyr examples

Recall the rather involved base R loop that we created for the purpose of adding a grouping variable to our Marine6 dataset. Let’s consider how we could do this in the dplyr setting.

# function to categorize the different common names

fishCat <- function(x)
{ #x is the dataset variable of interest
  ifelse(is.na(x), NA, 
         ifelse(x %in% c("American lobster","Rainbow star"),
                "Other","Fish"))
}

# creating a new grouping variable  

Marine6 %>%
  mutate(Group = fishCat(Common.Name)) %>%
  group_by(Group) %>%
  count()

# A tibble: 2 × 2
# Groups:   Group [2]
  Group     n
  <chr> <int>
1 Fish    602
2 Other   213

In the code above we created a user defined function (specific to our Marine6 data context) that takes in a dataset variable and returns either NA, “Fish,” or “Other.” The function uses nested ifelse() statements to account for the various criteria, rather than the more general if-else statement that we used for the loop in our base R grouping example. Although this function does not require anything beyond base R, we at least are able to see, in case you hadn’t heard, that there is more than one way to do a thing in R. In addition, ifelse() is a vectorized function and applies the specified conditions to each element simultaneously for a more efficient process!

Below the user-defined fishCat() function we introduced the mutate() function. This function allows us to create a new variable from one that exists in the referenced dataset. Just as with our base R example, we named this variable “Group.” We used fishCat() to create the values of “Group” based on the components of “Common.Name”. Finally, we applied the group_by() function and counted the frequencies within each of our newly created “Group” categories.

Now, lets look at one last function that will combine our filtering and grouping into one sequence of pipes and operations.

Marine6 %>%
  filter(Year <= 2000, Region == "Northeast US Fall") %>% #filtering
  select(Year, Region, Depth, Common.Name) %>% #subsetting
  summary()

      Year                             Region       Depth       
 Min.   :1974   Northeast US Fall         :93   Min.   : 21.50  
 1st Qu.:1980   Gulf of Alaska            : 0   1st Qu.: 34.83  
 Median :1987   Gulf of Mexico            : 0   Median : 89.55  
 Mean   :1987   Gulf of St. Lawrence South: 0   Mean   : 75.41  
 3rd Qu.:1995   Maritimes Summer          : 0   3rd Qu.:108.07  
 Max.   :2000   Northeast US Spring       : 0   Max.   :134.87  
                (Other)                   : 0   NA's   :1       
               Common.Name
 American lobster    :27  
 Banded drum         :12  
 Black sea bass      :27  
 Guachanche barracuda: 0  
 Rainbow star        : 0  
 Red hake            :27

Consider

In our 2.0 tidyverse comparisons vs. the base R versions, the readability feature of applying data moves with dplyr may have come into focus. As you continue to apply data moves, consider the advantages and disadvantages of the two different methods presented here. Which methods might work best for you and do you imagine using both base R and dplyr for your data moves in R?

In this section we learned about important data moves that represent common programming applications for data dives and explorations within the data science workflow. We realized these data moves using base R and the dplyr package (with help from magrittr). Furthermore, we introduced loops and user-defined functions as examples of how various concepts come together and play a role in the data exploration process. As you move forward with using data moves for your data dive, think about the questions that come to mind via this exploration. Your data moves can even guide you towards a new data hypothesis!