3Project Part 3.1: Diving into Data Exploration - R
Now that we are familiar with our dataset “metadata” let’s consider a more thorough exploratory process informed by the background knowledge encoded into the data dictionary. To do this, we will access some data science programming tools and concepts that will provide us with the means to dive and delve into our data.
3.1 Data Moves
To begin exploring your data in more detail, you will likely need to modify your dataset in some way. This may involve removing variables or observations, creating groupings or categories, performing operations on certain values, and other transformations of the data. For example, if you want to understand or visualize a particular trend in your data, you may need to index your observations in a particular order. This might be a necessary step if your data has a temporal component and you want to visualize a trend or phenomenon with respect to time.
Erickson et al. (1), refer to such data transformations as “data moves” and understanding what these are can facilitate your data science explorations and investigations regardless of your programming language or platform of choice. We will consider data moves as a fundamental part of our programming and as a guide to facilitating the many aspects of our data exploration and investigations process, from data cleaning to data visualization and beyond. Our version of data moves will be connected to the dataframe object structure that is part of both R and Python. These concepts are the SCUBA gear that will allow us to wade through the depths of our data with purpose and confidence.
3.2 Data Moves in Base R
Typical dataset modifications involve data moves such as filtering, subsetting, grouping, and merging. In chapter 3, we looked at the Marine6 dataset. This 815x10 dimension dataset was created through a series of data moves from a larger dataset consisting of 48,237 rows and nine columns. Creating this dataset of interest involved data exploration and data moves such as filtering, subsetting, grouping, and creating new variables.
3.2.1 Filtering
The filtering data move is what we will use to reduce or examine a dataset based on certain row criteria. Since we have information about the range of years in the dataset, from our data dictionary, we might wonder “How many observations were recorded from the beginning, in 1970?”.
# read in the Marine6 dataset using base R# convert characters to factors upon reading inMarine6 <-read.csv("Marine6.csv", stringsAsFactors =TRUE )# find the minimum value for the variable "Year"# aka, the earliest recordingsmin(Marine6$Year)
[1] 1970
In the code above, we read in the Marine6 data and used the min() function on the “Year” variable to get the earliest date. The output confirms that the earliest year recoded in the dataset was 1970. We’ve identified our row criteria on which we will filter the data to investigate our question of interest.
# filter the dataset to observations of interest# get the number of observationsnrow(Marine6[Marine6$Year ==1970,])
[1] 2
In the code above we used the logical statement Marine6$Year == 1970 in the row index position to filter the Marine6 dataset (e.g., get all rows in the dataset where the logical statement is true). This output served as the input for the nrow() function which counts the number of rows and returned the answer to our inquiry.
Like the briefly mentioned process of creating the Marine6 dataset, in many instances, we might filter our data to create a new object, or dataset, of interest. We might store all observations according to a specific set of years.
# Store observation from 1970 to 2000 in M6_2000M6_2000 <- Marine6[Marine6$Year <=2000,]summary(M6_2000$Year) # max should be 2000
Min. 1st Qu. Median Mean 3rd Qu. Max.
1970 1983 1991 1989 1996 2000
3.2.2 Subsetting
We can use column criteria to reduce our data to only certain variables of interest. We can distinguish the subsetting data move from filtering based on the use of column vs. row criteria, although both operations result in what can be considered subsets of the data.
# summary only of the variables subset of interestsummary(Marine6[,c(1,2,8,10)])
Year Region Depth
Min. :1970 Northeast US Fall :164 Min. : 7.48
1st Qu.:1991 Northeast US Spring:132 1st Qu.: 8.43
Median :2000 Maritimes Summer : 99 Median : 52.98
Mean :1999 Southeast US Summer: 93 Mean : 67.22
3rd Qu.:2010 Southeast US Spring: 92 3rd Qu.:119.63
Max. :2020 Southeast US Fall : 87 Max. :255.06
(Other) :148 NA's :53
Common.Name
American lobster :187
Banded drum :158
Black sea bass :180
Guachanche barracuda:126
Rainbow star : 26
Red hake :138
In the code above, we used a vector of indices to get a subset of the Marine6 variables. In larger datasets, particularly for gleaning insights from the summary() function, subsetting data in this way can render the output more informative and digestible.
3.2.3 Grouping (Loop Example)
The grouping data move may serve various purposes, but is often used to create categories for the purposes of comparisons. Often, the grouping data move is a specific case of creating a new variable or attribute, particularly when the grouping information is needed for subsequent analysis. For example, a new variable containing grouping information may be a parameter that is used to create a data visualization in a certain way.
In the Marine6 dataset, we have species of fish and species of not-fish. We might be interested in comparing the changes in depths over time between these more general classifications. To set this comparison up, we can use grouping.
# Using a for loop to create a new groupfor(i in1:nrow(Marine6)){if(Marine6$Common.Name[i] %in%c("American lobster","Rainbow star")) { Marine6$Group[i] ="Other" }elseif(is.na(Marine6$Common.Name[i])) { Marine6$Group[i] =NA } else { Marine6$Group[i] ="Fish" }}Marine6$Group <-as.factor(Marine6$Group)summary(Marine6$Group)
Fish Other
602 213
Let’s examine the code above, which contains a for loop. This programming sequence iterates through each observation in the Marine6 data and returns the values “Other”, “NA”, or “Fish” based on the specified condition-criteria. In the first condition we introduced the %in% operator which is a logical statement that returns TRUE if the value of “Common.Name” matches any of the characters in the object c("American lobster","Rainbow star"), or FALSE otherwise. The next condition uses is.na(), also a logical statement, which returns TRUE if the value of “Common.Name” is missing, and FALSE otherwise. Although there are no missing values in “Common.Name”, this may not always be the case for a given variable and we should account for this missing value condition in order to run our loop without error. Finally, the last remaining case includes all other common names not specified in the first condition. We do not need to explicitly write this out as it is included in the else statement. Note how our data dictionary can provide this factor level information to us. We may even revise our data dictionary to include information about the levels that correspond to fish and those that do not.
3.2.4 Additional Data Moves
In some cases, there may be information spread across multiple datasets that we want to combine into a single dataframe. If each of these dataset has a common “identifier” which links the information, we can use the merging data move to accomplish this task. For example, we might want to add instructor information to student data, based on a common course ID. We can merge data with the base R merge() function where the inputs would be each dataset and the common identifier on which the merge would be based.
The creating hierarchy data move may not be different from a combination of grouping and creating a new variable. However, certain models are based on data hierarchy, and depending on the dataset there may be a need to create this structure for a related analysis. For example, we may have a dataset that contains information on states, counties, and schools. Across the states there may be counties (and schools) that have the same names. In order to distinguish one county from another, information about the state would be necessary. Thus, we would nest our counties within the states (and the schools within the counties), and this nesting would create a hierarchical data structure. As with comparing our fish and non-fish groups, we might be interested in visualizing the variation in county test scores across a sample of states, and the hierarchical information would be essential to such a task.
In the examples above, we used based R and bracket notation (i.e., referencing indices through conditions in the row aand column spaces of our dataframe) to perform our data moves. Needless to say, there are more ways in R to perform our data moves. In fact, many data moves directly correspond to functions that exist through the dplyr package in the “tidyverse”. Even better, these functions can really simplify our coding process.
3.3 Tidyverse Data Moves - Tidy Moves
In the previous section, we demonstrated data moves one at a time. We could have easily added both row-filtering and column-subsetting information at the same time to create our new dataset of interest. These do not need to be independent steps, but combining these steps into one may lead to rather long code statements that could potentially be visually difficult to parse (and revise). Let’s see an example of this.
# summary of a subset of variables # for the observation prior to 2001 and # in one region of interestsummary(Marine6[Marine6$Year <=2000& Marine6$Region =="Northeast US Fall", c(1,2,8,10)])
Year Region Depth
Min. :1974 Northeast US Fall :93 Min. : 21.50
1st Qu.:1980 Gulf of Alaska : 0 1st Qu.: 34.83
Median :1987 Gulf of Mexico : 0 Median : 89.55
Mean :1987 Gulf of St. Lawrence South: 0 Mean : 75.41
3rd Qu.:1995 Maritimes Summer : 0 3rd Qu.:108.07
Max. :2000 Northeast US Spring : 0 Max. :134.87
(Other) : 0 NA's :1
Common.Name
American lobster :27
Banded drum :12
Black sea bass :27
Guachanche barracuda: 0
Rainbow star : 0
Red hake :27
In the code above, we added two conditions into the row space to filter by observations before 2001 and only in the “Northeast US Fall” region. We also added the column space index subsetting criteria, all in the same statement. This accomplishes the filtering and subsetting task of interest in short order, but we can certainly improve on the readability of this code. Also, it’s not ideal to have to revisit column information to find out which column names correspond to which indices. So, is there a better - or at least different - way?
3.3.1 A different way (and the pipe operator)
The tidyverse that we previewed in chapter 3 contains a very handy package called dplyr. One of the cool things about the dplyr package is that many of the functions that exist within it can be put into direct correspondence with our data moves! Before we jump into the these functions, we need an essential tool called the pipe operator, which we can get from the magrittr package.
# I can add two seperate statements on one line using ";"library(magrittr); library(dplyr)#example using the pipe operatorMarine6 %>%summary()
Year Region Species
Min. :1970 Northeast US Fall :164 Centropristis striata:180
1st Qu.:1991 Northeast US Spring:132 Homarus americanus :187
Median :2000 Maritimes Summer : 99 Larimus fasciatus :158
Mean :1999 Southeast US Summer: 93 Orthasterias koehleri: 26
3rd Qu.:2010 Southeast US Spring: 92 Sphyraena guachancho :126
Max. :2020 Southeast US Fall : 87 Urophycis chuss :138
(Other) :148
Latitude Latitude.Standard.Error Longitude
Min. :26.51 Min. :0.0000 Min. :-159.98
1st Qu.:32.40 1st Qu.:0.1770 1st Qu.: -79.90
Median :37.92 Median :0.2267 Median : -75.05
Mean :37.54 Mean :0.2942 Mean : -76.79
3rd Qu.:42.23 3rd Qu.:0.3342 3rd Qu.: -68.63
Max. :57.50 Max. :4.4575 Max. : -60.24
NA's :5 NA's :34 NA's :5
Longitude.Standard.Error Depth Depth.Standard.Error
Min. :0.0000 Min. : 7.48 Min. : 0.0000
1st Qu.:0.2362 1st Qu.: 8.43 1st Qu.: 0.1287
Median :0.3298 Median : 52.98 Median : 4.4436
Mean :0.3859 Mean : 67.22 Mean : 5.7448
3rd Qu.:0.4475 3rd Qu.:119.63 3rd Qu.: 9.2985
Max. :1.7813 Max. :255.06 Max. :63.2701
NA's :34 NA's :53 NA's :73
Common.Name Group
American lobster :187 Fish :602
Banded drum :158 Other:213
Black sea bass :180
Guachanche barracuda:126
Rainbow star : 26
Red hake :138
In the code above, the %>% symbol is the operator of interest. The pipe operator applies the operations that follows (to the right) to the inputs that precede it (to the left). Just to understand a little more about this useful tool, let’s briefly step away from our dataset context to see another example of how the pipe operator works.
# What's the stadard deviation of [2,6,10]?c(2,6,10) %>%var() %>%sqrt()
[1] 4
In the example above, we applied the variance function var() to [2,6,10], and took the square root of the variance output by applying the sqrt() function. This is the same as sqrt(var(c(2,6,10))), but (not so) arguably easier to read, and definitely less prone to parentheses errors. Now that we’ve seen the pipe operator, let’s use it to smooth out those data moves!
3.3.2 Filtering 2.0
Previously, using base R, we filtered the Marin6 dataset in order to count the number of observations that corresponded only to the year 1970. Let’s try this using the appropriate dplyr method, aptly known as the filter() function.
Marine6 %>%filter(Year ==1970) %>%#I can just name the variablenrow() #counting the previously filtered data
[1] 2
Wow! Above we were able to count the number of observations corresponding to the year 1970 using filter() together with the pipe operator. This new format flexibility also allows for in line commenting relative to each step.
3.3.3 Subsetting 2.0
Next, in base R, we used bracket notation and variable indices to subset our data (to produce a summary of interest). Let’s take a look at how we can go about accomplishing this subsetting data move (and summary) using the select() function.
# a summary of the selected variablesMarine6 %>%select(Year, Region, Depth, Common.Name) %>%summary()
Year Region Depth
Min. :1970 Northeast US Fall :164 Min. : 7.48
1st Qu.:1991 Northeast US Spring:132 1st Qu.: 8.43
Median :2000 Maritimes Summer : 99 Median : 52.98
Mean :1999 Southeast US Summer: 93 Mean : 67.22
3rd Qu.:2010 Southeast US Spring: 92 3rd Qu.:119.63
Max. :2020 Southeast US Fall : 87 Max. :255.06
(Other) :148 NA's :53
Common.Name
American lobster :187
Banded drum :158
Black sea bass :180
Guachanche barracuda:126
Rainbow star : 26
Red hake :138
Notice that we can specify the variable names as they are in the dataset. This is preferable to using indices which can lack sufficient information about what the subset should actually contain.
3.3.4 Grouping 2.0
The grouping data move may serve various purposes and in some cases it is not necessary to create new variables to group data for comparisons. For example, in the dplyr package, we can use group_by() to create a “grouped” data frame where subsequent functions are applied to each group.
Marine6 %>%group_by(Common.Name) %>%count()
# A tibble: 6 × 2
# Groups: Common.Name [6]
Common.Name n
<fct> <int>
1 American lobster 187
2 Banded drum 158
3 Black sea bass 180
4 Guachanche barracuda 126
5 Rainbow star 26
6 Red hake 138
In the code above, we were able to generate a comparison of counts for the variable on which the grouping was based. We input our grouped data frame (created by group_by()) into the count() function to generate a summary similar to what we got through applying the summary() function to as.factor(Marine6$Common.Name).
3.3.5 A few more dplyr examples
Recall the rather involved base R loop that we created for the purpose of adding a grouping variable to our Marine6 dataset. Let’s consider how we could do this in the dplyr setting.
# function to categorize the different common namesfishCat <-function(x){ #x is the dataset variable of interestifelse(is.na(x), NA, ifelse(x %in%c("American lobster","Rainbow star"),"Other","Fish"))}# creating a new grouping variable Marine6 %>%mutate(Group =fishCat(Common.Name)) %>%group_by(Group) %>%count()
# A tibble: 2 × 2
# Groups: Group [2]
Group n
<chr> <int>
1 Fish 602
2 Other 213
In the code above we created a user defined function (specific to our Marine6 data context) that takes in a dataset variable and returns either NA, “Fish,” or “Other.” The function uses nested ifelse() statements to account for the various criteria, rather than the more general if-else statement that we used for the loop in our base R grouping example. Although this function does not require anything beyond base R, we at least are able to see, in case you hadn’t heard, that there is more than one way to do a thing in R. In addition, ifelse() is a vectorized function and applies the specified conditions to each element simultaneously for a more efficient process!
Below the user-defined fishCat() function we introduced the mutate() function. This function allows us to create a new variable from one that exists in the referenced dataset. Just as with our base R example, we named this variable “Group.” We used fishCat() to create the values of “Group” based on the components of “Common.Name”. Finally, we applied the group_by() function and counted the frequencies within each of our newly created “Group” categories.
Now, lets look at one last function that will combine our filtering and grouping into one sequence of pipes and operations.
Marine6 %>%filter(Year <=2000, Region =="Northeast US Fall") %>%#filteringselect(Year, Region, Depth, Common.Name) %>%#subsettingsummary()
Year Region Depth
Min. :1974 Northeast US Fall :93 Min. : 21.50
1st Qu.:1980 Gulf of Alaska : 0 1st Qu.: 34.83
Median :1987 Gulf of Mexico : 0 Median : 89.55
Mean :1987 Gulf of St. Lawrence South: 0 Mean : 75.41
3rd Qu.:1995 Maritimes Summer : 0 3rd Qu.:108.07
Max. :2000 Northeast US Spring : 0 Max. :134.87
(Other) : 0 NA's :1
Common.Name
American lobster :27
Banded drum :12
Black sea bass :27
Guachanche barracuda: 0
Rainbow star : 0
Red hake :27
Consider
In our 2.0 tidyverse comparisons vs. the base R versions, the readability feature of applying data moves with dplyr may have come into focus. As you continue to apply data moves, consider the advantages and disadvantages of the two different methods presented here. Which methods might work best for you and do you imagine using both base R and dplyr for your data moves in R?
In this section we learned about important data moves that represent common programming applications for data dives and explorations within the data science workflow. We realized these data moves using base R and the dplyr package (with help from magrittr). Furthermore, we introduced loops and user-defined functions as examples of how various concepts come together and play a role in the data exploration process. As you move forward with using data moves for your data dive, think about the questions that come to mind via this exploration. Your data moves can even guide you towards a new data hypothesis!
1.
Erickson T, Wilkerson M, Finzer W, Reichsman F. Data moves. Technology Innovations in Statistics Education (2019) 12: