## Load in the data into a pandas DataFrame, using the pd.read_csv function
import pandas as pd
= pd.read_csv("theramenrater_big_list.csv") ramen
5 Project Part 3.2: Diving into Data Exploration - Python
5.1 Data Moves in Python
The Ramen Rater has been around since 2002. The website’s founder, Hans Lienesch, reviews ramen noodles and records data on the brand, variety, style, country, and his own personal rating with scores given in increments of 0.25. As of now, the list contains reviews dating back to 2002. To expand upon the background description and address other components of the data dictionary criteria, we will read in the The Big List dataset and use some functions to access general metadata and information on the variables within.
When data is read in through the pandas read_csv method the default structure is a DataFrame. Recall, that DataFrames store data in tabular format (e.g., like rows and columns in a spreadsheet).
We could use pandas functions (or methods) to extract useful information about the data, such as column names and data types. This information is useful in general and is often an initial, default, step in a data science workflow upon reading in data. We can begin with using the .info() method which will return a summary of information about the structure of the DataFrame.
## List summary information about the ramen dataframe
ramen.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5015 entries, 0 to 5014
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review_number 5015 non-null int64
1 brand 5015 non-null object
2 variety 5015 non-null object
3 style 5015 non-null object
4 country 5015 non-null object
5 stars 5015 non-null float64
dtypes: float64(1), int64(1), object(4)
memory usage: 235.2+ KB
The output above shows that there are 5015 entries (rows) and 6 columns. In the table, each row represents a column in the ramen
dataset and #
shows the index for the columns (from 0 and to 5), the column name, the Non-Null Count (i.e., the number of non-missing values), and the data type. The review_number column is int64 (numerical), the stars column is float64 (numerical), and the remaining columns are objects. When a column’s data type is listed as object, the values could be strings, numbers, or a mix of both. All of this information is useful and includes things we would want to include in something like a data dictionary.
We could also retrieve individual pieces of this information using DataFrame attributes like .shape
and .columns
. Attributes
are properties of pandas objects, such as metadata (e.g., column names, indices), that can be accessed using dot notation just like methods. Unlike methods, though, attributes are called without using parentheses ()
.
## Check the dimensions of the dataset
ramen.shape
(5015, 6)
## List the variables in the dataset
ramen.columns
Index(['review_number', 'brand', 'variety', 'style', 'country', 'stars'], dtype='object')
5.1.1 Accessing Column Values
To retrieve all values from a column in Pandas we can use bracket notation where the column name is placed inside of square brackets and enclosed by single or double quotes. For example, to access the values in the style column we can use ramen[‘style’]. The returned object is a Series.
'style'] ramen[
0 Pack
1 Bowl
2 Bowl
3 Bowl
4 Cup
...
5010 Pack
5011 Pack
5012 Pack
5013 Pack
5014 Cup
Name: style, Length: 5015, dtype: object
A benefit of returning a Series is that it supports vectorized operations. Pandas also provides built-in methods for Series that perform a wide range of tasks for both categorical and numerical data. In addition, Series have attributes that store information such as the data type and size.
5.1.2 Summarizing
The summarizing data move can be used to condense the contents of a variable into frequency counts or proportions for categorical variables, and other descriptive statistics of interest for numerical variables. As an example, we can use the .value_counts()
method to tally the different styles of noodles.
## Create a frequency table of the labels in the 'styles' column
'style'].value_counts() ramen[
style
Pack 2637
Bowl 1022
Cup 1002
Tray 228
Box 120
Restaurant 3
Bar 1
Bottle 1
Can 1
Name: count, dtype: int64
It appears that most of the reviewed noodles are sold in packs, bowls, or cups. Slightly more than half were in packs, while bowls and cups make up another 40%.
We can use the .describe()
method to get a comprehensive summary of a numerical variable. The describe()
method returns a summary that includes the number of non-missing values, as well as statistical measures that include the mean, standard deviation, minimum, 25th percentile (Q1), 50th percentile (median), 75th percentile (Q3), and maximum.
## Generate summary statistics for the 'stars' column
'stars'].describe() ramen[
count 5015.000000
mean 3.735278
std 1.090282
min 0.000000
25% 3.250000
50% 3.750000
75% 4.500000
max 5.000000
Name: stars, dtype: float64
The output above allows us to get an idea of the distribution of the data. It appears the ratings may be slightly skewed to the right when observing the differences between Q2 and the median vs. the median and Q3. Looking ahead, a related visualization could provide an informative visual depiction of the data distribution.
5.1.3 Filtering
Recall that the filtering data move is what we use to reduce or examine a dataset based on certain row criteria. In our ramen dataset we have the country of origin for each noodle and it may be of interest to analyze the information related to a specific country. For example, suppose we wanted to filter for all the noodles that are manufactured in Japan. How can we do this?
5.1.3.1 Boolean Masks
A Boolean mask is typically a Series containing True
and False
values. It is created by applying comparison operators (==
, >
, <
, etc.) or logical conditions (&
, |
, etc.). To create a Boolean mask for the rows where the country is Japan, we can use the comparison statement ramen[‘country’] == “Japan”
. This returns a Series that can be used to filter out, or hide, the rows we do not want (e.g., countries other than Japan).
## Create a Boolean mask where 'country' == "Japan"
'country'] == "Japan" ramen[
0 False
1 False
2 False
3 False
4 False
...
5010 False
5011 False
5012 False
5013 False
5014 False
Name: country, Length: 5015, dtype: bool
So, now that we have boolean masks at our disposal, we can use one to filter the data of interest. To do this, we can assign the filter criteria to a variable (perhaps named mask) and use it to filter the dataframe.
## Create the Boolean mask where 'country' == "Japan"
= ramen['country'] == "Japan"
mask
## Filter the dataframe
## The .head() method limits the output to the first 5 rows
ramen[mask].head()
review_number brand variety style country stars
1105 3464 Takamori Agodashi Udon Pack Japan 5.0
1106 4613 Ippudo Akamaru Modern Ramen Box Japan 5.0
1107 3382 Itsuki Akikara Ramen Pack Japan 5.0
1108 4308 Marutai Asahikawa Soy Sauce Ramen Pack Japan 5.0
1109 2974 Itomen Bansyu Ramen Pack Japan 5.0
Note: The use of the .head()
method returns only the first five rows (from the filtered dataframe). What method could we use to see that the filtered data contains 1039 entries?
Next, just in case the opportunity to travel to Japan arises and so as to lean on data for the best ramen experience, let’s find out about the ramen that’s rated the best!
## Find the average 'stars' rating
## The .mean() method calculates the mean of a numerical Series
= ramen['stars'].mean()
mean_stars
## Create the Boolean mask where 'country' == "Japan" and 'stars' greater
## than or equal to the average stars rating
= (ramen['country'] == "Japan") & (ramen['stars'] >= mean_stars)
mask
## Filter the dataframe and display the number of rows that match the condition
0] ramen[mask].shape[
660
It looks like we have 660 options to choose from, so maybe we should narrow the search. But first, let’s see what’s going on in the code above.
- We took the mean of the ratings and stored it in
mean_stars
. - Then, added more criteria to our
mask
variable to filter rows where the country is Japan and (&
) the ratings are greater than or equal to (>=
) the overall average. - Lastly, we applied the mask criteria to our ramen data and called the
shape
attribute (which gives the dimensions).- NOTE: Since Python indices begin at zero, the
[0]
index of shape yields the number of rows.
- NOTE: Since Python indices begin at zero, the
Although ramen was originally invented in China, it was Japan that refined and popularized it, and as a result, modern ramen is distinctly Japanese. So much so that it is featured in the Shin-Yokohama Ramen Museum, which documents its rich history and tasty transformations over time!
Note, that contextual information can facilitate your data exploration workflow and efficacy.
5.1.3.2 Query
In programming, a query is a request to access or retrieve information from a database or dataset based on conditions or criteria. Queries may also include instructions to filter, sort, or aggregate the data.
In Pandas, the .query()
method is specifically used to filter rows in a DataFrame based on a string input that consists of a boolean expression. For example, we can assign a string that contain a boolean expression (i.e., filtering criteria) to a variable and pass it as the parameter to .query().
Let’s use this method to show a different way to filter the ramen DataFrame for rows where the country is equal to Japan.
## Define a query string to filter the DataFrame for rows where the 'country' is "Japan"
= "country == 'Japan'"
q
## Use the .query() method to filter the DataFrame and access the first 5 entries (with the .head() method)
ramen.query(q).head()
review_number brand variety style country stars
1105 3464 Takamori Agodashi Udon Pack Japan 5.0
1106 4613 Ippudo Akamaru Modern Ramen Box Japan 5.0
1107 3382 Itsuki Akikara Ramen Pack Japan 5.0
1108 4308 Marutai Asahikawa Soy Sauce Ramen Pack Japan 5.0
1109 2974 Itomen Bansyu Ramen Pack Japan 5.0
Notice that the results are identical to those obtained using a Boolean mask. While the advantages of using .query()
may not be clear for a single condition, its benefits become more apparent with more complex filtering. For example, suppose we want to find extreme cases where the noodle ratings are 0.0 or 5.0, for Myojo brand, Tray-style noodles, manufactured in Japan.
## Define a query string to filter the DataFrame for
## rows where the 'country' is "Japan",
## the style of the noodles is "Tray",
## the brand is "Myojo", and
## the stars rating is either 0.0 or 5.0
= 'country == "Japan" and style == "Tray" and brand == "Myojo" and (stars == 0.0 or stars == 5.0)'
q
## Use the .query() method to filter the DataFrame with criteria q
ramen.query(q)
review_number brand ... country stars
1112 3940 Myojo ... Japan 5.0
1151 2951 Myojo ... Japan 5.0
1170 4686 Myojo ... Japan 5.0
1171 1103 Myojo ... Japan 5.0
1172 2801 Myojo ... Japan 5.0
1173 5014 Myojo ... Japan 5.0
1174 3470 Myojo ... Japan 5.0
1175 3231 Myojo ... Japan 5.0
1270 2906 Myojo ... Japan 5.0
2142 3505 Myojo ... Japan 0.0
[10 rows x 6 columns]
The output shows partial information for the first ten rows of the filtered dataset of interest. In terms of the code, in this instance, the query string improves readability and reduces the required syntax in comparison to specifying the same criteria with a Boolean masks. When using a Boolean mask, you need to repeatedly reference the DataFrame and use bracket notation. In contrast, .query()
lets you reference column names directly as variables in a string expression, which is often more efficient.
5.1.4 Subsetting
Recall that the subsetting data move involves reducing the data based on column criteria. We can subset a Pandas DataFrame by specifying columns to keep using bracket notation, or by dropping columns that are not needed using the .drop()
method. For bracket notation, we can input a list object consisting of column names inside brackets as shown below. Subsetting the data frame in this way results in a new DataFrame with the columns of interest.
## List of columns to select
= ['brand', 'style', 'stars']
columns
## Subset the ramen dataframe
## ramen[['brand', 'style', 'stars']].head() results in the same
ramen[columns].head()
brand style stars
0 Maggi Pack 5.00
1 Suimin Bowl 5.00
2 Suimin Bowl 5.00
3 Suimin Bowl 5.00
4 Suimin Cup 4.25
Now, let’s subset by dropping unneeded columns, using .drop()
.
## Drop the review_number column from the ramen dataframe
= ramen.drop(columns = ['review_number'])
ramen
## Verify the column was removed
ramen.columns
Index(['brand', 'variety', 'style', 'country', 'stars'], dtype='object')
5.1.5 Grouping
Recall that the grouping data move results in certain categorizations that can allow for realted comparisons via statistics, visualizations, models and more. In the Pandas library, the .groupby()
methods groups rows based on values in one or more variables. For example, we may be interested in groupings based on levels within a specified categorical variable (e.g., ramen varieties). Once the grouping has been implemented we can apply aggregation methods, such as summing the totals, for variety.
## Create a groupby object
= ramen.groupby('style') style_grps
Above, we created a variable style_grps
. This is a grouped object that contains DataFrames organized into subgroups based on the levels within the style
column. Each subgroup contains a separate DataFrame consisting of row information corresponding to one of the unique levels (e.g., “Cup”, “Can”, etc.) in style
.
In the code below, the command style_grps.groups.keys()
displays all the subgroup names (i.e., the grouping criteria or the levels of style
) in the style_grps
object. Note, that the .groups
attribute of a grouped object is a dictionary
on which we can run the .keys()
method.
## Displays all the subgroup names in grps
style_grps.groups.keys()
dict_keys(['Bar', 'Bottle', 'Bowl', 'Box', 'Can', 'Cup', 'Pack', 'Restaurant', 'Tray'])
Various data moves are related and can be complimentary. Although the grouping data move does not remove row information (but rather partitions the rows) and filtering does remove certain rows, we could choose to use either of these data moves to accomplish certain desired tasks. For example, if we were interested in comparing the average style rating across each style, we could use our filtering data move for each of the 9 different styles and compute the means of the stars ratings for each filtered dataset. Or, we could leverage grouping to compute the means in a more efficient way using our style_grps
object. One difference in using grouping vs. filtering besides reduction in code is that computations applied to the grouped object are vectorized (and also more efficient in that way). The mean stars per group calculation is demonstrated below.
## Calculate the mean stars rating for each style in the style_grps grouped object
## The .sort_values method arranges the output in ascending order
'stars'].mean().sort_values() style_grps[
style
Cup 3.493693
Can 3.500000
Restaurant 3.583333
Tray 3.595965
Bowl 3.664540
Pack 3.851422
Bottle 4.000000
Box 4.060417
Bar 5.000000
Name: stars, dtype: float64
Applying the mean()
function to our grouped object results in a Series. We can then apply the .sort_values() method to this Series so that the group means appear in ascending (or descending) order.
From the output we can see that the three highest mean ratings are for Bar, Box and Bottle. To complement the mean rating analysis, we should also examine the number of occurrences of each style, and we can do this by applying the appropriate method to our same style_grps
object.
## Return the counts of non-missing values in the stars column
## for each style in the style_grps object
## The .sort_values() method arranges the output in ascending order
'stars'].size().sort_values() style_grps[
style
Bar 1
Bottle 1
Can 1
Restaurant 3
Box 120
Tray 228
Cup 1002
Bowl 1022
Pack 2637
Name: stars, dtype: int64
From the output we can see that two of our top three mean ratings are only represented by 1 review. So, we can regard the other averages (besides restaurant) represented by many more ratings as more reliable with respect to the style categorization. The frequencies may reflect the preferences of the rater, or may be reflective of a concept like “availability”. At least, we have an idea about a line of inquiry that may be worth investigating and we might want to put more or less faith in some mean ratings than others. ### Calculating (a new attribute)
The calculating a new attribute data move allows us to create new variables from existing ones. We might calculate a new attribute to reveal underlying patterns in the data, to facilitate comparisons, or to preparing data for analysis, among other useful transformations. We can calculate various types of new attributes including numerical or categorical variables. For example, two numerical variables might be combined in a formula to calculate a new quantity, or used to sort observations into groups like small, medium, large. These calculated values become part of the dataset and can be used to support further exploration.
5.1.5.1 A calculating data move example
The American Federation of Labor and Congress of Industrial Organizations (AFL-CIO) website provides data on CEO compensation packages in the United States. The information on the compensation packages includes base salary, bonuses, stock awards, and other earnings, all captured in a dataset compiled from AFL-CIO data.
Below are the attributes and descriptions for this dataset:
library(knitr)
library(kableExtra)
# Create the data frame
<- data.frame(
compensation_table Name = c( "ticker", "salary", "bonus",
"stock_awards", "option_awards","non_equity_comp",
"pension_change","other_comp"
),Description = c( "Stock ticker symbol for the company.",
"Base annual salary.", "Additional cash bonus.",
"Value of stock granted.", "Value of stock options granted.",
"Performance-based cash compensation not tied to equity.",
"Increase in pension value and deferred compensation earnings.",
"Miscellaneous compensation (e.g., perks, benefits)."
)
)
kable(compensation_table,
caption = "CEO Compensation Packages - Variable Descriptions (AFL-CIO)",
booktabs = TRUE,
longtable = TRUE,
align = "l") %>%
kable_styling(latex_options = c("striped", "hold_position"))
Name | Description |
---|---|
ticker | Stock ticker symbol for the company. |
salary | Base annual salary. |
bonus | Additional cash bonus. |
stock_awards | Value of stock granted. |
option_awards | Value of stock options granted. |
non_equity_comp | Performance-based cash compensation not tied to equity. |
pension_change | Increase in pension value and deferred compensation earnings. |
other_comp | Miscellaneous compensation (e.g., perks, benefits). |
Although the dataset does not contain a column representing total compensation, we can calculate this attribute from the existing data. We could also create additional attributes that represent various measures like the percentage of total compensation coming from stock or bonus pay.
# Load the CEO compensation data
= pd.read_csv("ceo_compensation_summary.csv")
ceo_pay
# Calculate total compensation and store the result in the dataframe in a new column called total
'total'] = (
ceo_pay['salary'] +
ceo_pay['bonus'] +
ceo_pay['stock_awards'] +
ceo_pay['option_awards'] +
ceo_pay['non_equity_comp'] +
ceo_pay['pension_change']
ceo_pay[ )
Now that we’ve computed this new column, additional analysis and exploration can be performed to better understand CEOs payment trends. In addition, we could use the summarizing data move to calculate statistics such as the mean, median, and standard deviation of the new total
attribute.
5.1.6 A little extra - Merging & Joining
Merging and joining involve combining information from various datasets based on some identifying criteria. These terms are often used interchangeably.
Consider the ceo_pay
dataset, which includes details about the CEO compensation structure but lacks information about the companies and the particular CEOs. If we had the additional company and CEO information in another dataset and a common identifier between these disparate sources, we could merge the data into a single source containing all of the information.
As it turns out, the company
dataset, read in below, has information on company and ceo names. In addition, both the company
and ceo_pay
datasets contain a common identifier, ticker
. So, we can use this identifier to merge the two data sources into one.
# Load the company information data
= pd.read_csv("company_info.csv")
company
# Join the CEO pay dataset with the company dataset
= pd.merge(ceo_pay, company, on = 'ticker', how = 'inner')
ceos_and_pay
# Show the new combined data column names
ceos_and_pay.columns
Index(['ticker', 'salary', 'bonus', 'stock_awards', 'option_awards',
'non_equity_comp', 'pension_change', 'other_comp', 'total', 'city',
'state', 'display_name', 'fiscal_year', 'ceo_name'],
dtype='object')
# preview the first 5 rows
ceos_and_pay.head()
ticker salary ... fiscal_year ceo_name
0 TPG 509615 ... 2023.0 Mr. Jon Winkelried
1 CG 838462 ... 2023.0 Mr. Harvey Mitchell Schwartz
2 AVGO 1200000 ... 2024.0 Mr. Hock E. Tan
3 PANW 750000 ... 2024.0 Mr. Nikesh Arora C.F.A.
4 COTY 3549000 ... 2024.0 Ms. Sue Y. Nabi
[5 rows x 14 columns]
Above we used the pd.merge()
function (in the Pandas library) to combine the two dataframes based on the ticker
variable/identifier in both datasets. We specified the identifier ticker
as the on
parameter value, and we specified inner
as the how
parameter value. The inner
specification defines an inner-join which means only the rows where both datasets have a matching ticker
value are merged and kept in the resulting dataset. In general pd.merge()
is used to combine DataFrames based on one or more common keys (i.e., columns).
Now that we have additional information in our ceos_and_pay
dataset we have the option to explore entirely new data investigations, such as CEO salaries by industry, or salaries by other company characteristics. And of course, to do this, we can leverage data moves!