5 Project Part 3.2: Diving into Data Exploration - Python

5.1 Data Moves in Python

The Ramen Rater has been around since 2002. The website’s founder, Hans Lienesch, reviews ramen noodles and records data on the brand, variety, style, country, and his own personal rating with scores given in increments of 0.25. As of now, the list contains reviews dating back to 2002. To expand upon the background description and address other components of the data dictionary criteria, we will read in the The Big List dataset and use some functions to access general metadata and information on the variables within.

## Load in the data into a pandas DataFrame, using the pd.read_csv function
import pandas as pd

ramen = pd.read_csv("theramenrater_big_list.csv")

When data is read in through the pandas read_csv method the default structure is a DataFrame. Recall, that DataFrames store data in tabular format (e.g., like rows and columns in a spreadsheet).

We could use pandas functions (or methods) to extract useful information about the data, such as column names and data types. This information is useful in general and is often an initial, default, step in a data science workflow upon reading in data. We can begin with using the .info() method which will return a summary of information about the structure of the DataFrame.

## List summary information about the ramen dataframe

ramen.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5015 entries, 0 to 5014
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   review_number  5015 non-null   int64  
 1   brand          5015 non-null   object 
 2   variety        5015 non-null   object 
 3   style          5015 non-null   object 
 4   country        5015 non-null   object 
 5   stars          5015 non-null   float64
dtypes: float64(1), int64(1), object(4)
memory usage: 235.2+ KB

The output above shows that there are 5015 entries (rows) and 6 columns. In the table, each row represents a column in the ramen dataset and # shows the index for the columns (from 0 and to 5), the column name, the Non-Null Count (i.e., the number of non-missing values), and the data type. The review_number column is int64 (numerical), the stars column is float64 (numerical), and the remaining columns are objects. When a column’s data type is listed as object, the values could be strings, numbers, or a mix of both. All of this information is useful and includes things we would want to include in something like a data dictionary.

We could also retrieve individual pieces of this information using DataFrame attributes like .shape and .columns. Attributes are properties of pandas objects, such as metadata (e.g., column names, indices), that can be accessed using dot notation just like methods. Unlike methods, though, attributes are called without using parentheses ().

## Check the dimensions of the dataset

ramen.shape

(5015, 6)

## List the variables in the dataset

ramen.columns

Index(['review_number', 'brand', 'variety', 'style', 'country', 'stars'], dtype='object')

5.1.1 Accessing Column Values

To retrieve all values from a column in Pandas we can use bracket notation where the column name is placed inside of square brackets and enclosed by single or double quotes. For example, to access the values in the style column we can use ramen[‘style’]. The returned object is a Series.

ramen['style']

0       Pack
1       Bowl
2       Bowl
3       Bowl
4        Cup
        ... 
5010    Pack
5011    Pack
5012    Pack
5013    Pack
5014     Cup
Name: style, Length: 5015, dtype: object

A benefit of returning a Series is that it supports vectorized operations. Pandas also provides built-in methods for Series that perform a wide range of tasks for both categorical and numerical data. In addition, Series have attributes that store information such as the data type and size.

5.1.2 Summarizing

The summarizing data move can be used to condense the contents of a variable into frequency counts or proportions for categorical variables, and other descriptive statistics of interest for numerical variables. As an example, we can use the .value_counts() method to tally the different styles of noodles.

## Create a frequency table of the labels in the 'styles' column

ramen['style'].value_counts()

style
Pack          2637
Bowl          1022
Cup           1002
Tray           228
Box            120
Restaurant       3
Bar              1
Bottle           1
Can              1
Name: count, dtype: int64

It appears that most of the reviewed noodles are sold in packs, bowls, or cups. Slightly more than half were in packs, while bowls and cups make up another 40%.

We can use the .describe() method to get a comprehensive summary of a numerical variable. The describe() method returns a summary that includes the number of non-missing values, as well as statistical measures that include the mean, standard deviation, minimum, 25th percentile (Q1), 50th percentile (median), 75th percentile (Q3), and maximum.

## Generate summary statistics for the 'stars' column

ramen['stars'].describe()

count    5015.000000
mean        3.735278
std         1.090282
min         0.000000
25%         3.250000
50%         3.750000
75%         4.500000
max         5.000000
Name: stars, dtype: float64

The output above allows us to get an idea of the distribution of the data. It appears the ratings may be slightly skewed to the right when observing the differences between Q2 and the median vs. the median and Q3. Looking ahead, a related visualization could provide an informative visual depiction of the data distribution.

5.1.3 Filtering

Recall that the filtering data move is what we use to reduce or examine a dataset based on certain row criteria. In our ramen dataset we have the country of origin for each noodle and it may be of interest to analyze the information related to a specific country. For example, suppose we wanted to filter for all the noodles that are manufactured in Japan. How can we do this?

5.1.3.1 Boolean Masks

A Boolean mask is typically a Series containing True and False values. It is created by applying comparison operators (==, >, <, etc.) or logical conditions (&, |, etc.). To create a Boolean mask for the rows where the country is Japan, we can use the comparison statement ramen[‘country’] == “Japan”. This returns a Series that can be used to filter out, or hide, the rows we do not want (e.g., countries other than Japan).

## Create a Boolean mask where 'country' == "Japan"

ramen['country'] == "Japan"

0       False
1       False
2       False
3       False
4       False
        ...  
5010    False
5011    False
5012    False
5013    False
5014    False
Name: country, Length: 5015, dtype: bool

So, now that we have boolean masks at our disposal, we can use one to filter the data of interest. To do this, we can assign the filter criteria to a variable (perhaps named mask) and use it to filter the dataframe.

## Create the Boolean mask where 'country' == "Japan"

mask = ramen['country'] == "Japan"

## Filter the dataframe 
## The .head() method limits the output to the first 5 rows

ramen[mask].head()

      review_number     brand                    variety style country  stars
1105           3464  Takamori              Agodashi Udon  Pack   Japan    5.0
1106           4613    Ippudo       Akamaru Modern Ramen   Box   Japan    5.0
1107           3382    Itsuki              Akikara Ramen  Pack   Japan    5.0
1108           4308   Marutai  Asahikawa Soy Sauce Ramen  Pack   Japan    5.0
1109           2974    Itomen               Bansyu Ramen  Pack   Japan    5.0

Note: The use of the .head() method returns only the first five rows (from the filtered dataframe). What method could we use to see that the filtered data contains 1039 entries?

Next, just in case the opportunity to travel to Japan arises and so as to lean on data for the best ramen experience, let’s find out about the ramen that’s rated the best!

## Find the average 'stars' rating
## The .mean() method calculates the mean of a numerical Series 
mean_stars = ramen['stars'].mean()

## Create the Boolean mask where 'country' == "Japan" and 'stars' greater
## than or equal to the average stars rating
mask = (ramen['country'] == "Japan") & (ramen['stars'] >= mean_stars)

## Filter the dataframe and display the number of rows that match the condition
ramen[mask].shape[0]

It looks like we have 660 options to choose from, so maybe we should narrow the search. But first, let’s see what’s going on in the code above.

We took the mean of the ratings and stored it in mean_stars.
Then, added more criteria to our mask variable to filter rows where the country is Japan and (&) the ratings are greater than or equal to (>=) the overall average.
Lastly, we applied the mask criteria to our ramen data and called the shape attribute (which gives the dimensions).
- NOTE: Since Python indices begin at zero, the [0] index of shape yields the number of rows.

So, what about ramen?

Although ramen was originally invented in China, it was Japan that refined and popularized it, and as a result, modern ramen is distinctly Japanese. So much so that it is featured in the Shin-Yokohama Ramen Museum, which documents its rich history and tasty transformations over time!

Note, that contextual information can facilitate your data exploration workflow and efficacy.

5.1.3.2 Query

In programming, a query is a request to access or retrieve information from a database or dataset based on conditions or criteria. Queries may also include instructions to filter, sort, or aggregate the data.

In Pandas, the .query() method is specifically used to filter rows in a DataFrame based on a string input that consists of a boolean expression. For example, we can assign a string that contain a boolean expression (i.e., filtering criteria) to a variable and pass it as the parameter to .query().

Let’s use this method to show a different way to filter the ramen DataFrame for rows where the country is equal to Japan.

## Define a query string to filter the DataFrame for rows where the 'country' is "Japan" 

q = "country == 'Japan'"

## Use the .query() method to filter the DataFrame and access the first 5 entries (with the .head() method)

ramen.query(q).head()

      review_number     brand                    variety style country  stars
1105           3464  Takamori              Agodashi Udon  Pack   Japan    5.0
1106           4613    Ippudo       Akamaru Modern Ramen   Box   Japan    5.0
1107           3382    Itsuki              Akikara Ramen  Pack   Japan    5.0
1108           4308   Marutai  Asahikawa Soy Sauce Ramen  Pack   Japan    5.0
1109           2974    Itomen               Bansyu Ramen  Pack   Japan    5.0

Notice that the results are identical to those obtained using a Boolean mask. While the advantages of using .query() may not be clear for a single condition, its benefits become more apparent with more complex filtering. For example, suppose we want to find extreme cases where the noodle ratings are 0.0 or 5.0, for Myojo brand, Tray-style noodles, manufactured in Japan.

## Define a query string to filter the DataFrame for 
## rows where the 'country' is "Japan",  
## the style of the noodles is "Tray",
## the brand is "Myojo", and
## the stars rating is either 0.0 or 5.0 

q = 'country == "Japan" and style == "Tray" and brand == "Myojo" and (stars == 0.0 or stars == 5.0)'

## Use the .query() method to filter the DataFrame with criteria q 
ramen.query(q)

      review_number  brand  ... country stars
1112           3940  Myojo  ...   Japan   5.0
1151           2951  Myojo  ...   Japan   5.0
1170           4686  Myojo  ...   Japan   5.0
1171           1103  Myojo  ...   Japan   5.0
1172           2801  Myojo  ...   Japan   5.0
1173           5014  Myojo  ...   Japan   5.0
1174           3470  Myojo  ...   Japan   5.0
1175           3231  Myojo  ...   Japan   5.0
1270           2906  Myojo  ...   Japan   5.0
2142           3505  Myojo  ...   Japan   0.0

[10 rows x 6 columns]

The output shows partial information for the first ten rows of the filtered dataset of interest. In terms of the code, in this instance, the query string improves readability and reduces the required syntax in comparison to specifying the same criteria with a Boolean masks. When using a Boolean mask, you need to repeatedly reference the DataFrame and use bracket notation. In contrast, .query() lets you reference column names directly as variables in a string expression, which is often more efficient.

5.1.4 Subsetting

Recall that the subsetting data move involves reducing the data based on column criteria. We can subset a Pandas DataFrame by specifying columns to keep using bracket notation, or by dropping columns that are not needed using the .drop() method. For bracket notation, we can input a list object consisting of column names inside brackets as shown below. Subsetting the data frame in this way results in a new DataFrame with the columns of interest.

## List of columns to select

columns = ['brand', 'style', 'stars']

## Subset the ramen dataframe
## ramen[['brand', 'style', 'stars']].head() results in the same

ramen[columns].head()

    brand style  stars
0   Maggi  Pack   5.00
1  Suimin  Bowl   5.00
2  Suimin  Bowl   5.00
3  Suimin  Bowl   5.00
4  Suimin   Cup   4.25

Now, let’s subset by dropping unneeded columns, using .drop().

## Drop the review_number column from the ramen dataframe

ramen = ramen.drop(columns = ['review_number'])

## Verify the column was removed

ramen.columns

Index(['brand', 'variety', 'style', 'country', 'stars'], dtype='object')

5.1.5 Grouping

Recall that the grouping data move results in certain categorizations that can allow for realted comparisons via statistics, visualizations, models and more. In the Pandas library, the .groupby() methods groups rows based on values in one or more variables. For example, we may be interested in groupings based on levels within a specified categorical variable (e.g., ramen varieties). Once the grouping has been implemented we can apply aggregation methods, such as summing the totals, for variety.


## Create a groupby object

style_grps = ramen.groupby('style')

Above, we created a variable style_grps. This is a grouped object that contains DataFrames organized into subgroups based on the levels within the style column. Each subgroup contains a separate DataFrame consisting of row information corresponding to one of the unique levels (e.g., “Cup”, “Can”, etc.) in style.

In the code below, the command style_grps.groups.keys() displays all the subgroup names (i.e., the grouping criteria or the levels of style) in the style_grps object. Note, that the .groups attribute of a grouped object is a dictionary on which we can run the .keys() method.

## Displays all the subgroup names in grps 

style_grps.groups.keys()

dict_keys(['Bar', 'Bottle', 'Bowl', 'Box', 'Can', 'Cup', 'Pack', 'Restaurant', 'Tray'])

Various data moves are related and can be complimentary. Although the grouping data move does not remove row information (but rather partitions the rows) and filtering does remove certain rows, we could choose to use either of these data moves to accomplish certain desired tasks. For example, if we were interested in comparing the average style rating across each style, we could use our filtering data move for each of the 9 different styles and compute the means of the stars ratings for each filtered dataset. Or, we could leverage grouping to compute the means in a more efficient way using our style_grps object. One difference in using grouping vs. filtering besides reduction in code is that computations applied to the grouped object are vectorized (and also more efficient in that way). The mean stars per group calculation is demonstrated below.

## Calculate the mean stars rating for each style in the style_grps grouped object 
## The .sort_values method arranges the output in ascending order

style_grps['stars'].mean().sort_values()

style
Cup           3.493693
Can           3.500000
Restaurant    3.583333
Tray          3.595965
Bowl          3.664540
Pack          3.851422
Bottle        4.000000
Box           4.060417
Bar           5.000000
Name: stars, dtype: float64

Applying the mean() function to our grouped object results in a Series. We can then apply the .sort_values() method to this Series so that the group means appear in ascending (or descending) order.

From the output we can see that the three highest mean ratings are for Bar, Box and Bottle. To complement the mean rating analysis, we should also examine the number of occurrences of each style, and we can do this by applying the appropriate method to our same style_grps object.

## Return the counts of non-missing values in the stars column 
## for each style in the style_grps object   
## The .sort_values() method arranges the output in ascending order   

style_grps['stars'].size().sort_values()

style
Bar              1
Bottle           1
Can              1
Restaurant       3
Box            120
Tray           228
Cup           1002
Bowl          1022
Pack          2637
Name: stars, dtype: int64

From the output we can see that two of our top three mean ratings are only represented by 1 review. So, we can regard the other averages (besides restaurant) represented by many more ratings as more reliable with respect to the style categorization. The frequencies may reflect the preferences of the rater, or may be reflective of a concept like “availability”. At least, we have an idea about a line of inquiry that may be worth investigating and we might want to put more or less faith in some mean ratings than others. ### Calculating (a new attribute)

The calculating a new attribute data move allows us to create new variables from existing ones. We might calculate a new attribute to reveal underlying patterns in the data, to facilitate comparisons, or to preparing data for analysis, among other useful transformations. We can calculate various types of new attributes including numerical or categorical variables. For example, two numerical variables might be combined in a formula to calculate a new quantity, or used to sort observations into groups like small, medium, large. These calculated values become part of the dataset and can be used to support further exploration.

5.1.5.1 A calculating data move example

The American Federation of Labor and Congress of Industrial Organizations (AFL-CIO) website provides data on CEO compensation packages in the United States. The information on the compensation packages includes base salary, bonuses, stock awards, and other earnings, all captured in a dataset compiled from AFL-CIO data.

Below are the attributes and descriptions for this dataset:

library(knitr)
library(kableExtra)

# Create the data frame
compensation_table <- data.frame(
  Name = c( "ticker", "salary", "bonus",
            "stock_awards", "option_awards","non_equity_comp",
            "pension_change","other_comp"
  ),
  Description = c( "Stock ticker symbol for the company.", 
                   "Base annual salary.", "Additional cash bonus.",
                   "Value of stock granted.", "Value of stock options granted.",
                   "Performance-based cash compensation not tied to equity.", 
                   "Increase in pension value and deferred compensation earnings.",
                   "Miscellaneous compensation (e.g., perks, benefits)."
  )
)

kable(compensation_table, 
      caption = "CEO Compensation Packages - Variable Descriptions (AFL-CIO)",
      booktabs = TRUE,
      longtable = TRUE,
      align = "l") %>%
  kable_styling(latex_options = c("striped", "hold_position"))

CEO Compensation Packages - Variable Descriptions (AFL-CIO)
Name	Description
ticker	Stock ticker symbol for the company.
salary	Base annual salary.
bonus	Additional cash bonus.
stock_awards	Value of stock granted.
option_awards	Value of stock options granted.
non_equity_comp	Performance-based cash compensation not tied to equity.
pension_change	Increase in pension value and deferred compensation earnings.
other_comp	Miscellaneous compensation (e.g., perks, benefits).

Although the dataset does not contain a column representing total compensation, we can calculate this attribute from the existing data. We could also create additional attributes that represent various measures like the percentage of total compensation coming from stock or bonus pay.

# Load the CEO compensation data
ceo_pay = pd.read_csv("ceo_compensation_summary.csv")

# Calculate total compensation and store the result in the dataframe in a new column called total   
ceo_pay['total'] = (
    ceo_pay['salary'] +
    ceo_pay['bonus'] +
    ceo_pay['stock_awards'] +
    ceo_pay['option_awards'] +
    ceo_pay['non_equity_comp'] +
    ceo_pay['pension_change']
)

Now that we’ve computed this new column, additional analysis and exploration can be performed to better understand CEOs payment trends. In addition, we could use the summarizing data move to calculate statistics such as the mean, median, and standard deviation of the new total attribute.

5.1.6 A little extra - Merging & Joining

Merging and joining involve combining information from various datasets based on some identifying criteria. These terms are often used interchangeably.

Consider the ceo_pay dataset, which includes details about the CEO compensation structure but lacks information about the companies and the particular CEOs. If we had the additional company and CEO information in another dataset and a common identifier between these disparate sources, we could merge the data into a single source containing all of the information.

As it turns out, the company dataset, read in below, has information on company and ceo names. In addition, both the company and ceo_pay datasets contain a common identifier, ticker. So, we can use this identifier to merge the two data sources into one.

# Load the company information data 

company = pd.read_csv("company_info.csv")

# Join the CEO pay dataset with the company dataset 

ceos_and_pay = pd.merge(ceo_pay, company, on = 'ticker', how = 'inner')

# Show the new combined data column names

ceos_and_pay.columns

Index(['ticker', 'salary', 'bonus', 'stock_awards', 'option_awards',
       'non_equity_comp', 'pension_change', 'other_comp', 'total', 'city',
       'state', 'display_name', 'fiscal_year', 'ceo_name'],
      dtype='object')

# preview the first 5 rows

ceos_and_pay.head()

  ticker   salary  ...  fiscal_year                      ceo_name
0    TPG   509615  ...       2023.0           Mr. Jon  Winkelried
1     CG   838462  ...       2023.0  Mr. Harvey Mitchell Schwartz
2   AVGO  1200000  ...       2024.0               Mr. Hock E. Tan
3   PANW   750000  ...       2024.0      Mr. Nikesh  Arora C.F.A.
4   COTY  3549000  ...       2024.0               Ms. Sue Y. Nabi

[5 rows x 14 columns]

Above we used the pd.merge() function (in the Pandas library) to combine the two dataframes based on the ticker variable/identifier in both datasets. We specified the identifier ticker as the on parameter value, and we specified inner as the how parameter value. The inner specification defines an inner-join which means only the rows where both datasets have a matching ticker value are merged and kept in the resulting dataset. In general pd.merge() is used to combine DataFrames based on one or more common keys (i.e., columns).

Now that we have additional information in our ceos_and_pay dataset we have the option to explore entirely new data investigations, such as CEO salaries by industry, or salaries by other company characteristics. And of course, to do this, we can leverage data moves!