We collected the data we’ll be cleaning through web scraping in part 1 and part 2 of this series. We have 3 datasets overall which we will finally combine into one by the end of this section.

Cleaning mvps data

We start with the mvps data. Import pandas package and read in the data. I saved my file as mvps.csv. Yours could be saved differently. Make sure it is in the same directory as your jupyter notebook file otherwise you’ll have to specify the whole path which I find very annoying.

import pandas as pd

# read in the data
mvps = pd.read_csv("mvps.csv")
players = pd.read_csv("player_stats.csv")
teams = pd.read_csv("teams.csv")

Next, we’ll view a sample of data from all three datasets. We’ll include a line of code that shows all the columns instead of ellipses (…) This is because I would like to compare the columns from all three datasets.

# view a sample of the data
pd.pandas.set_option('display.max_columns', None) #display all column names
mvps.sample()

You can do the same for the remaining datasets.

Since the mvps and players data are basically about players, we will remove some columns from the mvps dataset that are already in the players dataset. First, we’ll find the common columns in both using intersection( ) method.

common_columns = mvps.columns.intersection(players.columns)
print(common_columns)

Next, we will select the subset of columns that we want to keep.

  • Pts Won — the points won in the MVP voting

  • Pts Max — maximum points gotten

  • Share — Pts Won/Pts Max

mvps = mvps[["Player", "Pts Won", "Pts Max", "Share", "Year"]]
mvps.head()

Cleaning Players Data

Next, we’ll clean the player data and then merge it with the mvps data so that we can have the votes for every player. We’ll delete some unnecessary columns.

del players["Unnamed: 0"]
del players["Rk"]

# confirm
players.head()

From the common columns, we saw player and year columns. We will check these to make sure they’re in the correct format. We will merge players and mvps based on these 2 columns.

players["Player"].head(50)

We notice some names have asterisks* at the end. We have to remove the asterisks* as they will cause issues during merging. This is because pandas will interpret them as different names. We use python’s str.replace( ) method.

By default, pandas use * in regular expressions for the replace method, and the * is like a keyword (for lack of a better word) in regular expressions. So we set regex=False to make pandas treat the * as a regular *

players["Player"] = players["Player"].str.replace("*","", regex=False)

#confirm the change
players.head()

We now check for duplicates.

players.duplicated().sum()

There are over 466 duplicated rows. Since they’re too many to display, I did a visual inspection of the data by displaying the first 20 records.

You’ll notice that player 8. Greg Anderson has several records for 1991. This is because he played for different teams that year. To confirm this, we will create a group for each player and the year. This returns all 4 records of Greg Anderson.

players.groupby(["Player", "Year"]).get_group(("Greg Anderson", 1991))

We will write a function that iterates through each group and make sure that each group has only one row. We then use the apply( ) method to apply this function to each group.

def single_row(players):
    #if there's only one row for the player, return it
    if players.shape[0] == 1:
        return players
    else:
        row = players[players["Tm"] == ["TOT"]
        row["Tm"] = players.iloc[-1,:]["Tm"]  # replace with last team played in            
        return row

# apply the function to each group
players = players.groupby(["Player", "Year"]).apply(single_row)

We then look at the first 20 rows to make sure the changes were applied. Use head( ) method. The output shows each player with all the years they played.

We have a problem though. The groupby function added 2 extra index columns. We now have a multilevel index because we split it into groups and combined it again. But we don’t need those, so we will drop them. Run the code below twice.(in separate cells). Then display the dataframe again to confirm they’re gone.

players.index = players.index.droplevel()

For further confirmation, you can check the records for Greg Anderson to make sure there’s only one record for the year 1991 instead of 4.

players[players["Player"] == "Greg Anderson"]

Combining the players and mvps data

We’ll merge the two datasets on the Player and Year columns. We’ll do this using an outer merge. Reason being, not all rows on the players Player column are in the mvps Player column.

The mvps data contains only players who have won MVP while the players data contains all players whether they’ve won MVP or not. An outer merge will keep the rows anyway even if they are not found in the other dataset.

combined = players.merge(mvps, how="outer", on=["Player", "Year"])
combined.head()

Players with NaN for the last 3 columns have never won MVP. But we can check the players who have, just to make sure the merge worked correctly.

combined[combined["Pts Won"] > 0]

You can fill in the last 3 columns with zero for the players who have never won MVP if you wish. Technically, the values are not missing.

Cleaning team data

We already read in the teams data but you can if you hadn’t already. We’ll look at the dataframe’s first 20 rows. You’ll notice there are header rows after every 6 or 7 records. We need to remove these.

# ~ means not
teams = teams[~teams["W"].str.contains("Division")]

Some of the team names have an asterisk * at the end. We’ll use the replace method as before and remove the *.

# remove * from team names
teams["Team"] = teams["Team"].str.replace("*","", regex=False)
# confirm
teams.head()

Another issue we have is with the team names. When we get the unique team names in teams dataset and the ones in combined dataset, we realise the oned from the combined dataset are abbreviated while those on teams dataset are not. This will cause an issue when merging so we need to fix it.

One way to do this is to match the abbreviations with their full names, save that as a file and then use it. It’s already been done for you(you can say thanks with a tip below ; ) )The file is called nicknames.txt and you can download it from my Telegram group.

nicknames = {}

with open("nicknames.txt")as f:
    lines = f.readlines()
    for line in lines[1:]: # skip header row
        abbrev, name = line.replace("\n","").split(",")
        nicknames[abbrev] = name

# view the dictionary
nicknames

We will create a new column called Team in combined then use the map( ) function which applies the nicknames dictionary to each element in the "Tm" column from the combined dataset.

By default the new column will be at the end of the dataframe. But I would love to have it next to Tm column so its easy to match the names and verify they’re right. I’ll use the insert( ) method to create the new column.

combined.insert(4, "Team", combined["Tm"].map(nicknames))
combined.head()

Our teams data is clean and ready. We’ll merge it with the combined data using the Team and Year columns. So for every row in combined, we’ll add the row from teams that correspond to W/L record.

stats = combined.merge(teams, how="outer", on=["Team","Year"])
stats

del stats["Unnamed: 0"]

You can then confirm whether the datatypes are correct and change where necessary. since we have so many columns, the easiest way to do this is as below. The columns that are actual strings will throw errors that will be ignored so the datatype won’t change.

stats = stats.apply(pd.to_numeric, errors="ignore")

# confirm change
stats.dtypes

Go through the columns that didn’t change and confirm the datatype is correct. For example, the GB column should be numeric but remained as a string. We’ll confirm and then change if necessary.

stats["GB"].unique()

There’s a dash in one of the values which represents that a team has zero games back(GB) so we’ll replace that. Copy and paste it into your code. This is because some characters are not exactly what they appear to be on the screen. If you look closely, it is longer than the dash that would come from your keyboard.

stats["GB"] = stats["GB"].str.replace("—","0")
stats["GB"].unique()

Confirm the change using the unique( ) method again. Convert the column to numeric and confirm changes

stats["GB"] = pd.to_numeric(stats["GB"])
stats.dtypes

Save the merged datasets to csv

stats.to_csv("player_mvp_stats.csv")

That’s the end of our data cleaning. You’re free to start exploring the data. Here’s a head start. We’ll find the highest-scoring players and plot them on a bar chart.

highest_scoring = stats[stats["G"] > 70].sort_values("PTS", ascending=False).head(10)
highest_scoring

Plot a bar chart of the same

highest_scoring.plot.bar("Player", "PTS")

Well, this has been fun and tiring. You can continue from here if you wish.