In part 1 of this series, we extracted data of all players who have won the MVP award from 1991 to 2021. If you would like to follow along and missed out on part 1 of this series, click here.
In this section, we will extract data for all the players and their stats as well as team data.
The following is the code used to extract data using the requests library.
player_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html"
url = player_stats_url.format(1991)
data = requests.get(url)
with open("PLAYERS/1991.html", "w+", encoding="utf-8") as f:
f.write(data.text)The web page we intend to scrape loads the tables using javascript. As a result, the data scraped does not contain all the records we want. It only provides 17 rows, despite the fact that the table contains over 300 rows. To get around this problem, we'll use selenium.
Selenium is a free, open-source software testing framework that allows developers to automate web browser actions. It is primarily used for testing web applications, but it can also be used for web scraping.
One of the key advantages of Selenium is that it allows developers to control a real web browser, rather than just making HTTP requests like some other web scraping tools do. This means that it can interact with websites in the same way a user would, and can handle complex interactions such as JavaScript, cookies, and pop-ups.
Let’s begin!
To use Selenium, you will need to have the following prerequisites installed on your system:
A web browser: Selenium can control a number of different web browsers, including Chrome, Firefox, Safari, and Edge. You will need to have one of these browsers installed on your system in order to use Selenium.
The Selenium Python library: You will need to install the Selenium Python library in order to use Selenium in your Python code. You can install the library using
pip, the Python package manager, by running the following command:pip install seleniumA web driver: In order to control a web browser with Selenium, you will need to have a web driver installed on your system. A web driver is a piece of software that acts as a bridge between Selenium and the web browser, allowing Selenium to send commands to the browser and receive responses.
Different web browsers require different web drivers, so you will need to install the appropriate driver for the browser you are using. You can find more information about how to install web drivers on the Selenium documentation website.
Basic knowledge of Python: To use Selenium in your Python code, you will need to have a basic understanding of Python programming. This includes understanding concepts such as variables, loops, and functions, as well as how to write and execute Python code.
After installation of the web driver, import selenium and create a variable to store the driver executable using the code below:
from selenium import webdriver
driver = webdriver.Chrome(executable_path="PATH TO CHROMEDRIVER EXECUTABLE")Running the code above creates a new browser window that’s being controlled by selenium. We will use it to render a page with all the rows of data we need. As we did in Part 1, we will do it for one year to ensure that our program is working properly before creating a loop for all of the years.
import time
year = 1991
url = player_stats_url.format(year)
# render url in the browser
driver.get(url)
# add js to tell the browser to scroll down to be able to render the entire table
driver.execute_script("window.scrollTo(1, 1000)")
time.sleep(2)
# get the html of the page
html = driver.page_sourceWe then download the HTML page containing all the 300+ rows. First I created a folder called PLAYERS that will contain all the HTML pages we will download. You can name it however you want.
with open("PLAYERS/{}.html".format(year), "w+", encoding="utf-8") as f:
f.write(html)Open the downloaded HTML page to confirm it captured all the rows in the table. If successful, create a loop to download the pages for all the years from 1991 to 2021. We use a time delay of 2 seconds because there are many rows of data being parsed.
for year in years:
url = player_stats_url.format(year)
# render url in the browser
driver.get(url)
# add js to tell the browser to scroll down to be able to render the entire table
driver.execute_script("window.scrollTo(1, 1000)")
time.sleep(2)
# get the html of the page
html = driver.page_source
with open("PLAYERS/{}.html".format(year), "w+", encoding="utf-8") as f:
f.write(html)After downloading all the pages, it is now time to parse the stats with BeautifulSoup. When we look at the structure of the table, we realize that the row headers are repeated within the table after every 20 records.

This will be a bit confusing when the table is loaded into pandas. Using the decompose() method, we will remove all of the header rows except the first one. Upon inspection, the header rows are in the <tr> tag with a class of header while the table has an id of pre_game_stats.
The dataframes are stored in a list called player_df.
player_df = []
for year in years:
with open("PLAYERS/{}.html".format(year), encoding="utf-8") as f:
page = f.read()
# create a parser class to extract table from the page
soup = BeautifulSoup(page, "html.parser")
# remove the top row of the table
soup.find("tr", class_="thead").decompose()
# remove all other page elements and only find the specific table we want
player_table = soup.find(id="per_game_stats")
# convert the table into a string
# you'll get a list of dataframes so just get the first index.
player = pd.read_html(str(player_table))[0]
player["Year"] = year
player_df.append(player)We then use the pandas concat() function to combine all of the player stats before viewing a sample of the data to ensure it worked properly.
players = pd.concat(player_df)
# view sample of data
pd.pandas.set_option('display.max_columns', None) #display all column names
players.sample(5)Here is a sample of the output

Save the dataframe as a csv file.
players.to_csv("player_stats.csv")The next thing we’ll do is scrape the team data. This will be important in helping us make predictions. The url we’ll use for this is below. The curly brace represents the year. The code downloads all the HTML pages from 1991 to 2021.
team_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_standings.html"
# scraping the data
for year in years:
url = team_stats_url.format(year)
data = requests.get(url)
with open("TEAM/{}.html".format(year), "w+", encoding="utf-8") as f:
f.write(data.text)Here’s a sample of the page with the table whose data we will be extracting. We will use the Division standings table.

There are 2 separate tables that we need to scrape and we will do this using the BeautifulSoup Library. You can check the table elements using Inspect on your browser(Right click > Inspect)or Ctrl+Shift+I.
dfs = []
for year in years:
with open("team/{}.html".format(year), encoding="utf-8") as f:
page = f.read()
soup = BeautifulSoup(page, 'html.parser')
soup.find('tr', class_="thead").decompose()
# Eastern Conference
e_table = soup.find_all(id="divs_standings_E")[0]
e_df = pd.read_html(str(e_table))[0]
e_df["Year"] = year
e_df["Team"] = e_df["Eastern Conference"]
del e_df["Eastern Conference"]
dfs.append(e_df)
# Western Conference
w_table = soup.find_all(id="divs_standings_W")[0]
w_df = pd.read_html(str(w_table))[0]
w_df["Year"] = year
w_df["Team"] = w_df["Western Conference"]
del w_df["Western Conference"]
dfs.append(w_df)Afterwards, we combine both dataframes into one called teams. and then view a sample of the data.
Save the data as csv file.
teams.to_csv("teams.csv")Part 3 of this project series will focus on cleaning all of the data we collected now that we have gathered all of the information we require. So by now you should have player data, team data and mvps data.
If you’re yet to get started, here’s Part 1 of this series, click on the link below and follow me so you don’t miss out on future posts.
See you in Part 3!!