Web Scraping NBA Stats: Part 1

The ultimate goal of this project is to predict the MVP for 2022. I scraped the data from this website. I'll provide a list of links to all of the pages I'll scrape from to make it easier for you to follow along. The project is also on my GitHub page.

The following are the tools I used:

Jupyter Notebook for writing code
Python, Pandas
Requests library
Selenium for scraping JavaScript pages
Beautiful Soup

I will do my best to comment on my code and also leave explanations for my approaches within this article. Let’s begin!

I scraped data from 1991 to 2021. We start by creating a list that contains all the years using the range function. Because the last number is not included by the range function, I use range(1991, 2022) to generate a list from 1991 to 2021. The end parameter is used to print the list horizontally so that you don’t have to scroll all the way down to confirm you have the correct output.

years = list(range(1991, 2022))
print(years, end=" ")

The next step is to find the page with the data we need and assign its URL to a variable. This is the format of the original URL (https://www.basketball-reference.com/awards/awards_1991.html). Just replace the curly brace with any year between 1991 and 2021 to access the specific web page.

url_start = "https://www.basketball-reference.com/awards/awards_{}.html"

The curly brace in the URL will hold the respective years that we will iterate through as we download the HTML pages. We will use the requests library to make a request to the website to download the web pages we want. If you’re using jupyter notebook, install the requests library using the following code to install it within your terminal. !pip install requests

Create a folder called MVP or your preferred name. Make sure it is in the same directory as the one used to run your notebook otherwise, you will be forced to write the full path to your folder.

import requests # make a request to the webpage to download it
import time

for year in years:
    # create a url for a specific year
    url = url_start.format(year)
    data = requests.get(url)
    
    # W+ opens file in write mode and if it already exists it will just overwrite.
    with open("MVP/{}.html".format(year), "w+", encoding = "utf-8") as f:
        time.sleep(3)
        f.write(data.text) #text saves files as html

To avoid character encoding errors, specify the encoding as utf-8. The time.sleep(3) is used to delay making requests to the browser for 3 seconds after downloading the web page for each year to prevent overloading the website’s server. You can increase the number of seconds if you wish.

Each page has a table similar to the one below.

This is the table whose data we will extract for each year. We parse the table using the Beautiful Soup library. Install it using !pip install beautifulsoup4. We will first parse and extract data from a single page to ensure that we are doing it correctly, and then repeat for the remaining years.

# import beautiful soup
from bs4 import BeautifulSoup

# read the HTML data
with open("MVP/1991.html", encoding="utf-8") as f:
    page = f.read()

# create a parser class to extract table from the page
soup = BeautifulSoup(page, "html.parser")

To scrape the data, we need to find the tag elements of the table. We do this by inspecting the webpage. Right-click anywhere on the webpage and select Inspect. This displays the HTML code for the page and when you hover over it, it will highlight different elements of the page.

From the table, we can see that there is an extra header row. We need to remove this row because when we load the table data in pandas, it will create an extra header row that is unnecessary. Afterward, we find the specific table whose data we want to extract.

We use beautiful soup’s find function to find the header element which is in a <tr> tag with a class of “over_header”. Because class is a reserved keyword in Python and using it will result in a syntax error, we use an underscore after it.

# remove the top row of the table
soup.find("tr", class_="over_header").decompose()
print("Header row removed successfully")

# find the specific table we want using its id
mvp_table = soup.find(id="mvp")

The decompose function removes a tag as well as its inner content. We find the specific table we want using the id. This is because in HTML the id is a globally unique property in HTML that only one element should have.

Finally, we read the table into pandas. By default, it is not read as a string so we use the str( ) function to convert it into a string. The result will be a list of data frames which is not what we want. So we get the first index thus the [0]

import pandas as pd

mvp_1991 = pd.read_html(str(mvp_table))[0]

mvp_1991

The output above shows that we have successfully extracted the data for 1991. So we will do the same for the rest of the years and then merge the data frames into one dataset. All the data frames will be stored in a list called all_dfs. You can give it a different name if you wish.

One observation I made is that is impossible to tell which data came from which year so I added the year column to the data frame to help with the same.

all_dfs = []
for year in years:
    # read the HTML data
    with open("MVP/{}.html".format(year), encoding="utf-8") as f:
        page = f.read()

        # create a parser class to extract table from the page
        soup = BeautifulSoup(page, "html.parser")

        # remove the top row of the table
        soup.find("tr", class_="over_header").decompose()

        # remove all other page elements and only find the specific table we want
        mvp_table = soup.find(id="mvp")

        # read table into pandas dataframe
        mvp = pd.read_html(str(mvp_table))[0]
        
        # create year column to know where data came from
        mvp["Year"] = year

        all_dfs.append(mvp)

We then combine all of the dataframes into a single dataframe called mvps, which we save as a csv file.

mvps = pd.concat(dfs)

pd.pandas.set_option('display.max_columns', None) #display all column names
mvps.sample(5)

Here’s a sample of the final dataframe with the Year column included.

The data we have gathered so far consists only of the players who have won the MVP award. We need all player stats to determine the properties associated with players who are likely to win the MVP award. For the next part of this project, we will download the player stats and also introduce selenium for scraping javascript pages.

Follow me so you don’t miss out on part 2!

Web Scraping NBA Stats: Part 1

Keep the streak alive!

Discussion (0)

Web Scraping NBA Stats: Part 1

Keep the streak alive!

Share this article:

Discussion (0)