Social power, influence, and performance in the NBA, Part 2

Exploring the individual NBA players

Python, pandas, and a touch of R

Content series:

This content is part # of # in the series: Social power, influence, and performance in the NBA, Part 2

Stay tuned for additional content in this series.

This content is part of the series:Social power, influence, and performance in the NBA, Part 2

Stay tuned for additional content in this series.

Getting started

In Part 1 of this series, you learned about the basics of data science and machine learning. You used Jupyter Notebook, pandas, and scikit-learn to explore the relationship between NBA teams and their valuation. Here, you will explore the relationship between social media, salary, and on-the-court performance for NBA players.

Create a unified data frame (Warning: hard work ahead!)

To get started, create a new Jupyter Notebook and name it nba_player_power_influence_performance.

Next, load all of the data about players and merge the data into a single unified data frame.

Manipulating several data frames falls into the category of the 80 percent of the hard work of data science. In listings 1 and 2, the basketball-reference data frame is copied, then several columns are renamed.

Listing 1. Setting up Jupyter Notebook and loading data frames

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
color = sns.color_palette()
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
%matplotlib inline
attendance_valuation_elo_df = pd.read_csv("../data/nba_2017_att_val_elo.csv")
salary_df = pd.read_csv("../data/nba_2017_salary.csv")
pie_df = pd.read_csv("../data/nba_2017_pie.csv")
plus_minus_df = pd.read_csv("../data/nba_2017_real_plus_minus.csv")
br_stats_df = pd.read_csv("../data/nba_2017_br.csv")

Listing 2. Fixing bad data in column in plus minus data frame

plus_minus_df.rename(columns={"NAME":"PLAYER"}, inplace=True)
players = []
for player in plus_minus_df["PLAYER"]:
    plyr, _ = player.split(",")
plus_minus_df.drop(["PLAYER"], inplace=True, axis=1)
plus_minus_df["PLAYER"] = players

The output of the commands to rename the NAME column to the PLAYER column is shown below. The extra column is also dropped. Take note of the inplace=TRUE and the drops to apply to the existing data frame.

Figure 1. NBA dataset load and describe
Image shows output of commands
Image shows output of commands

The next step is to rename and merge the core data frame that holds the majority of the stats from Basketball Reference. To do this, use the code provided in listings 3 and 4.

Listing 3. Rename and merge basketball reference data frame

nba_players_df = br_stats_df.copy()
nba_players_df.rename(columns={'Player': 'PLAYER','Pos':'POSITION', 'Tm': "TEAM", 'Age': 'AGE'}, inplace=True)
nba_players_df.drop(["G", "GS", "TEAM"], inplace=True, axis=1)
nba_players_df = nba_players_df.merge(plus_minus_df, how="inner", on="PLAYER")

Listing 4. Clean up and merge PIE fields

pie_df_subset = pie_df[["PLAYER", "PIE", "PACE"]].copy()
nba_players_df = nba_players_df.merge(pie_df_subset, how="inner", on="PLAYER")

Figure 2 shows the output of splitting the columns into two parts and re-creating the column. Splitting and re-creating columns is a typical operation and takes up much of the time of data manipulation in solving data science problems.

Figure 2. Merge PIE data frames
Image shows 5 rows x 37 columns  of data
Image shows 5 rows x 37 columns of data

Up until now, most of the data manipulation tasks have been relatively straightforward. Things are about to get more difficult because there are missing records. In Listing 5, there are 111 missing salary records. One way to deal with this is to do a merge that drops the missing rows. There are many techniques to deal with missing data; just dropping missing rows, as shown in the example, is not always the best choice. There are many examples of dealing with missing data in Titanic: Machine Learning from Disaster. It is well worth the time to explore a few example notebooks there.

Listing 5. Clean up salary

salary_df.rename(columns={'NAME': 'PLAYER'}, inplace=True)
salary_df.drop(["POSITION","TEAM"], inplace=True, axis=1)

In Listing 6, you can see how a set is created to calculate the number of rows that are missing data. This is a handy trick that is invaluable in determining what is different between two data frames. It is accomplished by using the Python built-in function len(), which is also commonly used in regular Python programming to get the length of a list.

Listing 6. Find missing records and merge

diff = list(set(nba_players_df["PLAYER"].values.tolist()) - set(salary_df["PLAYER"].values.tolist()))

Out[45]:  111
nba_players_with_salary_df = nba_players_df.merge(salary_df)

The output is shown below.

Figure 3. Difference between data frames
Image shows difference between data frames
Image shows difference between data frames

With the data frame merges complete, it's time to create a correlation heatmap to discover which features are correlated. The heatmap below shows the combined output of the correlation of 35 columns and 342 rows. A couple of immediate things that pop out are that salary is highly correlated with both points and WINS_RPM, which is an advanced statistic that calculates the estimated wins a player adds to their team by being on the court.

Another interesting correlation is that Wikipedia page views are strongly correlated with Twitter Favorite counts. This correlation makes sense intuitively because they are both measures of engagement and popularity of NBA players by fans. This is an example of how a visualization can help nail down which features will go into a machine learning model.

Figure 4. NBA player correlation heatmap: 2016-2017 season (stats and salary)

With some initial discovery of what features are correlated, the next step is to further discover relationships in the data by plotting in Seaborn. Commands executed to run the plot are shown below.

Listing 7. Seaborn lmpot of salary versus WINS_RPM

sns.lmplot(x="SALARY_MILLIONS", y="WINS_RPM", data=nba_players_with_salary_df)

In the plot output shown below, there appears to be a strong linear relationship between salary and WINS_RPM. To further investigate this, run a linear regression.

Figure 5. Seaborn lmplot of salary and wins real plus minus
Image shows salary millions x  axis, wins y axis
Image shows salary millions x axis, wins y axis

The output of two linear regressions on wins is below. One of the more interesting findings is that wins are explained more by WINS_RPM than by points. The R-squared (goodness of fit) is 0.324 in the case of WINS_RPM versus 0.200 in points. WINS_RPM is the statistic that shows the individual wins attributed to a player. It makes sense that a more advanced statistic that takes into account defensive and offensive statistics and time on the court is more predictive versus just an offensive statistic.

An example of how this could play out in practice is to imagine a player who had a very low shooting percentage, but high points. If he shot the ball often, instead of a teammate with a higher shooting percentage, it could cost him wins. This case played itself out in real life during the 2015-16, season where Kobe Bryant in his last year with the Los Angeles Lakers, had 17.6 points per season, but a 41-percent shooting percentage for two-pointers. The team ended up only winning 17 games, and the WINS_RPM stat was 0.66 (only half a win attributed to his play during the season).

Figure 6. Linear regression wins
Image shows  linear  wins data
Image shows linear wins data

Listing 8. Regression wins and points

results = smf.ols('W ~POINTS', data=nba_players_with_salary_df).fit()

Another way to represent this relationship graphically is with ggplot in Python. Listing 9 is an example of how to set up the plot. The library in Python is a direct port of ggplot in R and is in active development. As of the time of this writing, it isn't as smooth to use as the regular R ggplot, but it has a lot of nice features. The graph is shown below.
Note: A handy feature is the ability to represent another column of continuous variables by a color.

Listing 9. Python ggplot

from ggplot import *
p = ggplot(nba_players_with_salary_df,aes(x="POINTS", y="WINS_RPM", color="SALARY_MILLIONS")) + geom_point(size=200)
p + xlab("POINTS/GAME") + ylab("WINS/RPM") + ggtitle("NBA Players 2016-2017:  POINTS/GAME, WINS REAL PLUS MINUS and SALARY")
Figure 7. Python ggplot plus minus salary points
Image shows wins/rpm x axis, points/game y axis
Image shows wins/rpm x axis, points/game y axis

Grabbing Wikipedia page views for NBA players

The next task is to figure out how to collect Wikipedia page views, which is typically a messy data collection. Problems include:

  1. Figuring out how to retrieve the data from Wikipedia (or some website)
  2. Figuring out how to programmatically generate Wikipedia handles
  3. Writing the data into a data frame and joining it to the rest of the data

The code below is in the GitHub repository for this tutorial. Comments about this code are throughout the sections below.

Listing 10 provides the code to construct a Wikipedia URL that returns a JSON response. In Part 1, in the docstrings, the route to construct is shown. This is the URL the code calls to get the page view data.

Listing 10. Wikipedia, part 1

Example Route To Construct: +
metrics/pageviews/per-article/ +
en.wikipedia/all-access/user/ +
LeBron_James/daily/2015070100/2017070500 +

import requests
import pandas as pd
import time
import wikipedia


def construct_url(handle, period, start, end):
    """Constructs a URL based on arguments

    Should construct the following URL:

    urls  = [BASE_URL, handle, period, start, end]
    constructed = str.join('/', urls)
    return constructed

def query_wikipedia_pageviews(url):

    res = requests.get(url)
    return res.json()

def wikipedia_pageviews(handle, period, start, end):
    """Returns JSON"""

    constructed_url = construct_url(handle, period, start,end)
    pageviews = query_wikipedia_pageviews(url=constructed_url)
    return pageviews

In Listing 10, part 2, Wikipedia handles are created by guessing that the first and last name is the player's name, then trying to append "(basketball)" to the URL if there is an error. This solves the majority of the cases, and only a few names/handles are missed. An example guess would be "LeBron" as the first name and "James" as the last name. The reason to initially guess this way is that it matches close to 80 percent of the Wikipedia pages and saves the time of finding the URLs one by one. For the 20 percent of names that don't fit this pattern, there is another method (shown below) that matches 80 percent of those initial misses.

By adding "(basketball)," Wikipedia can differentiate between one famous name and another. This convention catches the majority of names that did not match. Listing 10, part 2 shows the last method to find the other names.

Listing 10. Wikipedia, part 2

def wikipedia_2016(handle,sleep=0):
    """Retrieve pageviews for 2016""" 
    print("SLEEP: {sleep}".format(sleep=sleep))
    pageviews = wikipedia_pageviews(handle=handle, 
            period="daily", start="2016010100", end="2016123100")
    if not 'items' in pageviews:
        print("NO PAGEVIEWS: {handle}".format(handle=handle))
        return None
    return pageviews

def create_wikipedia_df(handles):
    """Creates a Dataframe of Pageviews"""

    pageviews = []
    timestamps = []    
    names = []
    wikipedia_handles = []
    for name, handle in handles.items():
        pageviews_record = wikipedia_2016(handle)
        if pageviews_record is None:
        for record in pageviews_record['items']:
    data = {
        "names": names,
        "wikipedia_handles": wikipedia_handles,
        "pageviews": pageviews,
        "timestamps": timestamps 
    df = pd.DataFrame(data)
    return df    

def create_wikipedia_handle(raw_handle):
    """Takes a raw handle and converts it to a wikipedia handle"""

    wikipedia_handle = raw_handle.replace(" ", "_")
    return wikipedia_handle

def create_wikipedia_nba_handle(name):
    """Appends basketball to link"""

    url = " ".join([name, "(basketball)"])
    return url

In Listing 10, part 3 the guess of a handle is facilitated by having access to a roster of players. This portion of the code runs the matching code shown above against the entire NBA roster collected earlier in the article.

Listing 10. Wikipedia, part 3

def wikipedia_current_nba_roster():
    """Gets all links on wikipedia current roster page"""

    links = {}
    nba ="List_of_current_NBA_team_rosters")
    for link in nba.links:
        links[link] = create_wikipedia_handle(link)
    return links

def guess_wikipedia_nba_handle(data="data/nba_2017_br.csv"):
    """Attempt to get the correct wikipedia handle"""

    links = wikipedia_current_nba_roster() 
    nba = pd.read_csv(data)
    count = 0
    verified = {}
    guesses = {}
    for player in nba["Player"].values:
        if player in links:
            print("Player: {player}, Link: {link} ".format(player=player,
            count += 1
            verified[player] = links[player] #add wikipedia link
            print("NO MATCH: {player}".format(player=player))
            guesses[player] = create_wikipedia_handle(player)
    return verified, guesses

In Listing 10, part 4, the entire script runs using the CSV file as input and making another CSV file work as output. Note that the Wikipedia Python library is used to inspect the page to find the word "NBA" in the final matches. This is the last check for pages that have failed multiple guessing techniques. The result of all of these heuristics is a relatively reliable way to get the Wikipedia handles for NBA athletes. You could imagine using a similar technique for other sports.

Listing 10. Wikipedia, part 4

def validate_wikipedia_guesses(guesses):
    """Validate guessed wikipedia accounts"""

    verified = {}
    wrong = {}
    for name, link in guesses.items():
            page =
        except (wikipedia.DisambiguationError, wikipedia.PageError) as error:
            #try basketball suffix
            nba_handle = create_wikipedia_nba_handle(name)
                page =
                print("Initial wikipedia URL Failed: {error}".format(error=error))
            except (wikipedia.DisambiguationError, wikipedia.PageError) as error:
                print("Second Match Failure: {error}".format(error=error))
                wrong[name] = link
        if "NBA" in page.summary:
            verified[name] = link
            print("NO GUESS MATCH: {name}".format(name=name))
            wrong[name] = link
    return verified, wrong

def clean_wikipedia_handles(data="data/nba_2017_br.csv"):
    """Clean Handles"""

    verified, guesses = guess_wikipedia_nba_handle(data=data)
    verified_cleaned, wrong = validate_wikipedia_guesses(guesses)
    print("WRONG Matches: {wrong}".format(wrong=wrong))
    handles = {**verified, **verified_cleaned}
    return handles

def nba_wikipedia_dataframe(data="data/nba_2017_br.csv"):
    handles = clean_wikipedia_handles(data=data)
    df = create_wikipedia_df(handles)    
    return df

def create_wikipedia_csv(data="data/nba_2017_br.csv"):
    df = nba_wikipedia_dataframe(data=data)

if __name__ == "__main__":

Grabbing Twitter engagement for NBA players

Now you need the Twitter library so you can download the tweets for NBA players. Listing 11, part 1 shows the API to use this code. The Twitter API is more advanced than the simple script shown below. This is one of the advantages of using a third-party library that has been developed for years.

Listing 11. Twitter extract metadata, part 1

Get status on Twitter

df = stats_df(user="KingJames")
In [34]: df.describe()
       favorite_count  retweet_count
count      200.000000     200.000000
mean     11680.670000    4970.585000
std      20694.982228    9230.301069
min          0.000000      39.000000
25%       1589.500000     419.750000
50%       4659.500000    1157.500000
75%      13217.750000    4881.000000
max     128614.000000   70601.000000

In [35]: df.corr()
                favorite_count  retweet_count
favorite_count        1.000000       0.904623
retweet_count         0.904623       1.000000


import time

import twitter
from . import config
import pandas as pd
import numpy as np
from twitter.error import TwitterError

def api_handler():
    """Creates connection to Twitter API"""
    api = twitter.Api(consumer_key=config.CONSUMER_KEY,
    return api

def tweets_by_user(api, user, count=200):
    """Grabs the "n" number of tweets.  Defaults to 200"""

    tweets = api.GetUserTimeline(screen_name=user, count=count)
    return tweets

In this next section, the tweets are extracted and converted into a pandas data frame that stores the values as a median. This is an excellent technique to compress the data by only storing the values we are interested in (i.e., the median of a set of data). The median is a useful metric because it is robust against outliers.

Listing 11. Twitter extract metadata, part 2

def stats_to_df(tweets):
    """Takes twitter stats and converts them to a dataframe"""

    records = []
    for tweet in tweets:
    df = pd.DataFrame(data=records)
    return df

def stats_df(user):
    """Returns a dataframe of stats"""

    api = api_handler()
    tweets = tweets_by_user(api, user)
    df = stats_to_df(tweets)
    return df

def twitter_handles(sleep=.5,data="data/twitter_nba_combined.csv"):
    """yield handles"""

    nba = pd.read_csv(data) 
    for handle in nba["twitter_handle"]:
        time.sleep(sleep) #Avoid throttling in twitter api
            df = stats_df(handle)
        except TwitterError as error:
            print("Error {handle} and error msg {error}".format(
            df = None
        yield df

def median_engagement(data="data/twitter_nba_combined.csv"):
    """Median engagement on twitter"""

    favorite_count = []
    retweet_count = []
    nba = pd.read_csv(data)
    for record in twitter_handles(data=data):
        #None records stored as Nan value
        if record is None:
            print("NO RECORD: {record}".format(record=record))
        except KeyError as error:
            print("No values found to append {error}".format(error=error))
    print("Creating DF")
    nba['twitter_favorite_count'] = favorite_count
    nba['twitter_retweet_count'] = retweet_count
    return nba

def create_twitter_csv(data="data/nba_2016_2017_wikipedia.csv"):
    nba = median_engagement(data)

Creating advanced visualizations

With the addition of social media data, you can create more advanced plots with additional insights. Figure 8 is an advanced plot, called a heatmap. It shows the correlation of a compressed set of key features. These features are a great building block for doing more machine learning, such as clustering (see Part 1 of this series). It would be worth using this data on your own to experiment with different clustering configurations.

Figure 8. NBA player endorsement, social power, on-court performance, team valuation correlation heatmap: 2016-17 season

Listing 12 provides the code to create the correlation heatmap.

Listing 12. Correlation heatmap

endorsements = pd.read_csv("../data/nba_2017_endorsement_full_stats.csv")
ax = plt.axes()
ax.set_title("NBA Player Endorsement, Social Power, On-Court Performance, Team Valuation Correlation Heatmap:  2016-2017 Season")
corr = endorsements.corr()
            yticklabels=corr.columns.values, cmap="copper")

Listing 13 shows a heatmap with colors created using a log scale, along with a special color map. This is a great trick to provide a distinct contrast between each cell. A log scale is a transformation that shows the relative change versus the actual change. It is a common technique to use in graphing when the values have large magnitudes of differentiation — for example, 10 and 10 million. Showing the relative change, versus the actual change, adds more clarity to a plot. Normally, a plot is shown in linear scale (a straight line). A log scale (log line) diminishes in power as it is plotted (meaning that it flattens out).

Listing 13. Correlation heatmap advanced

from matplotlib.colors import LogNorm
pd.set_option('display.float_format', lambda x: '%.3f' % x)
norm = LogNorm()
ax = plt.axes()
grid = endorsements.select_dtypes([np.number])
ax.set_title("NBA Player Endorsement, Social Power, On-Court Performance, Team Valuation Heatmap:  2016-2017 Season")
sns.heatmap(grid,annot=True, yticklabels=endorsements["PLAYER"],fmt='g', cmap="Accent", cbar=False, norm=norm)
Figure 9. NBA player endorsement, social power, on-court performance, team valuation heatmap: 2016-17 season
Image shows heatmap
Image shows heatmap

One last plot uses the R language to create a multi-dimensional plot in ggplot. This is shown in Listing 14 and Figure 10. The native ggplot library in R is a powerful and unique charting library that can create multiple dimensions with color, size, facets, and shapes. The ggplot library in R is well worth the time to explore on your own.

Listing 14. Advanced R-based ggplot

ggplot(nba_players_stats, aes(x=WINS_RPM, y=PAGEVIEWS,
                color=SALARY_MILLIONS, size=TWITTER_FAVORITE_COUNT)) + geom_point() +
                geom_smooth() + scale_color_gradient2(low = "blue", mid = "grey", high =
                "red", midpoint = 15) + labs(y="Wikipedia Median Daily Pageviews", x="WINS
                Attributed to Player( WINS_RPM)", title = "Social Power NBA 2016-2017
                Season: Wikipedia Daily Median Pageviews and Wins Attributed to Player
                (Adusted Plus Minus)") +
                = TRUE, data=subset(nba_players_stats, SALARY_MILLIONS > 25 | PAGEVIEWS
                > 4500 | WINS_RPM > 15), aes(WINS_RPM,label=PLAYER )) +
                annotate("text", x=8, y=13000, label= "NBA Fans Value Player Skill More
                Than Salary, Points, Team Wins or Another Other Factor?", size=5) +
                annotate("text", x=8, y=11000, label=paste("PAGEVIEWS/WINS Correlation:
                28%"),size=4) + annotate("text", x=8, y=10000,
                label=paste("PAGEVIEWS/Points Correlation 44%"),size=4) + annotate("text",
                x=8, y=9000, label=paste("PAGEVIEWS/WINS_RPM Correlation: 49%"),size=4,
                color="red") + annotate("text", x=8, y=8000,
                label=paste("SALARY_MILLIONS/TWITTER_FAVORITE_COUNT: 24%"),size=4)
Figure 10. NBA player social power: 2016-17 season
Image shows chart with end result NBA fans value player skill more than salary, points, or other factors
Image shows chart with end result NBA fans value player skill more than salary, points, or other factors


In Part 1 of this series, you learned the basics of machine learning and used unsupervised clustering techniques to explore the valuation of the team. The tools utilized for this data science were Python and advanced graphs with Jupyter Notebook.

Here in Part 2, you explored the players and their relationship with social media, influence, salary, and on-the-court performance. Many advanced graphs were created in a Jupyter Notebook, but there was also a brief touch of R.

Some questions exposed or needing further investigation (they might be wrong assumptions):

  • Salary paid to players isn't the best predictor of wins.
  • Fans engage more with highly skilled athletes (versus highly paid, for example).
  • Endorsement income correlates to how many wins a team has for a player, so they may want to be careful about which team they switch to.
  • There appears to be a different audience that attends games in person and the audience that engages with social media. The audience in person seems bothered if their team is unskilled.

There's more you can do. Try applying both supervised and unsupervised machine learning to the data set provided in GitHub. I have uploaded the data set for you to experiment with this project on Kaggle.

Downloadable resources

Related topics

Zone=Data and analytics, Open source
ArticleTitle=Social power, influence, and performance in the NBA, Part 2: Exploring the individual NBA players