<!DOCTYPE html>
Introduction
1.1 Background infomation
1.2 Concept
1.3 Libraries
Data Collection
2.1 Data scraping
2.2 Load data
Analysis
3.1 Height of players Overview
3.2 Height changed over time
3.3 Height, points, wins
3.4 Analysis in different time periods
Test on relationship
4.1 test on height vs. winning rate
4.2 test on height vs. points
Conclusion
On January 31, 2020, at the Rockets’ home game Toyota Center, James Harden led the Rockets to a 128-121 victory over the Dallas Mavericks. The special thing about this game is the Rockets became the first NBA team to play an entire game without a player listed taller than 6-foot-6 since the New York Knicks in a Jan. 31, 1963, loss to the Chicago Zephyrs. Before the game, the Rockets' starting center, 2.08 meters tall Clint Capela, was injured. This leaves the Rockets with only substitutes Tyson Chandler and Isaiah Hartenstein at the center position.
Is using a small lineup a last resort for the Rockets? Or does the NBA not need big guys anymore?
This tutorial is to show our understanding of data analysis and applying it in the real world. Specifically, we aim to test and verify a "rumor" that "Small Ball' becomes more and more popular in the NBA." Is it true? We are going to verify this claim base on the change of the average height of each team over years.
We would like to thank Basketball-reference(https://www.basketball-reference.com/), which covers player statistics from 1950 to 2020. As loyal fans of basketball (especially basketball fans about the NBA), we have always wanted to investigate the statistics of the league to show trends and even answer specific questions.
The NBA (National Basketball Association) was established in 1949 after the merger of the two leagues; the American Basketball Association (BAA) and the National Basketball League (NBL). Now the NBA almost represents the highest level of basketball in the world.
More than a decade ago, outside stars rarely chose to misplace their opponent's big players and more often singled out outside players who were against them. For example, Kobe has a series of so-called "Kobe Stoppers"-Raja Bell, Ruben Patterson, and Sean Battier. In the 2001 finals, there was Tyronn Lue who stuck Allen Iverson. There is no scene in which Iverson takes the ball and singles out O'Neal near the three-point line or Kobe plays Duncan in the high post.
However, in recent years, a group of new-style guards represented by Curry and Harden has increasingly used the "dislocation of small and large" tactics. Their success rate is also quite high, which is caused deliberately through pick and roll. This kind of dislocation comes to fight.
In the 2018 playoffs, Curry played Gobert of the Jazz several times, and Harden also frequently broke through Kanter of the Thunder in high positions. Most teams seem to have no good way to do this. Either let the inside line scalp up hard or lower the height and let the wing-height players play inside to speed up. In the finals of last season, there was even a scene where there was no center on the field against both sides. The big reason was that because you got a big man, you would be targeted by the opposing guard.
"Small Ball" is a concept in the backetball game. It is the style of having more "smaller" players in the team. These players tend to have higher speed, agility, and higher scoring compare to other "big" players who are higher and stronger physically. Players like Isaiah Thomas, Stephen Curry, Kyrie Irving, and Klay Thompson are familiar to the audiences today. And they are representative of "Small Ball."
Load the Python libraries we will use for this tutorial.
· Python 3.5
· Pandas
· Requests
· BeautifulSoup
· re
· os
· matplotlib
· seaborn
· statsmodels
· pickle
import pandas as pd
import requests
import string
import re
import os
from bs4 import BeautifulSoup as bs
import pickle
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import statsmodels.api as sm
import statsmodels.formula.api as smf
# sns.set(rc={'figure.figsize':(24,10)})
try:
os.mkdir("./final_2")
except:
os.chdir("./final_2")
As the world's No. 1 basketball league, the NBA began to record player statistics for each game as early as the early 1970s. The statistics at the time were still handwritten on paper by the staff of each team. Nowadays, NBA technical statistics can be easily downloaded from the Internet. But the data from different websites are not the same, which will bring inconsistency to data analysis. The data we selected are all from the website basketball-reference. We believe that the data on this website is the most reliable because all come from the official NBA database and are used widely in the NBA data analysis field.
Because our analysis will be based on the height of the players of each team in the NBA, we need to get the statistics of each team in each season, including various technical statistics, the number of wins and losses, and the average height of players. To get these data more efficiently, we code a small web-crawler program. It can download the statistics of each team for each season under the teams directory on the basketball-reference website.
teams_url = 'https://www.basketball-reference.com/teams/'
teams_page = requests.get(teams_url)
soup = bs(teams_page .content, 'html.parser')
team_links = soup.select('#teams_active a')
tm_url = [str(lk).split("/")[2] for lk in team_links]
pattern_stats_per_game = 'https://www.basketball-reference.com/teams/TEAM_KURZ/stats_per_game_yr_yr.html'
for i in range(len(tm_url)):
tm = tm_url[i]
print(tm)
tmp_team = pattern_stats_per_game.replace("TEAM_KURZ", tm)
tmp_dfs = pd.read_html(tmp_team)[0]
with open(tm + ".pkl", 'wb') as f:
pickle.dump(tmp_dfs, f)
num_pattern_url = 'https://www.basketball-reference.com/teams/TEAM_KURZ/stats_per_game_totals.html'
for i in range(len(tm_url)):
tm = tm_url[i]
print(tm)
tmp_team = num_pattern_url.replace("TEAM_KURZ", tm)
tmp_dfs = pd.read_html(tmp_team)[0]
with open(tm + "num_per_game.pkl", 'wb') as f:
pickle.dump(tmp_dfs, f)
After getting the basic data, let us load the data. When loading data, remove the extra columns and select only the data for the 1970 and later seasons. We believe that NBA games before 1970 were also very exciting and representative. However, because the statistics of that era were independently recorded by each team and were handwritten on paper, the data set lacked reliability and consistency. Therefore, we have decided to only use data from 1970 and beyond.
team_change = None
for i in range(len(tm_url)):
tm = tm_url[i]
with open(tm + ".pkl", 'rb') as f:
ts = pickle.load(f)
ts = ts[[i for i in ts.columns if i.find("Unname") < 0]]
ts['Year'] = ts['Season'].apply(lambda x: x.split("-")[0])
ts = ts.loc[ts.Year >= '1970']
ts.loc[ts.Year != 'Season']
team_change = pd.concat([team_change, ts], axis = 0)
team_change = team_change.loc[team_change.Lg == 'NBA']
obj_cols = ['Season', 'Lg', 'Tm']
for i in team_change.columns:
if i not in obj_cols:
tmp = team_change[i]
team_change[i] = [float(str(i).replace("%", '')) for i in tmp.values]
team_change['Year'] = pd.to_datetime(team_change.Year, format='%Y')
team_change
team_num = None
def ft2m(x):
x = x.split('-')
return int(x[0]) * 0.3048 + int(x[1]) * 0.0254
for i in range(len(tm_url)):
tm = tm_url[i]
with open(tm + "num_per_game.pkl", 'rb') as f:
ts = pickle.load(f)
ts = ts[[i for i in ts.columns if i.find("Unname") < 0]]
ts['Year'] = ts['Season'].apply(lambda x: x.split("-")[0])
ts = ts.loc[ts.Year >= '1970']
ts.loc[ts.Year != 'Season']
team_num = pd.concat([team_num, ts], axis = 0)
team_num = team_num.loc[team_num.Lg == 'NBA']
obj_cols = ['Season', 'Lg', 'Tm', 'Ht.']
for i in team_num.columns:
if i not in obj_cols:
tmp = team_num[i]
team_num[i] = [float(str(i).replace("%", '')) for i in tmp.values]
team_num['Year'] = pd.to_datetime(team_num.Year, format='%Y')
team_num['Ht.'] = team_num['Ht.'].apply(lambda x: ft2m(x))
team_num
The NBA is full of talented players. Just analyzing the data of a few players is not a good proof of any trend changes in the entire NBA. We will first take the team as a unit of our analysis. Now let's take a look at how tall these sports giants are, and what the height of most of them is. Maybe their height is not too different from you and me.
team_num.head()
We use some average heights as the horizontal axis, and the vertical axis is the percentage of the number of players.
plt.figure(figsize=(12,9)) #
# sns.set() # for style
sns.distplot(team_num['Ht.'], bins=10)
plt.title("Histogram of Average Height of Player on the Court") # for histogram title
plt.xlabel('Height')
plt.show()
Ignoring Time, the average height in most games is between 1.95 and 2.01. From the histogram, there are three main pillars, around 1.95, around 1.98, and around 2.00. This distribution is not surprising. Most point guards and shooting guards are between 1.90 and 2 meters tall. They need great luck and the ability to move. Excessive height is disadvantageous. Center forward players and power forward and small forward players are relatively taller.
Has there been a significant change in the average height of NBA teams in the 50 seasons since 1970? Let us use the following line chart to see how the average height changes over time. The horizontal axis of the graph represents the year, and the vertical axis represents the average height.
team_num['W_pct'] = team_num.W /(team_num.L + team_num.W) * 100
plt.figure(figsize=(12, 9))
plt.plot(team_num.loc[team_num['W_pct'] > 50].groupby('Year')[['Ht.']].mean(), color = 'blue')
plt.plot(team_num.loc[team_num['W_pct'] <= 50].groupby('Year')[['Ht.']].mean(), color = 'red')
plt.title('Height in Time')
plt.ylabel(' Average Height(m)')
plt.show()
Over time, around 1985 the height reached the peak, and now the low point comes.
We can see that around 1985 there is the highest peak of average player height in NBA history. You may think that this is because the 1980s was a time when big men competed, but the facts are a little different. The height peak in 1985 is the result of the popularity of centers in the 1970s and even the 1960s. At that time, there was no penalty zone restriction, so all teams were looking for big players to strengthen their internal confrontation. After reaching the height peak, the average height quickly slipped. If you pay attention to the NBA of that era, it is not difficult to know that it belonged to the right era of the Lakers and the Celtics. Larry Bird and Magic Johnson changed NBA strategy for the first time. The NBA has changed its focus on defense from earlier to dazzling passing and smooth offense. Although the height of the two is still around 2.06, they have strong athletic talents. They are very flexible. The NBA began to lean towards offensive and defensive balance and rapid rotation of player positions around 1985, but confrontation in the penalty area is still very important.
Since 2000, the height of players has increased again. The peak height at this stage appeared around 2003. In 2001, to limit O’Neal’s ability to rule in the penalty area, the league introduced the three-second rule. In 2003, O'Neal still led the Lakers to their three consecutive championships. At the same time, there are many great inside big players such as Kevin Garnett of the Timberwolves, Spurs vs. Duncan, and Nowitzki of the Mavericks. The little giant Yao Ming joined the Rockets in 2002. Howard was selected by the Magic in 2004.
Starting around 2014, the blue curve representing a winning percentage higher than 50% began to fall below the red curve representing a winning percentage equal to or lower than 50%. This shows that low height is gradually becoming an advantage for achieving high winning rates. This situation has happened twice in the past 15 years, the last time it happened between 2008-2010, and each can last about 3 seasons. However, this situation rarely happened before 2008, even if it happened for a short time.
The change in height is not big throughout the history of the NBA. However, the phenomenon of decreasing height and increasing the winning rate shows that the use of smaller players or more flexible players is increasing the winning rate of the team. The huge drop in player height in the 2019-2020 season shows that more teams have begun to implement small ball strategies.
plt.figure(figsize =(12, 9))
change_ht = team_change.groupby('Year')[['Ht.']].agg(['mean', 'median', "min", "max"])#.
change_ht.columns =[i[1] for i in change_ht.columns]
change_ht.plot(figsize = (12, 8))
plt.ylabel('Change in %')
plt.title("Trends on Height of Players on the Court")
plt.show()
There is a significant decreasing trend in the average height of players on the court in recent years.
### Points
plt.figure(figsize =(12, 9))
sns.distplot(team_num['PTS'], bins=30)
plt.vlines(team_num['PTS'].median(), ymin = 0 , ymax = 0.06, label = 'median')
plt.vlines(team_num['PTS'].mean(), ymin = 0 , ymax = 0.07,color = 'red', label = 'mean')
plt.title("Histogram of Average PTS") # for histogram title
plt.xlabel('Points')
plt.legend()
plt.show()
Points are asymmetrically distributed, with mean and median around 103.
def get_period(x):
a = 2
if x < pd.to_datetime(1990, format="%Y"):
a = 0
elif x < pd.to_datetime(2010, format="%Y"):
a = 1
return a
team_num['period'] = team_num['Year'].apply(lambda x: get_period(x))
We classify the records in to time period:
The NBA before 1990 and the NBA today are very different in terms of rules, overall physical fitness of players, and technical and tactical theory. Therefore, we divide the NBA in this period into the same stage. From 1990 to 2010, it was an era that emphasized internal offense and defense. The team will build a team lineup with center players as the core. Although the situation has changed after 2005, a small number of teams began to implement small ball strategies, but they are still in the process of exploring and experimenting with skills and tactics. The period that embodies the advantage of small ball strategy is from 2010.
plt.figure(figsize =(12, 9))
sns.scatterplot(data=team_num, x='Year', y='PTS', hue = 'period', palette="deep")
plt.title('Points in Time')
plt.show()
In the term of point, there are two peak time, around 1985, and now, which is corresponding to a special time in the term of height, the peak of height around 1985 and the low height now.
Starting around 1990, the NBA's defensive intensity gradually increased, which led to a decline in all teams' points. In 2001, the league began to implement the three-second rule. This caused the team to score up. After 2010, the popularity of small players and three-pointers pushed up the score again.
plt.figure(figsize =(12, 9))
#sns.scatterplot(data=team_num.loc[team_num['period'] == 2], x='Ht.', y='PTS', hue = 'period', palette="Set2", style='period')
sns.scatterplot(data=team_num, x='Ht.', y='PTS', hue = 'period', palette="Set2", style='period')
plt.title('Height and Points')
plt.show()
team_num.loc[team_num['PTS'] > 120] # high point team
It is clear, as long as the height is not extremely low, small size team are likely getting high points. However, it is also clear, with higher Height, teams are likely not getting low points.
team_num['W_pct'] = team_num.W /(team_num.L + team_num.W) * 100
plt.figure(figsize =(12, 9))
sns.scatterplot(data=team_num, x='Ht.', y='W_pct',hue = 'period', palette="Set2",style='period')
plt.title('Height and Win Rate')
plt.ylabel('Win Rate %')
plt.show()
team_num.loc[team_num['W_pct'] > 80] # high win rate team
Regarding win rate, however, the small ball does not return a better win rate in general, but, considering the low absolute team number in the small ball pillar showed above, we may be interested in the question that if the probability of getting a high win rate with a small ball is relatively hight.
cols_plot = ['MP', 'FG', 'FGA', 'FG%', '3P',
'3PA', '3P%', '2P', '2PA', '2P%',
'FT', 'FTA', 'FT%', 'ORB', 'DRB',
'TRB', 'AST', 'STL', 'BLK', 'TOV',
'PF','PTS']
team_num.groupby('period')[cols_plot].mean().T
fig, ax = plt.subplots(ncols=3, nrows=int(len(cols_plot)/3), figsize = (18, 24))
plt.subplots_adjust(top = 0.99, bottom=0.01, hspace=0.5, wspace=0.4)
i =0
for row in ax:
for col in row:
team_num.groupby('period')[cols_plot[i]].mean().plot(kind = 'bar',\
title = cols_plot[i], ax = col, color = ['red', 'blue',
'black'])
plt.xlabel('')
i = i + 1
plt.show()
There are obvious features in todays' NBA:
Quick ball:
Easy ball:
team_num['W_pct'] = team_num.W /(team_num.L + team_num.W) * 100
plt.figure(figsize =(12, 9))
sns.scatterplot(data=team_num, x='3PA', y='2P%', hue = 'period', palette="Set2",style='period')
Let's look at the relationship between the number of three-pointers and two-pointers made. These two seemingly unrelated data are the key to the small ball strategy. We can see from the figure that the increase in the number of three-point shots is positively correlated with the two-point shot percentage. This phenomenon only appeared in the third stage of our division, period 2. What is the cause of this phenomenon?
The first is that changes in the NBA's three-second rule and zone defense rules have made it more difficult for big insiders such as centers to score. Star players also received more double-teams without the ball because of the cancellation of the zone defense rules. The best way to break the defense at this time is to increase the space between the defensive players. More three-point shots can be a good way to make defensive players come near the three-point line to create more two-pointer offensive space such as breakthrough layups. This makes the two-pointer goal rate higher and the offense is more efficient.
Next, We are going to test the relationship between height and win rate/points, with interaction term periods.
model_df = team_num[[ 'Ht.', 'W_pct', 'period', 'PTS']].copy()
model_df.columns = ['Ht', 'W_pct', 'period','PTS']
res = smf.ols(formula='W_pct ~ Ht + C(period) + Ht : C(period)', data= model_df).fit()
res.summary()
The result suggests:
res2 = smf.ols(formula='PTS ~ Ht + C(period) + Ht : C(period) ', data= model_df).fit()
res2.summary()
The result suggests:
The prevalence of small ball strategy is the progress of the times, but also the changes brought about by changes in rules. Compared with the past, the overall shooting ability of the guards is now many times better than before. When the inside players pass the screen and are switched to them, if the inside players do not come out or the pace is too slow, these guards can easily score. NBA game methods and offensive and defensive styles are changing, modern basketball pays more attention to shooting, speed, and offensive and defensive conversion, height is no longer absolute. In particular, the height of center forwards and power forwards is significantly decreasing. The situation of a power forward playing a center and five small players on the field at the same time often occurs not only in the regular season, but also frequently in the fierce confrontation in the playoffs. It is foreseeable that with the deepening of the small ball strategy, the average height of the NBA will further decrease.