NFL Data Web Scrape and Analysis

Synopsis

I'm not much into sports—but I do love data. So, I figured a good project to show as an example would be scraping and analyzing NFL data. In it's simplest form the steps involve scoping out the data you want to scrape, developing a plan to do so, and then running some statistical analysis on that data.


Phase One:

I found a really good resource for NFL data on pro-football-reference.com. They have a really good collection of NFL team data going all the way back to 1922 to the present. I'll be accessing data from 2000 to 2020. They structured the site really well and some understanding of HTML will make it clear why.
When we look at the site we see a whole mess of dataframes to include conference standings, team offense, passing, rushing, drives, etc. Our first task here is to identify the tables we want to web-scrape. For starters let's scrape the standings tables. Most times it's benificial to do something once because if you can do one time you can do it several times.

Scraping the data from one page:


Code Editor: 

'''
We need four imports for this web-scrape. 
'''
import pandas as pd  #data science library 
from bs4 import BeautifulSoup, Comment  #HTML parser library 
import requests  #sleek HTTP library to interact with web-pages
import re  #Regex library in Python 

'''
The first thing we do here is request the web-page with the requests library. 
This scrapes HTTP data from the source. 
'''
page = requests.get('https://www.pro-football-reference.com/years/2000/')

'''
Next we place all the page data into a bs4 object. Which is used to extract the 
HTML content from the page. 
'''
soup = BeautifulSoup(page.content, 'lxml') 

'''
The bs4 object contains tables within. We can extract these tables 
into a list with a method inherent to bs4. This method finds the start and end 
tags for each table in the soup object and appends them into a list within a bs4 
object. 
'''
tables = soup.find_all('table')

'''
So, if we print tables[0] we will see the first table in HTML format. We can now 
convert it to a dataframe with Pandas. 
'''
afc_standings_df = pd.read_html(str(tables))[0]

'''
We can do the same for the second table. 
'''
nfc_standings_df = pd.read_html(str(tables))[1]



Python Console:

'''
We have now extracted and can view the first two dataframes. 
'''

In[]: afc_standings_df.head()
Out[]: 
                    Tm         W         L      W-L%        PF        PA  \
0             AFC East  AFC East  AFC East  AFC East  AFC East  AFC East   
1      Miami Dolphins*        11         5      .688       323       226   
2  Indianapolis Colts+        10         6      .625       429       326   
3        New York Jets         9         7      .563       321       321   
4        Buffalo Bills         8         8      .500       315       350   

         PD       MoV       SoS       SRS      OSRS      DSRS  
0  AFC East  AFC East  AFC East  AFC East  AFC East  AFC East  
1        97       6.1       1.0       7.1       0.0       7.1  
2       103       6.4       1.5       7.9       7.1       0.8  
3         0       0.0       3.5       3.5       1.4       2.2  
4       -35      -2.2       2.2       0.0       0.5      -0.5 

In[]: nfc_standings_df.head()
Out[]: 
                     Tm         W         L      W-L%        PF        PA  \
0              NFC East  NFC East  NFC East  NFC East  NFC East  NFC East   
1      New York Giants*        12         4      .750       328       246   
2  Philadelphia Eagles+        11         5      .688       351       245   
3   Washington Redskins         8         8      .500       281       269   
4        Dallas Cowboys         5        11      .313       294       361   

         PD       MoV       SoS       SRS      OSRS      DSRS  
0  NFC East  NFC East  NFC East  NFC East  NFC East  NFC East  
1        82       5.1      -2.7       2.4      -1.3       3.8  
2       106       6.6      -3.6       3.1       1.0       2.1  
3        12       0.8       0.2       1.0      -2.9       3.8  
4       -67      -4.2      -0.4      -4.6      -1.5      -3.0  
To get the rest of the dataframes is little more tricky becuase they are somewhat embedded into the comments of the page.

Code Editor: 

'''
Now let's scrape the Team Offense dataframe. To do this start by extracting all 
the comments in the bs4 object.
'''
comments = soup.findAll(text=lambda text:isinstance(text, Comment))

'''
Next we create a regex object to specify a table id. 
This Regex is looking for an open table tag(<table), which matches any character 
except a new line which matches zero or more repitions (.+?), of the named group
(id="team_stats"), where the potential match dose or does not contain a white space
character ([\s\S]), and ends with a table end tag (</table>). 
'''
rx = re.compile(r'<table.+?id="team_stats".+?>[\s\S]+?</table>')

'''
Then use a loop to iterate through each list item in the comments object with 
the Regex pattern. 
Then, once the Regex pattern is matched it breaks out of the loop. 
'''
for c in comments:
    try:
        table = rx.search(c.string).group(0)
        # break the loop if found
        break
    except:
        pass

'''
The table is in HTML format. Convert it with Pandas. The output is in a list 
so call it by it's index—here it is 0.
'''
df_team_stats = pd.read_html(str(table))[0]

'''
Now let's get Passing Offense using the same methods. 
'''
rx = re.compile(r'<table.+?id="passing".+?>[\s\S]+?</table>')

for c in comments:
    try:
        table = rx.search(c.string).group(0)
        # break the loop if found
        break
    except:
        pass

df_passing_stats = pd.read_html(str(table))[0]
Let's look at the heads of the dataframes we just scraped.

Python Console:

In[]: df_team_stats.head()
Out[]: 
  Unnamed: 0_level_0  Unnamed: 1_level_0 Unnamed: 2_level_0  \
                  Rk                  Tm                  G   
0                1.0      St. Louis Rams               16.0   
1                2.0      Denver Broncos               16.0   
2                3.0     Oakland Raiders               16.0   
3                4.0  Indianapolis Colts               16.0   
4                5.0   Minnesota Vikings               16.0   

  Unnamed: 3_level_0 Unnamed: 4_level_0 Tot Yds & TO             \
                  PF                Yds          Ply  Y/P    TO   
0              540.0             7075.0       1014.0  7.0  35.0   
1              485.0             6554.0       1115.0  5.9  25.0   
2              479.0             5776.0       1023.0  5.6  20.0   
3              429.0             6141.0       1026.0  6.0  29.0   
4              397.0             5961.0        958.0  6.2  28.0   

  Unnamed: 8_level_0 Unnamed: 9_level_0 Passing                             \
                  FL               1stD     Cmp    Att     Yds    TD   Int   
0               12.0              380.0   380.0  587.0  5232.0  37.0  23.0   
1               13.0              383.0   354.0  569.0  4243.0  28.0  12.0   
2                9.0              337.0   284.0  475.0  3306.0  28.0  11.0   
3               14.0              357.0   357.0  571.0  4282.0  33.0  15.0   
4               10.0              319.0   307.0  495.0  3832.0  33.0  18.0   

              Rushing                           Penalties               \
  NY/A   1stD     Att     Yds    TD  Y/A   1stD       Pen    Yds 1stPy   
0  8.3  247.0   383.0  1843.0  26.0  4.8  112.0     111.0  942.0  21.0   
1  7.1  223.0   516.0  2311.0  21.0  4.5  124.0      89.0  792.0  36.0   
2  6.6  177.0   520.0  2470.0  23.0  4.8  128.0     118.0  940.0  32.0   
3  7.2  213.0   435.0  1859.0  14.0  4.3  111.0      89.0  866.0  33.0   
4  7.2  193.0   428.0  2129.0  14.0  5.0  107.0     106.0  908.0  19.0   

  Unnamed: 25_level_0 Unnamed: 26_level_0 Unnamed: 27_level_0  
                  Sc%                 TO%                 EXP  
0                46.8                16.7               49.41  
1                38.3                11.7              -40.21  
2                39.2                10.6               11.03  
3                40.4                15.2               47.85  
4                39.7                14.9              -60.12  

In[]: df_passing_stats.head()
Out[]: 
    Rk                   Tm     G    Cmp    Att  Cmp%     Yds    TD  TD%  \
0  1.0       St. Louis Rams  16.0  380.0  587.0  64.7  5232.0  37.0  6.3   
1  2.0   Indianapolis Colts  16.0  357.0  571.0  62.5  4282.0  33.0  5.8   
2  3.0       Denver Broncos  16.0  354.0  569.0  62.2  4243.0  28.0  4.9   
3  4.0  San Francisco 49ers  16.0  366.0  583.0  62.8  4239.0  32.0  5.5   
4  5.0   Kansas City Chiefs  16.0  342.0  582.0  58.8  4149.0  29.0  5.0   

    Int  Int%   Lng  Y/A  AY/A   Y/C    Y/G  Rate    Sk  Yds.1  NY/A  ANY/A  \
0  23.0   3.9  85.0  9.4   8.9  14.5  327.0  99.7  44.0  260.0   8.3    7.8   
1  15.0   2.6  78.0  7.7   7.7  12.4  267.6  94.7  20.0  131.0   7.2    7.2   
2  12.0   2.1  61.0  7.8   7.9  12.6  265.2  94.2  30.0  221.0   7.1    7.1   
3  10.0   1.7  69.0  7.5   7.9  12.0  264.9  97.0  25.0  161.0   7.0    7.3   
4  15.0   2.6  81.0  7.6   7.4  12.9  259.3  88.5  34.0  259.0   6.7    6.6   

   Sk%  4QC  GWD     EXP  
0  7.0  1.0  3.0  120.25  
1  3.4  2.0  3.0   88.16  
2  5.0  2.0  3.0   37.99  
3  4.1  NaN  NaN   53.79  
4  5.5  3.0  3.0   36.06  

'''
Here notice there are unnamed column headers in df_team_stats. This is becuase 
it is multi-indexed. Meaning there are headers in the first index that describe 
headers in the second. Here's what they look like when you call the columns. 
'''
In[]: df_team_stats.columns 
Out[]: 
MultiIndex([( 'Unnamed: 0_level_0',    'Rk'),
            ( 'Unnamed: 1_level_0',    'Tm'),
            ( 'Unnamed: 2_level_0',     'G'),
            ( 'Unnamed: 3_level_0',    'PF'),
            ( 'Unnamed: 4_level_0',   'Yds'),
            (       'Tot Yds & TO',   'Ply'),
            (       'Tot Yds & TO',   'Y/P'),
            (       'Tot Yds & TO',    'TO'),
            ( 'Unnamed: 8_level_0',    'FL'),
            ( 'Unnamed: 9_level_0',  '1stD'),
            (            'Passing',   'Cmp'),
            (            'Passing',   'Att'),
            (            'Passing',   'Yds'),
            (            'Passing',    'TD'),
            (            'Passing',   'Int'),
            (            'Passing',  'NY/A'),
            (            'Passing',  '1stD'),
            (            'Rushing',   'Att'),
            (            'Rushing',   'Yds'),
            (            'Rushing',    'TD'),
            (            'Rushing',   'Y/A'),
            (            'Rushing',  '1stD'),
            (          'Penalties',   'Pen'),
            (          'Penalties',   'Yds'),
            (          'Penalties', '1stPy'),
            ('Unnamed: 25_level_0',   'Sc%'),
            ('Unnamed: 26_level_0',   'TO%'),
            ('Unnamed: 27_level_0',   'EXP')],
           )

'''
We can combine them to make this easier to work with and read. The following 
takes every item in the first index and joins it with a "_" to every item 
in the second index. 
'''
In[]: df_team_stats.columns = ['_'.join(col) for col in df_team_stats.columns]
In[]: df_team_stats.columns 
Out[]: 
Index(['Unnamed: 0_level_0_Rk', 'Unnamed: 1_level_0_Tm',
       'Unnamed: 2_level_0_G', 'Unnamed: 3_level_0_PF',
       'Unnamed: 4_level_0_Yds', 'Tot Yds & TO_Ply', 'Tot Yds & TO_Y/P',
       'Tot Yds & TO_TO', 'Unnamed: 8_level_0_FL', 'Unnamed: 9_level_0_1stD',
       'Passing_Cmp', 'Passing_Att', 'Passing_Yds', 'Passing_TD',
       'Passing_Int', 'Passing_NY/A', 'Passing_1stD', 'Rushing_Att',
       'Rushing_Yds', 'Rushing_TD', 'Rushing_Y/A', 'Rushing_1stD',
       'Penalties_Pen', 'Penalties_Yds', 'Penalties_1stPy',
       'Unnamed: 25_level_0_Sc%', 'Unnamed: 26_level_0_TO%',
       'Unnamed: 27_level_0_EXP'],
      dtype='object')

'''
But we're not done fixing the columns. We have to get rid of the Unnamed: 
portions of index 1 that joined to index 2. We can do that by using Regex. 
So first we create a new Regex pattern to match the unwanted string data 
in the index. 

So, we're trying to match the following where 
n=someDigit: "Unnamed: ([n]|[n][n])_level_n_"

Be aware that regex is strict. 
'''
#The below means: 
#match "Unnamed: ([a number 0-9]or[a numner 0-5]followed by[a number 0-9]_level_\anyDigitFrom0-9_)"
In[]: re_col_pat = r'(Unnamed: ([0-9]|[0-5][0-9])_level_\d_)'

'''
Now we iterate through every column header in our columns and replace any exact 
matches to re_col_pat with nothing and append it to a new list of column headers. 

'''
In[]: new_cols = []
      for i in df_team_stats.columns:
          res = re.sub(re_col_pat , '', i)
          new_cols.append(res)

'''
Now we replace the old columns with new ones by putting them both into a 
dictionary and renaming them with that dictionary. The rename() method in 
Pandas looks like someDf.rename(old_col, new_col). You can also use 
this method with a dictionary to rename multiple columns like 
someDf.rename({old_col_1:new_col_1, old_col_2:new_col_2}). 
'''
In[]: new_col_dict = dict(zip(df_team_stats.columns,new_cols))
In[]: df_team_stats.rename(columns=new_col_dict, inplace=True)
In[]: df_team_stats.columns 
Out[]: 
Index(['Rk', 'Tm', 'G', 'PF', 'Yds', 'Tot Yds & TO_Ply', 'Tot Yds & TO_Y/P',
       'Tot Yds & TO_TO', 'FL', '1stD', 'Passing_Cmp', 'Passing_Att',
       'Passing_Yds', 'Passing_TD', 'Passing_Int', 'Passing_NY/A',
       'Passing_1stD', 'Rushing_Att', 'Rushing_Yds', 'Rushing_TD',
       'Rushing_Y/A', 'Rushing_1stD', 'Penalties_Pen', 'Penalties_Yds',
       'Penalties_1stPy', 'Sc%', 'TO%', 'EXP'],
      dtype='object')

'''
Now we'll want to insert a year column because eventually we're going to get all 
the data from 2000-2020. This data is for 2000 we may want to be able to 
differentiate it from other years. 
'''
In[]: df_team_stats.insert(loc=0, column='Yr', value=2000)
In[]: df_team_stats.head()
Out[]: 
     Yr   Rk                  Tm     G     PF     Yds  Tot Yds & TO_Ply  \
0  2000  1.0      St. Louis Rams  16.0  540.0  7075.0            1014.0   
1  2000  2.0      Denver Broncos  16.0  485.0  6554.0            1115.0   
2  2000  3.0     Oakland Raiders  16.0  479.0  5776.0            1023.0   
3  2000  4.0  Indianapolis Colts  16.0  429.0  6141.0            1026.0   
4  2000  5.0   Minnesota Vikings  16.0  397.0  5961.0             958.0   

   Tot Yds & TO_Y/P  Tot Yds & TO_TO    FL   1stD  Passing_Cmp  Passing_Att  \
0               7.0             35.0  12.0  380.0        380.0        587.0   
1               5.9             25.0  13.0  383.0        354.0        569.0   
2               5.6             20.0   9.0  337.0        284.0        475.0   
3               6.0             29.0  14.0  357.0        357.0        571.0   
4               6.2             28.0  10.0  319.0        307.0        495.0   

   Passing_Yds  Passing_TD  Passing_Int  Passing_NY/A  Passing_1stD  \
0       5232.0        37.0         23.0           8.3         247.0   
1       4243.0        28.0         12.0           7.1         223.0   
2       3306.0        28.0         11.0           6.6         177.0   
3       4282.0        33.0         15.0           7.2         213.0   
4       3832.0        33.0         18.0           7.2         193.0   

   Rushing_Att  Rushing_Yds  Rushing_TD  Rushing_Y/A  Rushing_1stD  \
0        383.0       1843.0        26.0          4.8         112.0   
1        516.0       2311.0        21.0          4.5         124.0   
2        520.0       2470.0        23.0          4.8         128.0   
3        435.0       1859.0        14.0          4.3         111.0   
4        428.0       2129.0        14.0          5.0         107.0   

   Penalties_Pen  Penalties_Yds  Penalties_1stPy   Sc%   TO%    EXP  
0          111.0          942.0             21.0  46.8  16.7  49.41  
1           89.0          792.0             36.0  38.3  11.7 -40.21  
2          118.0          940.0             32.0  39.2  10.6  11.03  
3           89.0          866.0             33.0  40.4  15.2  47.85  
4          106.0          908.0             19.0  39.7  14.9 -60.12  

'''
There we have a working dataframe for offensive team stats for the year 2000. 
Let's move on to get all the dataframes on the page without having to rewrite the 
code for every one. 
'''

Phase Two:

We need to look at the HTML and inspect it for all the id's of the tables we want to scrape. When we inspect this HTML we see the tables have id's: html_inspect_id
And here's the passing offense table id: html_inspect_id

I've already gathered all the table id's for each dataframe. So now let's write our function to scrape all the dataframes on the one page—for year 2000.

I'm also going to add a dictionary that we will use to make the team names uniform across all the years. So, the team either moved to another city/state or they just changed the current name. We need the team names to be stanardized. I'm going to do it right when we scrape the data. From 2000 to 2020 there have been 4 team name changes.

Code Editor

#LET'S WRITE THE FUNCTION TO TAKE 1 ARGUMENT UNIVERSALLY DIFFERENT TO EACH 
#HTML PAGE HOUSING THE DATA; THE URL AND THE YEAR.
def get_all_data(url, yr):
    #NEXT DO ALMOST EVERYTHING WE'VE ALREADY DONE. 
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    #REG EX PATTERN FOR REMAPPING COLS
    re_col_pat = r'(Unnamed: ([0-9]|[0-5][0-9])_level_\d_)'
    #TEAM NAME CHANGE DICTIONARY
    team_name_change_dict = {
                         'St. Louis Rams':'Los Angeles Rams',
                         'Oakland Raiders':'Las Vegas Raiders',
                         'San Diego Chargers':'Los Angeles Chargers',
                         'Washington Redskins':'Washington Football Team',
                         }
    #CONVERT AVAILIBLE AFC/NFC TABLEs (NOT HIDDEN IN COMMENTS) 
    tables = soup.find_all('table') #standings table
    #ITERATE THROUGH TABLES AND SCRAPE AFC/NFC STANDINGS; THE LOOP ITERATES 
    #TWICE, ONE FOR EACH TABLE IN TABLES
    for i3 in range(len(tables)):
        df_s = pd.read_html(str(tables))[i3]
        df_s.columns = ['_'.join(col) for col in df_s.columns]
        #NEW COL NAMES
        new_cols = []
        for i4 in df_s.columns:
            res = re.sub(re_col_pat, '', i4)
            new_cols.append(res)
        #CREATE NEW COLS/OLD COLS DICT
        new_col_dict = dict(zip(df_s.columns,new_cols))
        #APPLY NEW COL NAMES WITH RENAME()
        df_s.rename(columns=new_col_dict, inplace=True)
        df_s.rename(columns={'T_m':'Tm'}, inplace=True)
        #CHANGE TEAM NAMES SO THEY ARE UNIFORM
        df_s.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
        #INSERT YEAR COL
        df_s.insert(loc=0, column='Yr', value=yr)
        if i3 == 0: 
            df_afc_standings = df_s
        elif i3 == 1:
            df_nfc_standings = df_s
    #NOW WE SCRAPE THE DATAFRAMES HIDDEN IN THE COMMENTS
    #FIRST INSTANIATE A LIST WITH ALL THE ID'S OF THE HTML TABLES
    df_str_lst = ['team_stats', 'passing', 'rushing', 
                  'returns', 'kicking', 'team_scoring', 
                  'team_conversions', 'drives', 'playoff_results']
    #SCRAPE THE COMMENTS FROM THE PAGE SO WE CAN ITERATE THROUGH THEM AND GRAB THE DATAFRAMES
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    #ITERATE THROUGH DF_STR_LST WITH A FOR LOOP AND FORMAT THE ITEM ITERATING INTO THE REGEX 
    for i in range(len(df_str_lst)):
        rx_compiler_str = r'<table.+?id="{}".+?>[\s\S]+?</table>'.format(df_str_lst[i])
        #SAME AS BEFORE INSTANTIATE A REGEX COMPILED OBJECT AND ITERATE THROUGH THE COMMENTS
        rx = re.compile(rx_compiler_str)
        for comment in comments:
            try:
                table = rx.search(comment.string).group(0)
                # break the loop if found
                break
            except:
                pass
        #CONVERT TABLE AND CLEAN THE DF    
        df = pd.read_html(str(table))[0]
        df.columns = ['_'.join(col) for col in df.columns]
        #NEW COL NAMES
        new_cols = []
        for i2 in df.columns:
            res = re.sub(re_col_pat , '', i2)
            new_cols.append(res)
        #CREATE NEW COLS/OLD COLS DICT
        new_col_dict = dict(zip(df.columns,new_cols))
        #APPLY NEW COL NAMES WITH RENAME()
        df.rename(columns=new_col_dict, inplace=True)
        df.rename(columns={'T_m':'Tm'}, inplace=True)
        #UPDATE TEAM NAMES SO THEY ARE ALL UNIFORM 
        df.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
        #INSERT YEAR COL
        df.insert(loc=0, column='Yr', value=yr)    
        #NOW USE AN IF STATEMENT THAT CHECKS WHICH TABLE ID THE ITERATION IS ON 
        #AND ASSIGN IT TO THE CORRESPONDING DATAFRAME
        if i == 0:
            df_team_stats = df
        elif i == 1:
            df_passing = df
        elif i == 2:
            df_rushing = df
        elif i == 3: 
            df_returns = df
        elif i == 4: 
            df_kicking = df
        elif i == 5: 
            df_team_scoring = df
        elif i == 6:
            df_team_conversions = df
        elif i == 7: 
            df_team_drives = df
        elif i == 8:
            df_playoff_results = df
    #LASTLY RETURN ALL THE DATAFRAMES 
    return [df_afc_standings, df_nfc_standings, 
            df_team_stats, df_passing, 
            df_rushing, df_returns, 
            df_kicking, df_team_scoring, 
            df_team_conversions, df_team_drives, 
            df_playoff_results]

#call the function for the data; each item in the return list is reassigned to a 
#variable dataframe 
#I reassigned them with the addition of _mast to denote master dataframes for all years 
[df_afc_standings_mast, df_nfc_standings_mast, 
df_team_stats_mast, df_passing_mast, 
df_rushing_mast, df_returns_mast, 
df_kicking_mast, df_team_scoring_mast, 
df_team_conversions_mast, df_team_drives_mast, 
df_playoff_results_mast] = get_all_data('https://www.pro-football-reference.com/years/2000/', 2000)

'''
Now we have all the dataframes on the page. 
'''
In[]: df_passing_mast.head()
Out[]: 
     Yr  R_k                  T_m     G  C_m_p  A_t_t  C_m_p_%   Y_d_s   T_D  \
0  2000  1.0     Los Angeles Rams  16.0  380.0  587.0     64.7  5232.0  37.0   
1  2000  2.0   Indianapolis Colts  16.0  357.0  571.0     62.5  4282.0  33.0   
2  2000  3.0       Denver Broncos  16.0  354.0  569.0     62.2  4243.0  28.0   
3  2000  4.0  San Francisco 49ers  16.0  366.0  583.0     62.8  4239.0  32.0   
4  2000  5.0   Kansas City Chiefs  16.0  342.0  582.0     58.8  4149.0  29.0   

   T_D_%  I_n_t  I_n_t_%  L_n_g  Y_/_A  A_Y_/_A  Y_/_C  Y_/_G  R_a_t_e   S_k  \
0    6.3   23.0      3.9   85.0    9.4      8.9   14.5  327.0     99.7  44.0   
1    5.8   15.0      2.6   78.0    7.7      7.7   12.4  267.6     94.7  20.0   
2    4.9   12.0      2.1   61.0    7.8      7.9   12.6  265.2     94.2  30.0   
3    5.5   10.0      1.7   69.0    7.5      7.9   12.0  264.9     97.0  25.0   
4    5.0   15.0      2.6   81.0    7.6      7.4   12.9  259.3     88.5  34.0   

   Y_d_s_._1  N_Y_/_A  A_N_Y_/_A  S_k_%  4_Q_C  G_W_D   E_X_P  
0      260.0      8.3        7.8    7.0    1.0    3.0  120.25  
1      131.0      7.2        7.2    3.4    2.0    3.0   88.16  
2      221.0      7.1        7.1    5.0    2.0    3.0   37.99  
3      161.0      7.0        7.3    4.1    NaN    NaN   53.79  
4      259.0      6.7        6.6    5.5    3.0    3.0   36.06  

In[]: df_team_drives_mast.head()
Out[]: 
     Yr   Rk                    Tm     G   Dr  Plays   Sc%   TO%  \
0  2000  1.0  Los Angeles Chargers  16.0  206   1007  21.4  23.3   
1  2000  2.0         New York Jets  16.0  205   1109  27.8  19.0   
2  2000  3.0         Detroit Lions  16.0  204   1034  26.5  14.2   
3  2000  4.0         Buffalo Bills  16.0  200   1118  29.0  10.5   
4  2000  5.0  Jacksonville Jaguars  16.0  198   1113  34.8  13.6   

   Average Drive_Plays  Average Drive_Yds Average Drive_Start  \
0                  4.9               20.8            Own 29.3   
1                  5.4               26.3            Own 33.2   
2                  5.1               21.6            Own 35.9   
3                  5.6               27.4            Own 29.4   
4                  5.6               28.7            Own 34.1   

  Average Drive_Time  Average Drive_Pts  
0               2:12               1.15  
1               2:23               1.50  
2               2:22               1.38  
3               2:35               1.51  
4               2:33               1.85  

Phase Three:

Now we need to apply the function to all the HTML pages for 2000 to 2020. To do this we'll have to re-write some of the function. Here is that function. I did my best to heavily comment it.

There are a few things we will need. We need a list of the urls we're scraping and a list of years to scrape.

Code Editor: 

import pandas as pd
from bs4 import BeautifulSoup, Comment
import requests, re

pd.options.display.max_columns = 50
pd.options.display.max_rows = 50
pd.set_option('max_colwidth', 50)
pd.set_option('max_rows', 50)

def get_all_data(url_lst):
    #IF/ELSE STATMENT CHECKS FOR A LIST OR STR
    if isinstance(url_lst, list):
        print('List is being processed...')
    else:
        url_lst = [url_lst]
        print('processing...')
    #REG EX PATTERN FOR REMAPPING COLS
    re_col_pat = r'(Unnamed: ([0-9]|[0-5][0-9])_level_\d_)'    
    #REG EX PATTERN FOR GETTING THE YEAR FROM THE URL
    re_year_pat = r'(\d\d\d\d)'
    #TEAM NAME CHANGE DICTIONARY
    team_name_change_dict = {
                         'St. Louis Rams':'Los Angeles Rams',
                         'Oakland Raiders':'Las Vegas Raiders',
                         'San Diego Chargers':'Los Angeles Chargers',
                         'Washington Redskins':'Washington Football Team',
                         }
    #CREATE EMPTY MASTER DATAFRAMES AS PLACEHOLDERS
    [df_afc_standings_mast, df_nfc_standings_mast, 
    df_team_stats_mast, df_passing_mast, 
    df_rushing_mast, df_returns_mast, 
    df_kicking_mast, df_team_scoring_mast, 
    df_team_conversions_mast, df_team_drives_mast, 
    df_playoff_results_mast] = [pd.DataFrame() for n in range(11)]
    
    for n in url_lst:
        #BS4 DATA PARSE
        r = requests.get(n)
        soup = BeautifulSoup(r.text, 'lxml')
        #EXTRACT THE YEAR FROM THE URL STRING IN CURRENT ITERATION
        yr = re.findall(re_year_pat, n)
        #CONVERT AVAILIBLE AFC/NFC TABLEs (NOT HIDDEN IN COMMENTS)
        tables = soup.find_all('table') #standings table
        #####################################################################
        #GRAB AFC/NFC STANDINGS
        for i3 in range(len(tables)):
            df_s = pd.read_html(str(tables))[i3]
            df_s.columns = ['_'.join(col) for col in df_s.columns]
            #NEW COL NAMES
            new_cols = []
            for i4 in df_s.columns:
                res = re.sub(re_col_pat, '', i4)
                new_cols.append(res)
            #CREATE NEW COLS/OLD COLS DICT
            new_col_dict = dict(zip(df_s.columns,new_cols))
            #APPLY NEW COL NAMES WITH RENAME()
            df_s.rename(columns=new_col_dict, inplace=True)
            df_s.rename(columns={'T_m':'Tm'}, inplace=True)
            #CHANGE TEAM NAMES SO THEY ARE UNIFORM
            df_s.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
            #INSERT YEAR COL
            df_s.insert(loc=0, column='Yr', value=yr[0])
            if i3 == 0: 
                df_afc_standings = df_s
                df_afc_standings_mast = df_afc_standings_mast.append(df_afc_standings)
            elif i3 == 1:
                df_nfc_standings = df_s
                df_nfc_standings_mast = df_nfc_standings_mast.append(df_nfc_standings)
                
        #####################################################################
        #GRAB THE DFs HIDDEN IN COMMENTS
        df_str_lst = ['team_stats', 'passing', 'rushing', 
                      'returns', 'kicking', 'team_scoring', 
                      'team_conversions', 'drives', 'playoff_results']
        #GET THE COMMENTS
        comments = soup.findAll(text=lambda text:isinstance(text, Comment))
        #CREATE A MAST DF 
        for i in range(len(df_str_lst)):
            rx_compiler_str = r'<table.+?id="{}".+?>[\s\S]+?</table>'.format(df_str_lst[i])
            #LOOK FOR TABLE WITH THE ID "TEAM_STATS"
            rx = re.compile(rx_compiler_str)
            for comment in comments:
                try:
                    table = rx.search(comment.string).group(0)
                    # break the loop if found
                    break
                except:
                    pass
            #CONVERT TABLE AND CLEAN THE DF COLS  
            df = pd.read_html(str(table))[0]
            df.columns = ['_'.join(col) for col in df.columns]
            #NEW COL NAMES
            new_cols = []
            for i2 in df.columns:
                res = re.sub(re_col_pat , '', i2)
                new_cols.append(res)
            #CREATE NEW COLS/OLD COLS DICT
            new_col_dict = dict(zip(df.columns,new_cols))
            #APPLY NEW COL NAMES WITH RENAME()
            df.rename(columns=new_col_dict, inplace=True)
            df.rename(columns={'T_m':'Tm'}, inplace=True)
            #CHANGE TEAM NAMES SO THEY ARE UNIFORM
            df.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
            #INSERT YEAR COL
            df.insert(loc=0, column='Yr', value=yr[0])    
            #RETURN THE CORRECT DF TO CORRESPONDING MASTER DATASET PER ITERATION
            if i == 0:
                df_team_stats = df
                df_team_stats_mast = df_team_stats_mast.append(df_team_stats)
            elif i == 1:
                df_passing = df
                df_passing_mast = df_passing_mast.append(df_passing)
            elif i == 2:
                df_rushing = df
                df_rushing_mast = df_rushing_mast.append(df_rushing)
            elif i == 3: 
                df_returns = df
                df_returns_mast = df_returns_mast.append(df_returns)
            elif i == 4: 
                df_kicking = df
                df_kicking_mast = df_kicking_mast.append(df_kicking)
            elif i == 5: 
                df_team_scoring = df
                df_team_scoring_mast = df_team_scoring_mast.append(df_team_scoring)
            elif i == 6:
                df_team_conversions = df
                df_team_conversions_mast = df_team_conversions_mast.append(df_team_conversions)
            elif i == 7: 
                df_team_drives = df
                df_team_drives_mast = df_team_drives_mast.append(df_team_drives)
            elif i == 8:
                df_playoff_results = df
                df_playoff_results_mast = df_playoff_results_mast.append(df_playoff_results)
    #RETURN THE MASTER DATAFRAMES
    return [df_afc_standings_mast, df_nfc_standings_mast, 
            df_team_stats_mast, df_passing_mast, 
            df_rushing_mast, df_returns_mast, 
            df_kicking_mast, df_team_scoring_mast, 
            df_team_conversions_mast, df_team_drives_mast, 
            df_playoff_results_mast]

'''
Now because nobody wants to copy and paste 20 url's into a list we'll write them
a list comprehension of all the years 2000 to 2020 and a loop to format the year 
into the link (the links are all the same except for the year). Then append it 
to a stand-alone url list.
'''
year_lst = [i for i in range(2000,2021)]
link_lst = []
for i in range(len(year_lst)): 
    url = "http://www.pro-football-reference.com/years/{}/".format(year_lst[i])
    link_lst.append(url)
    del i
    del url

'''
Now that we have all our links pass them into the function. 
'''
[df_afc_standings_mast, df_nfc_standings_mast, 
df_team_stats_mast, df_passing_mast, 
df_rushing_mast, df_returns_mast, 
df_kicking_mast, df_team_scoring_mast, 
df_team_conversions_mast, df_team_drives_mast, 
df_playoff_results_mast] = get_all_data(link_lst)


Phase Four: Final Function

Now we have all the dataframes from every page dated from 2000 to 2020 with the team names standardized and the years for each in a column.

We have to re-write the function one more time to also extract the division, conference, and superbowl game data. These are important data that we will need for an analysis. The only caveat is that they are in different links—but still the only difference is the year. So, we will use mostly the same methods that we already have and have the function take one argument—the year.

I've also added try and except statements to circumvent the function from breaking in the case a table is not found. This will be useful for drawing the current years data. As I write this, the season 2021 has not conducted divisionals nor conference competitions. Thus, those tables do not yet exist. But when we draw the current data for 2021 we will need to pass these.

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup, Comment
import requests, re
import string

pd.options.display.max_columns = 50
pd.options.display.max_rows = 50
pd.set_option('max_colwidth', 50)
pd.set_option('max_rows', 50)

def get_all_data(yr_lst):
    #IF/ELSE STATMENT CHECKS FOR A LIST OR STR
    if isinstance(yr_lst, list):
        print('List is being processed...')
    else:
        yr_lst = [yr_lst]
        print('processing...')
    link_lst_team_data = []
    link_lst_div = []
    link_lst_conf = []
    url_sb_data = "https://www.pro-football-reference.com/super-bowl/"
    for i in range(len(yr_lst)): 
        url_td = "http://www.pro-football-reference.com/years/{}/".format(yr_lst[i])
        url_div = "https://www.pro-football-reference.com/years/{}/week_19.htm".format(yr_lst[i])
        url_confs = "https://www.pro-football-reference.com/years/{}/week_20.htm".format(yr_lst[i])
        link_lst_team_data.append(url_td)
        link_lst_div.append(url_div)
        link_lst_conf.append(url_confs)
    
    #REG EX PATTERN FOR REMAPPING COLS
    re_col_pat = r'(Unnamed: ([0-9]|[0-5][0-9])_level_\d_)'    
    #REG EX PATTERN FOR GETTING THE YEAR FROM THE URL
    re_year_pat = r'(\d\d\d\d)'
    #TEAM NAME CHANGE DICTIONARY
    team_name_change_dict = {
                            'St. Louis Rams':'Los Angeles Rams',
                            'Oakland Raiders':'Las Vegas Raiders',
                            'San Diego Chargers':'Los Angeles Chargers',
                            'Washington Redskins':'Washington Football Team',
                            }
    
    #CREATE EMPTY MASTER DATAFRAMES AS PLACEHOLDERS
    [df_afc_standings_mast, df_nfc_standings_mast, 
    df_team_stats_mast, df_passing_mast, 
    df_rushing_mast, df_returns_mast, 
    df_kicking_mast, df_team_scoring_mast, 
    df_team_conversions_mast, df_team_drives_mast, 
    df_playoff_results_mast] = [pd.DataFrame() for n in range(11)]
    #CREATE EMPTY MASTER DATAFRAMES AS PLACEHOLDERS
    [df_div_mast] = [pd.DataFrame() for n in range(1)]
    #CREATE EMPTY MASTER DATAFRAMES AS PLACEHOLDERS
    [df_conf_mast] = [pd.DataFrame() for n in range(1)] 
    #CREATE EMPTY MASTER DATAFRAMES AS PLACEHOLDERS
    [df_sb_champ_mast] = [pd.DataFrame() for n in range(1)] 

    for n in link_lst_team_data:
        #BS4 DATA PARSE
        r = requests.get(n)
        soup = BeautifulSoup(r.text, 'lxml')
        #EXTRACT THE YEAR FROM THE URL STRING IN CURRENT ITERATION
        yr = re.findall(re_year_pat, n)
        #CONVERT AVAILIBLE AFC/NFC TABLEs (NOT HIDDEN IN COMMENTS)
        tables = soup.find_all('table') #standings table
        #####################################################################
        #GRAB AFC/NFC STANDINGS
        for i3 in range(len(tables)):
            df_s = pd.read_html(str(tables))[i3]
            df_s.columns = ['_'.join(col) for col in df_s.columns]
            #NEW COL NAMES
            new_cols = []
            for i4 in df_s.columns:
                res = re.sub(re_col_pat, '', i4)
                new_cols.append(res)
            #CREATE NEW COLS/OLD COLS DICT
            new_col_dict = dict(zip(df_s.columns,new_cols))
            #APPLY NEW COL NAMES WITH RENAME()
            df_s.rename(columns=new_col_dict, inplace=True)
            df_s.rename(columns={'T_m':'Tm'}, inplace=True)
            
            #CHANGE TEAM NAMES SO THEY ARE UNIFORM
            df_s.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
            #INSERT YEAR COL
            df_s.insert(loc=0, column='Yr', value=yr[0])
            if i3 == 0: 
                df_afc_standings = df_s
                df_afc_standings_mast = df_afc_standings_mast.append(df_afc_standings)
            elif i3 == 1:
                df_nfc_standings = df_s
                df_nfc_standings_mast = df_nfc_standings_mast.append(df_nfc_standings)
                
        #####################################################################
        #GRAB THE DFs HIDDEN IN COMMENTS
        df_str_lst = ['team_stats', 'passing', 'rushing', 
                        'returns', 'kicking', 'team_scoring', 
                        'team_conversions', 'drives', 'playoff_results']
        #GET THE COMMENTS
        comments = soup.findAll(text=lambda text:isinstance(text, Comment))
        #CREATE A MAST DF 
        for i in range(len(df_str_lst)):
            rx_compiler_str = r'<table.+?id="{}".+?>[\s\S]+?</table>'.format(df_str_lst[i])
            #LOOK FOR TABLE WITH THE ID "TEAM_STATS"
            rx = re.compile(rx_compiler_str)
            for comment in comments:
                try:
                    table = rx.search(comment.string).group(0)
                    # break the loop if found
                    break
                except:
                    pass
            #CONVERT TABLE AND CLEAN THE DF COLS  
            df = pd.read_html(str(table))[0]
            df.columns = ['_'.join(col) for col in df.columns]
            #NEW COL NAMES
            new_cols = []
            for i2 in df.columns:
                res = re.sub(re_col_pat , '', i2)
                new_cols.append(res)
            #CREATE NEW COLS/OLD COLS DICT
            new_col_dict = dict(zip(df.columns,new_cols))
            #APPLY NEW COL NAMES WITH RENAME()
            df.rename(columns=new_col_dict, inplace=True)
            df.rename(columns={'T_m':'Tm'}, inplace=True)
            df.rename(columns={'R_k':'Rk'}, inplace=True)
            #CHANGE TEAM NAMES SO THEY ARE UNIFORM
            df.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
            #DROP ANY ROWS THAT START WITH Avg Team, League Total, Avg Tm/G #############FINISH
            drop_avgs_lst = ['Avg Team', 'League Total', 'Avg Tm/G']
            try:
                df = df.loc[~df['Tm'].isin(drop_avgs_lst)]
            except:
                pass
            #INSERT YEAR COL
            df.insert(loc=0, column='Yr', value=yr[0])    
            #RETURN THE CORRECT DF TO CORRESPONDING MASTER DATASET PER ITERATION
            if i == 0:
                df_team_stats = df
                df_team_stats_mast = df_team_stats_mast.append(df_team_stats)
            elif i == 1:
                df_passing = df
                df_passing_mast = df_passing_mast.append(df_passing)
            elif i == 2:
                df_rushing = df
                df_rushing_mast = df_rushing_mast.append(df_rushing)
            elif i == 3: 
                df_returns = df
                df_returns_mast = df_returns_mast.append(df_returns)
            elif i == 4: 
                df_kicking = df
                df_kicking_mast = df_kicking_mast.append(df_kicking)
            elif i == 5: 
                df_team_scoring = df
                df_team_scoring_mast = df_team_scoring_mast.append(df_team_scoring)
            elif i == 6:
                df_team_conversions = df
                df_team_conversions_mast = df_team_conversions_mast.append(df_team_conversions)
            elif i == 7: 
                df_team_drives = df
                df_team_drives_mast = df_team_drives_mast.append(df_team_drives)
            elif i == 8:
                df_playoff_results = df
                df_playoff_results_mast = df_playoff_results_mast.append(df_playoff_results)
                
    #parse division URLs
    try:
        for n in link_lst_div: 
            #BS4 DATA PARSE
            r = requests.get(n)
            soup = BeautifulSoup(r.text, 'lxml')
            #EXTRACT THE YEAR FROM THE URL STRING IN CURRENT ITERATION
            yr = re.findall(re_year_pat, n)
            #CONVERT AVAILIBLE AFC/NFC TABLEs (NOT HIDDEN IN COMMENTS)
            champ_tables_div = soup.find_all('table') #standings table
            for i in range(0,8,2):
                df_s = pd.read_html(str(champ_tables_div))[i]
                #INSERT YEAR COL
                df_s.insert(loc=0, column='Yr', value=yr[0])
                if i == 0: 
                    df_div_0 = df_s
                    df_div_mast = df_div_mast.append(df_div_0)
                elif i == 2: 
                    df_div_1 = df_s
                    df_div_mast = df_div_mast.append(df_div_1)
                elif i == 4: 
                    df_div_2 = df_s
                    df_div_mast = df_div_mast.append(df_div_2)
                elif i == 6: 
                    df_div_3 = df_s
                    df_div_mast = df_div_mast.append(df_div_3)
        #CHANGE TEAM NAMES SO THEY ARE UNIFORM 
        df_div_mast.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
    except: 
        df_div_mast.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
        print('no division table found')
    #Parse conf champ URLs
    try:
        for n in link_lst_conf: 
            #BS4 DATA PARSE
            r = requests.get(n)
            soup = BeautifulSoup(r.text, 'lxml')
            #EXTRACT THE YEAR FROM THE URL STRING IN CURRENT ITERATION
            yr = re.findall(re_year_pat, n)
            #CONVERT AVAILIBLE AFC/NFC TABLEs (NOT HIDDEN IN COMMENTS)
            champ_tables_conf = soup.find_all('table') #standings table
            for i in range(0,4,2):#FIX THE RANGE 
                df_s = pd.read_html(str(champ_tables_conf))[i]
                #INSERT YEAR COL
                df_s.insert(loc=0, column='Yr', value=yr[0])
                if i == 0: 
                    df_conf_0 = df_s
                    df_conf_mast = df_conf_mast.append(df_conf_0)
                elif i == 2: 
                    df_conf_1 = df_s
                    df_conf_mast = df_conf_mast.append(df_conf_1)
        #CHANGE TEAM NAMES SO THEY ARE UNIFORM 
        df_conf_mast.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
    except: 
        df_conf_mast.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
        print('no conference table found')
    #parse super bowl URL
    #BS4 DATA PARSE
    sb_df_link = requests.get(url_sb_data)
    soup = BeautifulSoup(sb_df_link.text, 'lxml')
    #EXTRACT THE YEAR FROM THE URL STRING IN CURRENT ITERATION
    yr = re.findall(re_year_pat, n)
    #CONVERT AVAILIBLE AFC/NFC TABLEs (NOT HIDDEN IN COMMENTS)
    sb_data = soup.find_all('table') #standings table
    df_sb_mast = pd.read_html(str(sb_data))[0]
    df_sb_mast['Date'] = pd.to_datetime(df_sb_mast['Date'])
    df_sb_mast['Yr'] = df_sb_mast['Date'].dt.year
    #CHANGE TEAM NAMES SO THEY ARE UNIFORM 
    df_sb_mast.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
    print('Done.')
    
    #RETURN THE MASTER DATAFRAMES (ORDER HERE MATTERS)
    return [df_afc_standings_mast, df_nfc_standings_mast, 
            df_team_stats_mast, df_passing_mast, 
            df_rushing_mast, df_returns_mast, 
            df_kicking_mast, df_team_scoring_mast, 
            df_team_conversions_mast, df_team_drives_mast, 
            df_playoff_results_mast,  
            df_sb_mast, df_div_mast, df_conf_mast]
    
#PASS THE YEARS THROUGH THE FUNCTION
[df_afc_standings_mast, df_nfc_standings_mast, 
df_team_stats_mast, df_passing_mast, 
df_rushing_mast, df_returns_mast, 
df_kicking_mast, df_team_scoring_mast, 
df_team_conversions_mast, df_team_drives_mast, 
df_playoff_results_mast, 
df_sb_champ_mast, df_div_mast, df_conf_mast] = get_all_data([i for i in range(2000,2023)])              

Full Data Download

Here is a binder you can download where I've placed all the data in csv format and I've included all the codebooks for the columns in a txt file. Also a .py file that contains all the code.


Data Wrangle

Now that we have the data we need to wrangle it. Wrangling the data is a process by which we will clean it and unify it into something that will make it easier to analyze.

'''
It is good practice to avoid changing the master data sets too much. 
So try not to change the originals. 

One of the first things we are going to do is combine the master standings 
dataframes. We will combine them and re-assign them into a new dataframe. 
'''
#IMPORTS 
import pandas as pd 
import numpy as np 
import string 
#PANDAS VIEW
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option('max_colwidth', None)
pd.set_option('max_rows', None)
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt  # To visualize
import seaborn as sns # To visualize
pd.options.mode.chained_assignment = None  # default='warn'

'''
Here is a function to add suffixes to columns of dataframes. When we merge or 
append dataframes, if they have like columns, this will aid in differentiating 
them. 
'''
def add_suffix(df, suf_str): 
    keep_same = {'Yr', 'Tm'}
    df.columns = ['{}{}'.format(c, '' if c in keep_same else suf_str)
              for c in df.columns]

'''
First, import the master dataframes. This is not the most efficient way but this 
is just how I did it. 
'''         
def data_import():
    df_afc_standings_mast = pd.read_csv(r'Path\to\data\df_afc_standings_mast.csv', engine='python', index_col=False)
    df_kicking_mast = pd.read_csv(r'Path\to\data\df_kicking_mast.csv', engine='python', index_col=False)
    df_nfc_standings_mast = pd.read_csv(r'Path\to\data\df_nfc_standings_mast.csv', engine='python', index_col=False)
    df_passing_mast = pd.read_csv(r'Path\to\data\df_passing_mast.csv', engine='python', index_col=False)
    df_playoff_results_mast = pd.read_csv(r'Path\to\data\df_playoff_results_mast.csv', engine='python', index_col=False)
    df_returns_mast = pd.read_csv(r'Path\to\data\df_returns_mast.csv', engine='python', index_col=False)
    df_rushing_mast = pd.read_csv(r'Path\to\data\df_rushing_mast.csv', engine='python', index_col=False)
    df_team_conversions_mast = pd.read_csv(r'Path\to\data\df_team_conversions_mast.csv', engine='python', index_col=False)
    df_team_drives_mast = pd.read_csv(r'Path\to\data\df_team_drives_mast.csv', engine='python', index_col=False)
    df_team_scoring_mast = pd.read_csv(r'Path\to\data\df_team_scoring_mast.csv', engine='python', index_col=False)
    df_team_stats_mast = pd.read_csv(r'Path\to\data\df_team_stats_mast.csv', engine='python', index_col=False)
    df_div_mast = pd.read_csv(r'Path\to\data\df_div_mast.csv', engine='python', index_col=False)
    df_conf_mast = pd.read_csv(r'Path\to\data\df_conf_mast.csv', engine='python', index_col=False)
    df_sb_champ_mast = pd.read_csv(r'Path\to\data\df_sb_champ_mast.csv', engine='python', index_col=False)

    return [df_afc_standings_mast, df_nfc_standings_mast, 
            df_team_stats_mast, df_passing_mast, 
            df_rushing_mast, df_returns_mast, 
            df_kicking_mast, df_team_scoring_mast, 
            df_team_conversions_mast, df_team_drives_mast, 
            df_playoff_results_mast, df_div_mast,
            df_conf_mast, df_sb_champ_mast]
          
[df_afc_standings_mast, df_nfc_standings_mast, 
df_team_stats_mast, df_passing_mast, 
df_rushing_mast, df_returns_mast, 
df_kicking_mast, df_team_scoring_mast, 
df_team_conversions_mast, df_team_drives_mast, 
df_playoff_results_mast, df_div_mast,
df_conf_mast, df_sb_champ_mast] = data_import()

'''
Next, clean the standings data. This data will be the base of our analysis dataframe. 
'''
def clean_standings(): 
    team_names_lst = ['Los Angeles Rams', 'Denver Broncos', 'Las Vegas Raiders',
                    'Indianapolis Colts', 'Minnesota Vikings', 'Tampa Bay Buccaneers',
                    'San Francisco 49ers', 'Jacksonville Jaguars',
                    'Kansas City Chiefs', 'New Orleans Saints', 'Green Bay Packers',
                    'Philadelphia Eagles', 'Tennessee Titans', 'Baltimore Ravens',
                    'New York Giants', 'Miami Dolphins', 'Pittsburgh Steelers',
                    'New York Jets', 'Seattle Seahawks', 'Buffalo Bills',
                    'Carolina Panthers', 'Detroit Lions', 'Dallas Cowboys',
                    'Washington Football Team', 'New England Patriots',
                    'Los Angeles Chargers', 'Atlanta Falcons', 'Chicago Bears',
                    'Arizona Cardinals', 'Cincinnati Bengals', 'Cleveland Browns',
                    'Houston Texans']

    #PUT THE STANDINGS DFs TOGETHER
    df_wins_losses = df_afc_standings_mast.append(df_nfc_standings_mast)
    #STRIP ANY PUNCTUATION
    df_wins_losses['Tm'] = df_wins_losses['Tm'].str.strip(string.punctuation)
    #DROP ANY ROW THAT IS NOT A TEAM
    df_wins_losses = df_wins_losses.loc[df_wins_losses['Tm'].isin(team_names_lst)]
    #ISOLATE THE FOUR STANDINGS COLUMNS WE NEED
    df_standings_wins_loss_percent = df_wins_losses.iloc[0:,0:5]
    #MERGE THEM ON Yr AND Tm
    df_team_stats_mast_w_percent = df_standings_wins_loss_percent.merge(df_standings_wins_loss_percent, suffixes=('_standings','stand'), how='left', on=['Yr','Tm'])
    #GET A FOUNDATIONAL DF TO COMPARE WL% TO OTHER VARIABLES
    df1 = df_team_stats_mast_w_percent.iloc[0:,0:5]
    return df1
#DECLARES THE STANDINGS DATA AS df 
df = clean_standings()

'''
Then, drop all of the Rk and G columns from the dataframes. We do not need them. 
'''
#DROP ALL THE RANKS AND GAMES COLUMNS
def dropper():
    df_passing = df_passing_mast.drop(columns=['Rk','G'])
    add_suffix(df_passing, '_passing')
    
    df_kicking = df_kicking_mast.drop(columns=['Rk', 'G'])
    add_suffix(df_kicking, '_kicking')
    
    df_returns = df_returns_mast.drop(columns=['Rk', 'G'])
    add_suffix(df_returns, '_returns')
    
    df_rushing = df_rushing_mast.drop(columns=['Rk', 'G'])
    add_suffix(df_rushing, '_rushing')
    
    df_conversions = df_team_conversions_mast.drop(columns=['Rk', 'G'])
    add_suffix(df_conversions, '_conversions')
    
    df_drives = df_team_drives_mast.drop(columns=['Rk', 'G'])
    add_suffix(df_drives, '_drives')
    
    df_scoring = df_team_scoring_mast.drop(columns=['Rk', 'G'])
    add_suffix(df_scoring, '_scoring')
    
    df_team_stats = df_team_stats_mast.drop(columns=['Rk','G'])
    add_suffix(df_team_stats, '_team_offense')
    
    df_drop_lst = [df_passing, df_kicking, 
                      df_returns, df_rushing, 
                      df_conversions, df_drives,
                      df_scoring, df_team_stats]
    return df_drop_lst
#THIS PLACES THE DFs RETURNED INTO A LIST
df_merger_lst = dropper()

#THIS MERGES ALL THE DATAFRAMES INTO df ON Tm AND Yr. 
for i in df_merger_lst:
    df = df.merge(i, on=['Tm','Yr'])

'''
Many of the columns are object data and we need them to be numeric. 
get_sec() converts columns that have time in minutes and sec to only sec. 
'''
def get_sec(time_str):
    m, s = time_str.split(':')
    return int(m) * 60 + int(s)

def objs_to_numerics():
    df['Avg_Drive_Time_Sec_drives'] = [get_sec(i) for i in df['Average Drive_Time_drives']]
    df.drop(columns=['Average Drive_Time_drives'],inplace=True)
    
    #FIRST FIND THEM AND PUT THEM IN A LIST
    obj_cols = []
    for i in df.iloc[0:,2:].columns: 
        if df[i].dtype == object: 
            obj_cols.append(i)
        else: 
            pass
    #NOW THAT WE HAVE THEM STRIP THE % SIGN FROM PERCENT COLS
    for i in obj_cols:
        df[i] = df[i].str.strip(string.punctuation)
        df[i] = df[i].str.strip(r'Own ')  #all are 'Own '
        df[i] = pd.to_numeric(df[i])
objs_to_numerics()

'''
We need to add a boolean column that marks whether a team made it to divisionals, 
conference and/or the super bowl. We also want to track which teams won their 
conference or division game or not.
'''
def wrangle_div_conf(some_df, tm_col): 
    team_names_lst = ['Los Angeles Rams', 'Denver Broncos', 'Las Vegas Raiders',
                  'Indianapolis Colts', 'Minnesota Vikings', 'Tampa Bay Buccaneers',
                  'San Francisco 49ers', 'Jacksonville Jaguars',
                  'Kansas City Chiefs', 'New Orleans Saints', 'Green Bay Packers',
                  'Philadelphia Eagles', 'Tennessee Titans', 'Baltimore Ravens',
                  'New York Giants', 'Miami Dolphins', 'Pittsburgh Steelers',
                  'New York Jets', 'Seattle Seahawks', 'Buffalo Bills',
                  'Carolina Panthers', 'Detroit Lions', 'Dallas Cowboys',
                  'Washington Football Team', 'New England Patriots',
                  'Los Angeles Chargers', 'Atlanta Falcons', 'Chicago Bears',
                  'Arizona Cardinals', 'Cincinnati Bengals', 'Cleveland Browns',
                  'Houston Texans']
    #.loc ALL THE ROWS THAT CONTAIN THE TEAMS
    some_df = some_df.loc[some_df[tm_col].isin(team_names_lst)]
    #CONVERT THE SCORES FROM AN OBJECT STRING TO INTEGERS 
    some_df['1'] = some_df['1'].astype(int)
    #DROP THE FINALS COLUMN
    some_df.drop(columns='2',inplace=True)
    #REPLACE THE COL NAMES
    some_df.rename(columns={'0':'Tm', '1':'score'}, inplace=True)
    #BOOLEAN INDEX THE WINNER AND LOSER IN A COL
    #NOW SEPARTE THE DATA INTO EVERY OTHER AND EVERY NTH ; EASIER TO WORK WITH
    df_div_n2 = list(some_df['score'].iloc[0::2])
    df_div_n3 = list(some_df['score'].iloc[1::2])
    #ITERATE THROUGH THE LISTS TO GET TRUE FOR WIN; AND FALSE FOR LOSS
    start_idx = 0
    win_loss_bool = []
    for i in range(len(df_div_n2)):
        win_loss_bool.append(df_div_n2[start_idx] > df_div_n3[start_idx])
        win_loss_bool.append(df_div_n2[start_idx] < df_div_n3[start_idx])
        start_idx += 1
    #ADD A WIN_BOOL COLUMN TO THE DF
    some_df['win_bool'] = win_loss_bool
    return some_df
    
df_div = wrangle_div_conf(df_div_mast, '0')
df_conf = wrangle_div_conf(df_conf_mast, '0')
#RENAME THE COLUMNS IN DIVISION AND CONFERENCE DATA
df_div.rename(columns={'score':'score_div', 'win_bool':'win_bool_div'}, inplace=True)
df_conf.rename(columns={'score':'score_conf', 'win_bool':'win_bool_conf'}, inplace=True)

'''
Now in df we have to add whether the team went to conference and/or divisionals.
'''
#GET THE TM YR COLUMNS FROM DF IN A STANDALONE DATAFRAME
df_ana = df.iloc[0:,0:2]
#SET THE YR TM COLUMNS TO THE INDEX DF_ANA 
df.set_index(['Yr','Tm'], inplace=True)
df_div.set_index(['Yr','Tm'], inplace=True)
df_conf.set_index(['Yr','Tm'], inplace=True)
#MERGE THE DATAFRAMES 
df = df.merge(df_div, left_index=True, right_index=True, how='left')
df = df.merge(df_conf, left_index=True, right_index=True, how='left')
#FIX THE 'SC%_TEAM_OFFENSE' PROPORTION
df['W_-_L_%_standings'] = df['W_-_L_%_standings'] / 1000

'''
Now we must clean the superbowl data. 
'''
def clean_sb_data():
    #ISOLATE THE COLUMNS WE NEED
    df_sb = df_sb_champ_mast[['Yr', 'Winner', 'Pts', 'Loser', 'Pts.1']]
    df_sb['Yr'] = df_sb['Yr']-1
    #SEPERATE THE LOSERS AND WINNERS 
    sb_loser = df_sb[['Yr', 'Loser', 'Pts.1']]
    sb_winner = df_sb.iloc[0:,0:3]
    #CHANGE THE WINNER AND LOSER DATAFRAMES TEAM COLUMN TO TM AND PTS TO SCORE
    sb_loser.rename(columns={'Loser':'Tm', 'Pts.1':'score'}, inplace=True)
    sb_winner.rename(columns={'Winner':'Tm', 'Pts':'score'}, inplace=True)
    #CONCAT THE SEPARATED DATAFRAMES
    conc_sb_dfs = [sb_winner, sb_loser]
    df_sb = pd.concat(conc_sb_dfs)
    #RESET THE INDEX OF THE DATAFRAME
    df_sb.reset_index(drop=True, inplace=True)
    #ADD A WIN LOSS BOOL COLUMN 
    df_sb_win_loss_bool = []
    for i in df_sb.index:
        if i >= 55:
            i = False
            df_sb_win_loss_bool.append(i)
        else:
            i = True
            df_sb_win_loss_bool.append(i)
    df_sb['win_loss_bool'] = df_sb_win_loss_bool
    #LOC IN THE YEARS 2000 TO 2020 
    df_sb = df_sb.loc[df_sb['Yr'] >= 2000]
    #ADD A SUFFIX TO THE COLUMNS
    add_suffix(df_sb, '_sb')
    #SET THE INDEX TO MATCH DF
    df_sb.set_index(['Yr', 'Tm'], inplace=True)
    return df_sb
df_sb = clean_sb_data()
df = df.merge(df_sb, left_index=True, right_index=True, how='left')

'''
Now we have to organize a new dataframe with divisions, conferences, and 
superbowl data at the forefront. 
'''
#Start with a function that cleanly prints the columns of df by 3s
#makes it easier to work with
def get_df_cols():
    idx_v = 0
    try:
        for i in range(len(df.columns)):
            print("'{}','{}','{}',".format(df.columns[idx_v+0], df.columns[idx_v+1], df.columns[idx_v+2]))
            idx_v += 3
    except:
        pass
    return 

'''
Finally, corral it all into one big dataframe. 
'''
def wrangle_ana_df():
    '''
    This re-orders and re-assigns df.
    '''
    df_ana = df[['score_div', 'win_bool_div', 'score_conf', 
                 'win_bool_conf', 'score_sb', 'win_loss_bool_sb', 
                 'W_standings','L_standings','W_-_L_%_standings',
                 'C_m_p_passing','A_t_t_passing','C_m_p_%_passing',
                 'Y_d_s_passing','T_D_passing','T_D_%_passing',
                 'I_n_t_passing','I_n_t_%_passing','L_n_g_passing',
                 'Y_/_A_passing','A_Y_/_A_passing','Y_/_C_passing',
                 'Y_/_G_passing','R_a_t_e_passing','S_k_passing',
                 'Y_d_s_._1_passing','S_k_%_passing','N_Y_/_A_passing',
                 'A_N_Y_/_A_passing','4_Q_C_passing','G_W_D_passing',
                 'E_X_P_passing','0-19_FGA_kicking','0-19_FGM_kicking',
                 '20-29_FGA_kicking','20-29_FGM_kicking','30-39_FGA_kicking',
                 '30-39_FGM_kicking','40-49_FGA_kicking','40-49_FGM_kicking',
                 '50+_FGA_kicking','50+_FGM_kicking','Scoring_FGA_kicking',
                 'Scoring_FGM_kicking','Scoring_Lng_kicking','Scoring_FG%_kicking',
                 'Scoring_XPA_kicking','Scoring_XPM_kicking','Scoring_XP%_kicking',
                 'Kickoffs_KO_kicking','Kickoffs_KOYds_kicking','Kickoffs_TB_kicking',
                 'Kickoffs_TB%_kicking','Kickoffs_KOAvg_kicking','Punting_Pnt_kicking',
                 'Punting_Yds_kicking','Punting_Lng_kicking','Punting_Blck_kicking',
                 'Punting_Y/P_kicking','Punt Returns_Ret_returns','Punt Returns_Yds_returns',
                 'Punt Returns_TD_returns','Punt Returns_Lng_returns','Punt Returns_Y/R_returns',
                 'Kick Returns_Rt_returns','Kick Returns_Yds_returns','Kick Returns_TD_returns',
                 'Kick Returns_Lng_returns','Kick Returns_Y/Rt_returns','APYd_returns',
                 'A_t_t_rushing','Y_d_s_rushing','T_D_rushing',
                 'L_n_g_rushing','Y_/_A_rushing','Y_/_G_rushing',
                 'F_m_b_rushing','E_X_P_rushing','Downs_3DAtt_conversions',
                 'Downs_3DConv_conversions','Downs_3D%_conversions','Downs_4DAtt_conversions',
                 'Downs_4DConv_conversions','Downs_4D%_conversions','Red Zone_RZAtt_conversions',
                 'Red Zone_RZTD_conversions','Red Zone_RZPct_conversions','#Dr_drives',
                 'Plays_drives','Sc%_drives','TO%_drives',
                 'Average Drive_Plays_drives','Average Drive_Yds_drives','Average Drive_Start_drives',
                 'Average Drive_Pts_drives','R_s_h_T_D_scoring','R_e_c_T_D_scoring',
                 'P_R_ _T_D_scoring','K_R_ _T_D_scoring','F_b_l_T_D_scoring',
                 'I_n_t_T_D_scoring','O_t_h_T_D_scoring','A_l_l_T_D_scoring',
                 '2_P_M_scoring','2_P_A_scoring','X_P_M_scoring',
                 'X_P_A_scoring','F_G_M_scoring','F_G_A_scoring',
                 'S_f_t_y_scoring','P_t_s_scoring','P_t_s_/_G_scoring',
                 'D_2_P_scoring','PF_team_offense','Yds_team_offense',
                 'Tot Yds & TO_Ply_team_offense','Tot Yds & TO_Y/P_team_offense','Tot Yds & TO_TO_team_offense',
                 'FL_team_offense','1stD_team_offense','Passing_Cmp_team_offense',
                 'Passing_Att_team_offense','Passing_Yds_team_offense','Passing_TD_team_offense',
                 'Passing_Int_team_offense','Passing_NY/A_team_offense','Passing_1stD_team_offense',
                 'Rushing_Att_team_offense','Rushing_Yds_team_offense','Rushing_TD_team_offense',
                 'Rushing_Y/A_team_offense','Rushing_1stD_team_offense','Penalties_Pen_team_offense',
                 'Penalties_Yds_team_offense','Penalties_1stPy_team_offense','Sc%_team_offense',
                 'TO%_team_offense','EXP_team_offense','Avg_Drive_Time_Sec_drives',]]
    #Convert all the data to floats
    df_ana = df_ana.iloc[0:,0:].astype(float)
    #sort the index of ana_df
    df_ana.sort_index(inplace=True)
    #reset the index and get the Yr and Tm cols back into the dataframe
    df_ana.reset_index(inplace=True)
    return df_ana
df_ana = wrangle_ana_df()
#EXPORT DF_ANA
df_ana.to_csv(r'Export\file\path\of\df_ana.csv', index=False)
'''
Now we have all of the data into a dataframe we can use as a primary for analysis. 
You will have to do some more wrangling but this was the heaviest lifting. 
'''
  

Descriptive Analysis

Here I'll go over some basic descriptive statitstics.

'''
First I'm going to isolate this years data into a separate dataframe. 
Because this data is for the regular season of 16 games—at this time the 2021 
season hasn't played all 16 games. 
'''
In[]: df_ana_2021 = df_ana.iloc[-32:,0:]
In[]: df_ana_d = df_ana.iloc[0:670, 0:]
In[]: df_ana_2021.reset_index(drop=True, inplace=True)

'''
We can start by looking at some discripive statistics of the data in df_ana.
I am going to exclude the division, conference, and super bowl data for now.
'''
In[]: df_ana.iloc[0:,8:].describe()
#Here is a snippet of the return 
       W_standings  L_standings  W_-_L_%_standings  C_m_p_passing  \
count   670.000000   670.000000         670.000000     670.000000   
mean      7.983582     7.983582           0.498746     331.092537   
std       3.101863     3.098005           0.193672      48.968711   
min       0.000000     0.000000           0.000000     204.000000   
25%       6.000000     6.000000           0.375000     297.250000   
50%       8.000000     8.000000           0.500000     329.000000   
75%      10.000000    10.000000           0.625000     366.000000   
max      16.000000    16.000000           0.938000     472.000000   

       A_t_t_passing  C_m_p_%_passing  Y_d_s_passing  T_D_passing  \
count     670.000000       670.000000     670.000000    670.00000   
mean      540.602985        61.103582    3555.582090     23.08209   
std        59.762089         4.392035     620.522169      7.45294   
min       358.000000        45.600000    1898.000000      6.00000   
25%       501.000000        58.000000    3098.500000     18.00000   
50%       542.500000        61.100000    3536.500000     22.00000   
75%       583.000000        64.300000    3998.000000     27.75000   
max       740.000000        73.400000    5444.000000     55.00000 ...

'''
The above shows the counts the mean (average), standard deviation (std), min, max, 
and percentiles for each variable. But this is also for the entire data. We can also 
isolate data by year. 
'''
df_2000 = df_ana.loc[df_ana['Yr']==2000]

'''
Here we will generate a corrolation matrix. Starting at the wins and losses columns. 
I am going to exclude the division, conference, and super bowl data for now. 
'''
In[]: df_corr = df_ana.iloc[:,8:].corr(method='pearson')

'''
So if we look at the first ten rows and four columns we can see the structure of a 
corrolation matrix. 
'''
                   W_standings  L_standings  W_-_L_%_standings  C_m_p_passing
W_standings           1.000000    -0.998346           0.979706       0.122332
L_standings          -0.998346     1.000000          -0.979749      -0.127578
W_-_L_%_standings     0.979706    -0.979749           1.000000       0.113643
C_m_p_passing         0.122332    -0.127578           0.113643       1.000000
A_t_t_passing        -0.114731     0.109959          -0.118299       0.886518
C_m_p_%_passing       0.429580    -0.433127           0.418009       0.697468
Y_d_s_passing         0.313159    -0.317413           0.300767       0.872765
T_D_passing           0.502388    -0.503220           0.475131       0.610782
T_D_%_passing         0.591888    -0.591131           0.565948       0.327591
I_n_t_passing        -0.473305     0.476590          -0.465494      -0.078404

'''
It shows the corrolation coefficients for each variable to the others. 
So, let us look at all the coeficients for W_-_L_%_standings.
'''
In[]: w_loss_per_corr = df_corr.iloc[0:, 2]
Out[]: 
W_standings                      0.979706
L_standings                     -0.979749
W_-_L_%_standings                1.000000
C_m_p_passing                    0.113643
A_t_t_passing                   -0.118299
C_m_p_%_passing                  0.418009
Y_d_s_passing                    0.300767
T_D_passing                      0.475131
T_D_%_passing                    0.565948
I_n_t_passing                   -0.465494
I_n_t_%_passing                 -0.422179
L_n_g_passing                    0.066474
Y_/_A_passing                    0.512011
A_Y_/_A_passing                  0.591349
Y_/_C_passing                    0.327409
Y_/_G_passing                    0.300762
R_a_t_e_passing                  0.589854
S_k_passing                     -0.456082
Y_d_s_._1_passing               -0.453944
S_k_%_passing                   -0.368286
N_Y_/_A_passing                  0.569628
A_N_Y_/_A_passing                0.619554
4_Q_C_passing                    0.329574
G_W_D_passing                    0.375747
E_X_P_passing                    0.560876
0-19_FGA_kicking                -0.005201
0-19_FGM_kicking                -0.018923
20-29_FGA_kicking                0.225080
20-29_FGM_kicking                0.230720
30-39_FGA_kicking                0.209211
30-39_FGM_kicking                0.216945
40-49_FGA_kicking                0.125072
40-49_FGM_kicking                0.151656
50+_FGA_kicking                 -0.122340
50+_FGM_kicking                 -0.080407
Scoring_FGA_kicking              0.274723
Scoring_FGM_kicking              0.310597
Scoring_Lng_kicking             -0.007467
Scoring_FG%_kicking              0.187031
Scoring_XPA_kicking              0.674673
Scoring_XPM_kicking              0.675618
Scoring_XP%_kicking              0.057681
Kickoffs_KO_kicking              0.567831
Kickoffs_KOYds_kicking           0.575851
Kickoffs_TB_kicking              0.148144
Kickoffs_TB%_kicking             0.025797
Kickoffs_KOAvg_kicking           0.222601
Punting_Pnt_kicking             -0.374634
Punting_Yds_kicking             -0.386927
Punting_Lng_kicking             -0.027955
Punting_Blck_kicking            -0.019960
Punting_Y/P_kicking             -0.066211
Punt Returns_Ret_returns         0.142696
Punt Returns_Yds_returns         0.171174
Punt Returns_TD_returns          0.047896
Punt Returns_Lng_returns         0.069457
Punt Returns_Y/R_returns         0.089882
Kick Returns_Rt_returns         -0.259872
Kick Returns_Yds_returns        -0.238477
Kick Returns_TD_returns          0.020882
Kick Returns_Lng_returns         0.031521
Kick Returns_Y/Rt_returns        0.060090
APYd_returns                     0.353581
A_t_t_rushing                    0.466039
Y_d_s_rushing                    0.381405
T_D_rushing                      0.480960
L_n_g_rushing                   -0.011970
Y_/_A_rushing                    0.140499
Y_/_G_rushing                    0.381364
F_m_b_rushing                   -0.306978
E_X_P_rushing                    0.022537
Downs_3DAtt_conversions         -0.141215
Downs_3DConv_conversions         0.397804
Downs_3D%_conversions            0.497981
Downs_4DAtt_conversions         -0.451078
Downs_4DConv_conversions        -0.172604
Downs_4D%_conversions            0.217802
Red Zone_RZAtt_conversions       0.597728
Red Zone_RZTD_conversions        0.587456
Red Zone_RZPct_conversions       0.379994
#Dr_drives                      -0.110206
Plays_drives                     0.276659
Sc%_drives                       0.643822
TO%_drives                      -0.504452
Average Drive_Plays_drives       0.287888
Average Drive_Yds_drives         0.455470
Average Drive_Start_drives       0.390743
Average Drive_Pts_drives         0.638762
R_s_h_T_D_scoring                0.480960
R_e_c_T_D_scoring                0.475131
P_R_ _T_D_scoring                0.021229
K_R_ _T_D_scoring               -0.100318
F_b_l_T_D_scoring                0.066160
I_n_t_T_D_scoring                0.225410
O_t_h_T_D_scoring                0.159459
A_l_l_T_D_scoring                0.656068
2_P_M_scoring                    0.092491
2_P_A_scoring                   -0.086518
X_P_M_scoring                    0.675618
X_P_A_scoring                    0.674673
F_G_M_scoring                    0.310597
F_G_A_scoring                    0.274723
S_f_t_y_scoring                  0.119496
P_t_s_scoring                    0.703648
P_t_s_/_G_scoring                0.703709
D_2_P_scoring                         NaN
PF_team_offense                  0.703685
Yds_team_offense                 0.497314
Tot Yds & TO_Ply_team_offense    0.255818
Tot Yds & TO_Y/P_team_offense    0.485444
Tot Yds & TO_TO_team_offense    -0.487077
FL_team_offense                 -0.298245
1stD_team_offense                0.473765
Passing_Cmp_team_offense         0.113643
Passing_Att_team_offense        -0.118299
Passing_Yds_team_offense         0.300767
Passing_TD_team_offense          0.475131
Passing_Int_team_offense        -0.465494
Passing_NY/A_team_offense        0.569628
Passing_1stD_team_offense        0.279824
Rushing_Att_team_offense         0.466039
Rushing_Yds_team_offense         0.381405
Rushing_TD_team_offense          0.480960
Rushing_Y/A_team_offense         0.140499
Rushing_1stD_team_offense        0.434559
Penalties_Pen_team_offense      -0.113117
Penalties_Yds_team_offense      -0.078770
Penalties_1stPy_team_offense     0.103788
Sc%_team_offense                 0.643822
TO%_team_offense                -0.504452
EXP_team_offense                 0.456270
Avg_Drive_Time_Sec_drives        0.418059
Name: W_-_L_%_standings, dtype: float64

'''
So, let us look at this years data to see the variables with a coefficient above 
.6 for W_-_L_%_standings. 
'''
In[]: df_corr_2021 = df_ana_2021.iloc[:,8:].corr(method='pearson')
In[]: df_corr_win_loss_per_2021 = df_corr_2021[['W_-_L_%_standings']]
In[]: df_corr_win_loss_per_2021.loc[df_corr_win_loss_per_2021['W_-_L_%_standings'] > .6]
In[]: df_corr_win_loss_per_2021
Out[]: 
                               W_-_L_%_standings
W_standings                             0.995154
W_-_L_%_standings                       1.000000
Y_d_s_passing                           0.622321
T_D_passing                             0.634460
T_D_%_passing                           0.637850
Y_/_A_passing                           0.656540
A_Y_/_A_passing                         0.696654
Y_/_G_passing                           0.619605
R_a_t_e_passing                         0.702980
N_Y_/_A_passing                         0.731743
A_N_Y_/_A_passing                       0.716489
E_X_P_passing                           0.791657
Scoring_XPA_kicking                     0.787438
Scoring_XPM_kicking                     0.796545
Kickoffs_KO_kicking                     0.773950
Kickoffs_KOYds_kicking                  0.799903
Kickoffs_TB_kicking                     0.636094
Kickoffs_KOAvg_kicking                  0.679450
APYd_returns                            0.617156
Downs_3D%_conversions                   0.640967
Red Zone_RZAtt_conversions              0.788430
Red Zone_RZTD_conversions               0.783980
Sc%_drives                              0.850716
Average Drive_Yds_drives                0.722622
Average Drive_Pts_drives                0.849570
R_e_c_T_D_scoring                       0.634460
A_l_l_T_D_scoring                       0.754195
X_P_M_scoring                           0.796545
X_P_A_scoring                           0.787438
P_t_s_scoring                           0.800817
P_t_s_/_G_scoring                       0.834334
PF_team_offense                         0.800817
Yds_team_offense                        0.732371
Tot Yds & TO_Y/P_team_offense           0.679825
1stD_team_offense                       0.727823
Passing_Yds_team_offense                0.622321
Passing_TD_team_offense                 0.634460
Passing_NY/A_team_offense               0.731743
Sc%_team_offense                        0.850716
EXP_team_offense                        0.825778

'''
One figure that really sticks out is Sc% or percentage of drives ending in an 
offensive score. And this makes a lot of sense. For a team to win a lot they need 
to be good at moving the ball down the field and then score. 

The other obvious corrolations all relate to scoring. This makes sense becuase to 
win a team must score. 
'''