Using ML to get better at FPL

Background

2020 was a particularly difficult year for everyone, with so much suffering around the world & life becoming still for everyone stuck in their homes due to lockdowns. As a sports lover, I found it particularly difficult to get used to a life without any live sport. Summers in India are particularly packed with excitement of home test series at the start of the year and then IPL for the months of April & May. Without access to any live sport, I turned to OTT platforms and started watching documentaries around football teams like Man City, Leeds United, Tottenham Hotspur, Sunderland etc. I finished all of them in no time and though I’ve been a cricket fan most of my life & watching football for me meant only international competitions like World Cup & Euros, I started getting attracted to club football.

As the spread of Covid started reducing around the world, football was the first sport to break through the shackles & premier league football resumed around the month of July. I got hooked to the sport and at the same time also got introduced to the world of Fantasy Football. I decided to give it a shot with new season beginning in September. Because I was still very new to football my knowledge around players capabilities in PL was still very limited, that led me to make several poor decisions. I also found myself getting biased around clubs that I liked & filling squad with players from those clubs. This led to a poor showing after the first few gameweeks of FPL. I decided to put my skillset of data analysis to work and use it to compensate for my lack of understanding of PL and understand how different players compare against each other & make transfer decisions backed by data.

Gathering Data

Following are the list of data sources used by me for the analysis

Vaastav FPL Github - This I think is the best available data repository for FPL containing historical data around players & team performances in FPL going back several years and gets refreshed on a weekly basis. Data is available here
FPL API - I used FPL API to get info around upcoming fixtures & status of players fitness for a gameweek. Details around how to access the API can be found here
FivethirtyEight scores prediction - I really like Nate Silver’s fivethirtyeight.com]which uses analytics & data science to predict outcomes of several real life events. They also make predictions around expected score for games in PL as well. I use this dataset capture level of difficulty of a fixture & scores prediction

Data Overview

Our base dataset for this analysis is the Github repository mentioned above. It contains various data points around a player’s performance in a particular match as shown below -

enter image description here

We combine the base datasets with additional data points from the FPL API & Fivethirtyeight datasets. Once we have the data ready, the first thing we do is to look at distribution of points scored by players during a game week. As the chart below suggests, huge majority of players score less than or equal to 2 points in a game week. Events providing returns like assists, goals & clean sheets are quite rare. High number of players scoring zeros could be attributed to the fact that all premier league have big squads of around 25 players and only about 12-13 of them feature in a game. Since the distribution of points scored by players for more than 2 points is quite scattered, its quite difficult to actually predict the exact number of points scored by a player. enter image description here

Let’s now look at how the proportion of blanking(<=2 pts) and not blanking(2+points) looks by player position. Goalkeepers are most likely to blank based on the chart below, seems logical as well since only way GKs generally gain points is by maintaining clean sheets. For every other position, the proportions are quite similar. enter image description here

After taking a look at the above data points, I decided that it will be a better idea to build a classification model rather than a regression model for this exercise. The objective of the model would be to identify players who are most likely to score 2+points in a week. Since most players score less than 2 points in a week, this would be an example of imbalanced classification problem. We’ll be building tree based classifiers for this problem.

Feature Engineering

Quality of predictions for any model is directly correlated to the quality of features being fed into the model. I found the overall quality of data to be very rich & clean for this project, therefore lot of variables available in the raw datasets can directly be used as features. On top of that I added several features on my end to capture form & opponent strengths into the model. After going through several iterations of model training on added features, here are the list of features I came up with -

Player Performance

Influence, Creativity & Threat metrics
Rolling average of points scored during last four weeks
Position of a player
Player’s contribution to team’s total points
Rolling average of minutes played during last four weeks
Goals scored, assists & clean sheets kept
Yellow & Red Cards
Number of incoming transfers by FPL managers during a game week

Team Performance

Diff. between team & opponent’s position in the points table
Team’s form over last four weeks
Home or away fixture
Projected scores from fivethirtyeight
Total points scored by all players in the team
Penetrations into opponent’s box & number of penetrations allowed

Model Design

My workflow has been designed in such a way that it uses historical data for the entire season until the latest gameweek for training the model & then makes prediction for the upcoming week. Predictions includes list of 11 players who are most likely to score more than 2 points during the gameweek. enter image description here The project is deployed as a pipeline that runs every week and uses historical data till the latest game week and makes prediction for the upcoming game week.

Model Development

As seen in the data earlier majority of the players during a particular gameweek tend to blank(score less than or equal to 2 points for appearance). Since football is a low scoring sport & events like goals etc. can be quite random, it is very hard to predict the exact number of points scored by a player during a game week. Therefore I turned this into a 2 class classification problem where i’m just trying to predict if a player would blank in a particular gameweek or score more than 2+ points.

I decided to train tree based ensemble models using Random Forests & XGBoost and used a weighted average of the predictions used by both the models. The final output of model is a dataset with probability of each player not blanking during the upcoming game week. Here is the feature importance plot for the random forest model -

enter image description here As we all know the 2020-21 season been a pretty weird one with teams going through runs of good & bad forms. The model seems to recognize that as well & the features with rolling average of last four weeks on points scored, ICT index etc. tend to have a lot of importance. This model in particular tends to answer the classic conundrum of FPL managers - form vs fixtures in favor of form.

This model had an overall accuracy of 78%, but given that this is an imbalanced classification problem and we’re interested in accurately identifying top 11 players who are likely to score, we’re more interested in the true positive rate of the model.

enter image description here

The above charts show us that predicted probability distribution is heavily skewed to the left and very few players have predicted probability over 0.5. The AUC for ROC curve is 0.75 and the True Positive Rate is 70%. This is not bad for an initial model but definitely room for improvement in future iterations.

Output

The final output of the model is a list of 11 players who are most likely to score 2+points during the upcoming game week. The workflow runs for every game week and the output predictions are available on the Streamlit website created here. enter image description here

I also look at points scored in previous game weeks by my team vs the actual dream team for the week & average score for the game week. So far we can say that the team predicted by the model is doing slightly better than the average human on points scored every game week. Looking at the predictions for the six game weeks the model has been running, model team scored 249 points vs sum of average score of 239 pts. enter image description here

Hope is that model will continue to improve in its predictions through the season as it has access to higher volume of training data.

Code for the entire project can be found here

Next Steps & Improvement

This entire project was taken up by me as a Christmas project to understand PL football better & learn to use Streamlit for dashboarding. During the development of project I identified a few things that could be better -

Gather different metrics for attacking & defensive footballers and train different models for each
Use Linear Programming to optimize the maximum points from the predicted team while ensuring that team budget doesn’t cross 100M pounds. I’ll be looking to work further on this & hopefully improve the performance of this model.