Building a model to predict IPL match outcomes

App Link with final predictions - https://share.streamlit.io/arpitsolanki/ipl-prediction-engine/main/app.py

Code -https://github.com/arpitsolanki/IPL-Prediction-Engine

Background

Over the last few months I’ve been trying to combine my two passions - sports & data analytics. It all started with using analytics to get better at Fantasy League for Football, I also recently participated in annual Kaggle competition for predicting outcomes of NCAA matches. While I enjoyed both those projects, I found that my lack of knowledge for Football & Basketball limited my abilities to come up with good features & models to some extent. Therefore I decided to pick up Cricket, a sport which I understand really well and have been following for a long time.

Cricket traditionally isn’t a sport which has been known for widespread usage of data to inform tactics & team selection, at least not to the extent of some other sports like Baseball & Football. However with the growing popularity of T20 over the last 10 years, things have been changing very rapidly in Cricket as well, with teams employing performance analysts & using data to get any possible edge over the opposition. This has spilled over to cricket broadcasts & journalism as well, with commentators trying to bring in more visualizations & journalists using data to create stories around their articles. A great example of this is Jarrod Kimber, who uses a lot of analytics in his video blogs to deep dive into matches & explain why teams performed in a certain manner. I’d been watching all of this very closely over the years & decided to get my hands dirty & combine my love for the sport of cricket and my skills as a data analyst to create a project around predicting match outcomes for IPL matches using data analytics.

Why IPL is a difficult tournament to predict accurately?

I did check out a few solutions available on the internet predicting match outcomes, but they all had low accuracy scores, sometimes worse than a coin toss. I think this could be due to the following reasons -

Upsets are more likely in T20 than any other cricket format. In cricket or any other sport, the longer the match goes, the stronger team has much higher chance of overcoming poor passages of play and win the game, but since T20 cricket is just 120 balls per innings, few moments of individual brilliances can change outcomes of games.
IPL teams have a fixed purse amount to spend on players, which ensures a level playing field. The difference in quality in best & worst teams isn’t as high as in some other sports like EPL, NBA etc.
The format for tournament isn’t consistent - squads get changed every few years, gets played outside India every now and then, teams keep getting added or banned. This makes it difficult to get high volume of high quality data & there’s too much noise added due to above factors
As I mentioned earlier, except for a few there aren’t many teams which have consistently dominated the IPL over a long time. A team having a good season can suddenly become poor the next season. The above chart shows the percentage of matches won by IPL teams over different seasons. There aren’t too many relatively straight lines in the chart above - win rates for teams wildly fluctuates from one season to the other. There are two teams which have had longer successful spells - Mumbai Indians & Chennai Super Kings, but even they’ve had poor run of seasons every now and then.

Gathering Data

I had initially thought that getting access to good quality data for IPL matches would be a challenge, but a few google searches helped me discover cricsheet.org - absolute gem of a website when it comes to cricket data. It tracks data for all major domestic & international competitions going back several years. The dataset available for IPL is fairly granular - ball by ball data for all the IPL matches that have happened so far. This is how the data looks like -

enter image description here

The data is available for all the matches, tracking information around venue, batsman, bowler, extras, dismissals etc. However, it does require some amount of cleaning & wrangling to get it into a format ready for making predictions.

Data Preparation

Although the data is available in a reasonably clean format, there are still some operations that need to be performed to get the data in a desired format -

Ensure team & stadium across season are consistent. For ex. Hyderabad team shows up as Deccan Chargers, Sunrisers Hyderabad across multiple seasons. Ensuring that such instances are treated as same across seasons
Roll-up the ball level data to match by match data, with clear winner being identified for each match.

Feature Engineering

As with any modelling problem, domain knowledge is very important to come up with good features relevant to the problem statement. Very often the quality of data is just as important as the choice of the model for making predictions. Based on my experience of watching cricket broadcasts over the years, I came up with a list of factors which I think play a role in determining match outcomes. Exploratory analysis of data on these factors along with several iterations during the model building stage will help us select the optimal set of features which maximizes prediction accuracy. Here are the categories of features I came up with intuitively -

Venue Stats

Home/Away fixture - Unlike football & some other sports, pitch conditions play a huge role in determining the outcome of the match. Teams create squads to suit their home ground pitch conditions - for ex. Chennai going spin heavy to suit spinning pitches at Chepauk.
Venue Win Rates - As some seasons are played at neutral venues, its good to keep a track of a team’s performance at every individual venue

Runs & Wicket Stats

Powerplay Batting & Bowling Stats - Number of wickets & runs scored on average while batting & bowling during powerplay
Innings Batting & Bowling Stats - Number of wickets taken & runs scored on average while batting & bowling during the entire innings

Win Rate Stats

Team Overall Win Rates - Even though we saw that a team’s win rate in a season can fluctuate a lot over multiple seasons, its still a good metric to separate the likes of MI & CSK
Head2Head - Head to head records of teams against each other
Season Performance Stats - Team’s position on points table & number of matches won in a season is a good indicator of recent form and could be useful for predicting the winner. For ex. you would expect a team at the top of the table to beat the team at the bottom of the table most times during a season.

Let’s start diving deep into some of these feature groups and look at which ones correlate with a team’s win rate.

Venue Stats

enter image description here

As you can see above, most teams seem to do well in their home conditions compared to away grounds. Teams play at least half of their matches at their home ground every season. Pitch characteristics & ground sizes vary a lot across India, with some pitches like Chennai being spin bowler friendly while others like Bangalore being a batting paradise. Teams therefore build their squad to best suit their home ground conditions which gives them an advantage over visiting teams. Also home support is a big factor as all IPL games are well attended and fan bases are quite strong.

Rajasthan & Chennai both win almost 70% of their matches on their home grounds. No team other than Mumbai & Chennai has a win_rate of more than 50% on away grounds. This versatility across conditions is probably the reason why both Mumbai & Chennai are the strongest teams in the IPL. Let’s also look at distribution of win rates across different grounds to see how much an individual team’s performance varies across grounds -

enter image description here

As you can see in the chart above, team’s win rate at different venues fluctuates quite a bit. The violins above show that even within away venues there is a huge variation in win rates for teams. This proves the point that pitch & ground conditions play a huge role in deciding match outcomes and should not be ignored while creating the model.

Batting & Bowling Stats enter image description here

The above scatter plot shows us a scatter plot distribution between win rates for teams & their average bowling & batting stats for powerplay & innings. For some of these stats like mean runs scored while batting first have a correlation pattern with win rate, some others like numbers of runs scored during powerplay while batting first don’t show any relationship with the win rates. A study of these scatter plots will help us understand which variables correlate with win rate & will help us in feature selection feature engineering stage of the exercise.

Win Rate Stats We’re looking at multiple win rate stats like head to head records of teams against each other, overall win rates since the beginning of the IPL, total number of wins in the ongoing season. Let’s take a look at one feature - difference in number of wins for the ongoing season.

enter image description here

As we can see above, current form in the season plays an important role in determining match outcomes. Teams that have won a significantly higher number of matches compared to the opposition in the season tend to win more often.

Model Design & Development

enter image description here

Summarizing the image above -

Model development - Happens one per season using training data as data of all the previous seasons
Test Data Update - Happens once every week taking into account change in team’s batting & bowling stats as well as current season’s performance. This test data is used to make predictions every week.

The aim of the model is to predict the winning team for every match. We use tree based ensemble Random Forest for making our predictions. After several iterations of model hyperparameter tuning & feature selection, here is the final feature importance plot we got for our model.

enter image description here

Current season form is the most important feature, followed by difference in number of mean runs scored along with overall win rates in the IPL. Since we only have about 500 data points, its good to restrict depth of trees & the number of features going into the model to avoid overfitting.

We also tested out the model for different seasons, looking to get an understanding of the average accuracy of the model over multiple seasons. As mentioned above, for making predictions for a season, we train the model with data up until the last season and make predictions on the current season. Using this method the general accuracy of the model lies in the range of 65%-70%, though it fell down to 61% for the 2018 season.

enter image description here

Generally you would expect model to get better with every passing year due to availability of more data, but that isn’t the case with IPL due to the reasons mentioned earlier -complete changes in squads due to mega auctions, tournament shifting to neutral venues etc.

Alternative approaches

While this model gives us reasonable accuracy, there is still a lot of scope for improvement. We did not consider toss outcomes, match importance for teams, day/night status and squad strength based on players involved in the match during our prediction. Adding these & other features could potentially improve the model accuracy.

While researching for methods to make predictions on sports outcomes, I also came across the fivethirtyeight.com method to create Elo ratings for teams based on squad strength and then run Monte Carlo simulations thousands of times and predict the outcomes of matches & seasons. They’ve used this approach number of times for sports like Football, Basketball, Baseball etc. I’ve found it to be fascinating and will be looking to try that out for upcoming seasons & tournaments as well.

Final Comments

This was a fun exercise and took me about two weekends to complete the entire process from data gathering, manipulation to model building. The final predictions have been made available on a Streamlit app. Code is available here

Hope you guys enjoyed reading it. Please feel free to leave your feedback in comments.

PS - Unfortunately the IPL season had to be postponed due to Covid situation in India, but this was a fun project & acted as a good starting point for me to use data for cricket. Will look to build on my work again when normalcy is restored to life & cricket.