When I first started getting an interest in sports analytics, I realized there was a problem that many people still relate to.
The data is scarce and hard to come by.
There’s two main methods of obtaining data:
Pay for it
Gather it through web scraping
Since I don’t have tens of thousands of dollars to pay for it, I’ve always scraped it, which is something that I teach in my
“Get Free Sports Data Forever by Building a Web Scraping Pipeline”
course.
Here are my 5 favorite that we can use to get free data.
1. Whoscored.com
Key Data available:
event data from live and historical matches
Whoscored is a gold mine because it has the one thing other websites don’t and that’s event data.
Through this we can access each individual pass and action that was taken during a match.
The downside is that they don’t have the xG values available so to get that we have to combine with a different data source.
2. Statsbomb Free API
Key Data available:
Advanced event data from historical matches
The free Statsbomb API is amazing because you can get access to high level event data, including Statsbomb360 data through their free API.
I recommend that anyone that is interested in learning sports analytics start here because you get a ton of data that is very accessible.
3. FBref.com
Key Data available:
aggregated and high level data
FBref is useful because it is aggregated for us which means that while have individual actions is great, we can use FBref to get the aggregated values over weeks, seasons, careers, etc.
4. sofascore.com
Key Data available:
xG values for shots
I’ve been using SofaScore for a while to get the xG values for shots. I like using SofaScore because it is data they are using from Opta, the downside is that it is also pretty difficult to scrape since you have to access their API and they do all they can to make sure you’re not accessing it.
5. Understat.com
Key Data available:
xG values for shots
This is one I use as a backup to SofaScore.
It’s very easy to scrape and has good information about matches, shots, etc.
The downside is that as far as I know, the xG values come from their own model so it’s not as precise as an Opta or Statsbomb model, but it is still really good.
Those are my five favorite!
If you want to get started with them, I have YouTube videos dedicated to some of them on my channel, which will walk you through accessing and building scrapers in Python.
I also love the community wrappers such as soccerdata to sidestep writing webscraping scripts where possible.