We learnt about NLG in our previous post. In this post let us look at building a fun NLG system.
Cricket is a religion in India. Whenever a match is on, everyone in India is glued to their TV sets or they follow the scores on an online website like https://www.espncricinfo.com/
The ball by ball live commentary by Cricinfo is legendary. When you look at the ball by ball commentary, you immediately see a pattern. And where there is a pattern, can AI be far behind. Ideally, if we have enough video footage and corresponding text commentary we can build a real time commentary system which will look at the video and write commentary. That is the topic of a different post. But in this post we will see if we can build a predictive commentator. That’s more fun. Can a system commentate on a hypothetical ball bowled at a certain stage by a certain bowler to a certain batsman?
With the advent of GPT2, we have had AI which writes stories, AI which writes music, AI which writes songs etc. So we thought lets use GPT2 to build ourselves an AI commentator. Now we use this commentator whenever a match is on going. We feed it data about the next ball and voila, it predicts the result! Not bad for a days work.
How did we build this AI commentator?
The base for any good AI is its data set. Getting a clean dataset is a very tough problem. But thanks to the wonderful people at Cricinfo, we have access to a very good dataset. The ball by ball commentaries. Cricinfo commentators follow a set format for commenting on each ball. It goes something like below
Great. The structure seems to be Over, Result, Bowler to Batsman, Result, Shot description, Detailed commentary.
But looks like we have to scrap lots of web pages. Lets look underneath and see if a better structure is available. By opening the dev console and observing the web requests we find out that the commentary is actually a JSON object. Bingo. So we have a clean dataset organized in a nice form for us. Because we don’t want people to bombard Cricinfo with web requests we are not open sourcing the downloader script. We will reach out to Cricinfo and if they are ok, we will make it public.
We downloaded around 40 matches of India from the Cricinfo site and used that to fine tune a GPT2 model. The good thing is that a GPT2 model just needs sentences in each line as training data. So there is actually very little data cleaning to do.
There are a lot of ready made colab notebooks and you can use any one of those for fine tuning. You can also use a python package like https://github.com/minimaxir/gpt-2-simple for fine tuning.
Also thanks to Huggingface for making it so easy to play around with text.
Because the amount of data was comparatively small, we were able to fine tune within an hour.
Without further ado, our cricket commentary bot:
The cool thing is this AI engine can be used for a lot of things. We can use it as a predictive engine. If it is the 31st over and 4th ball and Agar is bowling to Kohli, what is the outcome? If we feed this to our engine we get an output like “31.4,Agar to Kohli, 1 run,length ball in the channel, defended back towards mid-off.” Right now this is more of a fun system. But if we feed it enough data we can make it a good enough predictive system.
Observations:
- Just from 40 matches (around 40*100*6 sentences) the system learnt the structure for commenting about a single ball.
- The system does not know any rules of cricket, but it gets the sentences right most of the time.
- In rare cases, it makes a misstep and displays the wrong batsman name when describing the shot. This shows it has not learnt everything, but just making a statistical guess.
- This is just a toy system which can be used for fun. The goal was to show case the advancements in the area of NLG and how easy it is now to build a system like this from scratch. It took us exactly 8 hours to build this system from scratch, including data download and data clean up.
- All the data belongs to Cricinfo. This was done just for research purposes.