In PART 1 and PART 2 of this series we saw the different types of summarization techniques. Let us try to apply our learning and build a news summarization engine from scratch. We will go with an extractive approach as we believe abstractive approaches are not yet up to the mark and to even attempt them we will have to train on a lot of data.
If you just want to play with the new summarizer we have built, just use the chatbot below.
So let us see. What are the different components we need to build our system:
- Lots of training data. Luckily for text based systems we can find a lot of data on which we can train our systems. WikiText-2 is a good example.
- A web page parser. Since a web page will have lots of tags and meta data, we need to be able to extract just the article content from a web page. https://github.com/codelucas/newspaper/ is a very good python package which works very well for newspaper articles.
- A good sentence embedding. We should get a good representation of a sentence. Huggingface provides some good sentence embeddings. But in our system we have gone with our own Alpes embedding for sentences.
- An AI algorithm to find the best sentences to represent the summary of the article. Here also, we went with our Alpes agorithm.
To build this system we used the Alpes algorithm to build a hyperdimensional model space of sentences. We used 20,000 sentences to build this model and we got a space with 440 dimensions. Once the model space was built, it was just a matter of placing the sentences in this hyperdimensional space to get a sentence embedding. We can then use these sentence embeddings for downstream tasks. We have applied this to the article summarization problem and you can see the result in the chatbot above. Just mention the URL of the article and number of sentences you want in the summary to get the summary of the article.
This was a very quick way for us to showcase the power of the Alpes algorithm. This whole system was built in 2 days. The systems works as follows:
- Take the URL passed by the user.
- Pass the URL to newspaper module in python and get the text.
from newspaper import Article url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/' article = Article(url) article.download() article.parse() text=article.text
- Pass the article text to spacy and get the sentences
doc = nlp("This is a sentence. This is another sentence.") #doc.sents contains the sentences in the article.
- Get the sentence embeddings. We use Alpes embeddings. But you can also use HuggingFace sentence embeddings.
- You can use something like https://github.com/UKPLab/sentence-transformers
from sentence_transformers import SentenceTransformer model = SentenceTransformer('bert-base-nli-mean-tokens') sentence_embeddings = model.encode(doc.sents)
- Finally, use the embeddings and find the clusters of sentences. You can use regular K-Means. One approach to find summary is to find 5 clusters(if we want 5 sentences in summary), and for each cluster, find the centroid and then find a sentence closest to the centroid. That becomes your candidate sentence in summary. Group the sentences and print the summary.
That’s it. We are done.
The amazing thing is that our system learnt about sentences by just looking at 20,000 sentences. From that it is able to build a model space which captures the essence of sentences. The training process took just 4 minutes. Out of all the 20,000 sentences we did not feed it any sentences from a news article. But still the system if performing pretty well on summarization tasks. We believe that it will perform much better when we can train it on the domain data set and feed it more data.