Text Summarization – Part 1

1.  Introduction:

Text summarization is the process to generate a brief and accurate outline of large texts without losing the relevant overall information. We will walk through the different summarization approaches step by step and finally build our own summarization engine. This is the first part of the series and in this part, we will mainly be talking about extractive summarization.

2.  Text Summarization Approach:

There are two types of approaches by which we can use Text Summarization:

2.1   Extractive: In the extractive approach, the user inputs the sentences into the system for processing.  It retains the most important points by looking into the sentence similarity. There might be multiple sentences that are similar to the input sentence. To get the right sentence, the weight against each potential sentence is used. While calculating the weight, importance and similarity are taken into consideration. Further, the sentence with a higher rank is suggested to the user. The limitation of this approach is its inability to get novel words as output from the system. The extractive summarization technique is mainly focused on choosing how paragraphs and important sentences etc. produce the original documents in the precise form. The implication of sentences is demarcated based on linguistic and statistical features.

Figure1: Extractive Approach

There are two types of Extractive Text Summarization:

  • Unsupervised: An unsupervised technique for the sentence extraction task is discussed. The unsupervised approach does not need human summaries (user input) to decide the important features of the document; it requires the most sophisticated algorithm to provide compensation for the lack of human knowledge. Unsupervised summaries provide us a higher level of automation as compared to a supervised model and are more suitable for processing Big Data. Unsupervised learning models have proved successful in the text summarization task. Following are the approaches in unsupervised learning: 
  • Graph-based approach: These models are extensively used in document summarization since graphs can efficiently represent the document structure. Extractive summarization using external knowledge from Wikipedia incorporating a bipartite graph framework [2] has been used. They have proposed an iterative ranking algorithm (variation of HITS algorithm [3]) which is efficient in selecting important sentences and also ensures coherency in the final summary. The unique thing about this paper is that it combines both graph-based and concept based approach towards the summarization task. Another graph-based approach is LexRank [4], where the salience of the sentence is determined by the concept of Eigenvector centrality. The sentences inside the document are represented as graphs and also the edges between the sentences represented as weighted cosine similarity value. The sentences are clustered into groups based on their similarity measures and then the sentences are ranked based on their LexRank scores similar to the PageRank algorithm [5].
  • Concept-based approach: In this approach, the concepts are extracted from external knowledge bases such as HowNet [8] and Wikipedia [2]. In the methodology proposed [8], the importance of sentences is calculated based on the concepts retrieved from HowNet instead of words. A conceptual vector model is made to get a rough summary and similarity measures are calculated between the sentences to reduce redundancy in the final summary. 
  • Latent Semantic Analysis Method (LSA): Latent Semantic Analysis (LSA) [13] [14] is a method that extracts hidden semantic structures of sentences and words that are popularly used in text summarization task. It is an unsupervised learning approach that does not demand any kind of external or training knowledge. LSA captures the text of the input document and extracts information such as words that frequently occur together and words that are commonly seen in different sentences. A high number of common words amongst the sentences describe that the sentences are semantically related.
  • Supervised Learning Methods: Supervised extractive summarization related techniques are based on a classification approach at sentence level where the system learns by examples to classify between summary and non-summary sentences. The major loophole with the supervised approach is that it requires known manually created summaries by a human to label the sentences in the original training document enclosed with “summary sentence” or “no summary sentence” and it also requires labeled training data for classification.  
  • Machine Learning Approach based on Bayes rule: In this approach, a set of training documents along with its extractive summaries is fed as input to the training stage. This approach views classification problems in text summarization. The sentences are restricted as a non-summary and a summary sentence based on the features possessed by the sentence. The probability of classification is learned from the training data by the following Bayes rule [16]: where s denotes the set of sentences in the document and fi denotes the features used in classification stage and S denotes the set of sentences in the summary. P (s E< SII1,h,h, …. In) shows the probability of the sentences to be included in the summary based on the given features possessed by the sentence.
  • Neural Network-based approach: In the approach proposed in [9], the RankNet algorithm automatically using a neural network to identify the important sentences in the document. It uses a two-layer neural network with backpropagation trained using the RankNet algorithm. The first step involves labeling the training data using a machine learning approach and then extracts features of the sentences in both test set and train sets which is then fed to the neural network system to rank the sentences in the document. Another approach [10] uses a three-layered feed-forward neural network that learns in the training stage the characteristics of summary and non-summary sentences. The major phase is the feature fusion phase where the relationship between the features are identified by two stages 
       1. Eliminating infrequent features 

                2. Collapsing frequent features after which sentence ranking is performed to identify the                 important summary sentences.

  • Conditional Random Fields: This is a statistical modeling approach that focuses on machine learning to provide a structured prediction. The proposed system overcomes the issues faced by non-negative matrix factorization (NMF) methods by incorporating conditional random fields (CRF) to identify and extract correct features to determine the important sentence of the given text. The main advantage of this method is, it is able to identify correct features and provide a better representation of sentences and groups terms appropriately into its segments.

Functionalities Extractive Text Summarization constitutes:

  • Extractive summaries do not focus on the understanding of the text. It extracts the most important part based on statistical and linguistic features such as cue words, location, word frequency 
  • The identification of boundary is identified by the dot at the termination of a sentence.
  • stop words and unnecessary information is discarded
  • For every word a stem is built which gives meaning.
  • The processing phase calculates the relevant sentences and assigns the weights by using the weight learning method.
  • Sentences that are extracted are longer for summary and thus consume space.
  • Not all the relevant sentences are included.

Continue to PART 2

References:

[1]  https://arxiv.org/abs/1810.04805v2

[2] Y. Sankarasubramaniam, K. Ramanathan, and S. Ghosh, “Text summarization using Wikipedia,” Information Processing & Management, vol. 50, no. 3, pp. 443-461, 2014.

[3] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” Journal of the ACM (JACM), vol. 46, no. 5, pp. 604–632, 1999.

[4] G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical centrality as salience in text summarization,” Journal of Artificial Intelligence Research, pp. 457-479, 2004.

[5] S. M. R .. W. T. L., Brin, “The pagerank citation ranking: Bringing order to the web,” Technical report, Stanford University, Stanford, CA., Tech. Rep., (1998)

[6] G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical centrality as salience in text summarization,” Journal of Artificial Intelligence Research, pp. 457-479, 2004.

[7] S. M. R .. W. T. L., Brin, “The pagerank citation ranking: Bringing order to the web,” Technical report, Stanford University, Stanford, CA., Tech. Rep., (1998).

[8]  X. W. Meng Wang and  C. Xu, “An approach  to concept oriented text 

summarization,”  in in Proceedings of ISClTS05,  IEEE international conference, China,1290-1293″ 2005.

[9] K. M. Svore, L. Vanderwende, and C. J. Burges, “Enhancing singledocument summarization by combining ranknet and third-party sources.” in EMNLP-CoNLL, 2007, pp. 448-457.

[10] K. Kaikhah, “Automatic text summarization with neural networks,” 2004.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *