Multiple document summarization task has been performed in 5 stages:
-
TFIDF : Multiple news articles are represented in a vector format using information retrieval technique, TFIDF.
-
Document Clustering stage : K- Means Clustering technique has been employed on the multi-document text collection to create the text document clusters.
-
Latent Dirichlet Allocation (LDA) : LDA topic modelling technique has been employed on each individual text document cluster to generate the cluster topics and terms belonging to each cluster topic.
-
Frequent and Semantic terms: Global frequent and semantic terms are generated from the collection of multiple text documents.
-
Sentence filtering: For each document, the sentences which are containing the frequent terms and semantic similar terms are selected for participation in the summarized document.
Output of this project is the summarized documents of large News Article collections.