keyboard_arrow_up
A Comprehensive Comparison of Text Summarization Performance: A Multi-Faceted Evaluation of Large Language Models with Practical Considerations

Authors

Anantharaman Janakiraman and Behnaz Ghoraani , Florida Atlantic University

Abstract

Text summarization is crucial for mitigating information overload across domains. This research evaluates summa rization performance across 17 large language models using seven diverse datasets at three output lengths (50, 100, 150 tokens). We employ a novel multi-dimensional framework assessing factual consistency, semantic similarity, lexical overlap, and human-like evaluation while considering both quality and efficiency factors. Key findings reveal signifyicant differences between models, with specific models excelling in factual accuracy (deepseek-v3), human-like quality (claude-3-5-sonnet), processing efficiency (gemini-1.5-flash), and cost effectiveness (gemini-1.5-flash). Performance varies dramatically by dataset, with models struggling on technical domains but performing well on conversational content. We identified a critical tension between factual consistency (best at 50 tokens) and perceived quality (best at150 tokens).

Keywords

Text Summarization, Large Language Models, Multi-dimensional Evaluation, Evaluation Metrics, Model Comparison

Full Text  Volume 16, Number 2