keyboard_arrow_up
Comparing Results of Sentiment Analysis Using Naive Bayes and Support Vector Machine in Distributed Apache Spark Environment

Authors

Tomasz Szandala, Wroclaw University of Technology, Poland

Abstract

Short messages like those on Twitter or Facebook has become a very popular opinions sharing tool among Internet users. Therefore micro blogging web-sites are nowadays rich sources of data for opinion mining and sentiment analysis. However it is challenging because of the limited contextual information that they normally contains. Furthermore the greatest benefit can be achieved by collecting sentiment class in real time - when the post is published, in order to react as soon as possible. Nevertheless, most existing solutions are limited in centralized environments only. thus, they can only process at most a few thousand tweets. Such a sample, is not representative to define the sentiment polarity towards a topic due to the massive number of tweets published daily. Sample analysis has been performed using Machine Learning methodologies alongside with Natural Language Processing techniques and utilizes Apache Spark’s Machine learning library, MLlib, on a labelled (positive/negative) corpus containing 4234 tweets regarding Presidential Election in USA in 2016. The analysis has been completed using distributed Apache Spark environment with simulated stream of data from Kafka database.

Keywords

Apache Spark, natural language processing, sentiment analysis

Full Text  Volume 8, Number 14