Text Analytics Framework using Apache Spark and Combination of Lexical and Machine Learning Techniques
Published: 2016
Author(s) Name: Anuja Prakash Jain, Prof. Padma Dandannavar |
Author(s) Affiliation:
Locked
Subscribed
Available for All
Abstract
Today, we live in a data age. The sudden increase in the amount of user-generated data on social media platforms like Twitter, has led to new opportunities and challenges for companies that strive hard to keep an eye on customer reviews and opinions about their
products. Twitter is a huge fast emergent micro-blogging social networking platform for users to express their views about politics, products sports etc. These views are useful for businesses, government and individuals. Hence, tweets are used in this framework for mining publics opinion. Sentiment analysis is a process of naturally recognizing whether a user-generated content expresses positive, negative or neutral opinion about an entity (i.e. product, people, topic, event etc). The
traditional analytics tools are costly and are not built to handle Big data. Hadoop, though being a popular framework for data intensive applications, does not perform well on iterative process (like data analysis) due to the cost paid for data reloading from disk for each iteration. This paper proposes a text analysis framework for twitter data using Apache spark and hence is more flexible, fast, and scalable. The proposed framework is also domain independent as it uses a hybrid approach
by combining supervised machine learning algorithms (Naïve Bayes and decision tree machine learning algorithms) and lexicon approach (pattern analyzer) for sentiment classification thereby comparing various supervised learning models and using the one with highest accuracy for predicting sentiment.
Keywords: Sentiment Analysis, Machine Learning, Lexical Approach, Apache Spark, Natural Language Processing, Twitter
View PDF