A Term Weight Measures based Approach for Celebrity Profiling

Main Article Content

Siva Nagi Reddy Kalli, B. Narendra Kumar, S. Jagadeesh

Abstract

Celebrity Profiling is a type of text classification problem which is used for predicting the profiling features like birth-year, gender, fame and occupation of celebrity authors by analysing their writing styles. PAN competition introduced this task in 2019 competition. They provided a corpus for celebrity profiling task and the corpus contains four characteristics like gender, birth-year, fame and occupation of celebrity authors. In order to differentiate the authors writing style the researchers extracted a different types of features such as style based, content based, lexical, character based, syntactic and structural features in the approaches of celebrity profiling. The researchers found that content based features play a crucial role when contrasted with other features in the identification of the author. In this work, the content based features are used in the experiment of celebrity profiling. The frequencies of terms in the total corpus are considered to recognize the important features for the experiment. The most frequent terms are used as features for representing the document vectors. The term value in the vector representation plays a vital role to enhance the performance of celebrity profiling. The Term Weight Measures (TWMs) are used for this purpose to compute the importance of a term in a document. In existing literature, various TWMs are proposed by the researchers in various research domains. In this paper, a term weight measures based approach is proposed for celebrity profiling. In this approach, a new TWM is proposed and compared the performance of proposed term weight measure with existing term weight measures. We observed from the results of celebrity profiling the proposed term weight measure attained best accuracies for profiles prediction than other term weight measures. Three Machine Learning (ML) algorithms such as Naïve Bayes (NB), Support Vector Machine (SVM) and Random Forest (RF) are used for evaluating the performance of proposed approach. The PAN 2019 competition celebrity profiling corpus is used in this work. We considered gender, fame and occupation profiles prediction.    

Article Details

Section
Articles