Accomplishments

An Analysis of Topic Modeling Approaches for Unlabeled Dark Web Data Classification


  • Details
  • Share
Category
Conference
Conference Name
International Conference on Innovations and Advances in Cognitive Systems
Conference From
27-May-2024
Conference To
28-May-2024
Conference Venue
Online
  • Abstract

The dark web is home to plenty of hidden information. However, understanding the peculiar nature and context of the dark web data is quite challenging. In order to effectively study and analyze the data, it is important to recognize the variety of domains of information that it carries. Topic modeling approaches can help determine the origins of the dark web data, necessitating the need to compare and pick the topic model which can serve this purpose. In the dark web domain, topic models like Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) have been extensively studied and applied. But algorithms like Top2Vec, BERTopic, and Non Negative Matrix Factorization (NMF) have been less commonly experimented with. Very little work has been done encompassing a comprehensive comparative analysis of all these algorithms pertaining specifically to topic modeling. The proposed work used crawled dark web data in the raw HTML format as unlabeled text data. The authors applied five topic modeling algorithms - LDA, LSA, Top2Vec, BERTopic and NMF on the data. The topic models were then manually evaluated by cybersecurity experts using parameters like accuracy, coherence, and repetition. BERTopic emerged as the top performing algorithm with an average score of 80%.

Apply Now Enquire Now Chat with a Student