Accomplishments
An Analysis of Feature Engineering Approaches for Unlabeled Dark Web
- Abstract
Feature engineering is critical in building machine learning models for unlabeled text classification. In feature engineering, relevant features are extracted and created from unlabeled text data to improve machine learning models. Unlabeled text classification can be achieved through a variety of feature engineering approaches. The proposed work used crawled dark web data in the raw HTML as unlabeled text data. Bag of Words, TF-IDF, and BM25 are all commonly used feature engineering techniques in natural language processing. The proposed work applied Bag of Words, TF-IDF, and BM25 techniques to extract useful features to train Latent Dirichlet Allocation (LDA) model that identifies latent topics from unlabeled text data. The text data has been transformed with each feature engineering technique mentioned. The log perplexity score was used as a performance measure to evaluate the LDA model and compare the Bag of Words, TF-IDF, and BM25 feature engineering techniques.