Exploring topic models to discern zero-day vulnerabilities on Twitter through a case study on log4shell

Exploring topic models to discern zero-day vulnerabilities on Twitter through a case study on log4shell

Joy Wang, Bashar Md Abul, Mahinthan Chandramohan, Richi Nayak

30 December 2022

Twitter has demonstrated advantages in providing timely information about zero-day vulnerabilities and exploits. The large volume of unstructured tweets, on the other hand, makes it difficult for cybersecurity professionals to perform manual analysis and investigation into critical cyberattack incidents. To improve the efficiency of data processing on Twitter, we propose a novel vulnerability discovery and monitoring framework that can collect and organize unstructured tweets into semantically related topics with temporal dynamic patterns. Unlike existing supervised machine learning methods that process tweets based on a labelled dataset, our framework is unsupervised, making it better suited for analyzing emerging cyberattack and vulnerability incidents when no prior knowledge is available (e.g., zero-day vulnerability and incidents). The proposed framework compares three topic modeling techniques(Latent Dirichlet Allocation, Non-negative Matrix Factorization and Contextualized Topic Modeling) in combination of different text representation methods (Bag-of-word and contextualized pre-trained language models) on a Twitter dataset that was collected from 47 influential users in the cybersecurity community. We show how the proposed framework can be used to analyze a critical zero-day vulnerability incident(Log4shell) on Apache log4j java library in order to understand its temporal evolution and dynamic patterns across its vulnerability life-cycle. Results show that our proposed framework can be used to effectively analyze vulnerability related topics and their dynamic patterns. Twitter can reveal valuable information regarding the early indicator of exploits and users behaviors. The pre-trained contextualized text representation shows advantages for the unstructured, domain dependent, sparse Twitter textual data under the cybersecurity domain


Venue : IEEE Transactions on Information Forensics and Security

File Name : paper_explore_topic_models_submission.pdf