Twitter has demonstrated advantages in providing timely information about zero-day vulnerabilities and exploits.
The large volume of unstructured tweets, on the other hand, makes it difficult for cybersecurity professionals to perform manual analysis and investigation into critical cyberattack incidents. To improve the efficiency of data processing on Twitter, we propose a novel vulnerability discovery and monitoring framework that can collect and organize unstructured tweets into semantically related topics with temporal dynamic patterns. Unlike existing supervised machine learning methods that process tweets based on a labelled dataset, our framework is unsupervised, making it better suited for analyzing emerging cyberattack and vulnerability incidents when no prior knowledge is available (e.g., zero-day vulnerability and incidents). The proposed framework compares three topic modeling techniques(Latent Dirichlet Allocation, Non-negative Matrix Factorization and Contextualized Topic Modeling) in combination of different text representation
methods (Bag-of-word and contextualized pre-trained language models) on a Twitter dataset that was collected from 47 influential users in the cybersecurity community. We show how the proposed framework can be used to analyze a critical zero-day vulnerability incident(Log4shell) on Apache log4j java library in order to understand its temporal evolution and dynamic patterns across its vulnerability life-cycle. Results show that our proposed framework can be used to effectively analyze vulnerability related topics and their dynamic patterns. Twitter can reveal valuable information regarding the early indicator of exploits and users behaviors. The pre-trained contextualized text representation shows advantages for the unstructured, domain dependent, sparse Twitter textual data under the cybersecurity domain