What is DarkBERT? Can AI help fight cyber threats?

The popularity of large language models (LLMs) is skyrocketing, with new ones constantly appearing. Models like ChatGPT are often trained on a variety of Internet sources, including articles, websites, books, and social media.

A team of Korean researchers developed DarkBERT, an LLM trained on datasets sourced exclusively from the dark web. Their aim is to create an AI engine that outperforms existing language models and helps threat researchers, law enforcement and cybersecurity professionals fight cyberthreats.

What is DarkBERT?

DarkBERT is an encoder model based on the RoBERTa architecture. LLM has trained on millions of dark web sites, including data from hacking forums, phishing sites and other online sources associated with illegal activities.

The term "dark web" refers to a hidden part of the Internet that is not accessible through standard web browsers. This section of the Internet is notorious for harboring anonymous websites and illegal markets, such as the trade in stolen data, drugs and weapons.

To train DarkBERT, the researchers gained access to the dark web through the Tor network and collected raw data. They carefully filtered this data using techniques like deduplication, category balancing, and preprocessing to create a fine-tuned dark web database, which was then made available to RoBERTa for about 15 days to create DarkBERT.

Applications of DarkBERT in Cybersecurity

What is DarkBERT? Can AI help fight cyber threats? Picture 1

DarkBERT has an unsurpassed understanding of the language of cybercriminals and excels at detecting specific potential threats. It can study the dark web and successfully identify and flag cybersecurity threats such as data leaks and ransomware, making it a potentially useful tool against cyber threats.

Research published on arxiv.org indicates that to evaluate the effectiveness of DarkBERT, researchers compared it with two well-known NLP models, BERT and RoBERTa, evaluating their performance across three important use cases related to cybersecurity.

1. Monitor Dark Web forums for potentially harmful topics

Monitoring dark web forums, which are often used to exchange illegal information, is important for identifying potentially dangerous topics. However, manually reviewing these can be time-consuming, making process automation beneficial for security professionals.

The researchers focused on potentially harmful activity in hacking forums, providing annotated guides to notable topics, including sharing confidential data and distributing malware or critical vulnerabilities.

DarkBERT outperforms other language models in accuracy, recall, and F1 scores, emerging as a superior choice for identifying notable topics on the dark web.

2. Detecting sites that host confidential information

Hackers and ransomware groups use the dark web to create leaky websites, where they publish confidential data stolen from organizations that refuse to comply with ransom demands. Other cybercriminals simply upload leaked sensitive data, like passwords and financial information, to the dark web with the intention of selling them.

In their study, the researchers collected data from notorious ransomware groups and analyzed ransomware leak websites that publish private data of organizations. DarkBERT outperforms other language models in identifying and classifying such sites, demonstrating its understanding of the language used in underground hacking forums on the dark web.

3. Identify keywords related to threats on the Dark Web

What is DarkBERT? Can AI help fight cyber threats? Picture 2

DarkBERT leverages mask-filling, an inherent feature of the BERT family of language models, to pinpoint keywords associated with illegal activities, including drug sales on the dark web.

When the word "MDMA" was hidden in a drug page, DarkBERT generated drug-related words, while other models suggested generic words and terms unrelated to drugs, such as different professions.

DarkBERT's ability to identify keywords associated with illegal activities can be valuable in tracking and addressing emerging cyber threats.