Cybersecurity Datasets - Arash Habibi Lashkari

10. Large-Scale Intrusion Detection Dataset (BCCC-CSE-CIC-IDS2018)

The BCCC-CSE-CIC-IDS2018 dataset is an enhanced version of CSE-CIC-IDS2018 with 46 million labelled records and 300 features, addressing key issues to improve data quality and reliability for behavioral profiling in IDS research. Labeling inconsistencies, particularly for DoS attacks, were corrected by aligning attack labels with attacker IPs instead of timestamps. NTLFlowLyzer, a new network traffic analyzer, was developed to resolve anomalies in extracted features and refine feature implementation. Additionally, protocol issues were fixed by removing UDP-based attacks previously misclassified due to TCP-specific analysis. Attacks with insufficient flow counts were retained but excluded from analysis and profiling. The dataset now includes an expanded feature set to better detect evolving cyber threats, making it a robust benchmark for AI-driven IDS/IPS research.

The full research paper outlining the details of the dataset and its underlying principles:

"Toward Generating a Large Scale Intrusion Detection Dataset and Intruders Behavioral Profiling Using Network and Transportation Layers Traffic Flow Analyzer (NTLFlowLyzer)", MohammadMoein Shafi, Arash Habibi Lashkari & Arousha Haghighian Roudsari, Journal of Network and Systems Management, Vol 33, article 44, 2025

Download Dataset:

Request Dataset

9. Smart Contracts Vulnerabilities (BCCC-SCsVuls-2024)

The BCCC-SCsVuls-2024 dataset is a comprehensive resource for analyzing and detecting vulnerabilities in Solidity-based smart contracts, featuring 111,897 meticulously labeled samples across 11 vulnerabilities such as Re-entrancy (17,698), IntegerUO (16,740), DenialOfService (12,394), and Secure contracts (26,914). The dataset was curated from reputable sources like Smart Bugs, Ethereum SCs, and SmartScan-Dataset, ensuring diverse and representative vulnerability coverage. All entries were processed into SHA-256 hashes to maintain integrity and uniqueness, eliminating duplicates. This dataset provides a robust foundation for developing and testing vulnerability detection models for smart contracts, advancing research in blockchain security.

The full research paper outlining the details of the dataset and its underlying principles:

"Unveiling Smart Contracts Vulnerabilities: Toward Profiling Smart Contracts Vulnerabilities using Enhanced Genetic Algorithm and Generating Benchmark Dataset", Sepideh HajiHosseinKhani, Arash Habibi Lashkari, Ali Mizani Oskui, Blockchain: Research and Applications, December 2024, 100253

Download Dataset:

Request Dataset

8. Intrusion Detection Dataset (BCCC-CIC-IDS2017)

Using NLFlowLyzer, we successfully generated the “BCCC-CIC-IDS2017” dataset by extracting key flows from raw network traffic data of CIC-IDS2017, resulting in CSV files integrating essential network and transport layer features. This new dataset offers a structured approach for analyzing intrusion detection, combining diverse traffic types into multiple sub-categories. The “BCCC-CIC-IDS2017” dataset enriches the depth and variety needed to rigorously evaluate our proposed profiling model, advancing research in network security and enhancing the development of intrusion detection systems.

The full research paper outlining the details of the dataset and its underlying principles:

"NTLFlowLyzer: Toward Generating an Intrusion Detection Dataset and Intruders Behavior Profiling through Network Layer Traffic Analysis and Pattern Extraction, MohammadMoein Shafi, Arash Habibi Lashkari, Arousha Haghighian Roudsari, Computer & Security, Computers & Security, 104160, ISSN 0167-4048 (2024)"

Download Dataset:

Request Dataset

7. Tabular IoT Attack Dataset (CIC-BCCC-NRC TabularIoTAttack-2024)

The CIC-BCCC-NRC TabularIoTAttack-2024 dataset is a comprehensive collection of IoT network traffic data generated as part of an advanced effort to create a reliable source for training and testing AI-powered IoT cybersecurity models. This dataset is designed to address modern challenges in detecting and identifying IoT-specific cyberattacks, offering a rich and diverse set of labeled data that reflects realistic IoT network behaviours. The dataset extracted a wide array of network characteristics using CICFlowMeter, with each record containing relevant features such as network flows, timestamps, source/destination IPs, and attack labels.

The full research paper outlining the details of the dataset and its underlying principles:

"An Efficient Self Attention-Based 1D-CNN-LSTM Network for IoT Attack Detection and Identification Using Network Traffic”, Tinshu Sasi, Arash Habibi Lashkari, Rongxing Lu, Pulei Xiong, Shahrear Iqbal, Journal of Information and Intelligence, 2024, ISSN 2949-7159, https://doi.org/10.1016/j.jiixd.2024.09.001

Download Dataset:

Request Dataset

6. Malicious DNS and Attacks (BCCC-CIC-Bell-DNS-2024)

Using ALFlowLyzer, we successfully generated an augmented dataset, "BCCC-CIC-Bell-DNS-2024," from two existing datasets: "CIC-Bell-DNS-2021" and "CIC-Bell-DNS-EXF-2021." ALFlowLyzer enabled the extraction of essential flows from raw network traffic data, resulting in CSV files that integrate DNS metadata and application layer features. This new dataset combines light and heavy data exfiltration traffic into six unique sub-categories, providing a comprehensive structure for analyzing DNS data exfiltration attacks. The "BCCC-CIC-Bell-DNS-2024" dataset enhances the richness and diversity needed to evaluate our proposed profiling model effectively.

The full research paper outlining the details of the dataset and its underlying principles:

"Unveiling Malicious DNS Behavior Profiling and Generating Benchmark Dataset through Application Layer Traffic Analysis", Shafi, MohammadMoein, Arash Habibi Lashkari, Hardhik Mohanty; Computers and Electrical Engineering, 2024

Download Dataset:

Request Dataset

5. Cloud DDoS Attacks (BCCC-cPacket-Cloud-DDoS-2024)

The distributed denial of service attack poses a significant threat to network security. The effectiveness of new detection methods depends heavily on well-constructed datasets. After conducting an in-depth analysis of 16 publicly available datasets and identifying their shortcomings across various dimensions, the 'BCCC-cPacket-Cloud-DDoS-2024' is meticulously created, addressing challenges identified in previous datasets through a cloud infrastructure. The dataset contains over eight benign user activities and 17 DDoS attack scenarios. The dataset is fully labeled (with a total of 26 labels) with over 300 features extracted from the network and transport layers of the traffic flows using NTLFlowLyzer. The dataset's extensive size and comprehensive features make it a valuable resource for researchers and practitioners to develop and validate more robust and accurate DDoS detection and mitigation strategies. Furthermore, researchers can leverage the 'BCCC-cPacket-Cloud-DDoS-2024' dataset to train learning-based models aimed at predicting benign user behavior, detecting attacks, identifying patterns, classifying network data, etc.

The full research paper outlining the details of the dataset and its underlying principles:

"Toward Generating a New Cloud-Based Distributed Denial of Service (DDoS) Dataset and Cloud Intrusion Traffic Characterization", Shafi, MohammadMoein, Arash Habibi Lashkari, Vicente Rodriguez, and Ron Nevo.; Information 15, no. 4: 195. https://doi.org/10.3390/info15040195

Download Dataset:

Request Dataset

4. DNS over HTTPS ( BCCC-CIRA-CIC-DoHBrw-2020 )

The 'BCCC-CIRA-CIC-DoHBrw-2020' as an augmented dataset was created to address the imbalance in the 'CIRA-CIC-DoBre-2020' dataset. Unlike the 'CIRA-CIC-DoHBrw-2020' dataset, which is skewed with about 90% malicious and only 10% benign Domain over HTTPS (DoH) network traffic, the 'BCCC-CIRA-CIC-DoHBrw-2020' dataset offers a more balanced composition. It includes equal numbers of malicious and benign DoH network traffic instances, with 249,836 instances in each category. This balance was achieved using the Synthetic Minority Over-sampling Technique (SMOTE). The 'BCCC-CIRA-CIC-DoHBrw-2020' dataset comprises three CSV files: one for malicious DoH traffic, one for benign DoH traffic, and a third that combines both types.

The full research paper outlining the details of the dataset and its underlying principles:

“Unveiling DoH Tunnel: Toward Generating a Balanced DoH EncryptedTraffic Dataset and Profiling malicious Behaviour using InherentlyInterpretable Machine Learning“, Sepideh Niktabe, Arash Habibi Lashkari, Arousha Haghighian Roudsari, Peer-to-Peer Networking and Applications, Vol. 17, 2023

Download Dataset:

Request Dataset

3. Vulnerable Smart Contracts (BCCC-VulSCs-2023)

The BCCC-VulSCs-2023 dataset is a substantial collection for Solidity Smart Contracts (SCs) analysis, comprising 36,670 samples, each enriched with 70 feature columns. These features include the raw source code of the smart contract, a hashed version of the source code for secure referencing, and a binary label that indicates a contract as secure (0) or vulnerable (1). The dataset's extensive size and comprehensive features make it a valuable resource for machine-learning models to predict contract behavior, identify patterns, or classify contracts based on security and functionality criteria.

The full research paper outlining the details of the dataset and its underlying principles:

“Unveiling Vulnerable Smart Contracts: Toward Profiling Vulnerable Smart Contracts using Genetic Algorithm and Generating Benchmark Dataset”, Sepideh Hajihosseinkhani, Arash Habibi Lashkari, Ali Mizani, Blockchain: Research and Applications, Vol. 4, 2023

Download Dataset:

Request Dataset

2. SQL Injection Attack (BCCC-SFU-SQLInj-2023)

This dataset consists of a collection of 11,012 evasive or sophisticated malicious SQL queries. These queries are generated using a genetic algorithm applied to the Kaggle malicious SQL dataset. The goal of the genetic algorithm is to enhance the evasiveness and sophistication of the original malicious queries.

The full research paper outlining the details of the dataset and its underlying principles:

"An Evolutionary Algorithm for Adversarial SQL Injection Attack Generation", Maryam Issakhani, Mufeng Huang, Mohammad A. Tayebi, Arash Habibi Lashkari, IEEE Intelligence and Security Informatics (ISI2023), NC, USA

Download Dataset:

BCCC-SFU-SQLInj-2023 (CSV file)

1. Source Code Authorship Attribution (YU-SCAA-2022)

Source Code Authorship Attribution (SCAA) is the technique to find the real author of source code in a corpus. Though it is a privacy threat to open-source programmers, it has shown to be significantly helpful in developing forensic-based applications such as ghostwriting detection, copyright dispute settlements, catching authors of malicious applications using source code, and other code analysis applications. This dataset was created by extracting ’code’ data from the GCJ, and GitHub datasets, including examples of attacks and adversarial examples, were created using Source Code imitator. The dataset in a total of 1,632 code files from 204 authors.

The full research paper outlining the details of the dataset and its underlying principles:

”AuthAttLyzer: A Robust defensive distillation-based Authorship Attribution framework”, Abhishek Chopra , Nikhill Vombatkere , Arash Habibi Lashkari, The 12th International Conference on Communication and Network Security (ICCNS), 2022, China

5. Cloud DDoS Attacks (BCCC-cPacket-Cloud-DDoS-2024)

The full research paper outlining the details of the dataset and its underlying principles:

Download Dataset:

Request Dataset

4. DNS over HTTPS ( BCCC-CIRA-CIC-DoHBrw-2020 )

The full research paper outlining the details of the dataset and its underlying principles:

Download Dataset:

Request Dataset

3. Vulnerable Smart Contracts (BCCC-VulSCs-2023)

The full research paper outlining the details of the dataset and its underlying principles:

Download Dataset:

Request Dataset

2. SQL Injection Attack (BCCC-SFU-SQLInj-2023)

The full research paper outlining the details of the dataset and its underlying principles:

Download Dataset:

BCCC-SFU-SQLInj-2023 (CSV file)

1. Source Code Authorship Attribution (YU-SCAA-2022)

The full research paper outlining the details of the dataset and its underlying principles:

Download Dataset:

YU-SCAA-2022 (zip file)