ATME College of EngineeringLead AI/ML ResearcherJan 2025

AI-Powered Cyber Threat Intelligence System

Operational cyber threat intelligence platform combining BERT NER, ensemble classification, severity prediction, and real-time analysis—delivering 89.2% classification accuracy and sub‑second processing for SOC workflows.

Aug 2024 Jan 2025 5 months Live

An intelligent cyber threat intelligence system that leverages Natural Language Processing (NLP) and AI-based classification to extract meaningful cyber threat indicators from unstructured text, categorize threat types, predict severity levels, and visualize insights through an interactive web interface.

Project Overview

In today’s rapidly evolving cyber threat landscape, security analysts are overwhelmed with vast amounts of unstructured threat intelligence data from blogs, forums, reports, and social media. This project addresses the critical need for automated threat analysis by developing a comprehensive AI system that delivers:

  • Automated Threat Entity Extraction: Identifies malware names, threat actors, IPs, domains, and CVEs using BERT-based NER
  • Intelligent Threat Classification: Categorizes threats into Phishing, Malware, APTs, Ransomware with 89.2% accuracy
  • Risk Severity Assessment: Predicts threat impact as Low, Medium, or High using ensemble learning
  • Real-time Analysis Dashboard: Provides actionable intelligence for SOC teams and security analysts

Technical Architecture

Core NLP Pipeline
# Named Entity Recognition using BERT-based models
def extract_threat_entities(text):
    entities = ner_pipeline(text)
    return [{"word": e["word"], "entity_group": e["entity_group"]} 
            for e in entities]

# Threat Classification with ensemble methods
def classify_threat(text):
    features = tfidf_vectorizer.transform([text])
    prediction = threat_classifier.predict(features)
    return prediction[0]

Machine Learning Models

  • Named Entity Recognition: Fine-tuned BERT model (dslim/bert-base-NER) for cybersecurity entities
  • Threat Classification: Ensemble approach combining TF-IDF + Logistic Regression with Random Forest
  • Severity Prediction: Random Forest with engineered cybersecurity-specific features

Advanced Feature Engineering

  • IOC frequency analysis (IPs, domains, CVEs)
  • Cybersecurity keyword density mapping
  • Named entity occurrence patterns
  • Text complexity and sentiment metrics

Performance Metrics

Model Component Accuracy Precision Recall F1-Score
Threat Classification 89.2% 87.8% 88.5% 88.1%
Severity Prediction 84.7% 83.2% 85.1% 84.1%
Named Entity Recognition 91.3% 89.7% 92.8% 91.2%

Interactive Dashboard Features

  • Real-time Threat Analysis: Instant processing of threat reports
  • Visual Entity Highlighting: Color-coded threat indicators
  • Expandable Result Cards: Detailed classification breakdown
  • Export Capabilities: JSON/CSV report downloads
  • Responsive Design: Mobile-friendly interface

Research Impact & Innovation

Key Technical Contributions:
  1. Novel Ensemble Architecture: Combined traditional ML with modern NLP for robust predictions
  2. Domain-Specific Feature Engineering: Developed cybersecurity-focused feature extraction methods
  3. Real-time Processing Pipeline: Optimized for sub-second threat analysis
  4. Production-Ready Implementation: Designed for actual SOC deployment

Academic Achievement:

  • Outstanding Final Year Project at ATME College of Engineering (CSE - AI & ML)
  • Team Leadership: Successfully coordinated 4-member interdisciplinary research team
  • Industry Relevance: Addressed real-world cybersecurity operational challenges
  • Open Source Contribution: Growing community engagement on GitHub

Future Enhancements

Planned Technical Improvements:
  1. Advanced Transformer Models: Integration with ThreatBERT and domain-specific transformers
  2. Real-time Intelligence Feeds: Live threat data from multiple sources
  3. Graph-based Analytics: Threat actor relationship mapping and visualization
  4. Automated Response Systems: IOC blocking and firewall integration
  5. Multilingual Support: Analysis capabilities for non-English threat sources

This project demonstrates the practical application of cutting-edge AI/ML techniques in cybersecurity, providing immediate operational value while contributing to the advancement of automated threat intelligence systems.

Problem → Solution → Impact

Problem Solution Impact
No structured extraction of threat entities Fine-tuned BERT NER for security vocab 91.3% precision entity recognition
Inconsistent threat classification reliability Ensemble (TF‑IDF + Logistic + Random Forest) 89.2% multi-class accuracy
Latency blocked analyst adoption Pre-warmed model + caching + async pipeline Sub‑second interactive analysis
Feature sparsity hurting severity prediction Domain-specific engineered indicators (IOC density, keyword weighting) +6–8% lift in F1 severity model

Project Timeline

  1. 2024-08 · Dataset curation & labeling

    Aggregated multi-source threat reports; established annotation schema.

  2. 2024-09 · Baseline models & pipeline

    Implemented TF‑IDF + Logistic + RF ensemble; initial NER integration.

  3. 2024-10 · Advanced feature engineering

    Added IOC frequency, sentiment metrics, entity pattern features.

  4. 2024-11 · Real-time API & dashboard

    Built FastAPI services + interactive visualization layer.

  5. 2024-12 · Optimization & latency tuning

    Model loading strategy + vectorization caching for sub‑second responses.

  6. 2025-01-01 · Final evaluation & reporting

    Reached target metrics; documentation & deployment packaging.