
A production-ready Cyber Threat Intelligence (CTI) system that leverages Natural Language Processing (NLP) and AI-based classification to extract meaningful cyber threat indicators from unstructured text, categorize threat types, predict severity levels, and provide insights through an interactive web interface with comprehensive database integration, Docker deployment, and automated testing.
In today’s cyber threat landscape, real-time intelligence is crucial. This platform uses NLP-based entity recognition, machine learning-based threat classification, and severity prediction models to generate actionable insights from threat reports and social media data.
The system is designed for analysts and SOC teams to triage, investigate, and act — all within one command-center styled dashboard with full database persistence and API-driven architecture.
# Clone the repository
git clone https://github.com/sanjanb/cti-nlp-system.git
cd cti-nlp-system
# Quick setup
make quick-start
# Or manually:
python setup.py
docker-compose up -d
# Access the application
open http://localhost:8000/docs
# Setup Python environment
python -m venv myenv
myenv\Scripts\activate # Windows
source myenv/bin/activate # Linux/Mac
# Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm
# Train models (if needed)
python scripts/train_threat_classifier.py
python scripts/train_severity_model.py
# Start the application
uvicorn backend.main:app --reload
cti-nlp-system/
├── backend/ # FastAPI backend with ML models
│ ├── main.py # Main API application
│ ├── threat_ner.py # Named Entity Recognition
│ ├── classifier.py # Threat classification
│ └── severity_predictor.py # Severity prediction
│
├── database/ # Database layer
│ ├── models.py # SQLAlchemy models
│ ├── database.py # Database configuration
│ ├── services.py # Database services/CRUD
│ └── init.sql # Database initialization
│
├── data_ingestion/ # Data collection modules
│ ├── fetch_twitter.py # Twitter API integration
│ ├── fetch_darkweb.py # Dark web data collection
│ ├── fetch_mitre_attack.py # MITRE ATT&CK integration
│ └── preprocess.py # Data preprocessing
│
├── tests/ # Comprehensive test suite
│ ├── conftest.py # Test configuration
│ ├── test_api.py # API endpoint tests
│ ├── test_models.py # ML model tests
│ └── run_tests.py # Test runner
│
├── scripts/ # Utility and training scripts
│ ├── train_threat_classifier.py # Train classification model
│ ├── train_severity_model.py # Train severity model
│ └── ingest_all_sources.py # Data ingestion orchestrator
│
├── docs/ # Documentation
│ ├── DEPLOYMENT.md # Deployment guide
│ ├── API.md # API documentation
│ └── [model docs] # Model-specific documentation
│
├── nginx/ # Nginx configuration
├── models/ # Trained ML models
├── data/ # Raw and processed data
├── docker-compose.yml # Multi-service orchestration
├── Dockerfile # Application container
├── Makefile # Development automation
└── requirements.txt # Python dependencies
| Category | Tools & Libraries |
|---|---|
| Backend | FastAPI, Uvicorn, SQLAlchemy, Alembic |
| Database | PostgreSQL 15, Redis 7 |
| ML/NLP | spaCy, HuggingFace Transformers, scikit-learn |
| Models | BERT-NER, TF-IDF + Logistic Regression, XGBoost |
| Testing | pytest, httpx, pytest-cov |
| DevOps | Docker, Docker Compose, Nginx |
| Monitoring | Health checks, Prometheus metrics |
| Security | CORS, Rate limiting, Environment variables |
# Run the setup script
python setup.py
# Check system health
make health
# Run tests
make test
Environment Configuration
cp .env.example .env
# Edit .env with your configuration
Database Setup
# Start database services
docker-compose up -d postgres redis
# Initialize database
make db-init
Model Training
# Train all models
make train-models
# Or individually:
make train-classifier
make train-severity
Start Application
# Development mode
make dev
# Production mode
make deploy-prod
POST /analyze - Analyze threat intelligence textGET /threats - Retrieve threat records with filteringGET /threats/{id} - Get detailed threat informationGET /analytics - Get threat statistics and analyticsPOST /upload_csv - Batch analysis from CSV filesGET /health - System health checkimport requests
# Analyze text
response = requests.post("http://localhost:8000/analyze",
json={"text": "APT29 phishing campaign targeting banks"})
result = response.json()
print(f"Threat: {result['threat_type']}")
print(f"Severity: {result['severity']}")
print(f"Entities: {result['entities']}")
Full API documentation: docs/API.md
Interactive docs: http://localhost:8000/docs (when running)
# Complete test suite
make test
# Unit tests only
make test-unit
# Integration tests only
make test-integration
# With coverage report
pytest --cov=backend --cov=database --cov-report=html
The test suite maintains 70%+ code coverage with detailed HTML reports generated in htmlcov/.
# Start all services
docker-compose up -d
# View logs
docker-compose logs -f
# Scale application
docker-compose up -d --scale app=3
# Production deployment
make deploy-prod
# With custom environment
docker-compose -f docker-compose.prod.yml up -d
Supports deployment on:
Detailed deployment guide: docs/DEPLOYMENT.md
# Check application health
curl http://localhost:8000/health
# View service status
docker-compose ps
# Monitor logs
make logs-app
# Create full backup
make backup-full
# Database backup only
make db-backup
# Restore from backup
make restore
| Document | Description |
|---|---|
docs/DEPLOYMENT.md |
Complete deployment guide |
docs/API.md |
Comprehensive API documentation |
docs/1. SCRIPTS_OVERVIEW.md |
Training scripts documentation |
docs/2. THREAT_CLASSIFIER.md |
Threat classification model |
docs/3. SEVERITY_MODEL.md |
Severity prediction model |
docs/4. NER_MODEL.md |
Named Entity Recognition model |
docs/5. BACKEND_OVERVIEW.md |
Backend architecture overview |
CONTRIBUTING.md |
Contribution guidelines |
| Metric | Performance |
|---|---|
| API Response Time | < 200ms (95th percentile) |
| Throughput | 100+ requests/second |
| Model Inference | < 50ms per text sample |
| Database Queries | < 10ms (indexed queries) |
| Memory Usage | < 2GB (production) |
We welcome contributions from students, researchers, and cybersecurity enthusiasts!
git checkout -b feature/amazing-feature
make test
make lint
# Setup development environment
make setup
make install
# Make changes
# ...
# Test your changes
make test
make lint
make format
# Submit PR
git push origin feature/amazing-feature
Detailed guidelines: CONTRIBUTING.md
This project is released under the MIT License. See LICENSE for more details.
Developed as part of the final year project (CSE - AI & ML) at ATME College of Engineering, Mysuru:
If you find this project useful, please give it a star! ⭐
| _Last updated: August 2025 | v2.0.0_ |
In today’s cyber threat landscape, real-time intelligence is crucial. This platform uses NLP-based entity recognition, machine learning-based threat classification, and severity prediction models to generate actionable insights from threat reports and forum data.
The system is designed for analysts and SOC teams to triage, investigate, and act — all within one command-center styled dashboard.
cti-nlp-system/
│
├── backend/ # FastAPI backend logic (main.py, NER, ML models)
│ ├── main.py
│ ├── threat_ner.py
│ ├── classifier.py
│ ├── severity_predictor.py
│ └── ...
│
├── dashboard/ # Frontend templates (HTML + CSS + Jinja2/JS)
│ ├── templates/
│ ├── static/
│ └── ...
│
├── data/ # Raw and processed threat intel datasets
│ ├── raw/
│ ├── cleaned/
│ └── ...
│
├── docs/ # Documentation and testing guidelines
│ ├── setup_guide.md
│ ├── testing_guide.md
│ └── api_schema.json
│
├── models/ # Saved ML models, vectorizers (joblib/pkl files)
│
├── scripts/ # Preprocessing, training scripts for models
│ ├── train_threat_classifier.py
│ ├── preprocess.py
│ └── ...
│
├── utils/ # Helper utilities (tokenizer, logger, metrics)
│
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
├── README.md
└── CONTRIBUTING.md
| Category | Tools & Libraries |
|---|---|
| Backend | FastAPI, Uvicorn |
| NLP Models | spaCy, HuggingFace Transformers (BERT, ThreatBERT) |
| ML Libraries | Scikit-learn, XGBoost, PyTorch |
| Frontend | HTML, Bootstrap 5, JavaScript, Jinja2 |
| Dashboard | Custom Flask/Static Pages (migrating to React/Vite) |
| Deployment | Docker, Render, Railway |
| Storage | CSV, JSON |
git clone https://github.com/sanjanb/cti-nlp-system.git
cd cti-nlp-system
python -m venv myenv
# Windows
myenv\Scripts\activate
# Linux/Mac
source myenv/bin/activate
pip install -r requirements.txt
python -m spacy download en_core_web_sm
uvicorn backend.main:app --reload
Access the API at: http://127.0.0.1:8000/docs
Check the docs/testing_guide.md file for full testing procedures.
curl -X POST http://localhost:8000/analyze \
-H "Content-Type: application/json" \
-d '{"text": "QakBot malware exploited CVE-2023-1234 via phishing in Russia"}'
Expected JSON:
{
"original_text": "...",
"entities": [...],
"threat_type": "Phishing",
"severity": "High"
}
You can also upload .csv files with a text column at /upload_csv.
| File | Description |
|---|---|
docs/setup_guide.md |
End-to-end setup and deployment steps |
docs/testing_guide.md |
Manual and automated testing instructions |
docs/api_schema.json |
Swagger/OpenAPI schema |
CONTRIBUTING.md |
Contribution guidelines |
We welcome contributions from students, researchers, and cybersecurity enthusiasts.
For setup, conventions, and pull request flow, read
CONTRIBUTING.md.
This project is released under the MIT License. See LICENSE for more details.
Developed as part of the final year project (CSE - AI & ML) at ATME College of Engineering, Mysuru:
For questions or suggestions, open a GitHub issue or reach out on LinkedIn.