Sai Sanjeev · AI/ML Engineer

Industrial IoT

Modbus-Based SCADA System

A low-cost, production-grade SCADA system for real-time industrial monitoring and remote device control — built entirely with off-the-shelf hardware and open protocols.

Goal

Replace expensive proprietary SCADA setups with a cost-effective alternative using commodity IoT hardware. The system needed to be offline-capable, remotely accessible, and maintainable by non-specialist staff.

Architecture

ESP32 acts as the Modbus Slave, collecting DHT11 temperature/humidity readings into mapped Modbus registers. A Raspberry Pi 5 serves as the Modbus Master, polling the ESP32 every 2 seconds via Modbus RTU over serial (UART). The Pi also runs a lightweight local web server (Flask/Python) that exposes a live dashboard with real-time readings and device control buttons.

Implementation

Programmed ESP32 in embedded C to sample DHT11 sensor data and populate Modbus register map (holding registers 40001–40010)
Configured Raspberry Pi as Modbus RTU master with automatic reconnection and error-handling for dropped packets
Built responsive web dashboard (HTML/JS/Chart.js) showing live time-series graphs and toggle controls for connected relays
Integrated Cloudflare Tunnel (cloudflared) for zero-trust remote access — no port forwarding, no public IP exposure
Added local data logging to SQLite for offline analytics and audit trails

Outcome & Impact

System runs continuously with 99% data accuracy across 72-hour stress tests. Remote access latency under 200ms via Cloudflare Tunnel. Total hardware cost under ₹4,000 vs. ₹50,000+ for commercial equivalents.

Skills Applied

Modbus RTU/TCP, embedded C, Python (pymodbus, Flask), Cloudflare Tunnel, SQLite, IoT system architecture, real-time dashboard development.

99% data accuracy

<200ms remote latency

₹4K vs ₹50K cost

Offline capable

Cybersecurity

STIX Malware Analyser

An automated ML pipeline for malware detection using structured cyber threat intelligence in STIX format — combining unsupervised anomaly detection with supervised classification for layered defence.

Goal

Automate the process of ingesting raw threat feeds, normalizing them into the STIX 2.1 standard, and applying ML models to detect both known and unknown malware with minimal analyst intervention.

Approach

Parsed and normalized threat data (IP indicators, file hashes, attack patterns, TTPs) into STIX 2.1 JSON bundles using the stix2 Python library
Engineered features from STIX objects: IP geolocation behavior, file signature entropy, lateral movement patterns, C2 communication frequency
Layer 1 — Isolation Forest (unsupervised): flags zero-day and novel anomalies where no labelled data exists
Layer 2 — SVM with RBF kernel (supervised): classifies flagged samples against a labelled corpus of known malware families
Built a report generation module that outputs STIX Course-of-Action objects with remediation suggestions for each detected threat

Results

Hybrid two-layer approach improved overall detection reliability by 30% over a single-model baseline. False positive rate reduced by 18% compared to Isolation Forest alone. End-to-end pipeline processes a 10,000-indicator feed in under 45 seconds.

Skills Applied

STIX 2.1 standard, Isolation Forest, SVM, feature engineering from threat intelligence data, end-to-end ML pipeline automation, Python (scikit-learn, stix2).

+30% detection reliability

-18% false positives

10K indicators / 45s

Predictive Maintenance

NVMe Level-3 Health Analyzer

A predictive maintenance system for NVMe SSDs that uses SMART telemetry data and XGBoost — tuned via Genetic Algorithm — to forecast drive failures before they occur.

Goal

Provide data centre operators and end users with an early warning system for NVMe drive failure, reducing unplanned downtime and data loss. Target: predict failure with at least 90% accuracy with actionable lead time of 24–72 hours.

Workflow

Automated extraction of 15+ SMART attributes: NVMe temperature, read/write error rates, power-on hours, unsafe shutdowns, media errors, wear levelling counts, percentage-used endurance
Built a custom feature extraction module to handle missing SMART attributes (vendor-specific registers) using median imputation and flag encoding
Derived rolling-window features (7-day and 30-day) to capture degradation trends, not just point-in-time values
Trained XGBoost classifier — chosen for its native handling of missing data and structured log data
Applied Genetic Algorithm (DEAP library, 200+ generations, tournament selection) to optimize XGBoost hyperparameters: max_depth, learning_rate, subsample, colsample_bytree, n_estimators
Compared GA-tuned model against Grid Search and Bayesian Optimization baselines — GA achieved best F1 at comparable compute cost

Performance

Processed 10,000 drive records in 137 seconds end-to-end. Achieved 92% failure prediction accuracy with 89% recall on the minority (failure) class. GA tuning added 4.2% accuracy improvement over default XGBoost hyperparameters.

Skills Applied

XGBoost, Genetic Algorithms (DEAP), feature engineering, predictive maintenance, NVMe/SMART telemetry parsing, model evaluation (precision/recall/F1), Python.

92% failure accuracy

89% recall on failures

10K records / 137s

+4.2% from GA tuning

RAG + LLM

AI Document Query Chatbot

A Retrieval-Augmented Generation (RAG) chatbot that answers user questions from uploaded documents using semantic search — not keyword matching — backed by LLaMA 2 and Chroma DB.

Goal

Enable users to query large documents (PDFs, reports, manuals) in natural language and receive accurate, grounded answers — eliminating hallucination by always anchoring responses to retrieved document chunks.

Architecture

Documents ingested and split into 500-token overlapping chunks (50-token overlap) to preserve context at boundaries
Each chunk converted to a 768-dim dense vector using Sentence Transformers (all-mpnet-base-v2 model)
Vectors stored in Chroma DB (local persistent vector store) with document metadata for source attribution
At query time: user query embedded with same Sentence Transformer, cosine similarity search retrieves top-3 most relevant chunks
Retrieved chunks + user query assembled into a structured prompt and passed to LLaMA 2 (7B, quantized with GGUF) via LangChain's RetrievalQA chain
Source citations returned alongside answers, showing which document sections were used

Optimizations

Switched from FAISS to Chroma DB — 30% faster retrieval due to persistent indexing
Quantized LLaMA 2 to 4-bit GGUF format — reduced memory footprint from 14GB to 4GB, enabling local CPU inference
Implemented query caching for repeated questions — reduced average response time from 2.1s to 1.26s (40% improvement)

Metrics

90% answer accuracy on a held-out evaluation set of 200 QA pairs. Retrieval precision@3 of 88%. Runs fully offline on consumer hardware (8GB RAM, no GPU required).

Skills Applied

RAG architecture, vector databases (Chroma DB), LLM integration, LangChain, Sentence Transformers, GGUF quantization, Python.

90% answer accuracy

40% faster responses

2.1s → 1.26s avg

Runs fully offline

FinTech

Online Payment Fraud Detection

A robust fraud detection pipeline for credit card transactions tackling the real-world challenge of extreme class imbalance (only 0.17% fraud cases in 284,807 transactions).

Goal

Build a production-ready fraud classifier that maintains high precision and recall simultaneously — minimising both missed fraud (false negatives, which cost money) and false alarms (false positives, which hurt customer trust).

Process

Dataset: Kaggle European credit card dataset — 284,807 transactions, 492 fraud cases (0.17%), 28 PCA-transformed features + Time + Amount
Applied SMOTE (Synthetic Minority Oversampling Technique) to oversample fraud class from 492 → 10,000 synthetic samples, preserving realistic feature distributions
Feature engineering: transaction velocity (frequency per hour per card), balance shift ratio, time-since-last-transaction, normalized Amount (RobustScaler to handle outliers)
Trained and benchmarked three models: Logistic Regression (baseline), XGBoost, Random Forest with 5-fold stratified cross-validation
Optimized prediction threshold (default 0.5 → 0.35) using Precision-Recall curve to maximize F1 on imbalanced test set
Built a lightweight inference pipeline: feature transformation → model predict_proba → threshold check → alert flag, achieving 25% faster runtime than naive sklearn pipeline

Outcome

Random Forest outperformed both baselines. Final model: 95% precision, 93% recall, F1-score 0.94 on held-out test set. ROC-AUC of 0.98. Pipeline throughput: 50,000 transactions processed per minute.

Skills Applied

SMOTE, imbalanced learning, threshold optimization, Random Forest, XGBoost, feature engineering, precision/recall tradeoff analysis, Python (scikit-learn, imbalanced-learn).

95% precision

93% recall

0.98 ROC-AUC

25% faster pipeline

Smart Agri · Active 🌱

SOIL — Sustainable Organic Intelligence Layer

An end-to-end smart agriculture system providing real-time soil health monitoring and AI-powered crop/fertilizer recommendations for small-scale farmers — a Govt. of Karnataka Grassroot Innovation 2025 finalist.

Goal

Eliminate the dependence on expensive and infrequent laboratory soil testing for small farmers. Provide continuous, affordable, and actionable soil intelligence directly in the field, reducing input costs and improving yield decisions.

System Components

Sensor layer: capacitive soil moisture sensor, analog pH probe (SEN0161), NPK sensor (RS485 Modbus output) — all connected to an ESP32 microcontroller
ESP32 aggregates multi-sensor readings every 5 minutes and transmits compressed packets over LoRa (SX1278, 433 MHz) to a central gateway up to 2km away
Gateway (Raspberry Pi 4) receives LoRa packets, decodes sensor values, and runs local AI inference — no internet dependency
AI model: Random Forest classifier trained on a regional crop-soil dataset (5,000 labelled samples across Karnataka crops) — classifies soil health into 3 categories (Healthy / Nutrient-deficient / Degraded) and recommends suitable crops + fertilizer ratios
Results displayed on a local e-ink display at the gateway node and a mobile-optimised web dashboard accessible on the farm's local WiFi

Pilot Results

Tested on 3 farms in the Mysore district over 8 weeks. Reduced manual soil testing time by 80% (from 2 hours per sample trip to 20 minutes per week of dashboard review). Crop recommendations matched expert agronomist advice in 85% of cases. Avg. sensor power draw: 12mA — projected battery life of 6 months on a 10,000mAh pack.

Planned Upgrades

Weather API integration (OpenWeather) to factor rainfall forecasts into irrigation advice
Auto-irrigation control via relay-controlled solenoid valves triggered by soil moisture thresholds
SMS alert gateway for farmers without smartphones

Skills Applied

LoRa communication, ESP32 embedded programming, Modbus RS485, edge AI inference, Random Forest, agricultural domain knowledge, low-power IoT design, Raspberry Pi, Python.

80% less manual testing

85% expert match rate

2km LoRa range

🏆 Gov. Finalist 2025

I'M SAI SANJEEV

Work Experience

Machine Learning Intern

Student TL Manager

Teaching Assistant

Featured Projects

Modbus-Based SCADA System

Goal

Architecture

Implementation

Outcome & Impact

Skills Applied

STIX Malware Analyser

Goal

Approach

Results

Skills Applied

NVMe Level-3 Health Analyzer

Goal

Workflow

Performance

Skills Applied

AI Document Query Chatbot

Goal

Architecture

Optimizations

Metrics

Skills Applied

Online Payment Fraud Detection

Goal

Process

Outcome

Skills Applied

SOIL — Sustainable Organic Intelligence Layer

Goal

System Components

Pilot Results

Planned Upgrades

Skills Applied

Core Competencies

Education & Certifications

B.E. Computer Science (AI & ML)

Let's Connect

Contact Details

Social & Profiles

I'M
SAI SANJEEV