--- language: - en - id tags: - bert - text-classification - token-classification - cybersecurity - fill-mask - named-entity-recognition - transformers - tensorflow - pytorch - masked-language-modeling base_model: boltuix/bert-micro library_name: transformers pipeline_tag: fill-mask --- # bert-micro-cybersecurity ## 1. Model Details **Model description** "bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content). - Model type: fine-tuned lightweight BERT variant - Languages: English & Indonesia - Finetuned from: `boltuix/bert-micro` - Status: **Early version** — trained on **66.14%** of planned data. **Model sources** - Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro) - Data: Cybersecurity Data ## 2. Uses ### Direct use You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence. ### Downstream use - Embedding extraction for clustering. - Named Entity Recognition on log or security data. - Classification of security data. - Anomaly detection in security logs. - As part of a pipeline for phishing detection, malicious email filtering, incident triage. - As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard). ### Out-of-scope use - Not meant for high-stakes automated blocking decisions without human review. - Not optimized for languages other than English and Indonesian. - Not tested for non-cybersecurity domains or out-of-distribution data. ### Downstream Usecase in Development using this model - NER on security log, botnet data, and json data. - Early classification of SIEM alert & events. ## 3. Bias, Risks, and Limitations Because the model is based on a small subset (66.14%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language). - Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary. - **Should not be used as sole authority for incident decisions; only as an aid to human analysts.** ## 4. Training Details ### Text Processing & Chunking Since cybersecurity data often contains lengthy alert descriptions and execution logs that exceed BERT's 512 token limit, we implement an overlapping chunking strategy: - **Max sequence length**: 512 tokens - **Stride**: 32 tokens (overlap between consecutive chunks) - **Chunking behavior**: Long texts are split into overlapping segments. For example, with max_length=512 and stride=128, a 1000-token document becomes ~3 chunks with 128-token overlaps, preserving context across boundaries. ### Training Hyperparameters - **Base model**: `boltuix/bert-micro` - **Training epochs**: 3 - **Learning rate**: 5e-05 - **Batch size**: 16 - **Weight decay**: 0.01 - **Warmup ratio**: 0.06 - **Gradient accumulation steps**: 1 - **Optimizer**: AdamW - **LR scheduler**: Linear with warmup ### Training Data - **Total database rows**: 246,838 - **Rows processed (cumulative)**: 163,258 (66.14%) - **Training date**: 2025-12-30 04:18:17 ### Post-Training Metrics - **Final training loss**: - **Rows→Samples ratio**: