356 lines
16 KiB
Markdown
356 lines
16 KiB
Markdown
# semeion
|
|
|
|

|
|
|
|
## Overview
|
|
|
|
Semeion is a **timeline-first digital forensics analysis platform** that solves the two biggest pain points in modern investigations: navigating massive communication datasets and correlating events across data sources, including filesystm artifacts and OS events.
|
|
|
|
> **Core Question Semeion Answers:** "What happened around this time—and how are these events connected?"
|
|
|
|
Traditional forensic tools often have unintuitiveur slow user experience (opinionated). Semeion in contrast aims to provide a fast, intelligent timeline that automatically clusters semantically-related events, enabling investigators to see patterns at a glance rather than clicking through endless individual entries.
|
|
|
|
An investigator opens Semeion alongside their primary forensic tool, either searches for relevant content (like "*discussions about data transfer*" in Russian messages) or navigates to a known timestamp (from incident logs), and immediately sees all correlated activity—messages, file access, network connections, system events—organized into meaningful sequences rather than flat lists.
|
|
|
|
## Key Features
|
|
|
|
| Feature | Description |
|
|
|---------|-------------|
|
|
| **Timeline Visualization** | Render millions of events with <200ms response time; smooth zooming/panning |
|
|
| **Semantic Event Clustering** | Automatically group related events into patterns (file transfers, communication bursts, login sequences) |
|
|
| **Multilingual Semantic Search** | Find concepts across languages—search in English, find results in Russian/Chinese/Arabic with translation |
|
|
| **Two-Mode Operation** | Semantic content discovery → timeline analysis OR timestamp → content correlation |
|
|
| **Cross-Source Correlation** | Unified timeline across messages, files, network, browser, system events |
|
|
| **Event Grouping** | Collapse/expand semantically-related events; discover patterns |
|
|
| **Native Desktop Application** | PySide6-based—runs fully local, tailored for forensic-grade airgapped environments |
|
|
| **Open Source & Transparent** | Auditable methods, minimize "blackbox" AI, court-defensible results |
|
|
|
|
## The Problem Semeion Solves
|
|
|
|
### Problematic UX with existing forensic suites
|
|
|
|
Existing forensic tools have timeline interfaces that:
|
|
|
|
- Render slowly
|
|
- Show flat event lists with no context
|
|
- Require manual correlation across sources
|
|
- Have poor visualization and navigation
|
|
*This is highly opinionated from personal experiences of the maintainer with multiple large commercial suites which werent able to deliver a satisfying result. I wont name any here, and exceptions may exist, but improvements are required here.*
|
|
|
|
**Semeion's Solution:** Fast, interactive timeline with semantic clustering that shows patterns immediately.
|
|
|
|
### Large Communication Datasets
|
|
|
|
Modern investigations involve:
|
|
|
|
- Seized databases with millions of relevant artifacts
|
|
- Multilingual content
|
|
- Slang, code words, and poor machine translation
|
|
- Hours spent reading irrelevant conversations
|
|
|
|
**Semeion's Solution:** Semantic search that understands meaning, handles multiple languages, and jumps directly to timeline context.
|
|
|
|
## How It Works
|
|
|
|
### Two Entry Points
|
|
|
|
#### **Entry Point 1: Semantic content discovery → timeline analysis**
|
|
|
|
```bash
|
|
1. Investigator searches: "discussions about file transfers"
|
|
2. Semeion finds semantically relevant messages
|
|
3. Click any result → Timeline centers on that timestamp
|
|
4. See all activity ±2 hours: file access, USB connections, deletions
|
|
5. Semantic clustering highlights: "File Exfiltration Sequence" pattern
|
|
```
|
|
|
|
#### **Entry Point 2: timestamp → content correlation**
|
|
|
|
```bash
|
|
1. Incident log shows: "Ransomware encrypted files at 2024-03-15 14:22:00"
|
|
2. Navigate timeline to that timestamp
|
|
3. Semantic clustering reveals:
|
|
- Communication spike 30min before (cluster: "Coordination Activity")
|
|
- Suspicious file downloads (cluster: "Malware Delivery")
|
|
- Network connections (cluster: "C2 Communication")
|
|
- Process executions leading to encryption
|
|
4. Complete attack chain visible at a glance
|
|
```
|
|
|
|
## Semantic Event Clustering
|
|
|
|
**Traditional Timeline View (Overwhelming):**
|
|
|
|
```bash
|
|
14:33:15 - USB device connected
|
|
14:33:18 - Chrome visited facebook.com
|
|
14:33:22 - File modified: report.docx
|
|
14:33:30 - File copied to USB: project_data.zip (2.3 GB)
|
|
14:33:35 - File copied to USB: budget.xlsx
|
|
14:33:40 - File deleted: project_data.zip
|
|
14:33:45 - WhatsApp message sent
|
|
...50 more individual events...
|
|
```
|
|
|
|
**Semeion Clustered View (Clear Pattern):**
|
|
|
|
```bash
|
|
┌──────────────────────────────────────────────┐
|
|
│ 14:33:15-14:33:40 ⚠️ File Transfer (4) │ ← Click to expand
|
|
│ Pattern: Data Exfiltration Sequence │
|
|
│ USB connected → 2 files copied → source deleted │
|
|
│ Related: Files mentioned in chat 2min earlier │
|
|
└──────────────────────────────────────────────┘
|
|
```
|
|
|
|
### How Clustering Works
|
|
|
|
1. **Temporal Proximity**: Events within configurable time window evaluated together
|
|
2. **Semantic Similarity**: Vector embeddings detect conceptually-related events
|
|
3. **Entity Linking**: Shared file names, IPs, usernames connect events
|
|
4. **Pattern Templates**: Pre-defined sequences (USB exfiltration, login chains, etc.)
|
|
5. **LLM Summarization** (Optional): Natural language cluster descriptions
|
|
|
|
## Architecture
|
|
|
|
### Desktop Application (PySide6)
|
|
|
|
```bash
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ PySide6 Desktop UI │
|
|
│ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐ │
|
|
│ │ Timeline │ │ Semantic │ │ Artifact │ │
|
|
│ │ View │ │ Search │ │ Detail │ │
|
|
│ └───────────────┘ └──────────────┘ └─────────────┘ │
|
|
└─────────────────────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Analysis & Clustering Engine │
|
|
│ • Event clustering algorithm │
|
|
│ • Semantic similarity calculation │
|
|
│ • Cross-artifact correlation │
|
|
│ • Timeline rendering engine │
|
|
└─────────────────────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Data Layer │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Qdrant │ │ SQLite │ │ LLM API │ │
|
|
│ │ (Vectors) │ │ (Metadata) │ │ (Optional) │ │
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
└─────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Deployment Options
|
|
|
|
| Configuration | Qdrant | Embedding | LLM | Use Case |
|
|
|---------------|--------|-----------|-----|----------|
|
|
| **Fully Local** | localhost | Ollama/local model | Ollama (optional) | Single investigator, offline |
|
|
| **Airgapped Network** | Internal server | Internal server | Internal server | proper forensic environment |
|
|
| **Hybrid** | Local | Local | Cloud API (optional) | investigative environment without focus on confidentiality |
|
|
|
|
## Artifact Types
|
|
|
|
### Searchable & Timeline-Enabled
|
|
|
|
| Type | Examples | Features |
|
|
|------|----------|----------|
|
|
| **Messages** | WhatsApp, Telegram, Signal, SMS | Semantic search, multilingual, timeline clustering |
|
|
| **Browser** | Chrome, Firefox, Safari, Edge | History, downloads, searches—timeline correlation |
|
|
| **Email** | Any email format (EML, MBOX, PST) | Searchable content, timeline placement |
|
|
| **Documents** | PDF, Word, plain text | Content search, access time correlation |
|
|
|
|
### Timeline-Only (Context View)
|
|
|
|
| Type | Examples |
|
|
|------|----------|
|
|
| **File Events** | File creation, modification, deletion, access |
|
|
| **Process Events** | Application launches, process creation |
|
|
| **Network Events** | Connections, DNS queries |
|
|
| **System Events** | Logs, authentication events, USB connections |
|
|
|
|
## Technical Stack
|
|
|
|
| Component | Technology | Purpose |
|
|
|-----------|------------|---------|
|
|
| **UI Framework** | PySide6 (Qt6) | Native desktop application |
|
|
| **Timeline Rendering** | PyQtGraph or Plotly | High-performance visualization |
|
|
| **Vector Database** | Qdrant | Semantic search and similarity |
|
|
| **Metadata Storage** | SQLite | Fast local queries, metadata |
|
|
| **LLM Interface** | OpenAI-compatible API | Optional: cluster summarization, query interpretation |
|
|
| **Embedding** | Sentence Transformers / OpenAI | Multilingual vector generation |
|
|
| **Forensic Parsing** | pytsk3, pyewf | Disk image processing |
|
|
| **Language** | Python 3.13+ | Application logic |
|
|
|
|
## Data Model
|
|
|
|
### SemeionArtifact (Universal Schema, WiP)
|
|
|
|
Every artifact—regardless of source—conforms to a unified model:
|
|
|
|
```bash
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ SemeionArtifact │
|
|
├─────────────────────────────────────────────────────────┤
|
|
│ Identity: id, case_id │
|
|
│ Classification: artifact_class, source_platform │
|
|
│ Temporal: timestamp (UTC normalized) │
|
|
│ Actors: [{identifier, display_name, role}] │
|
|
│ Content: text, semantic_text │
|
|
│ Entities: indexed_entities[] (files, IPs, etc) │
|
|
│ Hierarchy: parent_id, context_group │
|
|
│ Location: url, path, title │
|
|
│ Embeddings: semantic_vector (768-dim) │
|
|
│ Source-Specific: message{}, browser{}, email{}, etc │
|
|
└─────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Cluster Model (WiP)
|
|
|
|
```bash
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ EventCluster │
|
|
├─────────────────────────────────────────────────────────┤
|
|
│ id: unique_cluster_id │
|
|
│ time_range: (start_timestamp, end_timestamp) │
|
|
│ events: [artifact_ids] │
|
|
│ pattern_type: "file_exfiltration" | "communication" │
|
|
│ confidence: 0.0-1.0 │
|
|
│ summary: "USB transfer with deletion" │
|
|
│ icon: "⚠️" | "💬" | "🌐" | etc │
|
|
│ semantic_links: [related_cluster_ids] │
|
|
└─────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## What Semeion Is NOT (yet)
|
|
|
|
| Not This | Why |
|
|
|----------|-----|
|
|
| **Forensic suite replacement** | Companion tool—use alongside Autopsy for acquisition |
|
|
| **Reporting tool** | Timeline export for reports, but documentation happens in primary suite |
|
|
| **AI evidence interpreter** | AI assists with search/clustering; investigator interprets evidence |
|
|
|
|
## Development Setup
|
|
|
|
This project uses [uv](https://github.com/astral-sh/uv) for dependency management.
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.13+
|
|
- [uv](https://github.com/astral-sh/uv) installed
|
|
- requirements.txt
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
git clone https://git.cc24.dev/mstoeck3/semeion
|
|
cd semeion
|
|
|
|
# Create virtual environment
|
|
uv venv --python 3.13
|
|
|
|
# Activate environment
|
|
source .venv/bin/activate # Linux/macOS
|
|
# .venv\Scripts\activate # Windows
|
|
|
|
# Install dependencies
|
|
uv pip install -r requirements.txt -e .
|
|
|
|
# Configure environment
|
|
cp .env.example .env
|
|
# Edit .env with your Qdrant and embedding endpoint configurations
|
|
```
|
|
|
|
### Running
|
|
|
|
```bash
|
|
semeion
|
|
```
|
|
|
|
## System Requirements
|
|
|
|
### Minimum (Remote Processing)
|
|
|
|
| Resource | Requirement |
|
|
|----------|-------------|
|
|
| CPU | 4 cores |
|
|
| RAM | 8 GB |
|
|
| Storage | Minimal (evidence stored elsewhere) |
|
|
| GPU | Not required |
|
|
| Network | Access to Qdrant and embedding endpoints (if remote) |
|
|
|
|
### Recommended (Local Processing)
|
|
|
|
| Resource | Requirement |
|
|
|----------|-------------|
|
|
| CPU | 8+ cores |
|
|
| RAM | 16 GB (32 GB for large cases) |
|
|
| Storage | SSD, sufficient for evidence & vectors |
|
|
| GPU | Optional (improves embedding speed with local models) |
|
|
| Network | Optional (fully offline capable) |
|
|
|
|
## Project Status
|
|
|
|
**Current Phase:** MVP Development - Timeline & Core Features
|
|
|
|
**Roadmap:**
|
|
|
|
1. ✅ Concept and architecture design
|
|
2. ⬜ Core infrastructure
|
|
- Unified artifact ingestion (WhatsApp, Chrome)
|
|
- SQLite + Qdrant integration
|
|
3. ⬜ Timeline visualization
|
|
- High-performance rendering (target: <200ms for 100k events)
|
|
- Multi-source swim lanes
|
|
- Zoom/pan navigation
|
|
4. ⬜ Semantic clustering
|
|
- Temporal proximity grouping
|
|
- Pattern template matching
|
|
- Semantic similarity detection
|
|
5. ⬜ Communication search
|
|
- Multilingual embedding
|
|
- Query interpretation
|
|
- Search → timeline jump
|
|
6. ⬜ Additional parsers (Telegram, Signal, etc.)
|
|
7. ⬜ Export and reporting
|
|
8. ⬜ Performance optimization & polish
|
|
|
|
## Why Open Source Matters
|
|
|
|
**Transparency for Court:**
|
|
|
|
- Auditable algorithms—no "black box" analysis
|
|
- Reproducible results—scientific validation possible
|
|
- Peer-reviewed methods—community scrutiny
|
|
|
|
**Accessibility:**
|
|
|
|
- Free for budget-constrained labs
|
|
- No vendor lock-in
|
|
- Community-driven development
|
|
|
|
**Innovation:**
|
|
|
|
- Rapid feature development
|
|
- Specialized extensions possible
|
|
- Academic research enabled
|
|
|
|
## Branding
|
|
|
|
**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"—embodying the software's mission of revealing meaningful patterns within digital evidence through intelligent timeline analysis.
|
|
|
|
The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden connections through analytical thought.
|
|
|
|
## License
|
|
|
|
BSD 3-Clause
|
|
|
|
## Contributing
|
|
|
|
Semeion is in active development. Contributions welcome, especially from:
|
|
|
|
- Digital forensics practitioners (workflow validation)
|
|
- Timeline visualization experts
|
|
- Multilingual NLP specialists
|
|
- Performance optimization engineers
|