semeion/README.md

# semeion

![alt text](resources/title_image.png)

## Overview

Semeion is a **timeline-first digital forensics analysis platform** that solves the two biggest pain points in modern investigations: navigating massive communication datasets and correlating events across data sources, including filesystm artifacts and OS events.

> **Core Question Semeion Answers:** "What happened around this time—and how are these events connected?"

Traditional forensic tools often have unintuitiveur slow user experience (opinionated). Semeion in contrast aims to provide a fast, intelligent timeline that automatically clusters semantically-related events, enabling investigators to see patterns at a glance rather than clicking through endless individual entries.

An investigator opens Semeion alongside their primary forensic tool, either searches for relevant content (like "*discussions about data transfer*" in Russian messages) or navigates to a known timestamp (from incident logs), and immediately sees all correlated activity—messages, file access, network connections, system events—organized into meaningful sequences rather than flat lists.

## Key Features

| Feature | Description |
|---------|-------------|
| **Timeline Visualization** | Render millions of events with <200ms response time; smooth zooming/panning |
| **Semantic Event Clustering** | Automatically group related events into patterns (file transfers, communication bursts, login sequences) |
| **Multilingual Semantic Search** | Find concepts across languages—search in English, find results in Russian/Chinese/Arabic with translation |
| **Two-Mode Operation** | Semantic content discovery → timeline analysis OR timestamp → content correlation |
| **Cross-Source Correlation** | Unified timeline across messages, files, network, browser, system events |
| **Event Grouping** | Collapse/expand semantically-related events; discover patterns |
| **Native Desktop Application** | PySide6-based—runs fully local, tailored for forensic-grade airgapped environments |
| **Open Source & Transparent** | Auditable methods, minimize "blackbox" AI, court-defensible results |

## The Problem Semeion Solves

### Problematic UX with existing forensic suites

Existing forensic tools have timeline interfaces that:

- Render slowly
- Show flat event lists with no context
- Require manual correlation across sources
- Have poor visualization and navigation
*This is highly opinionated from personal experiences of the maintainer with multiple large commercial suites which werent able to deliver a satisfying result. I wont name any here, and exceptions may exist, but improvements are required here.*

**Semeion's Solution:** Fast, interactive timeline with semantic clustering that shows patterns immediately.

### Large Communication Datasets

Modern investigations involve:

- Seized databases with millions of relevant artifacts
- Multilingual content
- Slang, code words, and poor machine translation
- Hours spent reading irrelevant conversations

**Semeion's Solution:** Semantic search that understands meaning, handles multiple languages, and jumps directly to timeline context.

## How It Works

### Two Entry Points

#### **Entry Point 1: Semantic content discovery → timeline analysis**

```bash
1. Investigator searches: "discussions about file transfers"
2. Semeion finds semantically relevant messages
3. Click any result → Timeline centers on that timestamp
4. See all activity ±2 hours: file access, USB connections, deletions
5. Semantic clustering highlights: "File Exfiltration Sequence" pattern
```

#### **Entry Point 2: timestamp → content correlation**

```bash
1. Incident log shows: "Ransomware encrypted files at 2024-03-15 14:22:00"
2. Navigate timeline to that timestamp
3. Semantic clustering reveals:
   - Communication spike 30min before (cluster: "Coordination Activity")
   - Suspicious file downloads (cluster: "Malware Delivery")
   - Network connections (cluster: "C2 Communication")
   - Process executions leading to encryption
4. Complete attack chain visible at a glance
```

## Semantic Event Clustering

**Traditional Timeline View (Overwhelming):**

```bash
14:33:15 - USB device connected
14:33:18 - Chrome visited facebook.com
14:33:22 - File modified: report.docx
14:33:30 - File copied to USB: project_data.zip (2.3 GB)
14:33:35 - File copied to USB: budget.xlsx
14:33:40 - File deleted: project_data.zip
14:33:45 - WhatsApp message sent
...50 more individual events...
```

**Semeion Clustered View (Clear Pattern):**

```bash
┌──────────────────────────────────────────────┐
│ 14:33:15-14:33:40  ⚠️ File Transfer (4)      │ ← Click to expand
│ Pattern: Data Exfiltration Sequence          │
│ USB connected → 2 files copied → source deleted │
│ Related: Files mentioned in chat 2min earlier │
└──────────────────────────────────────────────┘
```

### How Clustering Works

1. **Temporal Proximity**: Events within configurable time window evaluated together
2. **Semantic Similarity**: Vector embeddings detect conceptually-related events
3. **Entity Linking**: Shared file names, IPs, usernames connect events
4. **Pattern Templates**: Pre-defined sequences (USB exfiltration, login chains, etc.)
5. **LLM Summarization** (Optional): Natural language cluster descriptions

## Architecture

### Desktop Application (PySide6)

```bash
┌─────────────────────────────────────────────────────────┐
│                   PySide6 Desktop UI                    │
│  ┌───────────────┐  ┌──────────────┐  ┌─────────────┐  │
│  │   Timeline    │  │   Semantic   │  │   Artifact  │  │
│  │     View      │  │    Search    │  │    Detail   │  │
│  └───────────────┘  └──────────────┘  └─────────────┘  │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│              Analysis & Clustering Engine               │
│  • Event clustering algorithm                           │
│  • Semantic similarity calculation                      │
│  • Cross-artifact correlation                           │
│  • Timeline rendering engine                            │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│                   Data Layer                            │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │   Qdrant     │  │   SQLite     │  │   LLM API    │  │
│  │   (Vectors)  │  │   (Metadata) │  │   (Optional) │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
└─────────────────────────────────────────────────────────┘
```

### Deployment Options

| Configuration | Qdrant | Embedding | LLM | Use Case |
|---------------|--------|-----------|-----|----------|
| **Fully Local** | localhost | Ollama/local model | Ollama (optional) | Single investigator, offline |
| **Airgapped Network** | Internal server | Internal server | Internal server | proper forensic environment |
| **Hybrid** | Local | Local | Cloud API (optional) | investigative environment without focus on confidentiality |

## Artifact Types

### Searchable & Timeline-Enabled

| Type | Examples | Features |
|------|----------|----------|
| **Messages** | WhatsApp, Telegram, Signal, SMS | Semantic search, multilingual, timeline clustering |
| **Browser** | Chrome, Firefox, Safari, Edge | History, downloads, searches—timeline correlation |
| **Email** | Any email format (EML, MBOX, PST) | Searchable content, timeline placement |
| **Documents** | PDF, Word, plain text | Content search, access time correlation |

### Timeline-Only (Context View)

| Type | Examples |
|------|----------|
| **File Events** | File creation, modification, deletion, access |
| **Process Events** | Application launches, process creation |
| **Network Events** | Connections, DNS queries |
| **System Events** | Logs, authentication events, USB connections |

## Technical Stack

| Component | Technology | Purpose |
|-----------|------------|---------|
| **UI Framework** | PySide6 (Qt6) | Native desktop application |
| **Timeline Rendering** | PyQtGraph or Plotly | High-performance visualization |
| **Vector Database** | Qdrant | Semantic search and similarity |
| **Metadata Storage** | SQLite | Fast local queries, metadata |
| **LLM Interface** | OpenAI-compatible API | Optional: cluster summarization, query interpretation |
| **Embedding** | Sentence Transformers / OpenAI | Multilingual vector generation |
| **Forensic Parsing** | pytsk3, pyewf | Disk image processing |
| **Language** | Python 3.13+ | Application logic |

## Data Model

### SemeionArtifact (Universal Schema, WiP)

Every artifact—regardless of source—conforms to a unified model:

```bash
┌─────────────────────────────────────────────────────────┐
│  SemeionArtifact                                        │
├─────────────────────────────────────────────────────────┤
│  Identity:        id, case_id                           │
│  Classification:  artifact_class, source_platform       │
│  Temporal:        timestamp (UTC normalized)            │
│  Actors:          [{identifier, display_name, role}]    │
│  Content:         text, semantic_text                   │
│  Entities:        indexed_entities[] (files, IPs, etc)  │
│  Hierarchy:       parent_id, context_group              │
│  Location:        url, path, title                      │
│  Embeddings:      semantic_vector (768-dim)             │
│  Source-Specific: message{}, browser{}, email{}, etc    │
└─────────────────────────────────────────────────────────┘
```

### Cluster Model (WiP)

```bash
┌─────────────────────────────────────────────────────────┐
│  EventCluster                                           │
├─────────────────────────────────────────────────────────┤
│  id:              unique_cluster_id                     │
│  time_range:      (start_timestamp, end_timestamp)      │
│  events:          [artifact_ids]                        │
│  pattern_type:    "file_exfiltration" | "communication" │
│  confidence:      0.0-1.0                               │
│  summary:         "USB transfer with deletion"          │
│  icon:            "⚠️" | "💬" | "🌐" | etc             │
│  semantic_links:  [related_cluster_ids]                 │
└─────────────────────────────────────────────────────────┘
```

## What Semeion Is NOT (yet)

| Not This | Why |
|----------|-----|
| **Forensic suite replacement** | Companion tool—use alongside Autopsy for acquisition |
| **Reporting tool** | Timeline export for reports, but documentation happens in primary suite |
| **AI evidence interpreter** | AI assists with search/clustering; investigator interprets evidence |

## Development Setup

This project uses [uv](https://github.com/astral-sh/uv) for dependency management.

### Prerequisites

- Python 3.13+
- [uv](https://github.com/astral-sh/uv) installed
- requirements.txt

### Installation

```bash
git clone https://git.cc24.dev/mstoeck3/semeion
cd semeion

# Create virtual environment
uv venv --python 3.13

# Activate environment
source .venv/bin/activate      # Linux/macOS
# .venv\Scripts\activate       # Windows

# Install dependencies
uv pip install -r requirements.txt -e .

# Configure environment
cp .env.example .env
# Edit .env with your Qdrant and embedding endpoint configurations
```

### Running

```bash
semeion
```

## System Requirements

### Minimum (Remote Processing)

| Resource | Requirement |
|----------|-------------|
| CPU | 4 cores |
| RAM | 8 GB |
| Storage | Minimal (evidence stored elsewhere) |
| GPU | Not required |
| Network | Access to Qdrant and embedding endpoints (if remote) |

### Recommended (Local Processing)

| Resource | Requirement |
|----------|-------------|
| CPU | 8+ cores |
| RAM | 16 GB (32 GB for large cases) |
| Storage | SSD, sufficient for evidence & vectors |
| GPU | Optional (improves embedding speed with local models) |
| Network | Optional (fully offline capable) |

## Project Status

**Current Phase:** MVP Development - Timeline & Core Features

**Roadmap:**

1. ✅ Concept and architecture design
2. ⬜ Core infrastructure
   - Unified artifact ingestion (WhatsApp, Chrome)
   - SQLite + Qdrant integration
3. ⬜ Timeline visualization
   - High-performance rendering (target: <200ms for 100k events)
   - Multi-source swim lanes
   - Zoom/pan navigation
4. ⬜ Semantic clustering
   - Temporal proximity grouping
   - Pattern template matching
   - Semantic similarity detection
5. ⬜ Communication search
   - Multilingual embedding
   - Query interpretation
   - Search → timeline jump
6. ⬜ Additional parsers (Telegram, Signal, etc.)
7. ⬜ Export and reporting
8. ⬜ Performance optimization & polish

## Why Open Source Matters

**Transparency for Court:**

- Auditable algorithms—no "black box" analysis
- Reproducible results—scientific validation possible
- Peer-reviewed methods—community scrutiny

**Accessibility:**

- Free for budget-constrained labs
- No vendor lock-in
- Community-driven development

**Innovation:**

- Rapid feature development
- Specialized extensions possible
- Academic research enabled

## Branding

**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"—embodying the software's mission of revealing meaningful patterns within digital evidence through intelligent timeline analysis.

The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden connections through analytical thought.

## License

BSD 3-Clause

## Contributing

Semeion is in active development. Contributions welcome, especially from:

- Digital forensics practitioners (workflow validation)
- Timeline visualization experts
- Multilingual NLP specialists
- Performance optimization engineers