Files
semeion/README.md
2025-12-17 22:08:13 +01:00

16 KiB

semeion

alt text

Overview

Semeion is a timeline-first digital forensics analysis platform that solves the two biggest pain points in modern investigations: navigating massive communication datasets and correlating events across data sources, including filesystm artifacts and OS events.

Core Question Semeion Answers: "What happened around this time—and how are these events connected?"

Traditional forensic tools often have unintuitiveur slow user experience (opinionated). Semeion in contrast aims to provide a fast, intelligent timeline that automatically clusters semantically-related events, enabling investigators to see patterns at a glance rather than clicking through endless individual entries.

An investigator opens Semeion alongside their primary forensic tool, either searches for relevant content (like "discussions about data transfer" in Russian messages) or navigates to a known timestamp (from incident logs), and immediately sees all correlated activity—messages, file access, network connections, system events—organized into meaningful sequences rather than flat lists.

Key Features

Feature Description
Timeline Visualization Render millions of events with <200ms response time; smooth zooming/panning
Semantic Event Clustering Automatically group related events into patterns (file transfers, communication bursts, login sequences)
Multilingual Semantic Search Find concepts across languages—search in English, find results in Russian/Chinese/Arabic with translation
Two-Mode Operation Semantic content discovery → timeline analysis OR timestamp → content correlation
Cross-Source Correlation Unified timeline across messages, files, network, browser, system events
Event Grouping Collapse/expand semantically-related events; discover patterns
Native Desktop Application PySide6-based—runs fully local, tailored for forensic-grade airgapped environments
Open Source & Transparent Auditable methods, minimize "blackbox" AI, court-defensible results

The Problem Semeion Solves

Problematic UX with existing forensic suites

Existing forensic tools have timeline interfaces that:

  • Render slowly
  • Show flat event lists with no context
  • Require manual correlation across sources
  • Have poor visualization and navigation This is highly opinionated from personal experiences of the maintainer with multiple large commercial suites which werent able to deliver a satisfying result. I wont name any here, and exceptions may exist, but improvements are required here.

Semeion's Solution: Fast, interactive timeline with semantic clustering that shows patterns immediately.

Large Communication Datasets

Modern investigations involve:

  • Seized databases with millions of relevant artifacts
  • Multilingual content
  • Slang, code words, and poor machine translation
  • Hours spent reading irrelevant conversations

Semeion's Solution: Semantic search that understands meaning, handles multiple languages, and jumps directly to timeline context.

How It Works

Two Entry Points

Entry Point 1: Semantic content discovery → timeline analysis

1. Investigator searches: "discussions about file transfers"
2. Semeion finds semantically relevant messages
3. Click any result → Timeline centers on that timestamp
4. See all activity ±2 hours: file access, USB connections, deletions
5. Semantic clustering highlights: "File Exfiltration Sequence" pattern

Entry Point 2: timestamp → content correlation

1. Incident log shows: "Ransomware encrypted files at 2024-03-15 14:22:00"
2. Navigate timeline to that timestamp
3. Semantic clustering reveals:
   - Communication spike 30min before (cluster: "Coordination Activity")
   - Suspicious file downloads (cluster: "Malware Delivery")
   - Network connections (cluster: "C2 Communication")
   - Process executions leading to encryption
4. Complete attack chain visible at a glance

Semantic Event Clustering

Traditional Timeline View (Overwhelming):

14:33:15 - USB device connected
14:33:18 - Chrome visited facebook.com
14:33:22 - File modified: report.docx
14:33:30 - File copied to USB: project_data.zip (2.3 GB)
14:33:35 - File copied to USB: budget.xlsx
14:33:40 - File deleted: project_data.zip
14:33:45 - WhatsApp message sent
...50 more individual events...

Semeion Clustered View (Clear Pattern):

┌──────────────────────────────────────────────┐
│ 14:33:15-14:33:40  ⚠️ File Transfer (4)      │ ← Click to expand
│ Pattern: Data Exfiltration Sequence          │
│ USB connected → 2 files copied → source deleted │
│ Related: Files mentioned in chat 2min earlier │
└──────────────────────────────────────────────┘

How Clustering Works

  1. Temporal Proximity: Events within configurable time window evaluated together
  2. Semantic Similarity: Vector embeddings detect conceptually-related events
  3. Entity Linking: Shared file names, IPs, usernames connect events
  4. Pattern Templates: Pre-defined sequences (USB exfiltration, login chains, etc.)
  5. LLM Summarization (Optional): Natural language cluster descriptions

Architecture

Desktop Application (PySide6)

┌─────────────────────────────────────────────────────────┐
│                   PySide6 Desktop UI                    │
│  ┌───────────────┐  ┌──────────────┐  ┌─────────────┐  │
│  │   Timeline    │  │   Semantic   │  │   Artifact  │  │
│  │     View      │  │    Search    │  │    Detail   │  │
│  └───────────────┘  └──────────────┘  └─────────────┘  │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│              Analysis & Clustering Engine               │
│  • Event clustering algorithm                           │
│  • Semantic similarity calculation                      │
│  • Cross-artifact correlation                           │
│  • Timeline rendering engine                            │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│                   Data Layer                            │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │   Qdrant     │  │   SQLite     │  │   LLM API    │  │
│  │   (Vectors)  │  │   (Metadata) │  │   (Optional) │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
└─────────────────────────────────────────────────────────┘

Deployment Options

Configuration Qdrant Embedding LLM Use Case
Fully Local localhost Ollama/local model Ollama (optional) Single investigator, offline
Airgapped Network Internal server Internal server Internal server proper forensic environment
Hybrid Local Local Cloud API (optional) investigative environment without focus on confidentiality

Artifact Types

Searchable & Timeline-Enabled

Type Examples Features
Messages WhatsApp, Telegram, Signal, SMS Semantic search, multilingual, timeline clustering
Browser Chrome, Firefox, Safari, Edge History, downloads, searches—timeline correlation
Email Any email format (EML, MBOX, PST) Searchable content, timeline placement
Documents PDF, Word, plain text Content search, access time correlation

Timeline-Only (Context View)

Type Examples
File Events File creation, modification, deletion, access
Process Events Application launches, process creation
Network Events Connections, DNS queries
System Events Logs, authentication events, USB connections

Technical Stack

Component Technology Purpose
UI Framework PySide6 (Qt6) Native desktop application
Timeline Rendering PyQtGraph or Plotly High-performance visualization
Vector Database Qdrant Semantic search and similarity
Metadata Storage SQLite Fast local queries, metadata
LLM Interface OpenAI-compatible API Optional: cluster summarization, query interpretation
Embedding Sentence Transformers / OpenAI Multilingual vector generation
Forensic Parsing pytsk3, pyewf Disk image processing
Language Python 3.13+ Application logic

Data Model

SemeionArtifact (Universal Schema, WiP)

Every artifact—regardless of source—conforms to a unified model:

┌─────────────────────────────────────────────────────────┐
│  SemeionArtifact                                        │
├─────────────────────────────────────────────────────────┤
│  Identity:        id, case_id                           │
│  Classification:  artifact_class, source_platform       │
│  Temporal:        timestamp (UTC normalized)            │
│  Actors:          [{identifier, display_name, role}]    │
│  Content:         text, semantic_text                   │
│  Entities:        indexed_entities[] (files, IPs, etc)  │
│  Hierarchy:       parent_id, context_group              │
│  Location:        url, path, title                      │
│  Embeddings:      semantic_vector (768-dim)             │
│  Source-Specific: message{}, browser{}, email{}, etc    │
└─────────────────────────────────────────────────────────┘

Cluster Model (WiP)

┌─────────────────────────────────────────────────────────┐
│  EventCluster                                           │
├─────────────────────────────────────────────────────────┤
│  id:              unique_cluster_id                     │
│  time_range:      (start_timestamp, end_timestamp)      │
│  events:          [artifact_ids]                        │
│  pattern_type:    "file_exfiltration" | "communication" │
│  confidence:      0.0-1.0                               │
│  summary:         "USB transfer with deletion"          │
│  icon:            "⚠️" | "💬" | "🌐" | etc             │
│  semantic_links:  [related_cluster_ids]                 │
└─────────────────────────────────────────────────────────┘

What Semeion Is NOT (yet)

Not This Why
Forensic suite replacement Companion tool—use alongside Autopsy for acquisition
Reporting tool Timeline export for reports, but documentation happens in primary suite
AI evidence interpreter AI assists with search/clustering; investigator interprets evidence

Development Setup

This project uses uv for dependency management.

Prerequisites

  • Python 3.13+
  • uv installed
  • requirements.txt

Installation

git clone https://git.cc24.dev/mstoeck3/semeion
cd semeion

# Create virtual environment
uv venv --python 3.13

# Activate environment
source .venv/bin/activate      # Linux/macOS
# .venv\Scripts\activate       # Windows

# Install dependencies
uv pip install -r requirements.txt -e .

# Configure environment
cp .env.example .env
# Edit .env with your Qdrant and embedding endpoint configurations

Running

semeion

System Requirements

Minimum (Remote Processing)

Resource Requirement
CPU 4 cores
RAM 8 GB
Storage Minimal (evidence stored elsewhere)
GPU Not required
Network Access to Qdrant and embedding endpoints (if remote)
Resource Requirement
CPU 8+ cores
RAM 16 GB (32 GB for large cases)
Storage SSD, sufficient for evidence & vectors
GPU Optional (improves embedding speed with local models)
Network Optional (fully offline capable)

Project Status

Current Phase: MVP Development - Timeline & Core Features

Roadmap:

  1. Concept and architecture design
  2. Core infrastructure
    • Unified artifact ingestion (WhatsApp, Chrome)
    • SQLite + Qdrant integration
  3. Timeline visualization
    • High-performance rendering (target: <200ms for 100k events)
    • Multi-source swim lanes
    • Zoom/pan navigation
  4. Semantic clustering
    • Temporal proximity grouping
    • Pattern template matching
    • Semantic similarity detection
  5. Communication search
    • Multilingual embedding
    • Query interpretation
    • Search → timeline jump
  6. Additional parsers (Telegram, Signal, etc.)
  7. Export and reporting
  8. Performance optimization & polish

Why Open Source Matters

Transparency for Court:

  • Auditable algorithms—no "black box" analysis
  • Reproducible results—scientific validation possible
  • Peer-reviewed methods—community scrutiny

Accessibility:

  • Free for budget-constrained labs
  • No vendor lock-in
  • Community-driven development

Innovation:

  • Rapid feature development
  • Specialized extensions possible
  • Academic research enabled

Branding

Semeion derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"—embodying the software's mission of revealing meaningful patterns within digital evidence through intelligent timeline analysis.

The mascot, Koios (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden connections through analytical thought.

License

BSD 3-Clause

Contributing

Semeion is in active development. Contributions welcome, especially from:

  • Digital forensics practitioners (workflow validation)
  • Timeline visualization experts
  • Multilingual NLP specialists
  • Performance optimization engineers