11 KiB
semeion
Overview
Semeion is a semantic search companion tool for digital forensics investigators. It does not replace traditional forensic suites (Autopsy, Axiom, X-Ways) but augments them by enabling natural language queries across communication artifacts to quickly identify areas of interest.
Core Question Semeion Answers: "Where should I look?"
An investigator opens Semeion alongside their primary forensic tool, types a query like "discussions about cryptocurrency payments", reviews ranked results across chat messages, browser history, emails, and documents, then uses those findings to guide deeper analysis in their forensic suite.
Key Features
| Feature | Description |
|---|---|
| Natural Language Search | Query artifacts using plain English instead of keywords |
| LLM-Assisted Interpretation | Queries are parsed by an LLM into structured search parameters |
| Human-in-the-Loop Confirmation | Review and edit interpreted search parameters before execution |
| Semantic + (Hybrid Search) | Combines meaning-based vector search with keyword matching |
| Interactive Refinement | Mark results as relevant [+] or irrelevant [-], then refine via Qdrant's Recommend API |
| Temporal Context View | See what happened before and after a discovered artifact |
| Universal Artifact Model | Platform-agnostic — can adapt to any forensic data from external sources, expandable concept |
| Flexible Deployment | Runs fully local or on airgapped forensic networks (or with cloud infrastructure if anyone would do such a thing) |
Artifact Types (for first PoC)
Searchable (Semantic Vector)
These artifacts are embedded and searchable via natural language:
| Type | Examples |
|---|---|
| Messages | WhatsApp, Telegram, Signal, or anything which has a parseable SQLite database |
| Browser Events | Chrome, Firefox, Safari, Edge — history, downloads, bookmarks, searches |
| Any email files (simplified: sender, receiver, subject, body, timestamp) | |
| Documents | PDF, Word, plain text |
Timeline-Only (Context View)
These artifacts appear in the temporal context view but are not semantically searchable:
| Type | Examples |
|---|---|
| File Events | File creation, modification, deletion, access (via Sleuthkit) |
| Process Events | Application launches, process creation |
| Network Events | Connections, DNS queries |
| Registry Events | Windows registry modifications |
| System Events | Logs, authentication events |
How It Works
Architecture
Client-Server Design
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────────┐ │
│ │ PySide6 Client │ │
│ │ (Investigator │ │
│ │ Workstation) │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────┐ ┌──────────────────────┐ │
│ │ Qdrant API │ │ LLM API │ │
│ │ │ │ (OpenAI-compatible)│ │
│ └─────────────────────┘ └──────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Deployment Options
| Configuration | Qdrant | LLM | Use Case |
|---|---|---|---|
| Fully Local | localhost | Ollama (localhost) | Single investigator, offline |
| Airgapped Network | Internal server | Internal server | Forensic lab, sensitive cases |
| Hybrid | Local | Cloud API | Balance of privacy and capability |
| Full Cloud | Cloud | Cloud API | Team access, scalability |
Use Case Examples
Example 1: Cryptocurrency Investigation
Query: "Find discussions about buying software with cryptocurrency"
Semeion finds:
- Chat messages mentioning crypto payments
- Browser visits to exchange websites
- Wallet-related searches
Context view reveals:
- File downloads after payment discussions
- Application installations
- Network connections to blockchain services
Example 2: Data Exfiltration
Query: "Messages about sending confidential documents"
Semeion finds:
- Emails discussing document sharing
- Chat messages about file transfers
Context view reveals:
- File access events before discussions
- Cloud storage uploads after discussions
- USB device connections
Example 3: Timeline Reconstruction
Query: "Threatening messages received in March"
Semeion finds:
- Messages matching threatening language patterns
Context view reveals:
- What the recipient searched afterward
- Files accessed or deleted
- Communication with others about the threat
Technical Stack
| Component | Technology | Purpose |
|---|---|---|
| GUI Framework | PySide6 | Desktop application |
| Vector Database | Qdrant | Semantic search and storage |
| LLM Interface | OpenAI-compatible API | Query interpretation |
| Embedding API | OpenAI-compatible API | Vector generation |
| Forensic Parsing | pytsk3, pyewf | Disk image processing |
| Language | Python 3.13 (PySide6-restricted) | Application logic |
Data Model
SemeionArtifact (Simplified)
Every artifact — regardless of source platform — conforms to a universal schema:
┌─────────────────────────────────────────────────────────────────────────┐
│ SemeionArtifact │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Identity: id, case_id │
│ Classification: artifact_class, source_platform, searchable │
│ Temporal: timestamp │
│ Actors: [{identifier, display_name, role}] │
│ Content: text, semantic_text │
│ Entities: indexed_entities[] (for filtering) │
│ Hierarchy: parent_id, chunk_info (for documents) │
│ Context: context_group (conversation, thread, session) │
│ Location: url, path, title │
│ Source-Specific: message{}, browser{}, email{}, document{}, etc. │
│ Ingestion: ingested_at, source_file, parser_id │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Vector Strategy
| Vector | Purpose | Required |
|---|---|---|
| Semantic | Conceptual similarity search | Yes |
| Sparse (keywords) | Exact term matching (hybrid) | Optional |
Timeline-only artifacts use a placeholder vector and are excluded from search but included in context views.
What Semeion Is NOT (yet)
| Not This | Why |
|---|---|
| Forensic suite replacement | Companion tool — use alongside Autopsy/Axiom |
| Reporting Tool | Review and analyse findings, documents in primary application |
| Forensic Interpretation Robot | Helps you to discover what you otherwise wouldnt |
Development Setup
This project uses uv for dependency management.
Prerequisites
- Python 3.13+
- uv installed
Installation
git clone <repository-url>
cd semeion
# virtual environment
uv venv --python 3.13
# activate environment
source .venv/bin/activate # Linux/macOS
# dependencies
uv pip install -r requirements.txt -e .
Running
uv handles the startup script:
semeion
System Requirements
Minimum (Remote Processing)
| Resource | Requirement |
|---|---|
| CPU | Multi-core |
| RAM | 4 GB |
| Storage | Minimal (evidence stored elsewhere) |
| Network | Access to Qdrant and LLM endpoints |
Recommended (Local Processing)
| Resource | Requirement |
|---|---|
| CPU | 8+ cores |
| RAM | 32 GB |
| Storage | sufficient for evidence & vectors, LLM if installed locally |
| GPU | optional (improves embedding speed) |
Project Status
Current Phase: Architecture and data model definition
Roadmap:
- ✅ Concept and schema design
- ⬜ Core infrastructure (Qdrant collection, basic ingestion)
- ⬜ Search execution (semantic search, filtering)
- ⬜ LLM integration (query interpretation)
- ⬜ Refinement system (Qdrant Recommend)
- ⬜ Context view
- ⬜ Platform parsers (WhatsApp, Chrome, etc.)
- ⬜ Hybrid search (sparse vectors)
Branding
Semeion derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark" — embodying the software's mission of interpreting semantic signals within digital evidence.
The mascot, Koios (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden knowledge through analytical thought.
License
BSD 3-Clause

