Semeion is a semantic search companion tool for digital forensics investigators. It does not replace traditional forensic suites (Autopsy, Axiom, X-Ways) but augments them by enabling natural language queries across communication artifacts to quickly identify areas of interest.

Core Question Semeion Answers: "Where should I look?"

An investigator opens Semeion alongside their primary forensic tool, types a query like "discussions about cryptocurrency payments", reviews ranked results across chat messages, browser history, emails, and documents, then uses those findings to guide deeper analysis in their forensic suite.

(Intro Video)

Key Features

Feature	Description
Natural Language Search	Query artifacts using plain English instead of keywords
LLM-Assisted Interpretation	Queries are parsed by an LLM into structured search parameters
Human-in-the-Loop Confirmation	Review and edit interpreted search parameters before execution
Semantic + (Hybrid Search)	Combines meaning-based vector search with keyword matching
Interactive Refinement	Mark results as relevant [+] or irrelevant [-], then refine via Qdrant's Recommend API
Temporal Context View	See what happened before and after a discovered artifact
Universal Artifact Model	Platform-agnostic — can adapt to any forensic data from external sources, expandable concept
Flexible Deployment	Runs fully local or on airgapped forensic networks (or with cloud infrastructure if anyone would do such a thing)

Artifact Types (for first PoC)

Searchable (Semantic Vector)

These artifacts are embedded and searchable via natural language:

Type	Examples
Messages	WhatsApp, Telegram, Signal, or anything which has a parseable SQLite database
Browser Events	Chrome, Firefox, Safari, Edge — history, downloads, bookmarks, searches
Email	Any email files (simplified: sender, receiver, subject, body, timestamp)
Documents	PDF, Word, plain text

Timeline-Only (Context View)

These artifacts appear in the temporal context view but are not semantically searchable:

Type	Examples
File Events	File creation, modification, deletion, access (via Sleuthkit)
Process Events	Application launches, process creation
Network Events	Connections, DNS queries
Registry Events	Windows registry modifications
System Events	Logs, authentication events

How It Works

Architecture

Client-Server Design

┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│                      ┌─────────────────────┐                            │
│                      │   PySide6 Client    │                            │
│                      │   (Investigator     │                            │
│                      │    Workstation)     │                            │
│                      └──────────┬──────────┘                            │
│                                 │                                       │
│                 ┌───────────────┴───────────────┐                       │
│                 │                               │                       │
│                 ▼                               ▼                       │
│      ┌─────────────────────┐       ┌──────────────────────┐             │
│      │    Qdrant API       │       │   LLM API            │             │
│      │                     │       │   (OpenAI-compatible)│             │
│      └─────────────────────┘       └──────────────────────┘             │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Deployment Options

Configuration	Qdrant	LLM	Use Case
Fully Local	localhost	Ollama (localhost)	Single investigator, offline
Airgapped Network	Internal server	Internal server	Forensic lab, sensitive cases
Hybrid	Local	Cloud API	Balance of privacy and capability
Full Cloud	Cloud	Cloud API	Team access, scalability

Use Case Examples

Example 1: Cryptocurrency Investigation

Query: "Find discussions about buying software with cryptocurrency"

Semeion finds:

Chat messages mentioning crypto payments
Browser visits to exchange websites
Wallet-related searches

Context view reveals:

File downloads after payment discussions
Application installations
Network connections to blockchain services

Example 2: Data Exfiltration

Query: "Messages about sending confidential documents"

Semeion finds:

Emails discussing document sharing
Chat messages about file transfers

Context view reveals:

File access events before discussions
Cloud storage uploads after discussions
USB device connections

Example 3: Timeline Reconstruction

Query: "Threatening messages received in March"

Semeion finds:

Messages matching threatening language patterns

Context view reveals:

What the recipient searched afterward
Files accessed or deleted
Communication with others about the threat

Technical Stack

Component	Technology	Purpose
GUI Framework	PySide6	Desktop application
Vector Database	Qdrant	Semantic search and storage
LLM Interface	OpenAI-compatible API	Query interpretation
Embedding API	OpenAI-compatible API	Vector generation
Forensic Parsing	pytsk3, pyewf	Disk image processing
Language	Python 3.13 (PySide6-restricted)	Application logic

Data Model

SemeionArtifact (Simplified)

Every artifact — regardless of source platform — conforms to a universal schema:

┌─────────────────────────────────────────────────────────────────────────┐
│  SemeionArtifact                                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Identity:        id, case_id                                           │
│  Classification:  artifact_class, source_platform, searchable           │
│  Temporal:        timestamp                                             │
│  Actors:          [{identifier, display_name, role}]                    │
│  Content:         text, semantic_text                                   │
│  Entities:        indexed_entities[] (for filtering)                    │
│  Hierarchy:       parent_id, chunk_info (for documents)                 │
│  Context:         context_group (conversation, thread, session)         │
│  Location:        url, path, title                                      │
│  Source-Specific: message{}, browser{}, email{}, document{}, etc.       │
│  Ingestion:       ingested_at, source_file, parser_id                   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Vector Strategy

Vector	Purpose	Required
Semantic	Conceptual similarity search	Yes
Sparse (keywords)	Exact term matching (hybrid)	Optional

Timeline-only artifacts use a placeholder vector and are excluded from search but included in context views.

What Semeion Is NOT (yet)

Not This	Why
Forensic suite replacement	Companion tool — use alongside Autopsy/Axiom
Reporting Tool	Review and analyse findings, documents in primary application
Forensic Interpretation Robot	Helps you to discover what you otherwise wouldnt

Development Setup

This project uses uv for dependency management.

Prerequisites

Python 3.13+
uv installed

Installation

git clone <repository-url>
cd semeion

# Create virtual environment
uv venv --python 3.13

# Activate environment
source .venv/bin/activate      # Linux/macOS
# .venv\Scripts\activate       # Windows

# Install dependencies
uv pip install -r requirements.txt -e .

# Configure environment
cp .env.example .env
# Edit .env with your Qdrant and LLM endpoint configurations

Running

uv handles the startup script:

semeion

System Requirements

Minimum (Remote Processing)

Resource	Requirement
CPU	Multi-core
RAM	4 GB
Storage	Minimal (evidence stored elsewhere)
Network	Access to Qdrant and LLM endpoints

Recommended (Local Processing)

Resource	Requirement
CPU	8+ cores
RAM	32 GB
Storage	sufficient for evidence & vectors, LLM if installed locally
GPU	optional (improves embedding speed)

Project Status

Current Phase: Architecture and data model definition

Roadmap:

✅ Concept and schema design
⬜ Core infrastructure (Qdrant collection, basic ingestion)
⬜ Search execution (semantic search, filtering)
⬜ LLM integration (query interpretation)
⬜ Refinement system (Qdrant Recommend)
⬜ Context view
⬜ Platform parsers (WhatsApp, Chrome, etc.)
⬜ Hybrid search (sparse vectors)

Branding

Semeion derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark" — embodying the software's mission of interpreting semantic signals within digital evidence.

The mascot, Koios (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden knowledge through analytical thought.

License

BSD 3-Clause