2025-12-08 21:14:06 +01:00
2025-12-01 10:28:17 +01:00
2025-12-03 15:09:30 +01:00
2025-12-02 15:56:13 +01:00
2025-12-03 15:09:30 +01:00
2025-11-26 10:33:55 +01:00
2025-12-02 15:56:13 +01:00
2025-12-08 21:14:06 +01:00
2025-11-29 21:13:54 +01:00
2025-12-02 15:56:13 +01:00

semeion

alt text

Overview

Semeion is a semantic search companion tool for digital forensics investigators. It does not replace traditional forensic suites (Autopsy, Axiom, X-Ways) but augments them by enabling natural language queries across communication artifacts to quickly identify areas of interest.

Core Question Semeion Answers: "Where should I look?"

An investigator opens Semeion alongside their primary forensic tool, types a query like "discussions about cryptocurrency payments", reviews ranked results across chat messages, browser history, emails, and documents, then uses those findings to guide deeper analysis in their forensic suite.

(Intro Video)

Key Features

Feature Description
Natural Language Search Query artifacts using plain English instead of keywords
LLM-Assisted Interpretation Queries are parsed by an LLM into structured search parameters
Human-in-the-Loop Confirmation Review and edit interpreted search parameters before execution
Semantic + (Hybrid Search) Combines meaning-based vector search with keyword matching
Interactive Refinement Mark results as relevant [+] or irrelevant [-], then refine via Qdrant's Recommend API
Temporal Context View See what happened before and after a discovered artifact
Universal Artifact Model Platform-agnostic — can adapt to any forensic data from external sources, expandable concept
Flexible Deployment Runs fully local or on airgapped forensic networks (or with cloud infrastructure if anyone would do such a thing)

Artifact Types (for first PoC)

Searchable (Semantic Vector)

These artifacts are embedded and searchable via natural language:

Type Examples
Messages WhatsApp, Telegram, Signal, or anything which has a parseable SQLite database
Browser Events Chrome, Firefox, Safari, Edge — history, downloads, bookmarks, searches
Email Any email files (simplified: sender, receiver, subject, body, timestamp)
Documents PDF, Word, plain text

Timeline-Only (Context View)

These artifacts appear in the temporal context view but are not semantically searchable:

Type Examples
File Events File creation, modification, deletion, access (via Sleuthkit)
Process Events Application launches, process creation
Network Events Connections, DNS queries
Registry Events Windows registry modifications
System Events Logs, authentication events

How It Works

resources/workflow.png

Architecture

Client-Server Design

┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│                      ┌─────────────────────┐                            │
│                      │   PySide6 Client    │                            │
│                      │   (Investigator     │                            │
│                      │    Workstation)     │                            │
│                      └──────────┬──────────┘                            │
│                                 │                                       │
│                 ┌───────────────┴───────────────┐                       │
│                 │                               │                       │
│                 ▼                               ▼                       │
│      ┌─────────────────────┐       ┌──────────────────────┐             │
│      │    Qdrant API       │       │   LLM API            │             │
│      │                     │       │   (OpenAI-compatible)│             │
│      └─────────────────────┘       └──────────────────────┘             │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Deployment Options

Configuration Qdrant LLM Use Case
Fully Local localhost Ollama (localhost) Single investigator, offline
Airgapped Network Internal server Internal server Forensic lab, sensitive cases
Hybrid Local Cloud API Balance of privacy and capability
Full Cloud Cloud Cloud API Team access, scalability

Use Case Examples

Example 1: Cryptocurrency Investigation

Query: "Find discussions about buying software with cryptocurrency"

Semeion finds:

  • Chat messages mentioning crypto payments
  • Browser visits to exchange websites
  • Wallet-related searches

Context view reveals:

  • File downloads after payment discussions
  • Application installations
  • Network connections to blockchain services

Example 2: Data Exfiltration

Query: "Messages about sending confidential documents"

Semeion finds:

  • Emails discussing document sharing
  • Chat messages about file transfers

Context view reveals:

  • File access events before discussions
  • Cloud storage uploads after discussions
  • USB device connections

Example 3: Timeline Reconstruction

Query: "Threatening messages received in March"

Semeion finds:

  • Messages matching threatening language patterns

Context view reveals:

  • What the recipient searched afterward
  • Files accessed or deleted
  • Communication with others about the threat

Technical Stack

Component Technology Purpose
GUI Framework PySide6 Desktop application
Vector Database Qdrant Semantic search and storage
LLM Interface OpenAI-compatible API Query interpretation
Embedding API OpenAI-compatible API Vector generation
Forensic Parsing pytsk3, pyewf Disk image processing
Language Python 3.13 (PySide6-restricted) Application logic

Data Model

SemeionArtifact (Simplified)

Every artifact — regardless of source platform — conforms to a universal schema:

┌─────────────────────────────────────────────────────────────────────────┐
│  SemeionArtifact                                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Identity:        id, case_id                                           │
│  Classification:  artifact_class, source_platform, searchable           │
│  Temporal:        timestamp                                             │
│  Actors:          [{identifier, display_name, role}]                    │
│  Content:         text, semantic_text                                   │
│  Entities:        indexed_entities[] (for filtering)                    │
│  Hierarchy:       parent_id, chunk_info (for documents)                 │
│  Context:         context_group (conversation, thread, session)         │
│  Location:        url, path, title                                      │
│  Source-Specific: message{}, browser{}, email{}, document{}, etc.       │
│  Ingestion:       ingested_at, source_file, parser_id                   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Vector Strategy

Vector Purpose Required
Semantic Conceptual similarity search Yes
Sparse (keywords) Exact term matching (hybrid) Optional

Timeline-only artifacts use a placeholder vector and are excluded from search but included in context views.

What Semeion Is NOT (yet)

Not This Why
Forensic suite replacement Companion tool — use alongside Autopsy/Axiom
Reporting Tool Review and analyse findings, documents in primary application
Forensic Interpretation Robot Helps you to discover what you otherwise wouldnt

Development Setup

This project uses uv for dependency management.

Prerequisites

  • Python 3.13+
  • uv installed

Installation

git clone <repository-url>
cd semeion

# Create virtual environment
uv venv --python 3.13

# Activate environment
source .venv/bin/activate      # Linux/macOS
# .venv\Scripts\activate       # Windows

# Install dependencies
uv pip install -r requirements.txt -e .

# Configure environment
cp .env.example .env
# Edit .env with your Qdrant and LLM endpoint configurations

Running

uv handles the startup script:

semeion

System Requirements

Minimum (Remote Processing)

Resource Requirement
CPU Multi-core
RAM 4 GB
Storage Minimal (evidence stored elsewhere)
Network Access to Qdrant and LLM endpoints
Resource Requirement
CPU 8+ cores
RAM 32 GB
Storage sufficient for evidence & vectors, LLM if installed locally
GPU optional (improves embedding speed)

Project Status

Current Phase: Architecture and data model definition

Roadmap:

  1. Concept and schema design
  2. Core infrastructure (Qdrant collection, basic ingestion)
  3. Search execution (semantic search, filtering)
  4. LLM integration (query interpretation)
  5. Refinement system (Qdrant Recommend)
  6. Context view
  7. Platform parsers (WhatsApp, Chrome, etc.)
  8. Hybrid search (sparse vectors)

Branding

Semeion derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark" — embodying the software's mission of interpreting semantic signals within digital evidence.

The mascot, Koios (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden knowledge through analytical thought.

License

BSD 3-Clause

Description
Scalable vector search engine with focus on post-mortem forensics
Readme BSD-3-Clause 21 MiB
Languages
Python 100%