316 lines
16 KiB
Markdown
316 lines
16 KiB
Markdown
# semeion
|
|
|
|

|
|
|
|
## Overview
|
|
|
|
Semeion is a **semantic search companion tool** for digital forensics investigators. It does not replace traditional forensic suites (Autopsy, Axiom, X-Ways) but augments them by enabling natural language queries across communication artifacts to quickly identify areas of interest.
|
|
|
|
> **Core Question Semeion Answers:** "Where should I look?"
|
|
|
|
An investigator opens Semeion alongside their primary forensic tool, types a query like *"discussions about cryptocurrency payments"*, reviews ranked results across chat messages, browser history, emails, and documents, then uses those findings to guide deeper analysis in their forensic suite.
|
|
|
|
## Key Features
|
|
|
|
| Feature | Description |
|
|
|---------|-------------|
|
|
| **Natural Language Search** | Query artifacts using plain English instead of keywords |
|
|
| **LLM-Assisted Interpretation** | Queries are parsed by an LLM into structured search parameters |
|
|
| **Human-in-the-Loop Confirmation** | Review and edit interpreted search parameters before execution |
|
|
| **Semantic + (Hybrid Search)** | Combines meaning-based vector search with keyword matching |
|
|
| **Interactive Refinement** | Mark results as relevant [+] or irrelevant [-], then refine via Qdrant's Recommend API |
|
|
| **Temporal Context View** | See what happened before and after a discovered artifact |
|
|
| **Universal Artifact Model** | Platform-agnostic — can adapt to any forensic data from external sources, expandable concept |
|
|
| **Flexible Deployment** | Runs fully local or on airgapped forensic networks (or with cloud infrastructure if anyone would do such a thing) |
|
|
|
|
## Artifact Types (for first PoC)
|
|
|
|
### Searchable (Semantic Vector)
|
|
|
|
These artifacts are embedded and searchable via natural language:
|
|
|
|
| Type | Examples |
|
|
|------|----------|
|
|
| **Messages** | WhatsApp, Telegram, Signal, or anything which has a parseable SQLite database |
|
|
| **Browser Events** | Chrome, Firefox, Safari, Edge — history, downloads, bookmarks, searches |
|
|
| **Email** | Any email files (simplified: sender, receiver, subject, body, timestamp) |
|
|
| **Documents** | PDF, Word, plain text |
|
|
|
|
### Timeline-Only (Context View)
|
|
|
|
These artifacts appear in the temporal context view but are not semantically searchable:
|
|
|
|
| Type | Examples |
|
|
|------|----------|
|
|
| **File Events** | File creation, modification, deletion, access (via Sleuthkit) |
|
|
| **Process Events** | Application launches, process creation |
|
|
| **Network Events** | Connections, DNS queries |
|
|
| **Registry Events** | Windows registry modifications |
|
|
| **System Events** | Logs, authentication events |
|
|
|
|
## How It Works
|
|
|
|
```bash
|
|
┌──────────────────────────────────────────────────────────────────────────┐
|
|
│ SEMEION WORKFLOW │
|
|
├──────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ 1. QUERY "Find messages about buying ransomware |
|
|
| access with crypto in January" │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ 2. INTERPRET ┌──────────────────┐ │
|
|
│ │ LLM parses │ │
|
|
│ │ into search │ │
|
|
│ │ object │ │
|
|
│ └────────┬─────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ 3. CONFIRM ┌──────────────────┐ │
|
|
│ │ User reviews │ │
|
|
│ │ and adjusts │ │
|
|
│ │ parameters │ │
|
|
│ └────────┬─────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ 4. SEARCH ┌──────────────────┐ │
|
|
│ │ Qdrant executes │ │
|
|
│ │ (hybrid) search │ │
|
|
│ └────────┬─────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ 5. REVIWE ┌──────────────────┐ │
|
|
│ │ Mark [+] / [-] │◄─────┐ │
|
|
│ │ Click "Refine" │ │ │
|
|
│ └────────┬─────────┘ │ │
|
|
│ │ │ │
|
|
│ ▼ │ │
|
|
│ ┌──────────────────┐ │ │
|
|
│ │ Qdrant Recommend │──────┘ │
|
|
│ │ returns better │ (iterate) │
|
|
│ │ results │ │
|
|
│ └────────┬─────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ 6. CONTEXT ┌──────────────────┐ │
|
|
│ │ View surrounding │ │
|
|
│ │ timeline (±N min)│ │
|
|
│ │ including system │ │
|
|
│ │ artifacts │ │
|
|
│ └──────────────────┘ │
|
|
│ │
|
|
└──────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Client-Server Design
|
|
|
|
```bash
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ │
|
|
│ ┌─────────────────────┐ │
|
|
│ │ PySide6 Client │ │
|
|
│ │ (Investigator │ │
|
|
│ │ Workstation) │ │
|
|
│ └──────────┬──────────┘ │
|
|
│ │ │
|
|
│ ┌───────────────┴───────────────┐ │
|
|
│ │ │ │
|
|
│ ▼ ▼ │
|
|
│ ┌─────────────────────┐ ┌──────────────────────┐ │
|
|
│ │ Qdrant API │ │ LLM API │ │
|
|
│ │ │ │ (OpenAI-compatible)│ │
|
|
│ └─────────────────────┘ └──────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Deployment Options
|
|
|
|
| Configuration | Qdrant | LLM | Use Case |
|
|
|---------------|--------|-----|----------|
|
|
| **Fully Local** | localhost | Ollama (localhost) | Single investigator, offline |
|
|
| **Airgapped Network** | Internal server | Internal server | Forensic lab, sensitive cases |
|
|
| **Hybrid** | Local | Cloud API | Balance of privacy and capability |
|
|
| **Full Cloud** | Cloud | Cloud API | Team access, scalability |
|
|
|
|
## Use Case Examples
|
|
|
|
### Example 1: Cryptocurrency Investigation
|
|
|
|
**Query:** *"Find discussions about buying software with cryptocurrency"*
|
|
|
|
**Semeion finds:**
|
|
|
|
- Chat messages mentioning crypto payments
|
|
- Browser visits to exchange websites
|
|
- Wallet-related searches
|
|
|
|
**Context view reveals:**
|
|
|
|
- File downloads after payment discussions
|
|
- Application installations
|
|
- Network connections to blockchain services
|
|
|
|
### Example 2: Data Exfiltration
|
|
|
|
**Query:** *"Messages about sending confidential documents"*
|
|
|
|
**Semeion finds:**
|
|
|
|
- Emails discussing document sharing
|
|
- Chat messages about file transfers
|
|
|
|
**Context view reveals:**
|
|
|
|
- File access events before discussions
|
|
- Cloud storage uploads after discussions
|
|
- USB device connections
|
|
|
|
### Example 3: Timeline Reconstruction
|
|
|
|
**Query:** *"Threatening messages received in March"*
|
|
|
|
**Semeion finds:**
|
|
|
|
- Messages matching threatening language patterns
|
|
|
|
**Context view reveals:**
|
|
|
|
- What the recipient searched afterward
|
|
- Files accessed or deleted
|
|
- Communication with others about the threat
|
|
|
|
## Technical Stack
|
|
|
|
| Component | Technology | Purpose |
|
|
|-----------|------------|---------|
|
|
| GUI Framework | PySide6 | Desktop application |
|
|
| Vector Database | Qdrant | Semantic search and storage |
|
|
| LLM Interface | OpenAI-compatible API | Query interpretation |
|
|
| Embedding API | OpenAI-compatible API | Vector generation |
|
|
| Forensic Parsing | pytsk3, pyewf | Disk image processing |
|
|
| Language | Python 3.13 (PySide6-restricted) | Application logic |
|
|
|
|
## Data Model
|
|
|
|
### SemeionArtifact (Simplified)
|
|
|
|
Every artifact — regardless of source platform — conforms to a universal schema:
|
|
|
|
```bash
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ SemeionArtifact │
|
|
├─────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ Identity: id, case_id │
|
|
│ Classification: artifact_class, source_platform, searchable │
|
|
│ Temporal: timestamp │
|
|
│ Actors: [{identifier, display_name, role}] │
|
|
│ Content: text, semantic_text │
|
|
│ Entities: indexed_entities[] (for filtering) │
|
|
│ Hierarchy: parent_id, chunk_info (for documents) │
|
|
│ Context: context_group (conversation, thread, session) │
|
|
│ Location: url, path, title │
|
|
│ Source-Specific: message{}, browser{}, email{}, document{}, etc. │
|
|
│ Ingestion: ingested_at, source_file, parser_id │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Vector Strategy
|
|
|
|
| Vector | Purpose | Required |
|
|
|--------|---------|----------|
|
|
| **Semantic** | Conceptual similarity search | Yes |
|
|
| **Sparse** (keywords) | Exact term matching (hybrid) | Optional |
|
|
|
|
Timeline-only artifacts use a placeholder vector and are excluded from search but included in context views.
|
|
|
|
## What Semeion Is NOT (yet)
|
|
|
|
| Not This | Why |
|
|
|----------|-----|
|
|
| Forensic suite replacement | Companion tool — use alongside Autopsy/Axiom |
|
|
| Reporting Tool | Review and analyse findings, documents in primary application |
|
|
| Forensic Interpretation Robot | Helps you to discover what you otherwise wouldnt |
|
|
|
|
## Development Setup
|
|
|
|
This project uses [uv](https://github.com/astral-sh/uv) for dependency management.
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.13+
|
|
- [uv](https://github.com/astral-sh/uv) installed
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
git clone <repository-url>
|
|
cd semeion
|
|
|
|
# virtual environment
|
|
uv venv --python 3.13
|
|
|
|
# activate environment
|
|
source .venv/bin/activate # Linux/macOS
|
|
|
|
# dependencies
|
|
uv pip install -r requirements.txt -e .
|
|
```
|
|
|
|
### Running
|
|
|
|
uv handles the startup script:
|
|
|
|
```bash
|
|
semeion
|
|
```
|
|
|
|
## System Requirements
|
|
|
|
### Minimum (Remote Processing)
|
|
|
|
| Resource | Requirement |
|
|
|----------|-------------|
|
|
| CPU | Multi-core |
|
|
| RAM | 4 GB |
|
|
| Storage | Minimal (evidence stored elsewhere) |
|
|
| Network | Access to Qdrant and LLM endpoints |
|
|
|
|
### Recommended (Local Processing)
|
|
|
|
| Resource | Requirement |
|
|
|----------|-------------|
|
|
| CPU | 8+ cores |
|
|
| RAM | 32 GB |
|
|
| Storage | sufficient for evidence & vectors, LLM if installed locally |
|
|
| GPU | optional (improves embedding speed) |
|
|
|
|
## Project Status
|
|
|
|
**Current Phase:** Architecture and data model definition
|
|
|
|
**Roadmap:**
|
|
|
|
1. ✅ Concept and schema design
|
|
2. ⬜ Core infrastructure (Qdrant collection, basic ingestion)
|
|
3. ⬜ Search execution (semantic search, filtering)
|
|
4. ⬜ LLM integration (query interpretation)
|
|
5. ⬜ Refinement system (Qdrant Recommend)
|
|
6. ⬜ Context view
|
|
7. ⬜ Platform parsers (WhatsApp, Chrome, etc.)
|
|
8. ⬜ Hybrid search (sparse vectors)
|
|
|
|
## Branding
|
|
|
|
**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark" — embodying the software's mission of interpreting semantic signals within digital evidence.
|
|
|
|
The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden knowledge through analytical thought.
|
|
|
|
## License
|
|
|
|
BSD 3-Clause
|