Files
semeion/README.md
2025-11-29 19:02:40 +01:00

316 lines
16 KiB
Markdown

# semeion
![alt text](resources/title_image.png)
## Overview
Semeion is a **semantic search companion tool** for digital forensics investigators. It does not replace traditional forensic suites (Autopsy, Axiom, X-Ways) but augments them by enabling natural language queries across communication artifacts to quickly identify areas of interest.
> **Core Question Semeion Answers:** "Where should I look?"
An investigator opens Semeion alongside their primary forensic tool, types a query like *"discussions about cryptocurrency payments"*, reviews ranked results across chat messages, browser history, emails, and documents, then uses those findings to guide deeper analysis in their forensic suite.
## Key Features
| Feature | Description |
|---------|-------------|
| **Natural Language Search** | Query artifacts using plain English instead of keywords |
| **LLM-Assisted Interpretation** | Queries are parsed by an LLM into structured search parameters |
| **Human-in-the-Loop Confirmation** | Review and edit interpreted search parameters before execution |
| **Semantic + (Hybrid Search)** | Combines meaning-based vector search with keyword matching |
| **Interactive Refinement** | Mark results as relevant [+] or irrelevant [-], then refine via Qdrant's Recommend API |
| **Temporal Context View** | See what happened before and after a discovered artifact |
| **Universal Artifact Model** | Platform-agnostic — can adapt to any forensic data from external sources, expandable concept |
| **Flexible Deployment** | Runs fully local or on airgapped forensic networks (or with cloud infrastructure if anyone would do such a thing) |
## Artifact Types (for first PoC)
### Searchable (Semantic Vector)
These artifacts are embedded and searchable via natural language:
| Type | Examples |
|------|----------|
| **Messages** | WhatsApp, Telegram, Signal, or anything which has a parseable SQLite database |
| **Browser Events** | Chrome, Firefox, Safari, Edge — history, downloads, bookmarks, searches |
| **Email** | Any email files (simplified: sender, receiver, subject, body, timestamp) |
| **Documents** | PDF, Word, plain text |
### Timeline-Only (Context View)
These artifacts appear in the temporal context view but are not semantically searchable:
| Type | Examples |
|------|----------|
| **File Events** | File creation, modification, deletion, access (via Sleuthkit) |
| **Process Events** | Application launches, process creation |
| **Network Events** | Connections, DNS queries |
| **Registry Events** | Windows registry modifications |
| **System Events** | Logs, authentication events |
## How It Works
```bash
┌──────────────────────────────────────────────────────────────────────────┐
│ SEMEION WORKFLOW │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. QUERY "Find messages about buying ransomware |
| access with crypto in January"
│ │ │
│ ▼ │
│ 2. INTERPRET ┌──────────────────┐ │
│ │ LLM parses │ │
│ │ into search │ │
│ │ object │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ 3. CONFIRM ┌──────────────────┐ │
│ │ User reviews │ │
│ │ and adjusts │ │
│ │ parameters │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ 4. SEARCH ┌──────────────────┐ │
│ │ Qdrant executes │ │
│ │ (hybrid) search │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ 5. REVIWE ┌──────────────────┐ │
│ │ Mark [+] / [-] │◄─────┐ │
│ │ Click "Refine" │ │ │
│ └────────┬─────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────────┐ │ │
│ │ Qdrant Recommend │──────┘ │
│ │ returns better │ (iterate)
│ │ results │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ 6. CONTEXT ┌──────────────────┐ │
│ │ View surrounding │ │
│ │ timeline (±N min)│ │
│ │ including system │ │
│ │ artifacts │ │
│ └──────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
```
## Architecture
### Client-Server Design
```bash
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────────┐ │
│ │ PySide6 Client │ │
│ │ (Investigator │ │
│ │ Workstation) │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────┐ ┌──────────────────────┐ │
│ │ Qdrant API │ │ LLM API │ │
│ │ │ │ (OpenAI-compatible)│ │
│ └─────────────────────┘ └──────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
### Deployment Options
| Configuration | Qdrant | LLM | Use Case |
|---------------|--------|-----|----------|
| **Fully Local** | localhost | Ollama (localhost) | Single investigator, offline |
| **Airgapped Network** | Internal server | Internal server | Forensic lab, sensitive cases |
| **Hybrid** | Local | Cloud API | Balance of privacy and capability |
| **Full Cloud** | Cloud | Cloud API | Team access, scalability |
## Use Case Examples
### Example 1: Cryptocurrency Investigation
**Query:** *"Find discussions about buying software with cryptocurrency"*
**Semeion finds:**
- Chat messages mentioning crypto payments
- Browser visits to exchange websites
- Wallet-related searches
**Context view reveals:**
- File downloads after payment discussions
- Application installations
- Network connections to blockchain services
### Example 2: Data Exfiltration
**Query:** *"Messages about sending confidential documents"*
**Semeion finds:**
- Emails discussing document sharing
- Chat messages about file transfers
**Context view reveals:**
- File access events before discussions
- Cloud storage uploads after discussions
- USB device connections
### Example 3: Timeline Reconstruction
**Query:** *"Threatening messages received in March"*
**Semeion finds:**
- Messages matching threatening language patterns
**Context view reveals:**
- What the recipient searched afterward
- Files accessed or deleted
- Communication with others about the threat
## Technical Stack
| Component | Technology | Purpose |
|-----------|------------|---------|
| GUI Framework | PySide6 | Desktop application |
| Vector Database | Qdrant | Semantic search and storage |
| LLM Interface | OpenAI-compatible API | Query interpretation |
| Embedding API | OpenAI-compatible API | Vector generation |
| Forensic Parsing | pytsk3, pyewf | Disk image processing |
| Language | Python 3.13 (PySide6-restricted) | Application logic |
## Data Model
### SemeionArtifact (Simplified)
Every artifact — regardless of source platform — conforms to a universal schema:
```bash
┌─────────────────────────────────────────────────────────────────────────┐
│ SemeionArtifact │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Identity: id, case_id │
│ Classification: artifact_class, source_platform, searchable │
│ Temporal: timestamp, timestamp_precision │
│ Actors: [{identifier, display_name, role}]
│ Content: text, semantic_text │
│ Entities: indexed_entities[] (for filtering)
│ Hierarchy: parent_id, chunk_info (for documents)
│ Context: context_group (conversation, thread, session)
│ Location: url, path, title │
│ Source-Specific: message{}, browser{}, email{}, document{}, etc. │
│ Ingestion: ingested_at, source_file, parser_id │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
### Vector Strategy
| Vector | Purpose | Required |
|--------|---------|----------|
| **Semantic** | Conceptual similarity search | Yes |
| **Sparse** (keywords) | Exact term matching (hybrid) | Optional |
Timeline-only artifacts use a placeholder vector and are excluded from search but included in context views.
## What Semeion Is NOT (yet)
| Not This | Why |
|----------|-----|
| Forensic suite replacement | Companion tool — use alongside Autopsy/Axiom |
| Reporting Tool | Review and analyse findings, documents in primary application |
| Forensic Interpretation Robot | Helps you to discover what you otherwise wouldnt |
## Development Setup
This project uses [uv](https://github.com/astral-sh/uv) for dependency management.
### Prerequisites
- Python 3.13+
- [uv](https://github.com/astral-sh/uv) installed
### Installation
```bash
git clone <repository-url>
cd semeion
# virtual environment
uv venv --python 3.13
# activate environment
source .venv/bin/activate # Linux/macOS
# dependencies
uv pip install -r requirements.txt -e .
```
### Running
uv handles the startup script:
```bash
semeion
```
## System Requirements
### Minimum (Remote Processing)
| Resource | Requirement |
|----------|-------------|
| CPU | Multi-core |
| RAM | 4 GB |
| Storage | Minimal (evidence stored elsewhere) |
| Network | Access to Qdrant and LLM endpoints |
### Recommended (Local Processing)
| Resource | Requirement |
|----------|-------------|
| CPU | 8+ cores |
| RAM | 32 GB |
| Storage | sufficient for evidence & vectors, LLM if installed locally |
| GPU | optional (improves embedding speed) |
## Project Status
**Current Phase:** Architecture and data model definition
**Roadmap:**
1. ✅ Concept and schema design
2. ⬜ Core infrastructure (Qdrant collection, basic ingestion)
3. ⬜ Search execution (semantic search, filtering)
4. ⬜ LLM integration (query interpretation)
5. ⬜ Refinement system (Qdrant Recommend)
6. ⬜ Context view
7. ⬜ Platform parsers (WhatsApp, Chrome, etc.)
8. ⬜ Hybrid search (sparse vectors)
## Branding
**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark" — embodying the software's mission of interpreting semantic signals within digital evidence.
The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden knowledge through analytical thought.
## License
BSD 3-Clause