# semeion ![alt text](resources/title_image.png) ## Overview Semeion is a **semantic search companion tool** for digital forensics investigators. It does not replace traditional forensic suites (Autopsy, Axiom, X-Ways) but augments them by enabling natural language queries across communication artifacts to quickly identify areas of interest. > **Core Question Semeion Answers:** "Where should I look?" An investigator opens Semeion alongside their primary forensic tool, types a query like *"discussions about cryptocurrency payments"*, reviews ranked results across chat messages, browser history, emails, and documents, then uses those findings to guide deeper analysis in their forensic suite. ## Key Features | Feature | Description | |---------|-------------| | **Natural Language Search** | Query artifacts using plain English instead of keywords | | **LLM-Assisted Interpretation** | Queries are parsed by an LLM into structured search parameters | | **Human-in-the-Loop Confirmation** | Review and edit interpreted search parameters before execution | | **Semantic + (Hybrid Search)** | Combines meaning-based vector search with keyword matching | | **Interactive Refinement** | Mark results as relevant [+] or irrelevant [-], then refine via Qdrant's Recommend API | | **Temporal Context View** | See what happened before and after a discovered artifact | | **Universal Artifact Model** | Platform-agnostic — can adapt to any forensic data from external sources, expandable concept | | **Flexible Deployment** | Runs fully local or on airgapped forensic networks (or with cloud infrastructure if anyone would do such a thing) | ## Artifact Types (for first PoC) ### Searchable (Semantic Vector) These artifacts are embedded and searchable via natural language: | Type | Examples | |------|----------| | **Messages** | WhatsApp, Telegram, Signal, or anything which has a parseable SQLite database | | **Browser Events** | Chrome, Firefox, Safari, Edge — history, downloads, bookmarks, searches | | **Email** | Any email files (simplified: sender, receiver, subject, body, timestamp) | | **Documents** | PDF, Word, plain text | ### Timeline-Only (Context View) These artifacts appear in the temporal context view but are not semantically searchable: | Type | Examples | |------|----------| | **File Events** | File creation, modification, deletion, access (via Sleuthkit) | | **Process Events** | Application launches, process creation | | **Network Events** | Connections, DNS queries | | **Registry Events** | Windows registry modifications | | **System Events** | Logs, authentication events | ## How It Works ```bash ┌──────────────────────────────────────────────────────────────────────────┐ │ SEMEION WORKFLOW │ ├──────────────────────────────────────────────────────────────────────────┤ │ │ │ 1. QUERY "Find messages about buying ransomware | | access with crypto in January" │ │ │ │ │ ▼ │ │ 2. INTERPRET ┌──────────────────┐ │ │ │ LLM parses │ │ │ │ into search │ │ │ │ object │ │ │ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ 3. CONFIRM ┌──────────────────┐ │ │ │ User reviews │ │ │ │ and adjusts │ │ │ │ parameters │ │ │ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ 4. SEARCH ┌──────────────────┐ │ │ │ Qdrant executes │ │ │ │ (hybrid) search │ │ │ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ 5. REVIWE ┌──────────────────┐ │ │ │ Mark [+] / [-] │◄─────┐ │ │ │ Click "Refine" │ │ │ │ └────────┬─────────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌──────────────────┐ │ │ │ │ Qdrant Recommend │──────┘ │ │ │ returns better │ (iterate) │ │ │ results │ │ │ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ 6. CONTEXT ┌──────────────────┐ │ │ │ View surrounding │ │ │ │ timeline (±N min)│ │ │ │ including system │ │ │ │ artifacts │ │ │ └──────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────────────┘ ``` ## Architecture ### Client-Server Design ```bash ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ ┌─────────────────────┐ │ │ │ PySide6 Client │ │ │ │ (Investigator │ │ │ │ Workstation) │ │ │ └──────────┬──────────┘ │ │ │ │ │ ┌───────────────┴───────────────┐ │ │ │ │ │ │ ▼ ▼ │ │ ┌─────────────────────┐ ┌──────────────────────┐ │ │ │ Qdrant API │ │ LLM API │ │ │ │ │ │ (OpenAI-compatible)│ │ │ └─────────────────────┘ └──────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────┘ ``` ### Deployment Options | Configuration | Qdrant | LLM | Use Case | |---------------|--------|-----|----------| | **Fully Local** | localhost | Ollama (localhost) | Single investigator, offline | | **Airgapped Network** | Internal server | Internal server | Forensic lab, sensitive cases | | **Hybrid** | Local | Cloud API | Balance of privacy and capability | | **Full Cloud** | Cloud | Cloud API | Team access, scalability | ## Use Case Examples ### Example 1: Cryptocurrency Investigation **Query:** *"Find discussions about buying software with cryptocurrency"* **Semeion finds:** - Chat messages mentioning crypto payments - Browser visits to exchange websites - Wallet-related searches **Context view reveals:** - File downloads after payment discussions - Application installations - Network connections to blockchain services ### Example 2: Data Exfiltration **Query:** *"Messages about sending confidential documents"* **Semeion finds:** - Emails discussing document sharing - Chat messages about file transfers **Context view reveals:** - File access events before discussions - Cloud storage uploads after discussions - USB device connections ### Example 3: Timeline Reconstruction **Query:** *"Threatening messages received in March"* **Semeion finds:** - Messages matching threatening language patterns **Context view reveals:** - What the recipient searched afterward - Files accessed or deleted - Communication with others about the threat ## Technical Stack | Component | Technology | Purpose | |-----------|------------|---------| | GUI Framework | PySide6 | Desktop application | | Vector Database | Qdrant | Semantic search and storage | | LLM Interface | OpenAI-compatible API | Query interpretation | | Embedding API | OpenAI-compatible API | Vector generation | | Forensic Parsing | pytsk3, pyewf | Disk image processing | | Language | Python 3.13 (PySide6-restricted) | Application logic | ## Data Model ### SemeionArtifact (Simplified) Every artifact — regardless of source platform — conforms to a universal schema: ```bash ┌─────────────────────────────────────────────────────────────────────────┐ │ SemeionArtifact │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ Identity: id, case_id │ │ Classification: artifact_class, source_platform, searchable │ │ Temporal: timestamp, timestamp_precision │ │ Actors: [{identifier, display_name, role}] │ │ Content: text, semantic_text │ │ Entities: indexed_entities[] (for filtering) │ │ Hierarchy: parent_id, chunk_info (for documents) │ │ Context: context_group (conversation, thread, session) │ │ Location: url, path, title │ │ Source-Specific: message{}, browser{}, email{}, document{}, etc. │ │ Ingestion: ingested_at, source_file, parser_id │ │ │ └─────────────────────────────────────────────────────────────────────────┘ ``` ### Vector Strategy | Vector | Purpose | Required | |--------|---------|----------| | **Semantic** | Conceptual similarity search | Yes | | **Sparse** (keywords) | Exact term matching (hybrid) | Optional | Timeline-only artifacts use a placeholder vector and are excluded from search but included in context views. ## What Semeion Is NOT (yet) | Not This | Why | |----------|-----| | Forensic suite replacement | Companion tool — use alongside Autopsy/Axiom | | Reporting Tool | Review and analyse findings, documents in primary application | | Forensic Interpretation Robot | Helps you to discover what you otherwise wouldnt | ## Development Setup This project uses [uv](https://github.com/astral-sh/uv) for dependency management. ### Prerequisites - Python 3.13+ - [uv](https://github.com/astral-sh/uv) installed ### Installation ```bash git clone cd semeion # virtual environment uv venv --python 3.13 # activate environment source .venv/bin/activate # Linux/macOS # dependencies uv pip install -r requirements.txt -e . ``` ### Running uv handles the startup script: ```bash semeion ``` ## System Requirements ### Minimum (Remote Processing) | Resource | Requirement | |----------|-------------| | CPU | Multi-core | | RAM | 4 GB | | Storage | Minimal (evidence stored elsewhere) | | Network | Access to Qdrant and LLM endpoints | ### Recommended (Local Processing) | Resource | Requirement | |----------|-------------| | CPU | 8+ cores | | RAM | 32 GB | | Storage | sufficient for evidence & vectors, LLM if installed locally | | GPU | optional (improves embedding speed) | ## Project Status **Current Phase:** Architecture and data model definition **Roadmap:** 1. ✅ Concept and schema design 2. ⬜ Core infrastructure (Qdrant collection, basic ingestion) 3. ⬜ Search execution (semantic search, filtering) 4. ⬜ LLM integration (query interpretation) 5. ⬜ Refinement system (Qdrant Recommend) 6. ⬜ Context view 7. ⬜ Platform parsers (WhatsApp, Chrome, etc.) 8. ⬜ Hybrid search (sparse vectors) ## Branding **Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark" — embodying the software's mission of interpreting semantic signals within digital evidence. The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden knowledge through analytical thought. ## License BSD 3-Clause