From c1826ebab6aff3bf9251a7a36f5f54bbb711394d Mon Sep 17 00:00:00 2001 From: mstoeck Date: Sat, 29 Nov 2025 19:02:40 +0100 Subject: [PATCH] adkust README --- README.md | 423 ++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 270 insertions(+), 153 deletions(-) diff --git a/README.md b/README.md index 46656e0..aeac3b5 100644 --- a/README.md +++ b/README.md @@ -2,197 +2,314 @@ ![alt text](resources/title_image.png) -## Concept +## Overview -Desktop application for forensic investigators which allows to search for various types of artifacts via natural language. -The application uses a multi-stage hybrid search approach which accepts a query in natural language and then utilizes a GPT to generate a machine-readable search plan which can be adjusted by the user before execution. -The system then utilizes a combination of semantic search via pre-generated embeddings and filters the results as defined via the machine interface. -This enables the combination of semantic understanding with context and temporal relationships rather than traditional approaches of exact keyword matching. +Semeion is a **semantic search companion tool** for digital forensics investigators. It does not replace traditional forensic suites (Autopsy, Axiom, X-Ways) but augments them by enabling natural language queries across communication artifacts to quickly identify areas of interest. -## UX Example +> **Core Question Semeion Answers:** "Where should I look?" -An investigator can ask "show me what happened after they discussed the payment" and the system will find relevant communication about payments, then correlate subsequent activities (file access, application launches, network traffic) in a temporal sequence, regardless of the specific applications or messaging platforms involved. +An investigator opens Semeion alongside their primary forensic tool, types a query like *"discussions about cryptocurrency payments"*, reviews ranked results across chat messages, browser history, emails, and documents, then uses those findings to guide deeper analysis in their forensic suite. -## System Overview +## Key Features -### Core Concept +| Feature | Description | +|---------|-------------| +| **Natural Language Search** | Query artifacts using plain English instead of keywords | +| **LLM-Assisted Interpretation** | Queries are parsed by an LLM into structured search parameters | +| **Human-in-the-Loop Confirmation** | Review and edit interpreted search parameters before execution | +| **Semantic + (Hybrid Search)** | Combines meaning-based vector search with keyword matching | +| **Interactive Refinement** | Mark results as relevant [+] or irrelevant [-], then refine via Qdrant's Recommend API | +| **Temporal Context View** | See what happened before and after a discovered artifact | +| **Universal Artifact Model** | Platform-agnostic — can adapt to any forensic data from external sources, expandable concept | +| **Flexible Deployment** | Runs fully local or on airgapped forensic networks (or with cloud infrastructure if anyone would do such a thing) | -The system treats all digital artifacts as semantically-rich object containers, embedded in a multi-dimensional vector space and associated with metadata. On top of matching exact strings, it finds artifacts with similar meaning. A local language model interprets natural language queries and decomposes them into structured search operations (represented by a json-like datastructure) that can handle temporal reasoning ("after"), relationship detection ("both in"), and causal inference ("because of"). -The query object contains fields which enable result narrowing for data which can be discovered by deterministic data just like timestamps or application purpose as well as a string which suits well for vector retreival. +## Artifact Types (for first PoC) -### Key Innovation +### Searchable (Semantic Vector) -Generic artifact understanding: The system doesn't need to "know" about TOX, WhatsApp, Signal, Telegram, or any specific application. Ingestion is a pre-scripted pre-processing operation which constructs a standardized data container, which has multiple pre-defined fileds: A set of Metadata which contains data of machine-origin such as OS/application events with timestamps similar to traditional forensic artifacts or timeline data, and a vector representation which holds whatever provides semantic relevance for retreival purposes (which is primarily, but not restricted to content generated by user behavior). This means it works on artifacts from applications that don't even exist yet, or proprietary communication tools without public documentation, or even arbitrary data which holds semantic information only. +These artifacts are embedded and searchable via natural language: -### Architecture Philosophy +| Type | Examples | +|------|----------| +| **Messages** | WhatsApp, Telegram, Signal, or anything which has a parseable SQLite database | +| **Browser Events** | Chrome, Firefox, Safari, Edge — history, downloads, bookmarks, searches | +| **Email** | Any email files (simplified: sender, receiver, subject, body, timestamp) | +| **Documents** | PDF, Word, plain text | -Client-Server Separation: Compute-intensive operations (embedding generation, LLM inference, vector search) can run on powerful remote infrastructure, while the GUI client remains lightweight and runs on the investigator's local machine. This enables: +### Timeline-Only (Context View) -- Shared infrastructure across investigation teams -- Scaling compute resources independently of workstations -- Deployment in both air-gapped labs and cloud environments -- Efficient resource utilization (GPU servers can serve multiple investigators) +These artifacts appear in the temporal context view but are not semantically searchable: + +| Type | Examples | +|------|----------| +| **File Events** | File creation, modification, deletion, access (via Sleuthkit) | +| **Process Events** | Application launches, process creation | +| **Network Events** | Connections, DNS queries | +| **Registry Events** | Windows registry modifications | +| **System Events** | Logs, authentication events | + +## How It Works + +```bash +┌──────────────────────────────────────────────────────────────────────────┐ +│ SEMEION WORKFLOW │ +├──────────────────────────────────────────────────────────────────────────┤ +│ │ +│ 1. QUERY "Find messages about buying ransomware | +| access with crypto in January" │ +│ │ │ +│ ▼ │ +│ 2. INTERPRET ┌──────────────────┐ │ +│ │ LLM parses │ │ +│ │ into search │ │ +│ │ object │ │ +│ └────────┬─────────┘ │ +│ │ │ +│ ▼ │ +│ 3. CONFIRM ┌──────────────────┐ │ +│ │ User reviews │ │ +│ │ and adjusts │ │ +│ │ parameters │ │ +│ └────────┬─────────┘ │ +│ │ │ +│ ▼ │ +│ 4. SEARCH ┌──────────────────┐ │ +│ │ Qdrant executes │ │ +│ │ (hybrid) search │ │ +│ └────────┬─────────┘ │ +│ │ │ +│ ▼ │ +│ 5. REVIWE ┌──────────────────┐ │ +│ │ Mark [+] / [-] │◄─────┐ │ +│ │ Click "Refine" │ │ │ +│ └────────┬─────────┘ │ │ +│ │ │ │ +│ ▼ │ │ +│ ┌──────────────────┐ │ │ +│ │ Qdrant Recommend │──────┘ │ +│ │ returns better │ (iterate) │ +│ │ results │ │ +│ └────────┬─────────┘ │ +│ │ │ +│ ▼ │ +│ 6. CONTEXT ┌──────────────────┐ │ +│ │ View surrounding │ │ +│ │ timeline (±N min)│ │ +│ │ including system │ │ +│ │ artifacts │ │ +│ └──────────────────┘ │ +│ │ +└──────────────────────────────────────────────────────────────────────────┘ +``` + +## Architecture + +### Client-Server Design + +```bash +┌─────────────────────────────────────────────────────────────────────────┐ +│ │ +│ ┌─────────────────────┐ │ +│ │ PySide6 Client │ │ +│ │ (Investigator │ │ +│ │ Workstation) │ │ +│ └──────────┬──────────┘ │ +│ │ │ +│ ┌───────────────┴───────────────┐ │ +│ │ │ │ +│ ▼ ▼ │ +│ ┌─────────────────────┐ ┌──────────────────────┐ │ +│ │ Qdrant API │ │ LLM API │ │ +│ │ │ │ (OpenAI-compatible)│ │ +│ └─────────────────────┘ └──────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────┘ +``` + +### Deployment Options + +| Configuration | Qdrant | LLM | Use Case | +|---------------|--------|-----|----------| +| **Fully Local** | localhost | Ollama (localhost) | Single investigator, offline | +| **Airgapped Network** | Internal server | Internal server | Forensic lab, sensitive cases | +| **Hybrid** | Local | Cloud API | Balance of privacy and capability | +| **Full Cloud** | Cloud | Cloud API | Team access, scalability | + +## Use Case Examples + +### Example 1: Cryptocurrency Investigation + +**Query:** *"Find discussions about buying software with cryptocurrency"* + +**Semeion finds:** + +- Chat messages mentioning crypto payments +- Browser visits to exchange websites +- Wallet-related searches + +**Context view reveals:** + +- File downloads after payment discussions +- Application installations +- Network connections to blockchain services + +### Example 2: Data Exfiltration + +**Query:** *"Messages about sending confidential documents"* + +**Semeion finds:** + +- Emails discussing document sharing +- Chat messages about file transfers + +**Context view reveals:** + +- File access events before discussions +- Cloud storage uploads after discussions +- USB device connections + +### Example 3: Timeline Reconstruction + +**Query:** *"Threatening messages received in March"* + +**Semeion finds:** + +- Messages matching threatening language patterns + +**Context view reveals:** + +- What the recipient searched afterward +- Files accessed or deleted +- Communication with others about the threat + +## Technical Stack + +| Component | Technology | Purpose | +|-----------|------------|---------| +| GUI Framework | PySide6 | Desktop application | +| Vector Database | Qdrant | Semantic search and storage | +| LLM Interface | OpenAI-compatible API | Query interpretation | +| Embedding API | OpenAI-compatible API | Vector generation | +| Forensic Parsing | pytsk3, pyewf | Disk image processing | +| Language | Python 3.13 (PySide6-restricted) | Application logic | + +## Data Model + +### SemeionArtifact (Simplified) + +Every artifact — regardless of source platform — conforms to a universal schema: + +```bash +┌─────────────────────────────────────────────────────────────────────────┐ +│ SemeionArtifact │ +├─────────────────────────────────────────────────────────────────────────┤ +│ │ +│ Identity: id, case_id │ +│ Classification: artifact_class, source_platform, searchable │ +│ Temporal: timestamp, timestamp_precision │ +│ Actors: [{identifier, display_name, role}] │ +│ Content: text, semantic_text │ +│ Entities: indexed_entities[] (for filtering) │ +│ Hierarchy: parent_id, chunk_info (for documents) │ +│ Context: context_group (conversation, thread, session) │ +│ Location: url, path, title │ +│ Source-Specific: message{}, browser{}, email{}, document{}, etc. │ +│ Ingestion: ingested_at, source_file, parser_id │ +│ │ +└─────────────────────────────────────────────────────────────────────────┘ +``` + +### Vector Strategy + +| Vector | Purpose | Required | +|--------|---------|----------| +| **Semantic** | Conceptual similarity search | Yes | +| **Sparse** (keywords) | Exact term matching (hybrid) | Optional | + +Timeline-only artifacts use a placeholder vector and are excluded from search but included in context views. + +## What Semeion Is NOT (yet) + +| Not This | Why | +|----------|-----| +| Forensic suite replacement | Companion tool — use alongside Autopsy/Axiom | +| Reporting Tool | Review and analyse findings, documents in primary application | +| Forensic Interpretation Robot | Helps you to discover what you otherwise wouldnt | ## Development Setup -This project uses [uv](https://github.com/astral-sh/uv) for fast dependency management and the modern "Src Layout" structure. +This project uses [uv](https://github.com/astral-sh/uv) for dependency management. ### Prerequisites - Python 3.13+ -- [uv](https://github.com/astral-sh/uv) installed (`curl -LsSf https://astral.sh/uv/install.sh | sh`) +- [uv](https://github.com/astral-sh/uv) installed -### Installation Steps - -1. **Clone the repository** - - ```bash - git clone - cd semeion - ``` - -2. **Create a virtual environment** - This project requires Python 3.13. - - ```bash - uv venv --python 3.13 - ``` - -3. **Environment** - - - Linux/macOS: - - ```bash - source .venv/bin/activate - ``` - - - Windows: - - ```powershell - .venv\Scripts\activate - ``` - -4. **Dependencies** - This command installs locked dependencies and links the local `semeion` package in editable mode. - - ```bash - uv pip install -r requirements.txt -e . - ``` - -### Running the Application +### Installation ```bash -python src/semeion/main.py +git clone +cd semeion + +# virtual environment +uv venv --python 3.13 + +# activate environment +source .venv/bin/activate # Linux/macOS + +# dependencies +uv pip install -r requirements.txt -e . ``` -### Running Tests +### Running + +uv handles the startup script: ```bash -pytest +semeion ``` -## Technical Stack +## System Requirements -### Core Technologies +### Minimum (Remote Processing) -| Component | Technology | Version | License | Purpose | -|----------------------|---------------------------|-----------|--------------------------------|--------------------------------| -| GUI Framework | PySide6 | TBD | LGPL | Desktop application interface | -| Vector Database | Qdrant | 1.10+ | Apache 2.0 | Semantic search and storage | -| LLM Inference | OpenAI-compatible API | various | various | Natural language understanding | -| LLM Models | various | various | various | Query parsing | -| Embeddings | OpenAI-compatible API | various | various | Semantic vectors | -| Embedding Model | TBD | TBD | TBD | Text to vector conversion | -| Forensic Parsing | pytsk3 | TBD | Apache 2.0 | Disk image processing | -| Image Handling | pyewf | TBD | LGPL | E01 image support | -| NLP | spaCy | TBD | MIT | Entity extraction | -| Programming Language | Python | 3.13+ | PSF | Application logic | +| Resource | Requirement | +|----------|-------------| +| CPU | Multi-core | +| RAM | 4 GB | +| Storage | Minimal (evidence stored elsewhere) | +| Network | Access to Qdrant and LLM endpoints | -### Infrastructure Requirements +### Recommended (Local Processing) -#### Remote Processing +| Resource | Requirement | +|----------|-------------| +| CPU | 8+ cores | +| RAM | 32 GB | +| Storage | sufficient for evidence & vectors, LLM if installed locally | +| GPU | optional (improves embedding speed) | -- CPU: multi-Core -- RAM: ~4GiB -- Storage: insignificant, local client, possible evidence -- GPU: irrelevant -- remote infrastructure for compute +## Project Status -#### Small scale local processing (local workstation) +**Current Phase:** Architecture and data model definition -- CPU: 8+ cores (modern Ryzen5/Ryzen7/Ryzen AI series) -- RAM: 32GB minimum (16GB for Qdrant, 8GB for LLM, 8GB for OS/app) -- Storage: 2TB SSD (1TB evidence + 1TB index) -- GPU: recommended, not required (speed considerations) +**Roadmap:** -#### Recommended Configuration (Remote Server Compute Node) - -- CPU: 16+ cores (Ryzen7+) -- RAM: 64-128 GiB -- Storage: sufficient for index+evidence -- GPU: recommended, AMD Instinct MI50 32 GiB or better (Inference, Embeddings) -- Network: sufficient - -#### Enterprise Configuration (Multi-User) - -- TBD, out of scope - -## Supported Ingestion Formats - -### Primary: Specialized Data Objects - -TBD - -### Secondary: Conversion Engine (algorithmic) - -Example: - -- SQLite Parser for browser History -> Special Data Object -- Converter for TSK artifacts -> Metadata in Special Data Object (TBD) - -## Use Case Scenarios - -### Scenario 1: Drug Transaction Investigation - -Query: "Find when the suspect made cryptocurrency payments after discussing deals" Process: - -1. System finds chat messages about drug deals -2. Extracts timestamps of deal discussions -3. Searches for cryptocurrency-related activity after each discussion -4. Correlates wallet launches, browser activity, blockchain transactions -5. Presents timeline showing: discussion → wallet launch → transaction Evidence: Timeline of intent → action, strengthening case - -### Scenario 2: Data Exfiltration - -Query: "Show file access before large uploads to cloud storage" Process: - -1. Identifies cloud storage upload events -2. Looks backward in time for file access -3. Correlates accessed files with uploaded data -4. Maps file paths to user actions Evidence: Demonstrates what data was taken and when - -### Scenario 3: Coordinated Activity - -Query: "Find people who communicated privately and are also in the same groups" Process: - -1. Extracts participants from private messages -2. Extracts participants from group chats -3. Identifies overlap (intersection) -4. Shows sample conversations from each context Evidence: Demonstrates coordinated behavior across communication channels - -### Scenario 4: Timeline Reconstruction - -Query: "What happened between receiving the threatening email and deleting files?" Process: - -1. Finds threatening email (semantic search) -2. Finds file deletion events (system logs) -3. Returns all artifacts between these timestamps -4. Visualizes complete timeline Evidence: Establishes sequence of events and potential motive +1. ✅ Concept and schema design +2. ⬜ Core infrastructure (Qdrant collection, basic ingestion) +3. ⬜ Search execution (semantic search, filtering) +4. ⬜ LLM integration (query interpretation) +5. ⬜ Refinement system (Qdrant Recommend) +6. ⬜ Context view +7. ⬜ Platform parsers (WhatsApp, Chrome, etc.) +8. ⬜ Hybrid search (sparse vectors) ## Branding -**Semeion** derives from the Greek σημεῖον (semeion), meaning "sign," "signal," or "meaningful mark"—a name that embodies the software's core mission of interpreting semantic signals hidden within digital evidence. The mascot, **Koios** (Coeus), the Titan of intellect, inquiry, and questioning in Greek mythology, represents the investigative mindset at the heart of forensic work: the relentless pursuit of hidden knowledge through analytical thought. Just as Koios personified the axis of heavenly inquiry, Semeion serves as the pivot point for discovering meaningful patterns across temporal and semantic dimensions. Together, the name and mascot bridge ancient wisdom traditions of interpretation with cutting-edge vector embedding technology, reflecting the software's philosophy that effective investigation requires both sophisticated computational tools and human insight—finding the signs that matter in an ocean of data. +**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark" — embodying the software's mission of interpreting semantic signals within digital evidence. + +The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden knowledge through analytical thought. ## License -BSD 3-Clause (subject to change during development) +BSD 3-Clause