rework concept

This commit is contained in:
2025-12-17 22:08:13 +01:00
parent ea460864a6
commit 096b9868a7

379
README.md
View File

@@ -4,189 +4,231 @@
## Overview ## Overview
Semeion is a **semantic search companion tool** for digital forensics investigators. It does not replace traditional forensic suites (Autopsy, Axiom, X-Ways) but augments them by enabling natural language queries across communication artifacts to quickly identify areas of interest. Semeion is a **timeline-first digital forensics analysis platform** that solves the two biggest pain points in modern investigations: navigating massive communication datasets and correlating events across data sources, including filesystm artifacts and OS events.
> **Core Question Semeion Answers:** "Where should I look?" > **Core Question Semeion Answers:** "What happened around this time—and how are these events connected?"
An investigator opens Semeion alongside their primary forensic tool, types a query like *"discussions about cryptocurrency payments"*, reviews ranked results across chat messages, browser history, emails, and documents, then uses those findings to guide deeper analysis in their forensic suite. Traditional forensic tools often have unintuitiveur slow user experience (opinionated). Semeion in contrast aims to provide a fast, intelligent timeline that automatically clusters semantically-related events, enabling investigators to see patterns at a glance rather than clicking through endless individual entries.
([Intro Video](https://cloud.cc24.dev/s/pNwNWJE9QkDiX3J)) An investigator opens Semeion alongside their primary forensic tool, either searches for relevant content (like "*discussions about data transfer*" in Russian messages) or navigates to a known timestamp (from incident logs), and immediately sees all correlated activity—messages, file access, network connections, system events—organized into meaningful sequences rather than flat lists.
## Key Features ## Key Features
| Feature | Description | | Feature | Description |
|---------|-------------| |---------|-------------|
| **Natural Language Search** | Query artifacts using plain English instead of keywords | | **Timeline Visualization** | Render millions of events with <200ms response time; smooth zooming/panning |
| **LLM-Assisted Interpretation** | Queries are parsed by an LLM into structured search parameters | | **Semantic Event Clustering** | Automatically group related events into patterns (file transfers, communication bursts, login sequences) |
| **Human-in-the-Loop Confirmation** | Review and edit interpreted search parameters before execution | | **Multilingual Semantic Search** | Find concepts across languages—search in English, find results in Russian/Chinese/Arabic with translation |
| **Semantic + (Hybrid Search)** | Combines meaning-based vector search with keyword matching | | **Two-Mode Operation** | Semantic content discovery → timeline analysis OR timestamp → content correlation |
| **Interactive Refinement** | Mark results as relevant [+] or irrelevant [-], then refine via Qdrant's Recommend API | | **Cross-Source Correlation** | Unified timeline across messages, files, network, browser, system events |
| **Temporal Context View** | See what happened before and after a discovered artifact | | **Event Grouping** | Collapse/expand semantically-related events; discover patterns |
| **Universal Artifact Model** | Platform-agnostic — can adapt to any forensic data from external sources, expandable concept | | **Native Desktop Application** | PySide6-based—runs fully local, tailored for forensic-grade airgapped environments |
| **Flexible Deployment** | Runs fully local or on airgapped forensic networks (or with cloud infrastructure if anyone would do such a thing) | | **Open Source & Transparent** | Auditable methods, minimize "blackbox" AI, court-defensible results |
## Artifact Types (for first PoC) ## The Problem Semeion Solves
### Searchable (Semantic Vector) ### Problematic UX with existing forensic suites
These artifacts are embedded and searchable via natural language: Existing forensic tools have timeline interfaces that:
| Type | Examples | - Render slowly
|------|----------| - Show flat event lists with no context
| **Messages** | WhatsApp, Telegram, Signal, or anything which has a parseable SQLite database | - Require manual correlation across sources
| **Browser Events** | Chrome, Firefox, Safari, Edge — history, downloads, bookmarks, searches | - Have poor visualization and navigation
| **Email** | Any email files (simplified: sender, receiver, subject, body, timestamp) | *This is highly opinionated from personal experiences of the maintainer with multiple large commercial suites which werent able to deliver a satisfying result. I wont name any here, and exceptions may exist, but improvements are required here.*
| **Documents** | PDF, Word, plain text |
### Timeline-Only (Context View) **Semeion's Solution:** Fast, interactive timeline with semantic clustering that shows patterns immediately.
These artifacts appear in the temporal context view but are not semantically searchable: ### Large Communication Datasets
| Type | Examples | Modern investigations involve:
|------|----------|
| **File Events** | File creation, modification, deletion, access (via Sleuthkit) | - Seized databases with millions of relevant artifacts
| **Process Events** | Application launches, process creation | - Multilingual content
| **Network Events** | Connections, DNS queries | - Slang, code words, and poor machine translation
| **Registry Events** | Windows registry modifications | - Hours spent reading irrelevant conversations
| **System Events** | Logs, authentication events |
**Semeion's Solution:** Semantic search that understands meaning, handles multiple languages, and jumps directly to timeline context.
## How It Works ## How It Works
![resources/workflow.png](resources/workflow.png) ### Two Entry Points
#### **Entry Point 1: Semantic content discovery → timeline analysis**
```bash
1. Investigator searches: "discussions about file transfers"
2. Semeion finds semantically relevant messages
3. Click any result → Timeline centers on that timestamp
4. See all activity ±2 hours: file access, USB connections, deletions
5. Semantic clustering highlights: "File Exfiltration Sequence" pattern
```
#### **Entry Point 2: timestamp → content correlation**
```bash
1. Incident log shows: "Ransomware encrypted files at 2024-03-15 14:22:00"
2. Navigate timeline to that timestamp
3. Semantic clustering reveals:
- Communication spike 30min before (cluster: "Coordination Activity")
- Suspicious file downloads (cluster: "Malware Delivery")
- Network connections (cluster: "C2 Communication")
- Process executions leading to encryption
4. Complete attack chain visible at a glance
```
## Semantic Event Clustering
**Traditional Timeline View (Overwhelming):**
```bash
14:33:15 - USB device connected
14:33:18 - Chrome visited facebook.com
14:33:22 - File modified: report.docx
14:33:30 - File copied to USB: project_data.zip (2.3 GB)
14:33:35 - File copied to USB: budget.xlsx
14:33:40 - File deleted: project_data.zip
14:33:45 - WhatsApp message sent
...50 more individual events...
```
**Semeion Clustered View (Clear Pattern):**
```bash
┌──────────────────────────────────────────────┐
│ 14:33:15-14:33:40 ⚠️ File Transfer (4) │ ← Click to expand
│ Pattern: Data Exfiltration Sequence │
│ USB connected → 2 files copied → source deleted │
│ Related: Files mentioned in chat 2min earlier │
└──────────────────────────────────────────────┘
```
### How Clustering Works
1. **Temporal Proximity**: Events within configurable time window evaluated together
2. **Semantic Similarity**: Vector embeddings detect conceptually-related events
3. **Entity Linking**: Shared file names, IPs, usernames connect events
4. **Pattern Templates**: Pre-defined sequences (USB exfiltration, login chains, etc.)
5. **LLM Summarization** (Optional): Natural language cluster descriptions
## Architecture ## Architecture
### Client-Server Design ### Desktop Application (PySide6)
```bash ```bash
┌───────────────────────────────────────────────────────────────────────── ┌─────────────────────────────────────────────────────────┐
PySide6 Desktop UI
┌─────────────────────┐ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐
│ PySide6 Client │ Timeline Semantic Artifact │
(Investigator │ View│ Search │ │ Detail │
│ Workstation) └───────────────┘ └──────────────┘ └─────────────┘
│ └──────────┬──────────┘ │ └─────────────────────────────────────────────────────────┘
│ │
│ ┌───────────────┴───────────────┐ │ ┌─────────────────────────────────────────────────────────┐
│ │ Analysis & Clustering Engine
• Event clustering algorithm
┌─────────────────────┐ ┌──────────────────────┐ • Semantic similarity calculation
│ Qdrant API │ │ LLM API • Cross-artifact correlation
(OpenAI-compatible) • Timeline rendering engine
└─────────────────────┘ └──────────────────────┘ │ └─────────────────────────────────────────────────────────┘
└───────────────────────────────────────────────────────────────────────── ─────────────────────────────────────────────────────────
│ Data Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Qdrant │ │ SQLite │ │ LLM API │ │
│ │ (Vectors) │ │ (Metadata) │ │ (Optional) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
``` ```
### Deployment Options ### Deployment Options
| Configuration | Qdrant | LLM | Use Case | | Configuration | Qdrant | Embedding | LLM | Use Case |
|---------------|--------|-----|----------| |---------------|--------|-----------|-----|----------|
| **Fully Local** | localhost | Ollama (localhost) | Single investigator, offline | | **Fully Local** | localhost | Ollama/local model | Ollama (optional) | Single investigator, offline |
| **Airgapped Network** | Internal server | Internal server | Forensic lab, sensitive cases | | **Airgapped Network** | Internal server | Internal server | Internal server | proper forensic environment |
| **Hybrid** | Local | Cloud API | Balance of privacy and capability | | **Hybrid** | Local | Local | Cloud API (optional) | investigative environment without focus on confidentiality |
| **Full Cloud** | Cloud | Cloud API | Team access, scalability |
## Use Case Examples ## Artifact Types
### Example 1: Cryptocurrency Investigation ### Searchable & Timeline-Enabled
**Query:** *"Find discussions about buying software with cryptocurrency"* | Type | Examples | Features |
|------|----------|----------|
| **Messages** | WhatsApp, Telegram, Signal, SMS | Semantic search, multilingual, timeline clustering |
| **Browser** | Chrome, Firefox, Safari, Edge | History, downloads, searches—timeline correlation |
| **Email** | Any email format (EML, MBOX, PST) | Searchable content, timeline placement |
| **Documents** | PDF, Word, plain text | Content search, access time correlation |
**Semeion finds:** ### Timeline-Only (Context View)
- Chat messages mentioning crypto payments | Type | Examples |
- Browser visits to exchange websites |------|----------|
- Wallet-related searches | **File Events** | File creation, modification, deletion, access |
| **Process Events** | Application launches, process creation |
**Context view reveals:** | **Network Events** | Connections, DNS queries |
| **System Events** | Logs, authentication events, USB connections |
- File downloads after payment discussions
- Application installations
- Network connections to blockchain services
### Example 2: Data Exfiltration
**Query:** *"Messages about sending confidential documents"*
**Semeion finds:**
- Emails discussing document sharing
- Chat messages about file transfers
**Context view reveals:**
- File access events before discussions
- Cloud storage uploads after discussions
- USB device connections
### Example 3: Timeline Reconstruction
**Query:** *"Threatening messages received in March"*
**Semeion finds:**
- Messages matching threatening language patterns
**Context view reveals:**
- What the recipient searched afterward
- Files accessed or deleted
- Communication with others about the threat
## Technical Stack ## Technical Stack
| Component | Technology | Purpose | | Component | Technology | Purpose |
|-----------|------------|---------| |-----------|------------|---------|
| GUI Framework | PySide6 | Desktop application | | **UI Framework** | PySide6 (Qt6) | Native desktop application |
| Vector Database | Qdrant | Semantic search and storage | | **Timeline Rendering** | PyQtGraph or Plotly | High-performance visualization |
| LLM Interface | OpenAI-compatible API | Query interpretation | | **Vector Database** | Qdrant | Semantic search and similarity |
| Embedding API | OpenAI-compatible API | Vector generation | | **Metadata Storage** | SQLite | Fast local queries, metadata |
| Forensic Parsing | pytsk3, pyewf | Disk image processing | | **LLM Interface** | OpenAI-compatible API | Optional: cluster summarization, query interpretation |
| Language | Python 3.13 (PySide6-restricted) | Application logic | | **Embedding** | Sentence Transformers / OpenAI | Multilingual vector generation |
| **Forensic Parsing** | pytsk3, pyewf | Disk image processing |
| **Language** | Python 3.13+ | Application logic |
## Data Model ## Data Model
### SemeionArtifact (Simplified) ### SemeionArtifact (Universal Schema, WiP)
Every artifactregardless of source platform — conforms to a universal schema: Every artifactregardless of sourceconforms to a unified model:
```bash ```bash
┌───────────────────────────────────────────────────────────────────────── ┌─────────────────────────────────────────────────────────┐
│ SemeionArtifact │ │ SemeionArtifact │
├───────────────────────────────────────────────────────────────────────── ├─────────────────────────────────────────────────────────┤
│ │
│ Identity: id, case_id │ │ Identity: id, case_id │
│ Classification: artifact_class, source_platform, searchable │ Classification: artifact_class, source_platform │
│ Temporal: timestamp │ Temporal: timestamp (UTC normalized)
│ Actors: [{identifier, display_name, role}] │ Actors: [{identifier, display_name, role}]
│ Content: text, semantic_text │ │ Content: text, semantic_text │
│ Entities: indexed_entities[] (for filtering) │ Entities: indexed_entities[] (files, IPs, etc)
│ Hierarchy: parent_id, chunk_info (for documents) │ Hierarchy: parent_id, context_group
│ Context: context_group (conversation, thread, session)
│ Location: url, path, title │ │ Location: url, path, title │
Source-Specific: message{}, browser{}, email{}, document{}, etc. Embeddings: semantic_vector (768-dim)
Ingestion: ingested_at, source_file, parser_id Source-Specific: message{}, browser{}, email{}, etc
│ │ └─────────────────────────────────────────────────────────┘
└─────────────────────────────────────────────────────────────────────────┘
``` ```
### Vector Strategy ### Cluster Model (WiP)
| Vector | Purpose | Required | ```bash
|--------|---------|----------| ┌─────────────────────────────────────────────────────────┐
| **Semantic** | Conceptual similarity search | Yes | │ EventCluster │
| **Sparse** (keywords) | Exact term matching (hybrid) | Optional | ├─────────────────────────────────────────────────────────┤
│ id: unique_cluster_id │
Timeline-only artifacts use a placeholder vector and are excluded from search but included in context views. │ time_range: (start_timestamp, end_timestamp)
│ events: [artifact_ids]
│ pattern_type: "file_exfiltration" | "communication"
│ confidence: 0.0-1.0 │
│ summary: "USB transfer with deletion"
│ icon: "⚠️" | "💬" | "🌐" | etc │
│ semantic_links: [related_cluster_ids]
└─────────────────────────────────────────────────────────┘
```
## What Semeion Is NOT (yet) ## What Semeion Is NOT (yet)
| Not This | Why | | Not This | Why |
|----------|-----| |----------|-----|
| Forensic suite replacement | Companion tooluse alongside Autopsy/Axiom | | **Forensic suite replacement** | Companion tooluse alongside Autopsy for acquisition |
| Reporting Tool | Review and analyse findings, documents in primary application | | **Reporting tool** | Timeline export for reports, but documentation happens in primary suite |
| Forensic Interpretation Robot | Helps you to discover what you otherwise wouldnt | | **AI evidence interpreter** | AI assists with search/clustering; investigator interprets evidence |
## Development Setup ## Development Setup
@@ -196,11 +238,12 @@ This project uses [uv](https://github.com/astral-sh/uv) for dependency managemen
- Python 3.13+ - Python 3.13+
- [uv](https://github.com/astral-sh/uv) installed - [uv](https://github.com/astral-sh/uv) installed
- requirements.txt
### Installation ### Installation
```bash ```bash
git clone <repository-url> git clone https://git.cc24.dev/mstoeck3/semeion
cd semeion cd semeion
# Create virtual environment # Create virtual environment
@@ -215,13 +258,11 @@ uv pip install -r requirements.txt -e .
# Configure environment # Configure environment
cp .env.example .env cp .env.example .env
# Edit .env with your Qdrant and LLM endpoint configurations # Edit .env with your Qdrant and embedding endpoint configurations
``` ```
### Running ### Running
uv handles the startup script:
```bash ```bash
semeion semeion
``` ```
@@ -232,41 +273,83 @@ semeion
| Resource | Requirement | | Resource | Requirement |
|----------|-------------| |----------|-------------|
| CPU | Multi-core | | CPU | 4 cores |
| RAM | 4 GB | | RAM | 8 GB |
| Storage | Minimal (evidence stored elsewhere) | | Storage | Minimal (evidence stored elsewhere) |
| Network | Access to Qdrant and LLM endpoints | | GPU | Not required |
| Network | Access to Qdrant and embedding endpoints (if remote) |
### Recommended (Local Processing) ### Recommended (Local Processing)
| Resource | Requirement | | Resource | Requirement |
|----------|-------------| |----------|-------------|
| CPU | 8+ cores | | CPU | 8+ cores |
| RAM | 32 GB | | RAM | 16 GB (32 GB for large cases) |
| Storage | sufficient for evidence & vectors, LLM if installed locally | | Storage | SSD, sufficient for evidence & vectors |
| GPU | optional (improves embedding speed) | | GPU | Optional (improves embedding speed with local models) |
| Network | Optional (fully offline capable) |
## Project Status ## Project Status
**Current Phase:** Architecture and data model definition **Current Phase:** MVP Development - Timeline & Core Features
**Roadmap:** **Roadmap:**
1. ✅ Concept and schema design 1. ✅ Concept and architecture design
2. ⬜ Core infrastructure (Qdrant collection, basic ingestion) 2. ⬜ Core infrastructure
3. ⬜ Search execution (semantic search, filtering) - Unified artifact ingestion (WhatsApp, Chrome)
4. ⬜ LLM integration (query interpretation) - SQLite + Qdrant integration
5.Refinement system (Qdrant Recommend) 3.Timeline visualization
6. ⬜ Context view - High-performance rendering (target: <200ms for 100k events)
7. ⬜ Platform parsers (WhatsApp, Chrome, etc.) - Multi-source swim lanes
8. ⬜ Hybrid search (sparse vectors) - Zoom/pan navigation
4. ⬜ Semantic clustering
- Temporal proximity grouping
- Pattern template matching
- Semantic similarity detection
5. ⬜ Communication search
- Multilingual embedding
- Query interpretation
- Search → timeline jump
6. ⬜ Additional parsers (Telegram, Signal, etc.)
7. ⬜ Export and reporting
8. ⬜ Performance optimization & polish
## Why Open Source Matters
**Transparency for Court:**
- Auditable algorithms—no "black box" analysis
- Reproducible results—scientific validation possible
- Peer-reviewed methods—community scrutiny
**Accessibility:**
- Free for budget-constrained labs
- No vendor lock-in
- Community-driven development
**Innovation:**
- Rapid feature development
- Specialized extensions possible
- Academic research enabled
## Branding ## Branding
**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"embodying the software's mission of interpreting semantic signals within digital evidence. **Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"embodying the software's mission of revealing meaningful patterns within digital evidence through intelligent timeline analysis.
The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden knowledge through analytical thought. The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden connections through analytical thought.
## License ## License
BSD 3-Clause BSD 3-Clause
## Contributing
Semeion is in active development. Contributions welcome, especially from:
- Digital forensics practitioners (workflow validation)
- Timeline visualization experts
- Multilingual NLP specialists
- Performance optimization engineers