rework concept

This commit is contained in:
2025-12-17 22:08:13 +01:00
parent ea460864a6
commit 096b9868a7

389
README.md
View File

@@ -4,189 +4,231 @@
## Overview
Semeion is a **semantic search companion tool** for digital forensics investigators. It does not replace traditional forensic suites (Autopsy, Axiom, X-Ways) but augments them by enabling natural language queries across communication artifacts to quickly identify areas of interest.
Semeion is a **timeline-first digital forensics analysis platform** that solves the two biggest pain points in modern investigations: navigating massive communication datasets and correlating events across data sources, including filesystm artifacts and OS events.
> **Core Question Semeion Answers:** "Where should I look?"
> **Core Question Semeion Answers:** "What happened around this time—and how are these events connected?"
An investigator opens Semeion alongside their primary forensic tool, types a query like *"discussions about cryptocurrency payments"*, reviews ranked results across chat messages, browser history, emails, and documents, then uses those findings to guide deeper analysis in their forensic suite.
Traditional forensic tools often have unintuitiveur slow user experience (opinionated). Semeion in contrast aims to provide a fast, intelligent timeline that automatically clusters semantically-related events, enabling investigators to see patterns at a glance rather than clicking through endless individual entries.
([Intro Video](https://cloud.cc24.dev/s/pNwNWJE9QkDiX3J))
An investigator opens Semeion alongside their primary forensic tool, either searches for relevant content (like "*discussions about data transfer*" in Russian messages) or navigates to a known timestamp (from incident logs), and immediately sees all correlated activity—messages, file access, network connections, system events—organized into meaningful sequences rather than flat lists.
## Key Features
| Feature | Description |
|---------|-------------|
| **Natural Language Search** | Query artifacts using plain English instead of keywords |
| **LLM-Assisted Interpretation** | Queries are parsed by an LLM into structured search parameters |
| **Human-in-the-Loop Confirmation** | Review and edit interpreted search parameters before execution |
| **Semantic + (Hybrid Search)** | Combines meaning-based vector search with keyword matching |
| **Interactive Refinement** | Mark results as relevant [+] or irrelevant [-], then refine via Qdrant's Recommend API |
| **Temporal Context View** | See what happened before and after a discovered artifact |
| **Universal Artifact Model** | Platform-agnostic — can adapt to any forensic data from external sources, expandable concept |
| **Flexible Deployment** | Runs fully local or on airgapped forensic networks (or with cloud infrastructure if anyone would do such a thing) |
| **Timeline Visualization** | Render millions of events with <200ms response time; smooth zooming/panning |
| **Semantic Event Clustering** | Automatically group related events into patterns (file transfers, communication bursts, login sequences) |
| **Multilingual Semantic Search** | Find concepts across languages—search in English, find results in Russian/Chinese/Arabic with translation |
| **Two-Mode Operation** | Semantic content discovery → timeline analysis OR timestamp → content correlation |
| **Cross-Source Correlation** | Unified timeline across messages, files, network, browser, system events |
| **Event Grouping** | Collapse/expand semantically-related events; discover patterns |
| **Native Desktop Application** | PySide6-based—runs fully local, tailored for forensic-grade airgapped environments |
| **Open Source & Transparent** | Auditable methods, minimize "blackbox" AI, court-defensible results |
## Artifact Types (for first PoC)
## The Problem Semeion Solves
### Searchable (Semantic Vector)
### Problematic UX with existing forensic suites
These artifacts are embedded and searchable via natural language:
Existing forensic tools have timeline interfaces that:
| Type | Examples |
|------|----------|
| **Messages** | WhatsApp, Telegram, Signal, or anything which has a parseable SQLite database |
| **Browser Events** | Chrome, Firefox, Safari, Edge — history, downloads, bookmarks, searches |
| **Email** | Any email files (simplified: sender, receiver, subject, body, timestamp) |
| **Documents** | PDF, Word, plain text |
- Render slowly
- Show flat event lists with no context
- Require manual correlation across sources
- Have poor visualization and navigation
*This is highly opinionated from personal experiences of the maintainer with multiple large commercial suites which werent able to deliver a satisfying result. I wont name any here, and exceptions may exist, but improvements are required here.*
### Timeline-Only (Context View)
**Semeion's Solution:** Fast, interactive timeline with semantic clustering that shows patterns immediately.
These artifacts appear in the temporal context view but are not semantically searchable:
### Large Communication Datasets
| Type | Examples |
|------|----------|
| **File Events** | File creation, modification, deletion, access (via Sleuthkit) |
| **Process Events** | Application launches, process creation |
| **Network Events** | Connections, DNS queries |
| **Registry Events** | Windows registry modifications |
| **System Events** | Logs, authentication events |
Modern investigations involve:
- Seized databases with millions of relevant artifacts
- Multilingual content
- Slang, code words, and poor machine translation
- Hours spent reading irrelevant conversations
**Semeion's Solution:** Semantic search that understands meaning, handles multiple languages, and jumps directly to timeline context.
## How It Works
![resources/workflow.png](resources/workflow.png)
### Two Entry Points
#### **Entry Point 1: Semantic content discovery → timeline analysis**
```bash
1. Investigator searches: "discussions about file transfers"
2. Semeion finds semantically relevant messages
3. Click any result → Timeline centers on that timestamp
4. See all activity ±2 hours: file access, USB connections, deletions
5. Semantic clustering highlights: "File Exfiltration Sequence" pattern
```
#### **Entry Point 2: timestamp → content correlation**
```bash
1. Incident log shows: "Ransomware encrypted files at 2024-03-15 14:22:00"
2. Navigate timeline to that timestamp
3. Semantic clustering reveals:
- Communication spike 30min before (cluster: "Coordination Activity")
- Suspicious file downloads (cluster: "Malware Delivery")
- Network connections (cluster: "C2 Communication")
- Process executions leading to encryption
4. Complete attack chain visible at a glance
```
## Semantic Event Clustering
**Traditional Timeline View (Overwhelming):**
```bash
14:33:15 - USB device connected
14:33:18 - Chrome visited facebook.com
14:33:22 - File modified: report.docx
14:33:30 - File copied to USB: project_data.zip (2.3 GB)
14:33:35 - File copied to USB: budget.xlsx
14:33:40 - File deleted: project_data.zip
14:33:45 - WhatsApp message sent
...50 more individual events...
```
**Semeion Clustered View (Clear Pattern):**
```bash
┌──────────────────────────────────────────────┐
│ 14:33:15-14:33:40 ⚠️ File Transfer (4) │ ← Click to expand
│ Pattern: Data Exfiltration Sequence │
│ USB connected → 2 files copied → source deleted │
│ Related: Files mentioned in chat 2min earlier │
└──────────────────────────────────────────────┘
```
### How Clustering Works
1. **Temporal Proximity**: Events within configurable time window evaluated together
2. **Semantic Similarity**: Vector embeddings detect conceptually-related events
3. **Entity Linking**: Shared file names, IPs, usernames connect events
4. **Pattern Templates**: Pre-defined sequences (USB exfiltration, login chains, etc.)
5. **LLM Summarization** (Optional): Natural language cluster descriptions
## Architecture
### Client-Server Design
### Desktop Application (PySide6)
```bash
┌─────────────────────────────────────────────────────────────────────────
┌─────────────────────┐
│ PySide6 Client
(Investigator │
│ Workstation)
│ └──────────┬──────────┘ │
│ │
│ ┌───────────────┴───────────────┐ │
│ │
┌─────────────────────┐ ┌──────────────────────┐
│ Qdrant API │ │ LLM API
(OpenAI-compatible)
└─────────────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────
┌─────────────────────────────────────────────────────────┐
PySide6 Desktop UI
┌───────────────┐ ┌──────────────┐ ┌─────────────┐
│ Timeline Semantic Artifact │
View│ Search │ │ Detail │
└───────────────┘ └──────────────┘ └─────────────┘
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
Analysis & Clustering Engine
• Event clustering algorithm
• Semantic similarity calculation
• Cross-artifact correlation
• Timeline rendering engine
└─────────────────────────────────────────────────────────┘
─────────────────────────────────────────────────────────
│ Data Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Qdrant │ │ SQLite │ │ LLM API │ │
│ │ (Vectors) │ │ (Metadata) │ │ (Optional) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
```
### Deployment Options
| Configuration | Qdrant | LLM | Use Case |
|---------------|--------|-----|----------|
| **Fully Local** | localhost | Ollama (localhost) | Single investigator, offline |
| **Airgapped Network** | Internal server | Internal server | Forensic lab, sensitive cases |
| **Hybrid** | Local | Cloud API | Balance of privacy and capability |
| **Full Cloud** | Cloud | Cloud API | Team access, scalability |
| Configuration | Qdrant | Embedding | LLM | Use Case |
|---------------|--------|-----------|-----|----------|
| **Fully Local** | localhost | Ollama/local model | Ollama (optional) | Single investigator, offline |
| **Airgapped Network** | Internal server | Internal server | Internal server | proper forensic environment |
| **Hybrid** | Local | Local | Cloud API (optional) | investigative environment without focus on confidentiality |
## Use Case Examples
## Artifact Types
### Example 1: Cryptocurrency Investigation
### Searchable & Timeline-Enabled
**Query:** *"Find discussions about buying software with cryptocurrency"*
| Type | Examples | Features |
|------|----------|----------|
| **Messages** | WhatsApp, Telegram, Signal, SMS | Semantic search, multilingual, timeline clustering |
| **Browser** | Chrome, Firefox, Safari, Edge | History, downloads, searches—timeline correlation |
| **Email** | Any email format (EML, MBOX, PST) | Searchable content, timeline placement |
| **Documents** | PDF, Word, plain text | Content search, access time correlation |
**Semeion finds:**
### Timeline-Only (Context View)
- Chat messages mentioning crypto payments
- Browser visits to exchange websites
- Wallet-related searches
**Context view reveals:**
- File downloads after payment discussions
- Application installations
- Network connections to blockchain services
### Example 2: Data Exfiltration
**Query:** *"Messages about sending confidential documents"*
**Semeion finds:**
- Emails discussing document sharing
- Chat messages about file transfers
**Context view reveals:**
- File access events before discussions
- Cloud storage uploads after discussions
- USB device connections
### Example 3: Timeline Reconstruction
**Query:** *"Threatening messages received in March"*
**Semeion finds:**
- Messages matching threatening language patterns
**Context view reveals:**
- What the recipient searched afterward
- Files accessed or deleted
- Communication with others about the threat
| Type | Examples |
|------|----------|
| **File Events** | File creation, modification, deletion, access |
| **Process Events** | Application launches, process creation |
| **Network Events** | Connections, DNS queries |
| **System Events** | Logs, authentication events, USB connections |
## Technical Stack
| Component | Technology | Purpose |
|-----------|------------|---------|
| GUI Framework | PySide6 | Desktop application |
| Vector Database | Qdrant | Semantic search and storage |
| LLM Interface | OpenAI-compatible API | Query interpretation |
| Embedding API | OpenAI-compatible API | Vector generation |
| Forensic Parsing | pytsk3, pyewf | Disk image processing |
| Language | Python 3.13 (PySide6-restricted) | Application logic |
| **UI Framework** | PySide6 (Qt6) | Native desktop application |
| **Timeline Rendering** | PyQtGraph or Plotly | High-performance visualization |
| **Vector Database** | Qdrant | Semantic search and similarity |
| **Metadata Storage** | SQLite | Fast local queries, metadata |
| **LLM Interface** | OpenAI-compatible API | Optional: cluster summarization, query interpretation |
| **Embedding** | Sentence Transformers / OpenAI | Multilingual vector generation |
| **Forensic Parsing** | pytsk3, pyewf | Disk image processing |
| **Language** | Python 3.13+ | Application logic |
## Data Model
### SemeionArtifact (Simplified)
### SemeionArtifact (Universal Schema, WiP)
Every artifactregardless of source platform — conforms to a universal schema:
Every artifactregardless of sourceconforms to a unified model:
```bash
┌─────────────────────────────────────────────────────────────────────────
│ SemeionArtifact
├─────────────────────────────────────────────────────────────────────────
Identity: id, case_id
Classification: artifact_class, source_platform, searchable
Temporal: timestamp
Actors: [{identifier, display_name, role}]
Content: text, semantic_text
Entities: indexed_entities[] (for filtering)
Hierarchy: parent_id, chunk_info (for documents)
Context: context_group (conversation, thread, session)
Location: url, path, title
│ Source-Specific: message{}, browser{}, email{}, document{}, etc. │
│ Ingestion: ingested_at, source_file, parser_id │
│ │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ SemeionArtifact │
├─────────────────────────────────────────────────────────┤
Identity: id, case_id
Classification: artifact_class, source_platform
Temporal: timestamp (UTC normalized)
Actors: [{identifier, display_name, role}]
Content: text, semantic_text
Entities: indexed_entities[] (files, IPs, etc)
Hierarchy: parent_id, context_group
Location: url, path, title
Embeddings: semantic_vector (768-dim)
Source-Specific: message{}, browser{}, email{}, etc
└─────────────────────────────────────────────────────────┘
```
### Vector Strategy
### Cluster Model (WiP)
| Vector | Purpose | Required |
|--------|---------|----------|
| **Semantic** | Conceptual similarity search | Yes |
| **Sparse** (keywords) | Exact term matching (hybrid) | Optional |
Timeline-only artifacts use a placeholder vector and are excluded from search but included in context views.
```bash
┌─────────────────────────────────────────────────────────┐
│ EventCluster │
├─────────────────────────────────────────────────────────┤
│ id: unique_cluster_id │
│ time_range: (start_timestamp, end_timestamp)
│ events: [artifact_ids]
│ pattern_type: "file_exfiltration" | "communication"
│ confidence: 0.0-1.0 │
│ summary: "USB transfer with deletion"
│ icon: "⚠️" | "💬" | "🌐" | etc │
│ semantic_links: [related_cluster_ids]
└─────────────────────────────────────────────────────────┘
```
## What Semeion Is NOT (yet)
| Not This | Why |
|----------|-----|
| Forensic suite replacement | Companion tooluse alongside Autopsy/Axiom |
| Reporting Tool | Review and analyse findings, documents in primary application |
| Forensic Interpretation Robot | Helps you to discover what you otherwise wouldnt |
| **Forensic suite replacement** | Companion tooluse alongside Autopsy for acquisition |
| **Reporting tool** | Timeline export for reports, but documentation happens in primary suite |
| **AI evidence interpreter** | AI assists with search/clustering; investigator interprets evidence |
## Development Setup
@@ -196,11 +238,12 @@ This project uses [uv](https://github.com/astral-sh/uv) for dependency managemen
- Python 3.13+
- [uv](https://github.com/astral-sh/uv) installed
- requirements.txt
### Installation
```bash
git clone <repository-url>
git clone https://git.cc24.dev/mstoeck3/semeion
cd semeion
# Create virtual environment
@@ -215,13 +258,11 @@ uv pip install -r requirements.txt -e .
# Configure environment
cp .env.example .env
# Edit .env with your Qdrant and LLM endpoint configurations
# Edit .env with your Qdrant and embedding endpoint configurations
```
### Running
uv handles the startup script:
```bash
semeion
```
@@ -232,41 +273,83 @@ semeion
| Resource | Requirement |
|----------|-------------|
| CPU | Multi-core |
| RAM | 4 GB |
| CPU | 4 cores |
| RAM | 8 GB |
| Storage | Minimal (evidence stored elsewhere) |
| Network | Access to Qdrant and LLM endpoints |
| GPU | Not required |
| Network | Access to Qdrant and embedding endpoints (if remote) |
### Recommended (Local Processing)
| Resource | Requirement |
|----------|-------------|
| CPU | 8+ cores |
| RAM | 32 GB |
| Storage | sufficient for evidence & vectors, LLM if installed locally |
| GPU | optional (improves embedding speed) |
| RAM | 16 GB (32 GB for large cases) |
| Storage | SSD, sufficient for evidence & vectors |
| GPU | Optional (improves embedding speed with local models) |
| Network | Optional (fully offline capable) |
## Project Status
**Current Phase:** Architecture and data model definition
**Current Phase:** MVP Development - Timeline & Core Features
**Roadmap:**
1. ✅ Concept and schema design
2. ⬜ Core infrastructure (Qdrant collection, basic ingestion)
3. ⬜ Search execution (semantic search, filtering)
4. ⬜ LLM integration (query interpretation)
5.Refinement system (Qdrant Recommend)
6. ⬜ Context view
7. ⬜ Platform parsers (WhatsApp, Chrome, etc.)
8. ⬜ Hybrid search (sparse vectors)
1. ✅ Concept and architecture design
2. ⬜ Core infrastructure
- Unified artifact ingestion (WhatsApp, Chrome)
- SQLite + Qdrant integration
3.Timeline visualization
- High-performance rendering (target: <200ms for 100k events)
- Multi-source swim lanes
- Zoom/pan navigation
4. ⬜ Semantic clustering
- Temporal proximity grouping
- Pattern template matching
- Semantic similarity detection
5. ⬜ Communication search
- Multilingual embedding
- Query interpretation
- Search → timeline jump
6. ⬜ Additional parsers (Telegram, Signal, etc.)
7. ⬜ Export and reporting
8. ⬜ Performance optimization & polish
## Why Open Source Matters
**Transparency for Court:**
- Auditable algorithms—no "black box" analysis
- Reproducible results—scientific validation possible
- Peer-reviewed methods—community scrutiny
**Accessibility:**
- Free for budget-constrained labs
- No vendor lock-in
- Community-driven development
**Innovation:**
- Rapid feature development
- Specialized extensions possible
- Academic research enabled
## Branding
**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"embodying the software's mission of interpreting semantic signals within digital evidence.
**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"embodying the software's mission of revealing meaningful patterns within digital evidence through intelligent timeline analysis.
The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden knowledge through analytical thought.
The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden connections through analytical thought.
## License
BSD 3-Clause
## Contributing
Semeion is in active development. Contributions welcome, especially from:
- Digital forensics practitioners (workflow validation)
- Timeline visualization experts
- Multilingual NLP specialists
- Performance optimization engineers