This commit is contained in:
overcuriousity
2025-11-26 23:15:42 +01:00

133
README.md
View File

@@ -17,19 +17,21 @@ An investigator can ask "show me what happened after they discussed the payment"
### Core Concept
The system treats all digital artifacts as semantically-rich objects embedded in a multi-dimensional vector space. Instead of matching exact words, it finds artifacts with similar meaning. A local language model interprets natural language queries and decomposes them into structured search operations that can handle temporal reasoning ("after"), relationship detection ("both in"), and causal inference ("because of").
The system treats all digital artifacts as semantically-rich object containers, embedded in a multi-dimensional vector space and associated with metadata. On top of matching exact strings, it finds artifacts with similar meaning. A local language model interprets natural language queries and decomposes them into structured search operations (represented by a json-like datastructure) that can handle temporal reasoning ("after"), relationship detection ("both in"), and causal inference ("because of").
The query object contains fields which enable result narrowing for data which can be discovered by deterministic data just like timestamps or application purpose as well as a string which suits well for vector retreival.
### Key Innovation
Generic artifact understanding: The system doesn't need to "know" about TOX, WhatsApp, Signal, Telegram, or any specific application. During ingestion, it automatically classifies content as "chat message," "system event," "document," etc., based purely on semantic similarity to type descriptions. This means it works on artifacts from applications that don't even exist yet, or proprietary communication tools without public documentation.
Generic artifact understanding: The system doesn't need to "know" about TOX, WhatsApp, Signal, Telegram, or any specific application. Ingestion is a pre-scripted pre-processing operation which constructs a standardized data container, which has multiple pre-defined fileds: A set of Metadata which contains data of machine-origin such as OS/application events with timestamps similar to traditional forensic artifacts or timeline data, and a vector representation which holds whatever provides semantic relevance for retreival purposes (which is primarily, but not restricted to content generated by user behavior). This means it works on artifacts from applications that don't even exist yet, or proprietary communication tools without public documentation, or even arbitrary data which holds semantic information only.
### Architecture Philosophy
Client-Server Separation: Compute-intensive operations (embedding generation, LLM inference, vector search) can run on powerful remote infrastructure, while the GUI client remains lightweight and runs on the investigator's local machine. This enables:
- Shared infrastructure across investigation teams
- Scaling compute resources independently of workstations
- Deployment in air-gapped labs
- Efficient resource utilization (centralized compute nodes can serve multiple investigators)
- Deployment in both air-gapped labs and cloud environments
- Efficient resource utilization (GPU servers can serve multiple investigators)
## Development Setup
@@ -56,7 +58,7 @@ This project uses [uv](https://github.com/astral-sh/uv) for fast dependency mana
uv venv --python 3.13
```
3. **Activate the environment**
3. **Environment**
- Linux/macOS:
@@ -70,7 +72,7 @@ This project uses [uv](https://github.com/astral-sh/uv) for fast dependency mana
.venv\Scripts\activate
```
4. **Install dependencies**
4. **Dependencies**
This command installs locked dependencies and links the local `semeion` package in editable mode.
```bash
@@ -79,8 +81,6 @@ This project uses [uv](https://github.com/astral-sh/uv) for fast dependency mana
### Running the Application
You can execute the module directly:
```bash
python src/semeion/main.py
```
@@ -91,117 +91,6 @@ python src/semeion/main.py
pytest
```
## Data Flow
### Ingestion Pipeline
```bash
Raw Evidence Sources
├─ Forensic Images (E01, DD, AFF4)
├─ Timeline CSV (Timesketch format)
└─ Loose Files (documents, logs, databases)
┌────────────────────────┐
│ Artifact Extraction │
│ • pytsk3 (images) │
│ • CSV parser │
│ • File processors │
└───────┬────────────────┘
┌────────────────────────┐
│ Content Extraction │
│ • PDF, DOCX, XLSX │
│ • SQLite databases │
│ • Text files │
│ • OCR for images │
└───────┬────────────────┘
┌────────────────────────┐
│ Semantic Enrichment │
│ • Classify type │
│ • Extract entities │
│ • Detect relationships │
│ • Add metadata │
└───────┬────────────────┘
┌────────────────────────┐
│ Embedding Generation │
│ → Remote/Local Service │
└───────┬────────────────┘
┌────────────────────────┐
│ Index in Qdrant │
│ • Vector + Payload │
│ • Create indexes │
│ • Snapshot for audit │
└────────────────────────┘
```
Reproducibility: Each ingestion run generates a manifest file containing:
- Source hashes (MD5/SHA256 of evidence)
- Model versions (embedding model, LLM)
- Configuration parameters
- Processing statistics
- Timestamp and operator ID
This manifest allows exact reproduction of the index from the same source data.
### Query Execution Pipeline
```bash
Natural Language Query
"bitcoin transaction after drug deal"
┌────────────────────────┐
│ LLM Query Parser │
│ → Remote/Local Service │
│ Returns: JSON Plan │
└───────┬────────────────┘
┌────────────────────────┐
│ Query Plan Editor (UI) │
│ • Review plan │
│ • Adjust parameters │
│ • Modify steps │
│ • User approves │
└───────┬────────────────┘
┌────────────────────────┐
│ Search Orchestrator │
│ • Execute Step 1 │
│ • Extract timestamps │
│ • Execute Step 2 │
│ • Apply temporal logic │
└───────┬────────────────┘
┌────────────────────────┐
│ Correlation Engine │
│ • Calculate proximity │
│ • Weight scores │
│ • Build relationships │
└───────┬────────────────┘
┌────────────────────────┐
│ Results Presentation │
│ • Timeline view │
│ • Correlation graph │
│ • Export options │
└────────────────────────┘
```
---
## Technical Stack
### Core Technologies
@@ -248,8 +137,6 @@ Natural Language Query
- TBD, out of scope
---
## Supported Ingestion Formats
### Primary: Specialized Data Objects
@@ -302,6 +189,10 @@ Query: "What happened between receiving the threatening email and deleting files
3. Returns all artifacts between these timestamps
4. Visualizes complete timeline Evidence: Establishes sequence of events and potential motive
## Branding
**Semeion** derives from the Greek σημεῖον (semeion), meaning "sign," "signal," or "meaningful mark"—a name that embodies the software's core mission of interpreting semantic signals hidden within digital evidence. The mascot, **Koios** (Coeus), the Titan of intellect, inquiry, and questioning in Greek mythology, represents the investigative mindset at the heart of forensic work: the relentless pursuit of hidden knowledge through analytical thought. Just as Koios personified the axis of heavenly inquiry, Semeion serves as the pivot point for discovering meaningful patterns across temporal and semantic dimensions. Together, the name and mascot bridge ancient wisdom traditions of interpretation with cutting-edge vector embedding technology, reflecting the software's philosophy that effective investigation requires both sophisticated computational tools and human insight—finding the signs that matter in an ocean of data.
## License
BSD 3-Clause (subject to change during development)