From 0782fb29d053270759a6fce912482e69b9003dd7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mario=20St=C3=B6ckl?= Date: Wed, 26 Nov 2025 21:10:05 +0000 Subject: [PATCH] README.md aktualisiert --- README.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index d362172..9f8c994 100644 --- a/README.md +++ b/README.md @@ -17,19 +17,21 @@ An investigator can ask "show me what happened after they discussed the payment" ### Core Concept -The system treats all digital artifacts as semantically-rich objects embedded in a multi-dimensional vector space. Instead of matching exact words, it finds artifacts with similar meaning. A local language model interprets natural language queries and decomposes them into structured search operations that can handle temporal reasoning ("after"), relationship detection ("both in"), and causal inference ("because of"). +The system treats all digital artifacts as semantically-rich object containers, embedded in a multi-dimensional vector space and associated with metadata. On top of matching exact strings, it finds artifacts with similar meaning. A local language model interprets natural language queries and decomposes them into structured search operations (represented by a json-like datastructure) that can handle temporal reasoning ("after"), relationship detection ("both in"), and causal inference ("because of"). +The query object contains fields which enable result narrowing for data which can be discovered by deterministic data just like timestamps or application purpose as well as a string which suits well for vector retreival. ### Key Innovation -Generic artifact understanding: The system doesn't need to "know" about TOX, WhatsApp, Signal, Telegram, or any specific application. During ingestion, it automatically classifies content as "chat message," "system event," "document," etc., based purely on semantic similarity to type descriptions. This means it works on artifacts from applications that don't even exist yet, or proprietary communication tools without public documentation. +Generic artifact understanding: The system doesn't need to "know" about TOX, WhatsApp, Signal, Telegram, or any specific application. Ingestion is a pre-scripted pre-processing operation which constructs a standardized data container, which has multiple pre-defined fileds: A set of Metadata which contains data of machine-origin such as OS/application events with timestamps similar to traditional forensic artifacts or timeline data, and a vector representation which holds whatever provides semantic relevance for retreival purposes (which is primarily, but not restricted to content generated by user behavior). This means it works on artifacts from applications that don't even exist yet, or proprietary communication tools without public documentation, or even arbitrary data which holds semantic information only. ### Architecture Philosophy Client-Server Separation: Compute-intensive operations (embedding generation, LLM inference, vector search) can run on powerful remote infrastructure, while the GUI client remains lightweight and runs on the investigator's local machine. This enables: +- Shared infrastructure across investigation teams - Scaling compute resources independently of workstations -- Deployment in air-gapped labs -- Efficient resource utilization (centralized compute nodes can serve multiple investigators) +- Deployment in both air-gapped labs and cloud environments +- Efficient resource utilization (GPU servers can serve multiple investigators) ## Development Setup @@ -91,7 +93,7 @@ python src/semeion/main.py pytest ``` -## Data Flow +## Data Flow (subject to change) ### Ingestion Pipeline @@ -200,7 +202,6 @@ Natural Language Query └────────────────────────┘ ``` ---- ## Technical Stack