README.md aktualisiert

This commit is contained in:
2025-11-26 21:12:37 +00:00
parent 0782fb29d0
commit dbd1ea1ebf

118
README.md
View File

@@ -58,7 +58,7 @@ This project uses [uv](https://github.com/astral-sh/uv) for fast dependency mana
uv venv --python 3.13 uv venv --python 3.13
``` ```
3. **Activate the environment** 3. **Environment**
- Linux/macOS: - Linux/macOS:
@@ -72,7 +72,7 @@ This project uses [uv](https://github.com/astral-sh/uv) for fast dependency mana
.venv\Scripts\activate .venv\Scripts\activate
``` ```
4. **Install dependencies** 4. **Dependencies**
This command installs locked dependencies and links the local `semeion` package in editable mode. This command installs locked dependencies and links the local `semeion` package in editable mode.
```bash ```bash
@@ -81,8 +81,6 @@ This project uses [uv](https://github.com/astral-sh/uv) for fast dependency mana
### Running the Application ### Running the Application
You can execute the module directly:
```bash ```bash
python src/semeion/main.py python src/semeion/main.py
``` ```
@@ -93,116 +91,6 @@ python src/semeion/main.py
pytest pytest
``` ```
## Data Flow (subject to change)
### Ingestion Pipeline
```bash
Raw Evidence Sources
├─ Forensic Images (E01, DD, AFF4)
├─ Timeline CSV (Timesketch format)
└─ Loose Files (documents, logs, databases)
┌────────────────────────┐
│ Artifact Extraction │
│ • pytsk3 (images) │
│ • CSV parser │
│ • File processors │
└───────┬────────────────┘
┌────────────────────────┐
│ Content Extraction │
│ • PDF, DOCX, XLSX │
│ • SQLite databases │
│ • Text files │
│ • OCR for images │
└───────┬────────────────┘
┌────────────────────────┐
│ Semantic Enrichment │
│ • Classify type │
│ • Extract entities │
│ • Detect relationships │
│ • Add metadata │
└───────┬────────────────┘
┌────────────────────────┐
│ Embedding Generation │
│ → Remote/Local Service │
└───────┬────────────────┘
┌────────────────────────┐
│ Index in Qdrant │
│ • Vector + Payload │
│ • Create indexes │
│ • Snapshot for audit │
└────────────────────────┘
```
Reproducibility: Each ingestion run generates a manifest file containing:
- Source hashes (MD5/SHA256 of evidence)
- Model versions (embedding model, LLM)
- Configuration parameters
- Processing statistics
- Timestamp and operator ID
This manifest allows exact reproduction of the index from the same source data.
### Query Execution Pipeline
```bash
Natural Language Query
"bitcoin transaction after drug deal"
┌────────────────────────┐
│ LLM Query Parser │
│ → Remote/Local Service │
│ Returns: JSON Plan │
└───────┬────────────────┘
┌────────────────────────┐
│ Query Plan Editor (UI) │
│ • Review plan │
│ • Adjust parameters │
│ • Modify steps │
│ • User approves │
└───────┬────────────────┘
┌────────────────────────┐
│ Search Orchestrator │
│ • Execute Step 1 │
│ • Extract timestamps │
│ • Execute Step 2 │
│ • Apply temporal logic │
└───────┬────────────────┘
┌────────────────────────┐
│ Correlation Engine │
│ • Calculate proximity │
│ • Weight scores │
│ • Build relationships │
└───────┬────────────────┘
┌────────────────────────┐
│ Results Presentation │
│ • Timeline view │
│ • Correlation graph │
│ • Export options │
└────────────────────────┘
```
## Technical Stack ## Technical Stack
### Core Technologies ### Core Technologies
@@ -249,8 +137,6 @@ Natural Language Query
- TBD, out of scope - TBD, out of scope
---
## Supported Ingestion Formats ## Supported Ingestion Formats
### Primary: Specialized Data Objects ### Primary: Specialized Data Objects