diff --git a/README.md b/README.md index 9f8c994..83f2dc9 100644 --- a/README.md +++ b/README.md @@ -58,7 +58,7 @@ This project uses [uv](https://github.com/astral-sh/uv) for fast dependency mana uv venv --python 3.13 ``` -3. **Activate the environment** +3. **Environment** - Linux/macOS: @@ -72,7 +72,7 @@ This project uses [uv](https://github.com/astral-sh/uv) for fast dependency mana .venv\Scripts\activate ``` -4. **Install dependencies** +4. **Dependencies** This command installs locked dependencies and links the local `semeion` package in editable mode. ```bash @@ -81,8 +81,6 @@ This project uses [uv](https://github.com/astral-sh/uv) for fast dependency mana ### Running the Application -You can execute the module directly: - ```bash python src/semeion/main.py ``` @@ -93,116 +91,6 @@ python src/semeion/main.py pytest ``` -## Data Flow (subject to change) - -### Ingestion Pipeline - -```bash -Raw Evidence Sources -├─ Forensic Images (E01, DD, AFF4) -├─ Timeline CSV (Timesketch format) -└─ Loose Files (documents, logs, databases) - │ - ▼ -┌────────────────────────┐ -│ Artifact Extraction │ -│ • pytsk3 (images) │ -│ • CSV parser │ -│ • File processors │ -└───────┬────────────────┘ - │ - ▼ -┌────────────────────────┐ -│ Content Extraction │ -│ • PDF, DOCX, XLSX │ -│ • SQLite databases │ -│ • Text files │ -│ • OCR for images │ -└───────┬────────────────┘ - │ - ▼ -┌────────────────────────┐ -│ Semantic Enrichment │ -│ • Classify type │ -│ • Extract entities │ -│ • Detect relationships │ -│ • Add metadata │ -└───────┬────────────────┘ - │ - ▼ -┌────────────────────────┐ -│ Embedding Generation │ -│ → Remote/Local Service │ -└───────┬────────────────┘ - │ - ▼ -┌────────────────────────┐ -│ Index in Qdrant │ -│ • Vector + Payload │ -│ • Create indexes │ -│ • Snapshot for audit │ -└────────────────────────┘ -``` - -Reproducibility: Each ingestion run generates a manifest file containing: - -- Source hashes (MD5/SHA256 of evidence) -- Model versions (embedding model, LLM) -- Configuration parameters -- Processing statistics -- Timestamp and operator ID - -This manifest allows exact reproduction of the index from the same source data. - -### Query Execution Pipeline - -```bash -Natural Language Query -"bitcoin transaction after drug deal" - │ - ▼ -┌────────────────────────┐ -│ LLM Query Parser │ -│ → Remote/Local Service │ -│ Returns: JSON Plan │ -└───────┬────────────────┘ - │ - ▼ -┌────────────────────────┐ -│ Query Plan Editor (UI) │ -│ • Review plan │ -│ • Adjust parameters │ -│ • Modify steps │ -│ • User approves │ -└───────┬────────────────┘ - │ - ▼ -┌────────────────────────┐ -│ Search Orchestrator │ -│ • Execute Step 1 │ -│ • Extract timestamps │ -│ • Execute Step 2 │ -│ • Apply temporal logic │ -└───────┬────────────────┘ - │ - ▼ -┌────────────────────────┐ -│ Correlation Engine │ -│ • Calculate proximity │ -│ • Weight scores │ -│ • Build relationships │ -└───────┬────────────────┘ - │ - ▼ -┌────────────────────────┐ -│ Results Presentation │ -│ • Timeline view │ -│ • Correlation graph │ -│ • Export options │ -└────────────────────────┘ -``` - - ## Technical Stack ### Core Technologies @@ -249,8 +137,6 @@ Natural Language Query - TBD, out of scope ---- - ## Supported Ingestion Formats ### Primary: Specialized Data Objects