2025-11-26 10:08:26 +01:00
2025-11-26 10:08:26 +01:00
2025-11-26 10:08:26 +01:00
2025-11-25 13:49:57 +00:00
2025-11-25 13:49:57 +00:00
2025-11-26 10:08:26 +01:00
2025-11-26 10:08:26 +01:00
2025-11-26 10:08:26 +01:00

Gamayun

alt text

Concept

Desktop application for forensic investigators which allows to search for various types of artifacts via natural language. The application uses a multi-stage hybrid search approach which accepts a query in natural language and then utilizes a GPT to generate a machine-readable search plan which can be adjusted by the user before execution. The system then utilizes a combination of semantic search via pre-generated embeddings and filters the results as defined via the machine interface. This enables the combination of semantic understanding with context and temporal relationships rather than traditional approaches of exact keyword matching.

UX Example

An investigator can ask "show me what happened after they discussed the payment" and the system will find relevant communication about payments, then correlate subsequent activities (file access, application launches, network traffic) in a temporal sequence, regardless of the specific applications or messaging platforms involved.

System Overview

Core Concept

The system treats all digital artifacts as semantically-rich objects embedded in a multi-dimensional vector space. Instead of matching exact words, it finds artifacts with similar meaning. A local language model interprets natural language queries and decomposes them into structured search operations that can handle temporal reasoning ("after"), relationship detection ("both in"), and causal inference ("because of").

Key Innovation

Generic artifact understanding: The system doesn't need to "know" about TOX, WhatsApp, Signal, Telegram, or any specific application. During ingestion, it automatically classifies content as "chat message," "system event," "document," etc., based purely on semantic similarity to type descriptions. This means it works on artifacts from applications that don't even exist yet, or proprietary communication tools without public documentation.

Architecture Philosophy

Client-Server Separation: Compute-intensive operations (embedding generation, LLM inference, vector search) can run on powerful remote infrastructure, while the GUI client remains lightweight and runs on the investigator's local machine. This enables:

  • Scaling compute resources independently of workstations
  • Deployment in air-gapped labs
  • Efficient resource utilization (centralized compute nodes can serve multiple investigators)

Development Setup

🛠 Development Setup

This project uses uv for fast dependency management and the modern "Src Layout" structure.

Prerequisites

  • Python 3.13+
  • uv installed (curl -LsSf https://astral.sh/uv/install.sh | sh)

Installation Steps

  1. Clone the repository

    git clone <your-repo-url>
    cd gamayun
    
  2. Create a virtual environment This project requires Python 3.13.

    uv venv --python 3.13
    
  3. Activate the environment

    • Linux/macOS:

      source .venv/bin/activate
      
    • Windows:

      .venv\Scripts\activate
      
  4. Install dependencies This command installs locked dependencies and links the local gamayun package in editable mode.

    uv pip install -r requirements.txt -e .
    

Running the Application

You can execute the module directly:

python src/gamayun/main.py

Running Tests

pytest

Data Flow

Ingestion Pipeline

Raw Evidence Sources
├─ Forensic Images (E01, DD, AFF4)
├─ Timeline CSV (Timesketch format)
└─ Loose Files (documents, logs, databases)
         │
         ▼
┌────────────────────────┐
│ Artifact Extraction    │
│ • pytsk3 (images)      │
│ • CSV parser           │
│ • File processors      │
└───────┬────────────────┘
        │
        ▼
┌────────────────────────┐
│ Content Extraction     │
│ • PDF, DOCX, XLSX      │
│ • SQLite databases     │
│ • Text files           │
│ • OCR for images       │
└───────┬────────────────┘
        │
        ▼
┌────────────────────────┐
│ Semantic Enrichment    │
│ • Classify type        │
│ • Extract entities     │
│ • Detect relationships │
│ • Add metadata         │
└───────┬────────────────┘
        │
        ▼
┌────────────────────────┐
│ Embedding Generation   │
│ → Remote/Local Service │
└───────┬────────────────┘
        │
        ▼
┌────────────────────────┐
│ Index in Qdrant        │
│ • Vector + Payload     │
│ • Create indexes       │
│ • Snapshot for audit   │
└────────────────────────┘

Reproducibility: Each ingestion run generates a manifest file containing:

  • Source hashes (MD5/SHA256 of evidence)
  • Model versions (embedding model, LLM)
  • Configuration parameters
  • Processing statistics
  • Timestamp and operator ID

This manifest allows exact reproduction of the index from the same source data.

Query Execution Pipeline

Natural Language Query
"bitcoin transaction after drug deal"
         │
         ▼
┌────────────────────────┐
│ LLM Query Parser       │
│ → Remote/Local Service │
│ Returns: JSON Plan     │
└───────┬────────────────┘
        │
        ▼
┌────────────────────────┐
│ Query Plan Editor (UI) │
│ • Review plan          │
│ • Adjust parameters    │
│ • Modify steps         │
│ • User approves        │
└───────┬────────────────┘
        │
        ▼
┌────────────────────────┐
│ Search Orchestrator    │
│ • Execute Step 1       │
│ • Extract timestamps   │
│ • Execute Step 2       │
│ • Apply temporal logic │
└───────┬────────────────┘
        │
        ▼
┌────────────────────────┐
│ Correlation Engine     │
│ • Calculate proximity  │
│ • Weight scores        │
│ • Build relationships  │
└───────┬────────────────┘
        │
        ▼
┌────────────────────────┐
│ Results Presentation   │
│ • Timeline view        │
│ • Correlation graph    │
│ • Export options       │
└────────────────────────┘

Technical Stack

Core Technologies

Component Technology Version License Purpose
GUI Framework PySide6 TBD LGPL Desktop application interface
Vector Database Qdrant 1.10+ Apache 2.0 Semantic search and storage
LLM Inference OpenAI-compatible API various various Natural language understanding
LLM Models various various various Query parsing
Embeddings OpenAI-compatible API various various Semantic vectors
Embedding Model TBD TBD TBD Text to vector conversion
Forensic Parsing pytsk3 TBD Apache 2.0 Disk image processing
Image Handling pyewf TBD LGPL E01 image support
NLP spaCy TBD MIT Entity extraction
Programming Language Python 3.13+ PSF Application logic

Infrastructure Requirements

Remote Processing

  • CPU: multi-Core
  • RAM: ~4GiB
  • Storage: insignificant, local client, possible evidence
  • GPU: irrelevant
  • remote infrastructure for compute

Small scale local processing (local workstation)

  • CPU: 8+ cores (modern Ryzen5/Ryzen7/Ryzen AI series)
  • RAM: 32GB minimum (16GB for Qdrant, 8GB for LLM, 8GB for OS/app)
  • Storage: 2TB SSD (1TB evidence + 1TB index)
  • GPU: recommended, not required (speed considerations)
  • CPU: 16+ cores (Ryzen7+)
  • RAM: 64-128 GiB
  • Storage: sufficient for index+evidence
  • GPU: recommended, AMD Instinct MI50 32 GiB or better (Inference, Embeddings)
  • Network: sufficient

Enterprise Configuration (Multi-User)

  • TBD, out of scope

Supported Ingestion Formats

Primary: Specialized Data Objects

TBD

Secondary: Conversion Engine (algorithmic)

Example:

  • SQLite Parser for browser History -> Special Data Object
  • Converter for TSK artifacts -> Metadata in Special Data Object (TBD)

Use Case Scenarios

Scenario 1: Drug Transaction Investigation

Query: "Find when the suspect made cryptocurrency payments after discussing deals" Process:

  1. System finds chat messages about drug deals
  2. Extracts timestamps of deal discussions
  3. Searches for cryptocurrency-related activity after each discussion
  4. Correlates wallet launches, browser activity, blockchain transactions
  5. Presents timeline showing: discussion → wallet launch → transaction Evidence: Timeline of intent → action, strengthening case

Scenario 2: Data Exfiltration

Query: "Show file access before large uploads to cloud storage" Process:

  1. Identifies cloud storage upload events
  2. Looks backward in time for file access
  3. Correlates accessed files with uploaded data
  4. Maps file paths to user actions Evidence: Demonstrates what data was taken and when

Scenario 3: Coordinated Activity

Query: "Find people who communicated privately and are also in the same groups" Process:

  1. Extracts participants from private messages
  2. Extracts participants from group chats
  3. Identifies overlap (intersection)
  4. Shows sample conversations from each context Evidence: Demonstrates coordinated behavior across communication channels

Scenario 4: Timeline Reconstruction

Query: "What happened between receiving the threatening email and deleting files?" Process:

  1. Finds threatening email (semantic search)
  2. Finds file deletion events (system logs)
  3. Returns all artifacts between these timestamps
  4. Visualizes complete timeline Evidence: Establishes sequence of events and potential motive

License

BSD 3-Clause (subject to change during development)

Description
Scalable vector search engine with focus on post-mortem forensics
Readme BSD-3-Clause 21 MiB
Languages
Python 100%