# semeion

![alt text](resources/title_image.png)

## Concept

Desktop application for forensic investigators which allows to search for various types of artifacts via natural language.
The application uses a multi-stage hybrid search approach which accepts a query in natural language and then utilizes a GPT to generate a machine-readable search plan which can be adjusted by the user before execution.
The system then utilizes a combination of semantic search via pre-generated embeddings and filters the results as defined via the machine interface.
This enables the combination of semantic understanding with context and temporal relationships rather than traditional approaches of exact keyword matching.

## UX Example

An investigator can ask "show me what happened after they discussed the payment" and the system will find relevant communication about payments, then correlate subsequent activities (file access, application launches, network traffic) in a temporal sequence, regardless of the specific applications or messaging platforms involved.

## System Overview

### Core Concept

The system treats all digital artifacts as semantically-rich object containers, embedded in a multi-dimensional vector space and associated with metadata. On top of matching exact strings, it finds artifacts with similar meaning. A local language model interprets natural language queries and decomposes them into structured search operations (represented by a json-like datastructure) that can handle temporal reasoning ("after"), relationship detection ("both in"), and causal inference ("because of").  
The query object contains fields which enable result narrowing for data which can be discovered by deterministic data just like timestamps or application purpose as well as a string which suits well for vector retreival.

### Key Innovation

Generic artifact understanding: The system doesn't need to "know" about TOX, WhatsApp, Signal, Telegram, or any specific application. Ingestion is a pre-scripted pre-processing operation which constructs a standardized data container, which has multiple pre-defined fileds: A set of Metadata which contains data of machine-origin such as OS/application events with timestamps similar to traditional forensic artifacts or timeline data, and a vector representation which holds whatever provides semantic relevance for retreival purposes (which is primarily, but not restricted to content generated by user behavior). This means it works on artifacts from applications that don't even exist yet, or proprietary communication tools without public documentation, or even arbitrary data which holds semantic information only.

### Architecture Philosophy

Client-Server Separation: Compute-intensive operations (embedding generation, LLM inference, vector search) can run on powerful remote infrastructure, while the GUI client remains lightweight and runs on the investigator's local machine. This enables:

- Shared infrastructure across investigation teams
- Scaling compute resources independently of workstations
- Deployment in both air-gapped labs and cloud environments
- Efficient resource utilization (GPU servers can serve multiple investigators)

## Development Setup

This project uses [uv](https://github.com/astral-sh/uv) for fast dependency management and the modern "Src Layout" structure.

### Prerequisites

- Python 3.13+
- [uv](https://github.com/astral-sh/uv) installed (`curl -LsSf https://astral.sh/uv/install.sh | sh`)

### Installation Steps

1. **Clone the repository**

   ```bash
   git clone <your-repo-url>
   cd semeion
   ```

2. **Create a virtual environment**
    This project requires Python 3.13.

    ```bash
    uv venv --python 3.13
    ```

3. **Environment**

      - Linux/macOS:

        ```bash
        source .venv/bin/activate
        ```

      - Windows:

        ```powershell
        .venv\Scripts\activate
        ```

4. **Dependencies**
    This command installs locked dependencies and links the local `semeion` package in editable mode.

    ```bash
    uv pip install -r requirements.txt -e .
    ```

### Running the Application

```bash
python src/semeion/main.py
```

### Running Tests

```bash
pytest
```

## Technical Stack

### Core Technologies

| Component            | Technology                | Version   | License                        | Purpose                        |
|----------------------|---------------------------|-----------|--------------------------------|--------------------------------|
| GUI Framework        | PySide6                   | TBD       | LGPL                           | Desktop application interface  |
| Vector Database      | Qdrant                    | 1.10+     | Apache 2.0                     | Semantic search and storage    |
| LLM Inference        | OpenAI-compatible API     | various   | various                        | Natural language understanding |
| LLM Models           | various                   | various   | various                        | Query parsing                  |
| Embeddings           | OpenAI-compatible API     | various   | various                        | Semantic vectors               |
| Embedding Model      | TBD                       | TBD       | TBD                            | Text to vector conversion      |
| Forensic Parsing     | pytsk3                    | TBD       | Apache 2.0                     | Disk image processing          |
| Image Handling       | pyewf                     | TBD       | LGPL                           | E01 image support              |
| NLP                  | spaCy                     | TBD       | MIT                            | Entity extraction              |
| Programming Language | Python                    | 3.13+     | PSF                            | Application logic              |

### Infrastructure Requirements

#### Remote Processing

- CPU: multi-Core
- RAM: ~4GiB
- Storage: insignificant, local client, possible evidence
- GPU: irrelevant
- remote infrastructure for compute

#### Small scale local processing (local workstation)

- CPU: 8+ cores (modern Ryzen5/Ryzen7/Ryzen AI series)
- RAM: 32GB minimum (16GB for Qdrant, 8GB for LLM, 8GB for OS/app)
- Storage: 2TB SSD (1TB evidence + 1TB index)
- GPU: recommended, not required (speed considerations)

#### Recommended Configuration (Remote Server Compute Node)

- CPU: 16+ cores (Ryzen7+)
- RAM: 64-128 GiB
- Storage: sufficient for index+evidence
- GPU: recommended, AMD Instinct MI50 32 GiB or better (Inference, Embeddings)
- Network: sufficient

#### Enterprise Configuration (Multi-User)

- TBD, out of scope

## Supported Ingestion Formats

### Primary: Specialized Data Objects

TBD

### Secondary: Conversion Engine (algorithmic)

Example:

- SQLite Parser for browser History -> Special Data Object
- Converter for TSK artifacts -> Metadata in Special Data Object (TBD)

## Use Case Scenarios

### Scenario 1: Drug Transaction Investigation

Query: "Find when the suspect made cryptocurrency payments after discussing deals" Process:

1. System finds chat messages about drug deals
2. Extracts timestamps of deal discussions
3. Searches for cryptocurrency-related activity after each discussion
4. Correlates wallet launches, browser activity, blockchain transactions
5. Presents timeline showing: discussion → wallet launch → transaction Evidence: Timeline of intent → action, strengthening case

### Scenario 2: Data Exfiltration

Query: "Show file access before large uploads to cloud storage" Process:

1. Identifies cloud storage upload events
2. Looks backward in time for file access
3. Correlates accessed files with uploaded data
4. Maps file paths to user actions Evidence: Demonstrates what data was taken and when

### Scenario 3: Coordinated Activity

Query: "Find people who communicated privately and are also in the same groups" Process:

1. Extracts participants from private messages
2. Extracts participants from group chats
3. Identifies overlap (intersection)
4. Shows sample conversations from each context Evidence: Demonstrates coordinated behavior across communication channels

### Scenario 4: Timeline Reconstruction

Query: "What happened between receiving the threatening email and deleting files?" Process:

1. Finds threatening email (semantic search)
2. Finds file deletion events (system logs)
3. Returns all artifacts between these timestamps
4. Visualizes complete timeline Evidence: Establishes sequence of events and potential motive

## License

BSD 3-Clause (subject to change during development)