Skip to content

emmorais/blindsearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 BlindSearch

A Rust-powered tool for encrypting files securely using AES-GCM and enabling searchable encryption through deterministic tokens.


🔐 Overview

BlindSearch encrypts files using a password-derived key and stores the encrypted version with an .enc extension. It also optionally generates encrypted search tokens (keywords) that enable secure querying over encrypted data—without revealing the plaintext.

Highlights

  • 📂 Accepts a mandatory file path argument.
  • 🔑 Securely prompts for a password (no command-line exposure).
  • 🔒 Uses AES-256-GCM for authenticated encryption (confidentiality + integrity).
  • 🧂 Derives keys with Argon2id and stores salt in <filename>.salt.
  • 🔁 Supports:
    • 🔐 Encryption/decryption based on file extension.
    • 🔖 Search token generation with -g / --generate_keywords.
    • 🔍 Keyword search with -q <keyword> / --query <keyword>.

🚀 Installation

If you are using Ubuntu you need curl for downloading the installation script, and you need gcc, pkg-config and libssl-dev to build the source code.

On Ubuntu you can do:

sudo apt install curl gcc pkg-config libssl-dev

Install Rust: https://www.rust-lang.org/tools/install

Then after decompressing the zip file:

cd blindsearch
cargo build --release
cargo test

If you want to install it on you system:

cargo install --path .

Testing with cargo commands

cargo run -- input.txt            # encrypt
cargo run -- input.txt.enc        # decrypt
cargo run -- input.txt -g         # generate tokens
cargo run -- input.txt -q secrets # positive test
cargo run -- input.txt -q secret  # negative test

Examples:

Simple encryption:

> blindsearch input.txt
🔐 Enter your password: 

✅ Loaded existing salt from input.txt.salt
Encrypting file: input.txt
✅ Encrypted file: "input.txt.enc"
Blind Search finished successful!

Generate keywords:

> blindsearch input.txt -g
🔐 Enter your password: 

✅ Loaded existing salt from input.txt.salt
Encrypting file: input.txt
✅ Encrypted file: "input.txt.enc"
Generating keywords tokens in file: input.txt.keywords
Blind Search finished successful!

Query:

> blindsearch input.txt -q secrets
🔐 Enter your password: 

✅ Loaded existing salt from input.txt.salt
Querying for keyword: secrets
Keyword 'secrets' found in the encrypted database.

Large file

In order to generate a large.txt file, we did:

head -n 100000 /usr/share/dict/words > large.txt 

This resulted in a file of size ˜1 Mb.

Then we got the worst case with regards to the size of the keywords extracted from it, since there is no repeatition, which resulted in file size of 1.59 Mb.

🧠 Design

This project implements a simplified form of Searchable Symmetric Encryption (SSE), inspired by the construction in Figure 2 (page 21) of the paper:
Searchable Symmetric Encryption: Improved Definitions and Efficient Constructions by Curtmola et al.

We make the following simplifications and assumptions:

  • Since this tool operates on a single document at a time, we fix n = 1 (number of documents). Therefore, we do not append indices like w || i to the keywords, as the paper suggests for multi-document settings.
  • The document identifier is simply the filename. Avoid embedding sensitive information in filenames, as they are not encrypted.

To enable encrypted search:

  1. Deterministic encryption (e.g., AES in GCM mode) is used to convert each keyword into a fixed-length (16-byte) search token.
  2. These tokens are:
    • Truncated to a uniform length.
    • Shuffled before being written to the .keywords file to prevent revealing search patterns or keyword order.

An adversary:

  • Cannot generate valid tokens without the user's encryption key.
  • Cannot distinguish between valid and dummy tokens.

To query:

  • The user inputs a keyword.
  • It is deterministically encrypted using the same process and key.
  • The token is compared against the stored list.
  • If found, the keyword is confirmed to exist in the document.

Choice of algorithms

  • File encryption method: AES-GCM is part of many different standards, including NIST and FIPS, and is specifically adopted for encryption of data at rest by big companies like Amazon and Google, see the References.

  • Deterministic encryption: The original paper recommends using a PRP for generation of tokens. This means that the function used must be invertible and its output must be indistiguishable from random. Therefore, naive encryption would leak the keyword length. One solution for it is to use fixed length keywords, padding when needed. Here we simply adopt truncated AES-GCM with zero nonce. We leave as a future task to implement padding appropriately for different sizes of keywords like suggested in the paper.

🛡️ Threat Model

This design assumes the following:

  • The adversary may observe:
    • The encrypted file (.enc)
    • The salt (.salt)
    • The keyword tokens (.keywords)
  • The adversary does not know the user's password.
  • The adversary cannot:
    • Derive the encryption key from the password (thanks to Argon2id).
    • Generate valid tokens.
    • Access plaintext contents or keywords.

This model does not hide:

  • Access/query patterns (i.e., which tokens are queried and how often).

More specifically, the paper improves the model called IND2-CKA, where CPA-security is adapted to protect keywords. The high level idea is that an adversary can not distinguish keywords encryption from random data, therefore keywords are secure. The model must assume the adversary have no information about plaintexts, because deterministic encryption would always output the same result, hence the adversary would win the underlying security game. Beyond the basic model, the paper provides a simulation-based definition, which gives an extra guarantee.


⚠️ Limitations of current implementation

  • The number of keywords is revealed by the .keywords file.
  • This can be mitigated by:
    • Defining a maximum keyword capacity.
    • Padding unused slots with random dummy tokens.

⚖️ Trade-offs

  • ✅ Fast and simple implementation suitable for local single-file use.
  • ❌ Does not support complex queries, multi-document indexing, or access pattern privacy.
  • Other cryptographic solution could be used to solve all the privacy issues, like for example FHE, however it would be orders of magnitude slower in practice.

References

Searchable Symmetric Encryption: Improved Definitions and Efficient Constructions. Reza Curtmola, Juan Garay, Seny Kamara, Rafail Ostrovsky.

Practical Challenges with AES-GCM and the need for a new cipher. Panos Kampanakis, Matt Campagna, Eric Crocket, Adam Petcher, Shay Gueron.

Google cloud uses GCM

Searchable Encryption: The “Just Right” Balance of Data Privacy and Usability Blind Insight

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages