flashlight
The Problem
Picture this: you give your coding agent a task.
The first thing it does is try to understand your project. Every coding agent has grep built in, so the agent makes some naming assumptions, picks keywords, and searches. The trouble starts here.
If its assumptions don’t match the symbols that actually exist in your codebase, after a few failed attempts (maybe a few ls and read calls thrown in) it gives up and writes a new implementation from scratch. Congratulations — your project now has two versions of the same functionality with different names and different consumers.
OK, say the agent gets lucky and grep finds something. What happens next? A few scenarios: it reads those few matching lines, maybe reads a few more around them, maybe reads the whole file. Then, more likely than not, it starts implementing — with zero understanding of the callers, the upstream, or the broader system — based entirely on its own assumptions.
Your project starts rotting. It starts looking like what people call “vibe slop.”
I’ve lost count of how many times I’ve watched this happen and felt genuinely helpless. Those moments when your agent feels profoundly stupid. Countless times I’ve hit Escape in frustration, interrupted the agent, and demanded it go read more context before touching anything.
Better documentation, better workflows, better architecture, more human intervention — they all help, but the problem persists. Especially when your agent’s context crosses 200K tokens and its capabilities start degrading while your own patience is wearing thin. /clear means you have to reconstruct the entire conversation from scratch — re-explaining decisions, re-describing context that existed only in the previous session, especially the things that never made it into docs. /compact makes it worse: the agent forgets even the few lines it grep’d, and relies on a summary produced by a smaller, dumber model that misses the point or hallucinates, then charges ahead without pulling the actual code back into context.
What an agent needs is a global view — enough understanding to know how much to read before it actually understands. But it can’t read everything — that would burn through context instantly. Explore subagents help, but their output is secondhand: an LLM summary rather than actual code. Information degrades in transit; plausible-sounding guesses increase.
On top of all this, having an expensive coding model spend its time on ls-grep-read cycles doesn’t just waste time — you’re paying cache-read costs for work that shouldn’t be happening at the coding model’s rate.
Existing Solutions
Embedding-based RAG retrieval: 1) RAG’s fundamental limitation: no real understanding — results are either too noisy or miss entirely. 2) No global comprehension of the codebase.
Augment Context Engine: Likely combines a code-specialized embedding model with better chunking strategies. It does a decent job addressing basic RAG retrieval shortcomings, but: 1) Still fundamentally an embedding model — it cannot reason across files or follow multi-hop dependency chains, so complex queries that require true codebase understanding fall short. 2) Results include irrelevant matches — the embedding model surfaces code that is semantically similar on the surface but unrelated in purpose. 3) Locked behind an expensive subscription ($20/mo minimum).
What Changed
When the DeepSeek V4 series launched, we got a model with 1M context, cheap input pricing, even cheaper cache-hit pricing, and long cache TTL. Suddenly a brute-force approach became viable: stuff the entire codebase into the model and let it reason about what’s relevant.
But you still need the right strategy — sharding for large projects, cache optimization to keep costs acceptable, and reliable extraction to get structured results back.
That’s what Flashlight does.
Design
Flashlight is an MCP server that exposes a single tool. The agent sends a natural language query; Flashlight loads the codebase into DeepSeek’s context, lets the model find relevant code, and returns validated snippets with line numbers.
It’s like working in a pitch-black warehouse. Grep is a laser pointer — it finds exactly one spot, but you have no idea what’s around it. What the agent actually needs is a flashlight: something that illuminates not just the code that looks relevant, but the surrounding structure that makes it genuinely understood.
Setup
npm install -g @1percentsync/flashlight
Add to your MCP client config (e.g. ~/.claude.json under mcpServers):
{
"flashlight": {
"command": "flashlight",
"env": {
"DEEPSEEK_API_KEY": "sk-..."
}
}
}
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
DEEPSEEK_API_KEY | (required) | DeepSeek API key |
DEEPSEEK_BASE_URL | https://api.deepseek.com | API base URL (for proxies or compatible endpoints) |
FLASHLIGHT_MODEL | deepseek-v4-flash | deepseek-v4-flash or deepseek-v4-pro |
FLASHLIGHT_REASONING_EFFORT | max | Thinking effort: high or max |
FLASHLIGHT_CHANGE_THRESHOLD | 0.1 | Ratio of changed tokens that triggers a full base rebuild |
FLASHLIGHT_MAX_CONTEXT_TOKENS | 900000 | Token budget per shard (auto-sharding triggers when exceeded) |
Project Config
.flashlight/config.json in the workspace root:
| Field | Default | Description |
|---|---|---|
ext_whitelist | [] | File extensions to index |
ext_whitelist_override | false | true = only listed extensions; false = merge with built-in defaults |
Priority: project config > FLASHLIGHT_EXT_WHITELIST env var > built-in defaults (100+ extensions covering most languages, including shader languages like GLSL, HLSL, WGSL, and Metal).
Agent Interface
The MCP tool description includes the effective extension whitelist, so the agent knows exactly what file types are indexed.
Parameters:
| Parameter | Required | Description |
|---|---|---|
query | yes | Natural language description of the code to find |
scope | no | Relative directory path to narrow search |
file_types | no | Extension filter, e.g. [".ts", ".py"] |
Returns: Matched code snippets with line numbers, extracted from the local file snapshot:
--- src/shard.ts:27-38 ---
27 export function computeShardPlan(...): ShardPlan {
28 const allFiles = [...snapshot.keys()];
...
Overlapping or adjacent ranges (within 3 lines) are merged before output. Results are validated against the actual snapshot — hallucinated file paths and out-of-range line numbers are filtered out, so the agent never sees fabricated code.
LLM Request Structure
Each query is assembled as an array of user messages, concatenated in this order:
| # | Message | Content |
|---|---|---|
| 1 | System instructions | Role definition: “you are a code retrieval assistant.” Output rules: must always respond via tool call, never plain text. Ranking/formatting guidelines. |
| 2 | Base context | Every file in the snapshot, formatted as --- path (lines 1-N) --- with numbered lines. Files sorted by git commit time, oldest first — files that haven’t been touched in a long time are least likely to change, so placing them at the start of the prefix maximizes the stable portion that hits cache across queries. |
| 3 | Change context (conditional) | Only present when reusing a cached base. Contains files changed since the snapshot, tagged [UPDATED] or [DELETED]. |
| 4 | Query turn | Directory tree (with [CHANGED]/[DELETED] annotations if applicable), optional scope and file type filters, and the natural language query. |
In sharded mode, each shard gets its own variant of the system instructions that sets the expectation: “you are looking at a subset of the project; returning empty results is normal and expected.” Each shard also receives a directory tree that lists its own files individually while summarizing other shards’ directories as (N files, other shards) — preserving structural awareness without leaking file names that could trigger hallucinated results.
LLM Response
DeepSeek is forced to respond via a tool call to report_search_results:
{
"results": [
{ "file": "src/shard.ts", "start_line": 27, "end_line": 38 },
{ "file": "src/base.ts", "start_line": 71, "end_line": 103 }
]
}
If the model responds with plain text instead of a tool call, Flashlight retries up to 3 times with exponential backoff. API errors (429/500/503) are also retried; non-retryable errors (400/401/402) fail immediately.
The results are not passed through as-is. Flashlight validates each result against the local snapshot, normalizes line numbers, filters out entries referencing files outside the shard’s scope, and extracts the actual code from the snapshot. The agent receives real code, not model-generated text.
Cache Reuse Strategy
The entire economic model depends on DeepSeek’s prefix caching: if the same prefix is sent across queries, cached tokens cost ¥0.07/M vs ¥2/M for misses on deepseek-v4-flash — a 28x cost reduction.
Flashlight’s caching works as follows:
-
First query: all files are sent as the base context. The full request text and a SHA-256 hash per file are persisted to
.flashlight/base.json. -
Subsequent queries: each file’s content is hashed and compared against the stored base.
-
Change ratio = sum of changed file tokens / base total tokens.
-
If ratio > threshold (default 10%): full rebuild — the base is regenerated and saved.
-
If ratio ≤ threshold: the exact stored base text is reused (guaranteeing a prefix cache hit), and only changed/deleted files are appended as an incremental change context message.
This means that as long as the cumulative changes since the base was built stay under 10% of total tokens, every query reuses the cached prefix. Once the threshold is crossed, a full rebuild occurs and a new base is established — resetting the accumulation.
Sharding
When a project exceeds FLASHLIGHT_MAX_CONTEXT_TOKENS (default 900K), Flashlight automatically partitions the codebase:
-
Recursive directory splitting: attempt to fit the whole project first. If it doesn’t fit, group files by top-level directory and check each group. If a group still overflows, recurse into its subdirectories. Continue until every group fits the budget.
-
Incremental plan evolution: on subsequent queries, existing shard boundaries are reused. Only shards that overflow are re-split. New files not covered by any existing shard are adopted as orphans and assigned via the same splitting algorithm.
-
Parallel execution: all relevant shards are queried concurrently via
Promise.allSettled. If ascopeparameter is provided, only shards whose prefix intersects the scope are queried. -
Per-shard hallucination filtering: each shard’s results are checked against its own file list. If the model returns a result referencing a file from a different shard (possible because the directory tree reveals their existence), it’s silently filtered out.
-
Merge and dedup: results from all shards are merged, deduplicated by
file:start_line:end_line, and returned as a single result set.
Each shard maintains independent cache state (shard_{id}.json), so changes in one part of the codebase don’t invalidate the cache for unrelated shards.