Code Update: Similarity Search with Co-Located Vectors and Access Control
A companion to “The Separation Tax: When You Split Data Between Your Relational And Vector DB“
In a recent post, I argued that separating vectors from relational data creates a separation tax. This shows up as costs you pay in correctness, complexity, performance, and observability. My main thesis is that when embeddings and metadata live together in one database, retrieval becomes a bounded act of reasoning under policy constraints, not just simple recall.
Today, I’m releasing an update to the CRDB Tuning Report Generator that demonstrates this pattern in action with semantic similarity search and access control guardrails all running in a single CockroachDB query.
What’s New
This update adds two major features that work together to prove the co-location thesis:
1. Semantic Similarity Search
Reports and findings now get OpenAI embeddings automatically when created. The system can then find semantically similar content without relying on keyword matching.
Frontend:
“Find Similar” button on every report detail page
Dialog showing top 5 similar reports with similarity scores
Real-time semantic search across existing reports
Backend:
Auto-embedding via OpenAI
text-embedding-3-small(1536 dimensions)Vector similarity using CockroachDB’s native
VECTOR(1536)typeSingle-query retrieval with the
<->distance operator
Database:
-- Co-located embeddings with metadata
CREATE TABLE reports (
id UUID PRIMARY KEY,
title STRING,
report_text STRING,
embedding VECTOR(1536), -- Semantic representation
customer_id UUID, -- Access control
region STRING, -- Compliance boundary
pii_flag BOOL, -- Content guardrail
status STRING, -- Workflow state
...
);
2. Access Control Demo
Three demo users showcase how guardrails work alongside similarity search:
Alice Johnson (
analyst_alice) – Can only see Acme Corp reportsBob Smith (
analyst_bob) – Can only see Globex Industries reportsCharlie Davis (
admin_charlie) – Admin with full access to all customers
Switch between users in the UI and watch the similarity results change based on their access level. Same semantic query, different results—because access control happens before retrieval, not after.
Why This Matters: The Single Query
With separated systems, you’d do this:
# Retrieve all, then filter
all_similar = vector_db.search(embedding, limit=100)
# Now fetch metadata for each to check access...
allowed = []
for report in all_similar:
metadata = relational_db.get(report.id)
if user_has_access(user_id, metadata.customer_id):
allowed.append(report)
if len(allowed) >= 5:
break
return allowed
With co-located vectors, it’s one query:
-- Filter and retrieve in one serializable snapshot
WITH authorized AS (
SELECT customer_id
FROM user_access
WHERE user_id = $user_id
)
SELECT
r.id,
r.title,
r.customer_id,
r.embedding <-> $query_embedding AS distance
FROM reports r
WHERE r.customer_id IN (SELECT * FROM authorized)
AND r.embedding IS NOT NULL
AND r.status IN ('published', 'in_review')
AND r.pii_flag = FALSE
ORDER BY distance
LIMIT 5;
What you gain:
Correctness – Access control enforced before retrieval; no “oops” moments
Simplicity – One database, one query, one transaction, one truth
Performance – Prefilter by indexed customer_id, then compute distance on small candidate set
Observability –
EXPLAIN ANALYZEshows the full plan: filters, joins, and vector distance together
Code Change Highlights
The similarity search integrates cleanly with the existing agentic architecture:
Main API (FastAPI) now includes:
GET /api/v1/reports/{id}/similar– Semantic search with access controlembedding_service.py– Generates embeddings via OpenAIAuto-embedding on report/finding creation
Frontend (React + TypeScript):
User selector in top navigation (demo users: Alice, Bob, Charlie)
“Find Similar” button triggers similarity search dialog
Results filtered by selected user’s access level
Automatic redirect when switching users on detail pages
Database (CockroachDB):
VECTOR(1536)columns for embeddingsAccess control via
user_accessandcustomerstablesSingle-query retrieval with
<->operatorACID transactions guarantee consistency
Get the Code
Repository: github.com/kikiya/agentic-tuning-reporter
git clone https://github.com/kikiya/agentic-tuning-reporter.git
cd agentic-tuning-reporter
Quick Start
See the complete SIMILARITY_SEARCH.md guide. Quick version:
# 1. Install dependencies
cd backend && pip install -r requirements.txt
# 2. Add OpenAI key to backend/.env (You'll need this to generate embeddings)
OPENAI_API_KEY=sk-proj-your-key-here
# 3. Run database migration
cockroach sql --file=schema-add-embeddings.sql
# 4. (Optional) Load demo users for access control
cockroach sql --file=test-data-access-control.sql
# 5. Restart services
cd backend && python main.py
# 6. Test it!
# Open http://localhost:5173 → Any report → Click "Find Similar"
# Switch between users (Alice/Bob/Charlie) to see access control
Core Demonstrations
Co-Location Pattern – Embeddings + metadata + guardrails in one database
Single-Query Retrieval – Similarity search with access control in one plan
ACID Guarantees – Embedding updates atomic with metadata changes
Access Control – Customer isolation, PII filtering, region boundaries
Observability –
EXPLAIN ANALYZEshows the full retrieval planDemo Architecture – User switching to showcase access control in action
Key Takeaways
From the original “Separation Tax” post:
“When semantics meet relations under ACID, retrieval becomes a bounded act of reasoning as opposed to simple recall. Co-locating vectors with metadata lets you express hypotheses as predicates and distances in the same query plan—what should be visible, which era matters, which regions and roles apply, how to weigh similarity against severity—and settle it with a single, auditable transaction.”
This code release demonstrates that thesis in practice. The result isn’t just “the model said these are similar,” but rather “the database proved these are the most similar items you’re allowed to see, under one serializable snapshot.”
While this is a proof-of-concept with areas for enhancement, the core pattern is pretty solid.
Performance Notes
Cost: OpenAI embeddings cost ~$0.02 per 1M tokens. For typical reports (500-2000 words), this is negligible—a few cents for thousands of reports.
Query Performance:
With proper indexes (
customer_id,status,pii_flag)Prefiltering reduces candidates before distance calculation
Exact distance search is fast on filtered sets (<1000 rows)
Add ANN indexes when dataset grows large (CockroachDB 24.x+ supports
ivfflat)
Contributing & Feedback
This is a learning resource built to help others navigate co-located vector architectures. If you:
Find bugs or improvements
Have questions about the implementation
Want to share how you adapted it
Built something similar
Open an issue or PR! I’m building this in public to demonstrate practical patterns for “AI as a collaborator” systems with deterministic foundations.
What’s Next
I’ll continue exploring the intersection of:
Agentic architectures with stateful reasoning
Co-located vectors for bounded retrieval
Guardrails that enforce correctness at the database layer
Human-in-the-loop workflows with AI assistance
