Dense Vector Search Parsers
Dense vector search parsers enable semantic similarity search using vector embeddings. These parsers work with dense vector fields in your Solr schema.
KNNQueryParser
The KNNQueryParser performs k-nearest neighbor search using pre-computed vector embeddings.
Solr Documentation: Dense Vector Search
Basic Usage
from taiyo.parsers import KNNQueryParser
# Your query vector (from your embedding model)
query_vector = [0.23, -0.45, 0.67, ...]
parser = KNNQueryParser(field="content_vector", vector=query_vector, top_k=10)
results = client.search(parser)
Parameters
parser = KNNQueryParser(
field="embedding", # Dense vector field name
vector=[...], # Query embedding vector
top_k=10, # Number of nearest neighbors
# Common parameters
rows=10,
start=0,
field_list=["id", "title", "content"],
filters=["status:active"],
)
Example
from taiyo.parsers import KNNQueryParser
from your_embedding_model import embed_text
# Generate query embedding
query_text = "machine learning algorithms"
query_vector = embed_text(query_text)
# Search for similar documents
parser = KNNQueryParser(
field="content_vector",
vector=query_vector,
top_k=20,
rows=10,
field_list=["id", "title", "author", "abstract"],
filters=["category:technology", "published_date:[NOW-1YEAR TO NOW]"],
)
results = client.search(parser)
for doc in results.docs:
print(f"{doc.title} by {doc.author}")
print(f"Score: {doc.score}")
Pre-Filtering
Pre-filtering narrows the candidate set before KNN search for better performance and relevance.
Implicit Pre-Filtering
By default, all filter queries are automatically applied as pre-filters:
parser = KNNQueryParser(
field="embedding",
vector=query_vector,
top_k=10,
filters=[
"category:science", # Only science articles
"language:en", # English only
"quality_score:[7 TO *]", # High quality only
],
)
# All filters automatically pre-filter the KNN search
Explicit Pre-Filtering
Use pre_filter to explicitly specify pre-filter conditions:
parser = KNNQueryParser(
field="embedding",
vector=query_vector,
top_k=10,
pre_filter=["category:science", "status:published"],
filters=["language:en"], # This becomes a post-filter
)
Tagged Filtering
Control which filters are used for pre-filtering with tags:
parser = KNNQueryParser(
field="embedding",
vector=query_vector,
top_k=10,
filters=[
"{!tag=prefilter}category:science",
"{!tag=prefilter}status:published",
"language:en", # Not tagged
],
include_tags=["prefilter"], # Only use filters with this tag for pre-filtering
)
# Or exclude specific tags
parser = KNNQueryParser(
field="embedding",
vector=query_vector,
top_k=10,
filters=["{!tag=postfilter}facet_field:value", "category:science"],
exclude_tags=["postfilter"], # Exclude this from pre-filtering
)
Query Serialization with .build()
KNN parsers serialize into Solr's local params format compatible with Apache Solr's official documentation:
from taiyo.parsers import KNNQueryParser
parser = KNNQueryParser(
field="embedding",
vector=[0.1, 0.2, 0.3, 0.4, 0.5],
top_k=10,
pre_filter=["category:electronics"],
filters=["status:active"],
)
# Build query parameters as dictionary
params = parser.build()
# {
# 'q': '{!knn f=embedding topK=10 preFilter=category:electronics}[0.1,0.2,0.3,0.4,0.5]',
# 'fq': ['status:active'],
# 'rows': 10
# }
The generated query q parameter matches Solr's local params syntax: Apache Solr's Dense Vector Search Guide.
Compatibility with Other Clients
The dictionary output can be used with any HTTP client:
import httpx
parser = KNNQueryParser(field="product_vector", vector=[0.5, 0.3, 0.8, 0.2], top_k=20)
params = parser.build()
# Use with httpx or any HTTP library
response = httpx.get("http://localhost:8983/solr/products/select", params=params)
KNNTextToVectorQueryParser
The KNNTextToVectorQueryParser converts text queries to vectors using a configured encoder within Solr, then performs KNN search.
Solr Documentation: Text-to-Vector with KNN
Basic Usage
from taiyo.parsers import KNNTextToVectorQueryParser
parser = KNNTextToVectorQueryParser(
field="content_vector",
text="machine learning algorithms",
model="my-encoder-model",
top_k=10,
)
results = client.search(parser)
Parameters
parser = KNNTextToVectorQueryParser(
field="embedding", # Dense vector field name
text="search query", # Text query (will be encoded)
model="my-bert-encoder", # Model name in text-to-vector store (required)
top_k=10, # Number of nearest neighbors
# Common parameters
rows=10,
start=0,
field_list=["id", "title"],
filters=["status:active"],
)
Encoder Configuration
Your Solr schema needs an encoder configured:
<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="768">
<encoder class="org.apache.solr.ltr.model.BertEncoder">
<str name="modelPath">models/bert-base-uncased</str>
</encoder>
</fieldType>
<field name="content_vector" type="knn_vector" indexed="true" stored="true"/>
Example
from taiyo.parsers import KNNTextToVectorQueryParser
# No need to generate embeddings yourself!
parser = KNNTextToVectorQueryParser(
field="content_vector",
text="What are the latest advances in neural networks?",
model="bert-base", # Use configured model
rows=10,
field_list=["id", "title", "abstract", "published_date"],
sort="published_date desc",
filters=["category:ai", "published_date:[NOW-2YEARS TO NOW]"],
)
results = client.search(parser)
print(f"Found {results.num_found} similar documents")
for doc in results.docs:
print(f"\n{doc.title}")
print(f"Published: {doc.published_date}")
print(f"Similarity: {doc.score:.4f}")
VectorSimilarityQueryParser
The VectorSimilarityQueryParser performs similarity search with more control over scoring and filtering.
Solr Documentation: Vector Similarity Function
Basic Usage
from taiyo.parsers import VectorSimilarityQueryParser
parser = VectorSimilarityQueryParser(
field="content_vector", vector=query_vector, min_return=0.7
)
results = client.search(parser)
Parameters
parser = VectorSimilarityQueryParser(
field="embedding", # Dense vector field name
vector=[...], # Query embedding vector
min_return=0.7, # Minimum similarity threshold for returned docs
min_traverse=0.5, # Optional: minimum similarity to continue traversal
# Common parameters
rows=10,
start=0,
field_list=["id", "title"],
filters=["status:active"],
)
Minimum Similarity Threshold
Filter out low-similarity results using min_return:
parser = VectorSimilarityQueryParser(
field="content_vector",
vector=query_vector,
min_return=0.75, # Only return docs with similarity >= 0.75
min_traverse=0.6, # Continue graph traversal for similarity >= 0.6
)
Example
from taiyo.parsers import VectorSimilarityQueryParser
from your_model import embed_text
query_text = "deep learning for computer vision"
query_vector = embed_text(query_text)
parser = VectorSimilarityQueryParser(
field="abstract_vector",
vector=query_vector,
min_return=0.7, # High similarity threshold
min_traverse=0.5, # Traverse threshold
rows=20,
field_list=["id", "title", "authors", "abstract", "citations"],
sort="score desc, citations desc", # Sort by similarity, then citations
filters=["field:computer_science", "year:[2020 TO *]", "citations:[10 TO *]"],
)
results = client.search(parser)
print(f"Found {results.num_found} highly similar papers")
for doc in results.docs:
print(f"\n{doc.title}")
print(f"Authors: {', '.join(doc.authors)}")
print(f"Similarity: {doc.score:.3f}")
print(f"Citations: {doc.citations}")
Parser Comparison
| Feature | KNN | KNN Text-to-Vector | Vector Similarity |
|---|---|---|---|
| Input | Vector | Text | Vector |
| Encoding | Client-side | Server-side | Client-side |
| Top-K | Yes | Yes | Yes (via rows) |
| Similarity Function | Fixed | Fixed | Configurable |
| Min Score | No | No | Yes |
| Best for | External encoders | Simpler setup | Fine-tuned control |
Schema Setup
Vector Field Definition
from taiyo.schema import FieldType, Field
# Define vector field type
vector_type = FieldType(
name="knn_vector_768",
class_name="solr.DenseVectorField",
vector_dimension=768,
similarity_function="cosine", # cosine, dot_product, or euclidean
)
# Add vector field
embedding_field = Field(
name="content_vector", type="knn_vector_768", indexed=True, stored=True
)
# Update schema
client.schema.add_field_type(vector_type)
client.schema.add_field(embedding_field)
Indexing Documents with Vectors
from your_model import embed_text
documents = [
{
"id": "doc1",
"title": "Introduction to Machine Learning",
"content": "Machine learning is a subset of artificial intelligence...",
"content_vector": embed_text(
"Machine learning is a subset of artificial intelligence..."
),
},
{
"id": "doc2",
"title": "Deep Learning Fundamentals",
"content": "Deep learning uses neural networks with multiple layers...",
"content_vector": embed_text(
"Deep learning uses neural networks with multiple layers..."
),
},
]
client.index(documents)
Best Practices
Choose Appropriate Vector Dimensions
Common embedding dimensions: - BERT-base: 768 - BERT-large: 1024 - Sentence-BERT: 384 or 768 - OpenAI embeddings: 1536 - word2vec: 100-300
Match schema configuration to your embedding model:
vector_type = FieldType(
name="knn_vector", class_name="solr.DenseVectorField", vector_dimension=768
)
Normalize Vectors
import numpy as np
def normalize_vector(vector):
"""Normalize vector to unit length for cosine similarity."""
norm = np.linalg.norm(vector)
return (vector / norm).tolist()
# Use normalized vectors
query_vector = normalize_vector(raw_embedding)
parser = KNNQueryParser(field="content_vector", vector=query_vector, top_k=10)
Optimize Top-K
# Start with reasonable top_k
parser = KNNQueryParser(
field="embedding",
vector=query_vector,
top_k=100, # Get more candidates
)
# Then filter and re-rank
reranked_parser = parser.model_copy(
update={
"rows": 10, # Return only 10
"filters": ["category:relevant"],
"sort": "score desc",
}
)
results = client.search(reranked_parser)
Next Steps
- Learn about Sparse Parsers for keyword search
- Explore Spatial Parsers for geographic queries
- Add Faceting for aggregations
- Use Highlighting for result snippets