Schemas
Solr Schema Definition Models
Analyzer
Bases: BaseModel
Analyzer configuration for field types.
Defines how text is processed for indexing and querying.
Example
analyzer = Analyzer(
tokenizer=Tokenizer(name="standard"),
filters=[
Filter(name="lowercase"),
Filter(name="stop", ignore_case=True, words="stopwords.txt")
]
)
build(format='xml', indent=' ')
CharFilter
Bases: BaseModel
Character filter for text analysis.
Applied before tokenization to preprocess text.
Example
from taiyo.schema.enums import SolrCharFilterFactory
# Using enum
char_filter = CharFilter(
solr_class=SolrCharFilterFactory.PATTERN_REPLACE,
pattern="([a-zA-Z])\\1+",
replacement="$1$1"
)
# Or using string
char_filter = CharFilter(
name="patternReplace",
pattern="([a-zA-Z])\\1+",
replacement="$1$1"
)
CopyField
Bases: BaseModel
Solr copyField directive for automatic field data copying.
Copy fields instruct Solr to automatically copy data from a source field (or pattern) to a destination field during indexing. The destination field receives the original pre-analyzed text, which is then analyzed according to its own field type.
This enables: - Catch-all search fields aggregating multiple source fields - Same content analyzed differently for different purposes - Faceting on different field types than used for searching - Creating summary fields with character limits
Attributes:
| Name | Type | Description |
|---|---|---|
source |
str
|
Source field name (supports wildcards like 'txt' or 'attr') |
dest |
str
|
Destination field name (must be a defined field, no wildcards) |
maxChars |
str
|
Optional character limit for copied content (truncates if exceeded) |
Example
from taiyo.schema import CopyField
# Copy single field to catch-all
title_copy = CopyField(source="title", dest="text")
content_copy = CopyField(source="content", dest="text")
# Copy all text fields with wildcard
all_text_copy = CopyField(source="*_txt", dest="text")
# Copy with character limit for summaries
summary_copy = CopyField(
source="content",
dest="content_summary",
maxChars=500
)
# Copy for different analysis
# (e.g., stemmed text to unstemmed for exact phrase matching)
exact_copy = CopyField(source="description", dest="description_exact")
# Copy multiple dynamic fields
multi_lang_copy = CopyField(source="title_*", dest="title_all")
Note
- Copies happen before analysis of the destination field
- Destination field must be defined (cannot be dynamic)
- Wildcards only work in source, not destination
- maxChars truncates at character boundary, may split words
- Copying is one-way; changes to dest don't affect source
Reference
https://solr.apache.org/guide/solr/latest/indexing-guide/copy-fields.html
build(format='xml', indent='')
Filter
Bases: BaseModel
Token filter for text analysis.
Applied after tokenization to transform tokens.
Example
from taiyo.schema.enums import SolrFilterFactory
# Using enum
filter = Filter(solr_class=SolrFilterFactory.LOWER_CASE)
# Using name
filter = Filter(name="lowercase")
# With parameters using enum
filter = Filter(
solr_class=SolrFilterFactory.STOP,
ignore_case=True,
words="stopwords.txt"
)
# With parameters using name
filter = Filter(
name="stop",
ignore_case=True,
words="stopwords.txt"
)
Schema
Bases: BaseModel
Complete Solr schema definition with all components.
A Schema represents the complete structure of a Solr collection, defining how documents are indexed, stored, and searched. It combines field definitions, field types, copy field directives, and configuration into a single model that can be serialized to XML or JSON.
The Schema supports both: - Classic schema.xml format (XML serialization) - Schema API format (JSON serialization)
Attributes:
| Name | Type | Description |
|---|---|---|
name |
Optional[str]
|
Optional schema name identifier |
version |
Optional[float]
|
Schema version number (typically 1.6 for modern Solr) |
uniqueKey |
Optional[str]
|
Field name to use as unique document identifier (commonly 'id') |
fields |
List[SolrField]
|
List of field definitions |
dynamicFields |
List[SolrDynamicField]
|
List of dynamic field patterns |
fieldTypes |
List[SolrFieldType]
|
List of field type definitions |
copyFields |
List[CopyField]
|
List of copy field directives |
Example
from taiyo.schema import (
Schema, SolrField, SolrDynamicField,
SolrFieldType, SolrFieldClass, CopyField
)
from taiyo.schema.field_type import Analyzer, Tokenizer, Filter
# Define field types
text_type = SolrFieldType(
name="text_general",
solr_class=SolrFieldClass.TEXT,
position_increment_gap=100,
analyzer=Analyzer(
tokenizer=Tokenizer(name="standard"),
filters=[
Filter(name="lowercase"),
Filter(name="stop", words="stopwords.txt")
]
)
)
# Define fields
id_field = SolrField(
name="id",
type="string",
indexed=True,
stored=True,
required=True
)
title_field = SolrField(
name="title",
type="text_general",
indexed=True,
stored=True
)
# Define dynamic fields
text_dynamic = SolrDynamicField(
name="*_txt",
type="text_general",
indexed=True,
stored=True
)
# Define copy fields
title_copy = CopyField(source="title", dest="text")
# Build complete schema
schema = Schema(
name="my_collection",
version=1.6,
uniqueKey="id",
fields=[id_field, title_field],
dynamicFields=[text_dynamic],
fieldTypes=[text_type],
copyFields=[title_copy]
)
# Serialize to XML for schema.xml
xml_output = schema.build(format="xml")
with open("schema.xml", "w") as f:
f.write(xml_output)
# Serialize to JSON for Schema API
json_output = schema.build(format="json")
# Use builder pattern
schema = (
Schema(name="my_schema", version=1.6, uniqueKey="id")
.add_field_type(text_type)
.add_field(id_field)
.add_field(title_field)
.add_dynamic_field(text_dynamic)
.add_copy_field(title_copy)
)
Reference
https://solr.apache.org/guide/solr/latest/indexing-guide/schema-elements.html https://solr.apache.org/guide/solr/latest/indexing-guide/schema-api.html
add_copy_field(copy_field)
Add a copy field to the schema (builder pattern).
add_dynamic_field(field)
Add a dynamic field to the schema (builder pattern).
add_field(field)
Add a field to the schema (builder pattern).
add_field_type(field_type)
Add a field type to the schema (builder pattern).
SolrCharFilterFactory
Type-safe enum for Solr/Lucene char filter factory classes.
Char filters process the raw character stream before tokenization, performing
operations like HTML stripping or character mapping.
Each enum member maps to a char filter factory class name using the short notation
(e.g., solr.HTMLStripCharFilterFactory).
Categories
- HTML/Markup: HTML_STRIP
- Pattern-based: PATTERN_REPLACE
- Mapping: MAPPING
- ICU: ICU_NORMALIZER2
Example
from taiyo.schema.enums import SolrCharFilterFactory
analyzer_config = {
"charFilters": [
{"class": SolrCharFilterFactory.HTML_STRIP},
{"class": SolrCharFilterFactory.MAPPING, "mapping": "mapping-FoldToASCII.txt"}
],
"tokenizer": {"class": "solr.StandardTokenizerFactory"}
}
Reference
https://solr.apache.org/guide/solr/latest/indexing-guide/charfilters.html
SolrDynamicField
Bases: SolrField
Dynamic field pattern for automatic field creation.
Dynamic fields use wildcard patterns (* prefix or suffix) to automatically configure fields that match the pattern. When a document contains a field name matching a dynamic field pattern, Solr automatically creates and configures that field using the dynamic field's settings.
Dynamic fields are particularly useful for: - Handling varying field names in semi-structured data - Supporting multi-language fields (e.g., title_en, title_fr) - Creating type-specific field groups (e.g., _txt, _i, *_dt)
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Dynamic field pattern (e.g., '_txt', 's', 'attr*') |
type |
str
|
Field type name (must reference a defined fieldType) |
Example
from taiyo.schema import SolrDynamicField
# Text fields with suffix pattern
text_dynamic = SolrDynamicField(
name="*_txt",
type="text_general",
indexed=True,
stored=True,
multi_valued=True
)
# String fields with suffix pattern
string_dynamic = SolrDynamicField(
name="*_s",
type="string",
indexed=True,
stored=True,
doc_values=True
)
# Integer fields with suffix pattern
int_dynamic = SolrDynamicField(
name="*_i",
type="pint",
indexed=True,
stored=True,
doc_values=True
)
# Date fields with suffix pattern
date_dynamic = SolrDynamicField(
name="*_dt",
type="pdate",
indexed=True,
stored=True,
doc_values=True
)
# Attribute fields with prefix pattern
attr_dynamic = SolrDynamicField(
name="attr_*",
type="text_general",
indexed=True,
stored=True
)
# Ignored fields (no indexing or storage)
ignored_dynamic = SolrDynamicField(
name="ignored_*",
type="string",
indexed=False,
stored=False
)
Note
- Patterns must contain exactly one asterisk (*)
- Asterisk can be at beginning or end only
- More specific patterns take precedence over less specific ones
- If multiple patterns match, the longest match wins
Reference
https://solr.apache.org/guide/solr/latest/indexing-guide/dynamic-fields.html
build(format='xml', indent='')
SolrField
Bases: BaseModel
Solr field definition specifying data indexing and storage behavior.
Fields are named data containers that reference a field type to determine how their values are analyzed, indexed, and stored. Each field can override default behaviors inherited from its field type.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Field identifier (alphanumeric/underscore, no leading digit) |
type |
str
|
Field type name (must reference a defined fieldType) |
default |
Optional[Any]
|
Default value when not provided in documents |
indexed |
Optional[bool]
|
Enable field in queries and sorting (default: true) |
stored |
Optional[bool]
|
Store original value for retrieval (default: true) |
doc_values |
Optional[bool]
|
Column-oriented storage for sorting/faceting (default: true) |
multi_valued |
Optional[bool]
|
Allow multiple values per field (default: false) |
required |
Optional[bool]
|
Reject documents missing this field (default: false) |
omit_norms |
Optional[bool]
|
Disable length normalization (default: true for non-analyzed) |
omit_term_freq_and_positions |
Optional[bool]
|
Omit term frequency and positions |
omit_positions |
Optional[bool]
|
Omit positions but keep term frequency |
term_vectors |
Optional[bool]
|
Store term vectors for highlighting (default: false) |
term_positions |
Optional[bool]
|
Store term positions in vectors |
term_offsets |
Optional[bool]
|
Store term offsets in vectors |
term_payloads |
Optional[bool]
|
Store term payloads in vectors |
sort_missing_first |
Optional[bool]
|
Sort docs without this field first |
sort_missing_last |
Optional[bool]
|
Sort docs without this field last |
uninvertible |
Optional[bool]
|
Allow un-inverting when indexed=true docValues=false |
use_doc_values_as_stored |
Optional[bool]
|
Return docValues when using '*' in fl param |
large |
Optional[bool]
|
Lazy load values >512KB (requires stored=true, multiValued=false) |
Example
from taiyo.schema import SolrField
# Unique ID field
id_field = SolrField(
name="id",
type="string",
indexed=True,
stored=True,
required=True
)
# Text field for full-text search
title_field = SolrField(
name="title",
type="text_general",
indexed=True,
stored=True
)
# Multi-valued field
tags_field = SolrField(
name="tags",
type="string",
indexed=True,
stored=True,
multi_valued=True
)
# Field with docValues for faceting
category_field = SolrField(
name="category",
type="string",
indexed=True,
stored=True,
doc_values=True
)
# Numeric field with default value
price_field = SolrField(
name="price",
type="pdouble",
indexed=True,
stored=True,
default=0.0
)
# Version field (internal)
version_field = SolrField(
name="_version_",
type="plong",
indexed=False,
stored=False
)
Reference
https://solr.apache.org/guide/solr/latest/indexing-guide/fields.html
build(format='xml', indent='')
SolrFieldClass
Type-safe enum for Solr field type implementation classes.
Each enum member maps to a Solr field type class name using the short notation
(e.g., solr.TextField). These can be used when defining SolrFieldType instances
to ensure correct class names and enable IDE autocompletion.
Categories
- Text/String: TEXT, STR, SORTABLE_TEXT, COLLATION, ICU_COLLATION, ENUM
- Boolean/Binary: BOOL, BINARY, UUID
- Numeric: INT_POINT, LONG_POINT, FLOAT_POINT, DOUBLE_POINT, DATE_POINT
- Date/Currency: DATE_RANGE, CURRENCY
- Spatial: LATLON_POINT_SPATIAL, BBOX, SPATIAL_RPT, RPT_WITH_GEOMETRY, POINT
- Special: RANDOM_SORT, RANK, NEST_PATH, PRE_ANALYZED
- Vector/ML: DENSE_VECTOR
Example
from taiyo.schema import SolrFieldType, SolrFieldClass
# Text field type
text_type = SolrFieldType(
name="text_en",
solr_class=SolrFieldClass.TEXT,
analyzer=...
)
# Numeric field type
int_type = SolrFieldType(
name="pint",
solr_class=SolrFieldClass.INT_POINT
)
# Vector field type
vector_type = SolrFieldType(
name="vector",
solr_class=SolrFieldClass.DENSE_VECTOR,
vectorDimension=768
)
Reference
https://solr.apache.org/guide/solr/latest/indexing-guide/field-types-included-with-solr.html
SolrFieldType
Bases: BaseModel
Represents a Solr field type definition.
Field types define how data is analyzed and stored.
Example
from taiyo.schema import SolrFieldType, Analyzer, Tokenizer, Filter
from taiyo.schema.enums import (
SolrFieldClass,
SolrTokenizerFactory,
SolrFilterFactory
)
# Using enums (recommended)
field_type = SolrFieldType(
name="text_general",
solr_class=SolrFieldClass.TEXT,
position_increment_gap=100,
analyzer=Analyzer(
tokenizer=Tokenizer(solr_class=SolrTokenizerFactory.STANDARD),
filters=[
Filter(solr_class=SolrFilterFactory.LOWER_CASE),
Filter(solr_class=SolrFilterFactory.STOP, ignore_case=True, words="stopwords.txt")
]
)
)
# Or using strings (also supported)
field_type = SolrFieldType(
name="text_general",
solr_class="solr.TextField",
position_increment_gap=100,
analyzer=Analyzer(
tokenizer=Tokenizer(name="standard"),
filters=[
Filter(name="lowercase"),
Filter(name="stop", ignore_case=True, words="stopwords.txt")
]
)
)
Reference: https://solr.apache.org/guide/solr/latest/indexing-guide/field-type-definitions-and-properties.html
build(format='xml', indent='')
validate_field_class(v)
classmethod
Accept both enum and string values.
SolrFilterFactory
Type-safe enum for Solr/Lucene filter factory classes.
Filters modify, remove, or add tokens in the token stream produced by tokenizers.
Each enum member maps to a filter factory class name using the short notation
(e.g., solr.LowerCaseFilterFactory).
Categories
- Case: LOWER_CASE, UPPER_CASE, TURKISH_LOWER_CASE
- Stemming: PORTER_STEM, SNOWBALL_PORTER, KS_STEM, various language stems
- Stop words: STOP, SUGGEST_STOP
- Synonyms: SYNONYM_GRAPH, SYNONYM (deprecated)
- N-grams: EDGE_NGRAM, NGRAM, SHINGLE
- Phonetic: BEIDER_MORSE, DAITCH_MOKOTOFF_SOUNDEX, DOUBLE_METAPHONE, METAPHONE, PHONEX, REFINED_SOUNDEX, SOUNDEX
- Word analysis: WORD_DELIMITER_GRAPH, WORD_DELIMITER (deprecated)
- Language-specific: Multiple for various languages
- Special purpose: ASCII_FOLDING, TRUNCATE, REVERSE, TRIM, PROTECTED_TERM
Example
from taiyo.schema.enums import SolrFilterFactory
analyzer_config = {
"tokenizer": {"class": "solr.StandardTokenizerFactory"},
"filters": [
{"class": SolrFilterFactory.LOWER_CASE},
{"class": SolrFilterFactory.STOP, "words": "stopwords.txt"},
{"class": SolrFilterFactory.PORTER_STEM}
]
}
Reference
https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html
SolrTokenizerFactory
Type-safe enum for Solr/Lucene tokenizer factory classes.
Tokenizers break text into tokens (words) that can be further processed by filters.
Each enum member maps to a tokenizer factory class name using the short notation
(e.g., solr.StandardTokenizerFactory).
Categories
- Standard: STANDARD, CLASSIC, WHITESPACE, KEYWORD
- Pattern-based: PATTERN, SIMPLE_PATTERN, SIMPLE_PATTERN_SPLIT
- Path/Email: PATH_HIERARCHY, UAX29_URL_EMAIL
- Language-specific: THAI, KOREAN, JAPANESE, ICU
- N-gram: EDGE_NGRAM, NGRAM
- OpenNLP: OPENNLP
Example
from taiyo.schema import SolrFieldType, Tokenizer
from taiyo.schema.enums import SolrTokenizerFactory
field_type = SolrFieldType(
name="text_standard",
solr_class="solr.TextField",
analyzer=Analyzer(
tokenizer=Tokenizer(solr_tokenizer_class=SolrTokenizerFactory.STANDARD)
)
)
Reference
https://solr.apache.org/guide/solr/latest/indexing-guide/tokenizers.html
Tokenizer
Bases: BaseModel
Tokenizer for text analysis.
Splits text into tokens.
Example
from taiyo.schema.enums import SolrTokenizerFactory
# Using enum
tokenizer = Tokenizer(solr_class=SolrTokenizerFactory.STANDARD)
# Or using name
tokenizer = Tokenizer(name="standard")
# Or using string class
tokenizer = Tokenizer(solr_class="solr.StandardTokenizerFactory")