Skip to content

Schemas

Solr Schema Definition Models

Analyzer

Bases: BaseModel

Analyzer configuration for field types.

Defines how text is processed for indexing and querying.

Example
analyzer = Analyzer(
    tokenizer=Tokenizer(name="standard"),
    filters=[
        Filter(name="lowercase"),
        Filter(name="stop", ignore_case=True, words="stopwords.txt")
    ]
)

build(format='xml', indent=' ')

Build analyzer definition in specified format.

Parameters:

Name Type Description Default
format str

Output format - "xml" (default) or "json"

'xml'
indent str

Indentation prefix for XML output

' '

Returns:

Type Description
Dict[str, Any] | str

XML string for format="xml", dict for format="json"

CharFilter

Bases: BaseModel

Character filter for text analysis.

Applied before tokenization to preprocess text.

Example
from taiyo.schema.enums import SolrCharFilterFactory

# Using enum
char_filter = CharFilter(
    solr_class=SolrCharFilterFactory.PATTERN_REPLACE,
    pattern="([a-zA-Z])\\1+",
    replacement="$1$1"
)

# Or using string
char_filter = CharFilter(
    name="patternReplace",
    pattern="([a-zA-Z])\\1+",
    replacement="$1$1"
)

build(format='xml')

Build char filter definition.

Parameters:

Name Type Description Default
format str

Output format - only "json" supported for components

'xml'

Returns:

Type Description
Dict[str, Any]

dict representation

validate_class(v) classmethod

Accept both enum and string values.

CopyField

Bases: BaseModel

Solr copyField directive for automatic field data copying.

Copy fields instruct Solr to automatically copy data from a source field (or pattern) to a destination field during indexing. The destination field receives the original pre-analyzed text, which is then analyzed according to its own field type.

This enables: - Catch-all search fields aggregating multiple source fields - Same content analyzed differently for different purposes - Faceting on different field types than used for searching - Creating summary fields with character limits

Attributes:

Name Type Description
source str

Source field name (supports wildcards like 'txt' or 'attr')

dest str

Destination field name (must be a defined field, no wildcards)

maxChars str

Optional character limit for copied content (truncates if exceeded)

Example
from taiyo.schema import CopyField

# Copy single field to catch-all
title_copy = CopyField(source="title", dest="text")
content_copy = CopyField(source="content", dest="text")

# Copy all text fields with wildcard
all_text_copy = CopyField(source="*_txt", dest="text")

# Copy with character limit for summaries
summary_copy = CopyField(
    source="content",
    dest="content_summary",
    maxChars=500
)

# Copy for different analysis
# (e.g., stemmed text to unstemmed for exact phrase matching)
exact_copy = CopyField(source="description", dest="description_exact")

# Copy multiple dynamic fields
multi_lang_copy = CopyField(source="title_*", dest="title_all")
Note
  • Copies happen before analysis of the destination field
  • Destination field must be defined (cannot be dynamic)
  • Wildcards only work in source, not destination
  • maxChars truncates at character boundary, may split words
  • Copying is one-way; changes to dest don't affect source
Reference

https://solr.apache.org/guide/solr/latest/indexing-guide/copy-fields.html

build(format='xml', indent='')

Build copy field definition in specified format.

Parameters:

Name Type Description Default
format str

Output format - "xml" (default) or "json"

'xml'
indent str

Indentation prefix for XML output

''

Returns:

Type Description
Dict[str, Any] | str

XML string for format="xml", dict for format="json"

Filter

Bases: BaseModel

Token filter for text analysis.

Applied after tokenization to transform tokens.

Example
from taiyo.schema.enums import SolrFilterFactory

# Using enum
filter = Filter(solr_class=SolrFilterFactory.LOWER_CASE)

# Using name
filter = Filter(name="lowercase")

# With parameters using enum
filter = Filter(
    solr_class=SolrFilterFactory.STOP,
    ignore_case=True,
    words="stopwords.txt"
)

# With parameters using name
filter = Filter(
    name="stop",
    ignore_case=True,
    words="stopwords.txt"
)

build(format='xml')

Build filter definition.

Parameters:

Name Type Description Default
format str

Output format - only "json" supported for components

'xml'

Returns:

Type Description
Dict[str, Any]

dict representation

validate_class(v) classmethod

Accept both enum and string values.

Schema

Bases: BaseModel

Complete Solr schema definition with all components.

A Schema represents the complete structure of a Solr collection, defining how documents are indexed, stored, and searched. It combines field definitions, field types, copy field directives, and configuration into a single model that can be serialized to XML or JSON.

The Schema supports both: - Classic schema.xml format (XML serialization) - Schema API format (JSON serialization)

Attributes:

Name Type Description
name Optional[str]

Optional schema name identifier

version Optional[float]

Schema version number (typically 1.6 for modern Solr)

uniqueKey Optional[str]

Field name to use as unique document identifier (commonly 'id')

fields List[SolrField]

List of field definitions

dynamicFields List[SolrDynamicField]

List of dynamic field patterns

fieldTypes List[SolrFieldType]

List of field type definitions

copyFields List[CopyField]

List of copy field directives

Example
from taiyo.schema import (
    Schema, SolrField, SolrDynamicField,
    SolrFieldType, SolrFieldClass, CopyField
)
from taiyo.schema.field_type import Analyzer, Tokenizer, Filter

# Define field types
text_type = SolrFieldType(
    name="text_general",
    solr_class=SolrFieldClass.TEXT,
    position_increment_gap=100,
    analyzer=Analyzer(
        tokenizer=Tokenizer(name="standard"),
        filters=[
            Filter(name="lowercase"),
            Filter(name="stop", words="stopwords.txt")
        ]
    )
)

# Define fields
id_field = SolrField(
    name="id",
    type="string",
    indexed=True,
    stored=True,
    required=True
)

title_field = SolrField(
    name="title",
    type="text_general",
    indexed=True,
    stored=True
)

# Define dynamic fields
text_dynamic = SolrDynamicField(
    name="*_txt",
    type="text_general",
    indexed=True,
    stored=True
)

# Define copy fields
title_copy = CopyField(source="title", dest="text")

# Build complete schema
schema = Schema(
    name="my_collection",
    version=1.6,
    uniqueKey="id",
    fields=[id_field, title_field],
    dynamicFields=[text_dynamic],
    fieldTypes=[text_type],
    copyFields=[title_copy]
)

# Serialize to XML for schema.xml
xml_output = schema.build(format="xml")
with open("schema.xml", "w") as f:
    f.write(xml_output)

# Serialize to JSON for Schema API
json_output = schema.build(format="json")

# Use builder pattern
schema = (
    Schema(name="my_schema", version=1.6, uniqueKey="id")
    .add_field_type(text_type)
    .add_field(id_field)
    .add_field(title_field)
    .add_dynamic_field(text_dynamic)
    .add_copy_field(title_copy)
)
Reference

https://solr.apache.org/guide/solr/latest/indexing-guide/schema-elements.html https://solr.apache.org/guide/solr/latest/indexing-guide/schema-api.html

add_copy_field(copy_field)

Add a copy field to the schema (builder pattern).

add_dynamic_field(field)

Add a dynamic field to the schema (builder pattern).

add_field(field)

Add a field to the schema (builder pattern).

add_field_type(field_type)

Add a field type to the schema (builder pattern).

build(format='xml')

Build schema in specified format.

Parameters:

Name Type Description Default
format str

Output format - "xml" (default) or "json"

'xml'

Returns:

Type Description
Dict[str, Any] | str

XML string for format="xml", dict for format="json"

SolrCharFilterFactory

Bases: str, Enum

Type-safe enum for Solr/Lucene char filter factory classes.

Char filters process the raw character stream before tokenization, performing operations like HTML stripping or character mapping. Each enum member maps to a char filter factory class name using the short notation (e.g., solr.HTMLStripCharFilterFactory).

Categories
  • HTML/Markup: HTML_STRIP
  • Pattern-based: PATTERN_REPLACE
  • Mapping: MAPPING
  • ICU: ICU_NORMALIZER2
Example
from taiyo.schema.enums import SolrCharFilterFactory

analyzer_config = {
    "charFilters": [
        {"class": SolrCharFilterFactory.HTML_STRIP},
        {"class": SolrCharFilterFactory.MAPPING, "mapping": "mapping-FoldToASCII.txt"}
    ],
    "tokenizer": {"class": "solr.StandardTokenizerFactory"}
}
Reference

https://solr.apache.org/guide/solr/latest/indexing-guide/charfilters.html

SolrDynamicField

Bases: SolrField

Dynamic field pattern for automatic field creation.

Dynamic fields use wildcard patterns (* prefix or suffix) to automatically configure fields that match the pattern. When a document contains a field name matching a dynamic field pattern, Solr automatically creates and configures that field using the dynamic field's settings.

Dynamic fields are particularly useful for: - Handling varying field names in semi-structured data - Supporting multi-language fields (e.g., title_en, title_fr) - Creating type-specific field groups (e.g., _txt, _i, *_dt)

Attributes:

Name Type Description
name str

Dynamic field pattern (e.g., '_txt', 's', 'attr*')

type str

Field type name (must reference a defined fieldType)

Example
from taiyo.schema import SolrDynamicField

# Text fields with suffix pattern
text_dynamic = SolrDynamicField(
    name="*_txt",
    type="text_general",
    indexed=True,
    stored=True,
    multi_valued=True
)

# String fields with suffix pattern
string_dynamic = SolrDynamicField(
    name="*_s",
    type="string",
    indexed=True,
    stored=True,
    doc_values=True
)

# Integer fields with suffix pattern
int_dynamic = SolrDynamicField(
    name="*_i",
    type="pint",
    indexed=True,
    stored=True,
    doc_values=True
)

# Date fields with suffix pattern
date_dynamic = SolrDynamicField(
    name="*_dt",
    type="pdate",
    indexed=True,
    stored=True,
    doc_values=True
)

# Attribute fields with prefix pattern
attr_dynamic = SolrDynamicField(
    name="attr_*",
    type="text_general",
    indexed=True,
    stored=True
)

# Ignored fields (no indexing or storage)
ignored_dynamic = SolrDynamicField(
    name="ignored_*",
    type="string",
    indexed=False,
    stored=False
)
Note
  • Patterns must contain exactly one asterisk (*)
  • Asterisk can be at beginning or end only
  • More specific patterns take precedence over less specific ones
  • If multiple patterns match, the longest match wins
Reference

https://solr.apache.org/guide/solr/latest/indexing-guide/dynamic-fields.html

build(format='xml', indent='')

Build field definition in specified format.

Parameters:

Name Type Description Default
format str

Output format - "xml" (default) or "json"

'xml'
indent str

Indentation prefix for XML output

''

Returns:

Type Description
Dict[str, Any] | str

XML string for format="xml", dict for format="json"

SolrField

Bases: BaseModel

Solr field definition specifying data indexing and storage behavior.

Fields are named data containers that reference a field type to determine how their values are analyzed, indexed, and stored. Each field can override default behaviors inherited from its field type.

Attributes:

Name Type Description
name str

Field identifier (alphanumeric/underscore, no leading digit)

type str

Field type name (must reference a defined fieldType)

default Optional[Any]

Default value when not provided in documents

indexed Optional[bool]

Enable field in queries and sorting (default: true)

stored Optional[bool]

Store original value for retrieval (default: true)

doc_values Optional[bool]

Column-oriented storage for sorting/faceting (default: true)

multi_valued Optional[bool]

Allow multiple values per field (default: false)

required Optional[bool]

Reject documents missing this field (default: false)

omit_norms Optional[bool]

Disable length normalization (default: true for non-analyzed)

omit_term_freq_and_positions Optional[bool]

Omit term frequency and positions

omit_positions Optional[bool]

Omit positions but keep term frequency

term_vectors Optional[bool]

Store term vectors for highlighting (default: false)

term_positions Optional[bool]

Store term positions in vectors

term_offsets Optional[bool]

Store term offsets in vectors

term_payloads Optional[bool]

Store term payloads in vectors

sort_missing_first Optional[bool]

Sort docs without this field first

sort_missing_last Optional[bool]

Sort docs without this field last

uninvertible Optional[bool]

Allow un-inverting when indexed=true docValues=false

use_doc_values_as_stored Optional[bool]

Return docValues when using '*' in fl param

large Optional[bool]

Lazy load values >512KB (requires stored=true, multiValued=false)

Example
from taiyo.schema import SolrField

# Unique ID field
id_field = SolrField(
    name="id",
    type="string",
    indexed=True,
    stored=True,
    required=True
)

# Text field for full-text search
title_field = SolrField(
    name="title",
    type="text_general",
    indexed=True,
    stored=True
)

# Multi-valued field
tags_field = SolrField(
    name="tags",
    type="string",
    indexed=True,
    stored=True,
    multi_valued=True
)

# Field with docValues for faceting
category_field = SolrField(
    name="category",
    type="string",
    indexed=True,
    stored=True,
    doc_values=True
)

# Numeric field with default value
price_field = SolrField(
    name="price",
    type="pdouble",
    indexed=True,
    stored=True,
    default=0.0
)

# Version field (internal)
version_field = SolrField(
    name="_version_",
    type="plong",
    indexed=False,
    stored=False
)
Reference

https://solr.apache.org/guide/solr/latest/indexing-guide/fields.html

build(format='xml', indent='')

Build field definition in specified format.

Parameters:

Name Type Description Default
format str

Output format - "xml" (default) or "json"

'xml'
indent str

Indentation prefix for XML output

''

Returns:

Type Description
Dict[str, Any] | str

XML string for format="xml", dict for format="json"

SolrFieldClass

Bases: str, Enum

Type-safe enum for Solr field type implementation classes.

Each enum member maps to a Solr field type class name using the short notation (e.g., solr.TextField). These can be used when defining SolrFieldType instances to ensure correct class names and enable IDE autocompletion.

Categories
  • Text/String: TEXT, STR, SORTABLE_TEXT, COLLATION, ICU_COLLATION, ENUM
  • Boolean/Binary: BOOL, BINARY, UUID
  • Numeric: INT_POINT, LONG_POINT, FLOAT_POINT, DOUBLE_POINT, DATE_POINT
  • Date/Currency: DATE_RANGE, CURRENCY
  • Spatial: LATLON_POINT_SPATIAL, BBOX, SPATIAL_RPT, RPT_WITH_GEOMETRY, POINT
  • Special: RANDOM_SORT, RANK, NEST_PATH, PRE_ANALYZED
  • Vector/ML: DENSE_VECTOR
Example
from taiyo.schema import SolrFieldType, SolrFieldClass

# Text field type
text_type = SolrFieldType(
    name="text_en",
    solr_class=SolrFieldClass.TEXT,
    analyzer=...
)

# Numeric field type
int_type = SolrFieldType(
    name="pint",
    solr_class=SolrFieldClass.INT_POINT
)

# Vector field type
vector_type = SolrFieldType(
    name="vector",
    solr_class=SolrFieldClass.DENSE_VECTOR,
    vectorDimension=768
)
Reference

https://solr.apache.org/guide/solr/latest/indexing-guide/field-types-included-with-solr.html

SolrFieldType

Bases: BaseModel

Represents a Solr field type definition.

Field types define how data is analyzed and stored.

Example
from taiyo.schema import SolrFieldType, Analyzer, Tokenizer, Filter
from taiyo.schema.enums import (
    SolrFieldClass,
    SolrTokenizerFactory,
    SolrFilterFactory
)

# Using enums (recommended)
field_type = SolrFieldType(
    name="text_general",
    solr_class=SolrFieldClass.TEXT,
    position_increment_gap=100,
    analyzer=Analyzer(
        tokenizer=Tokenizer(solr_class=SolrTokenizerFactory.STANDARD),
        filters=[
            Filter(solr_class=SolrFilterFactory.LOWER_CASE),
            Filter(solr_class=SolrFilterFactory.STOP, ignore_case=True, words="stopwords.txt")
        ]
    )
)

# Or using strings (also supported)
field_type = SolrFieldType(
    name="text_general",
    solr_class="solr.TextField",
    position_increment_gap=100,
    analyzer=Analyzer(
        tokenizer=Tokenizer(name="standard"),
        filters=[
            Filter(name="lowercase"),
            Filter(name="stop", ignore_case=True, words="stopwords.txt")
        ]
    )
)

Reference: https://solr.apache.org/guide/solr/latest/indexing-guide/field-type-definitions-and-properties.html

build(format='xml', indent='')

Build field type definition in specified format.

Parameters:

Name Type Description Default
format str

Output format - "xml" (default) or "json"

'xml'
indent str

Indentation prefix for XML output

''

Returns:

Type Description
Dict[str, Any] | str

XML string for format="xml", dict for format="json"

validate_field_class(v) classmethod

Accept both enum and string values.

SolrFilterFactory

Bases: str, Enum

Type-safe enum for Solr/Lucene filter factory classes.

Filters modify, remove, or add tokens in the token stream produced by tokenizers. Each enum member maps to a filter factory class name using the short notation (e.g., solr.LowerCaseFilterFactory).

Categories
  • Case: LOWER_CASE, UPPER_CASE, TURKISH_LOWER_CASE
  • Stemming: PORTER_STEM, SNOWBALL_PORTER, KS_STEM, various language stems
  • Stop words: STOP, SUGGEST_STOP
  • Synonyms: SYNONYM_GRAPH, SYNONYM (deprecated)
  • N-grams: EDGE_NGRAM, NGRAM, SHINGLE
  • Phonetic: BEIDER_MORSE, DAITCH_MOKOTOFF_SOUNDEX, DOUBLE_METAPHONE, METAPHONE, PHONEX, REFINED_SOUNDEX, SOUNDEX
  • Word analysis: WORD_DELIMITER_GRAPH, WORD_DELIMITER (deprecated)
  • Language-specific: Multiple for various languages
  • Special purpose: ASCII_FOLDING, TRUNCATE, REVERSE, TRIM, PROTECTED_TERM
Example
from taiyo.schema.enums import SolrFilterFactory

analyzer_config = {
    "tokenizer": {"class": "solr.StandardTokenizerFactory"},
    "filters": [
        {"class": SolrFilterFactory.LOWER_CASE},
        {"class": SolrFilterFactory.STOP, "words": "stopwords.txt"},
        {"class": SolrFilterFactory.PORTER_STEM}
    ]
}
Reference

https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html

SolrTokenizerFactory

Bases: str, Enum

Type-safe enum for Solr/Lucene tokenizer factory classes.

Tokenizers break text into tokens (words) that can be further processed by filters. Each enum member maps to a tokenizer factory class name using the short notation (e.g., solr.StandardTokenizerFactory).

Categories
  • Standard: STANDARD, CLASSIC, WHITESPACE, KEYWORD
  • Pattern-based: PATTERN, SIMPLE_PATTERN, SIMPLE_PATTERN_SPLIT
  • Path/Email: PATH_HIERARCHY, UAX29_URL_EMAIL
  • Language-specific: THAI, KOREAN, JAPANESE, ICU
  • N-gram: EDGE_NGRAM, NGRAM
  • OpenNLP: OPENNLP
Example
from taiyo.schema import SolrFieldType, Tokenizer
from taiyo.schema.enums import SolrTokenizerFactory

field_type = SolrFieldType(
    name="text_standard",
    solr_class="solr.TextField",
    analyzer=Analyzer(
        tokenizer=Tokenizer(solr_tokenizer_class=SolrTokenizerFactory.STANDARD)
    )
)
Reference

https://solr.apache.org/guide/solr/latest/indexing-guide/tokenizers.html

Tokenizer

Bases: BaseModel

Tokenizer for text analysis.

Splits text into tokens.

Example
from taiyo.schema.enums import SolrTokenizerFactory

# Using enum
tokenizer = Tokenizer(solr_class=SolrTokenizerFactory.STANDARD)

# Or using name
tokenizer = Tokenizer(name="standard")

# Or using string class
tokenizer = Tokenizer(solr_class="solr.StandardTokenizerFactory")

build(format='xml')

Build tokenizer definition.

Parameters:

Name Type Description Default
format str

Output format - only "json" supported for components

'xml'

Returns:

Type Description
Dict[str, Any]

dict representation

validate_class(v) classmethod

Accept both enum and string values.