Skip to content

signalwire.search

signalwire.search

Copyright (c) 2025 SignalWire

This file is part of the SignalWire SDK.

Licensed under the MIT License. See LICENSE file in the project root for full license information.

SignalWire Agents Local Search Module

This module provides local search capabilities for the SignalWire Agents SDK. It requires additional dependencies that can be installed with:

pip install signalwire-sdk[search]           # Basic search
pip install signalwire-sdk[search-full]      # + Document processing
pip install signalwire-sdk[search-nlp]       # + Advanced NLP
pip install signalwire-sdk[search-all]       # All features

MODEL_ALIASES = {'mini': 'sentence-transformers/all-MiniLM-L6-v2', 'base': 'sentence-transformers/all-mpnet-base-v2', 'large': 'sentence-transformers/all-mpnet-base-v2'} module-attribute

DEFAULT_MODEL = MODEL_ALIASES['mini'] module-attribute

__all__ = ['DEFAULT_MODEL', 'MODEL_ALIASES', 'DocumentProcessor', 'IndexBuilder', 'SearchEngine', 'SearchIndexMigrator', 'SearchService', 'preprocess_document_content', 'preprocess_query', 'resolve_model_alias'] module-attribute

SearchIndexMigrator

Migrate search indexes between different backends

verbose = verbose instance-attribute

__init__(verbose=False)

Initialize the migrator

Parameters:

Name Type Description Default
verbose bool

Enable verbose output

False

migrate_sqlite_to_pgvector(sqlite_path, connection_string, collection_name, overwrite=False, batch_size=100)

Migrate a .swsearch SQLite index to pgvector

Parameters:

Name Type Description Default
sqlite_path str

Path to .swsearch file

required
connection_string str

PostgreSQL connection string

required
collection_name str

Name for the pgvector collection

required
overwrite bool

Whether to overwrite existing collection

False
batch_size int

Number of chunks to insert at once

100

Returns:

Type Description
dict[str, Any]

Migration statistics

migrate_pgvector_to_sqlite(connection_string, collection_name, output_path, batch_size=100)

Migrate a pgvector collection to SQLite .swsearch format

Parameters:

Name Type Description Default
connection_string str

PostgreSQL connection string

required
collection_name str

Name of the pgvector collection

required
output_path str

Output .swsearch file path

required
batch_size int

Number of chunks to fetch at once

100

Returns:

Type Description
dict[str, Any]

Migration statistics

get_index_info(index_path)

Get information about a search index

Parameters:

Name Type Description Default
index_path str

Path to index file or pgvector collection identifier

required

Returns:

Type Description
dict[str, Any]

Index information including type, config, and statistics

DocumentProcessor

__init__(*args, **kwargs)

IndexBuilder

__init__(*args, **kwargs)

SearchEngine

__init__(*args, **kwargs)

SearchService

__init__(*args, **kwargs)

resolve_model_alias(model_name)

Resolve model alias to full model name

Parameters:

Name Type Description Default
model_name str

Model name or alias (mini, base, large)

required

Returns:

Type Description
str

Full model name

preprocess_query(*args, **kwargs)

preprocess_document_content(*args, **kwargs)

document_processor

Copyright (c) 2025 SignalWire

This file is part of the SignalWire SDK.

Licensed under the MIT License. See LICENSE file in the project root for full license information.

logger = get_logger(__name__) module-attribute

DocumentProcessor

Enhanced document processor with smart chunking capabilities

chunking_strategy = chunking_strategy instance-attribute
max_sentences_per_chunk = max_sentences_per_chunk instance-attribute
chunk_size = chunk_size instance-attribute
split_newlines = split_newlines instance-attribute
semantic_threshold = semantic_threshold instance-attribute
topic_threshold = topic_threshold instance-attribute
chunk_overlap = chunk_overlap instance-attribute
__init__(chunking_strategy='sentence', max_sentences_per_chunk=5, chunk_size=50, chunk_overlap=10, split_newlines=None, index_nlp_backend='nltk', verbose=False, semantic_threshold=0.5, topic_threshold=0.3)

Initialize document processor

Parameters:

Name Type Description Default
chunking_strategy str

Strategy for chunking documents: - 'sentence': Sentence-based chunking with overlap - 'sliding': Sliding window with word-based chunks - 'paragraph': Natural paragraph boundaries - 'page': Page-based chunking (for PDFs) - 'semantic': Semantic similarity-based chunking - 'topic': Topic modeling-based chunking - 'qa': Question-answer optimized chunking - 'json': JSON structure-aware chunking - 'markdown': Markdown structure-aware chunking with code block detection

'sentence'
max_sentences_per_chunk int

For sentence strategy (default: 5)

5
chunk_size int

For sliding strategy - words per chunk (default: 50)

50
chunk_overlap int

For sliding strategy - overlap in words (default: 10)

10
split_newlines int | None

For sentence strategy - split on multiple newlines (optional)

None
index_nlp_backend str

NLP backend for indexing (default: 'nltk')

'nltk'
verbose bool

Whether to enable verbose logging (default: False)

False
semantic_threshold float

Similarity threshold for semantic chunking (default: 0.5)

0.5
topic_threshold float

Similarity threshold for topic chunking (default: 0.3)

0.3
create_chunks(content, filename, file_type)

Create chunks from document content using specified chunking strategy

Parameters:

Name Type Description Default
content str

Document content (string) - should be the actual content, not a file path

required
filename str

Name of the file (for metadata)

required
file_type str

File extension/type

required

Returns:

Type Description
list[dict[str, Any]]

List of chunk dictionaries

index_builder

Copyright (c) 2025 SignalWire

This file is part of the SignalWire SDK.

Licensed under the MIT License. See LICENSE file in the project root for full license information.

logger = get_logger(__name__) module-attribute

IndexBuilder

Build searchable indexes from document directories

model_name = model_name instance-attribute
chunking_strategy = chunking_strategy instance-attribute
max_sentences_per_chunk = max_sentences_per_chunk instance-attribute
chunk_size = chunk_size instance-attribute
chunk_overlap = chunk_overlap instance-attribute
split_newlines = split_newlines instance-attribute
index_nlp_backend = index_nlp_backend instance-attribute
verbose = verbose instance-attribute
semantic_threshold = semantic_threshold instance-attribute
topic_threshold = topic_threshold instance-attribute
backend = backend instance-attribute
connection_string = connection_string instance-attribute
model = None instance-attribute
doc_processor = DocumentProcessor(chunking_strategy=chunking_strategy, max_sentences_per_chunk=max_sentences_per_chunk, chunk_size=chunk_size, chunk_overlap=chunk_overlap, split_newlines=split_newlines, index_nlp_backend=(self.index_nlp_backend), verbose=(self.verbose), semantic_threshold=(self.semantic_threshold), topic_threshold=(self.topic_threshold)) instance-attribute
__init__(model_name='sentence-transformers/all-mpnet-base-v2', chunking_strategy='sentence', max_sentences_per_chunk=5, chunk_size=50, chunk_overlap=10, split_newlines=None, index_nlp_backend='nltk', verbose=False, semantic_threshold=0.5, topic_threshold=0.3, backend='sqlite', connection_string=None)

Initialize the index builder

Parameters:

Name Type Description Default
model_name str

Name of the sentence transformer model to use

'sentence-transformers/all-mpnet-base-v2'
chunking_strategy str

Strategy for chunking documents ('sentence', 'sliding', 'paragraph', 'page', 'semantic', 'topic', 'qa', 'json')

'sentence'
max_sentences_per_chunk int

For sentence strategy (default: 5)

5
chunk_size int

For sliding strategy - words per chunk (default: 50)

50
chunk_overlap int

For sliding strategy - overlap in words (default: 10)

10
split_newlines int | None

For sentence strategy - split on multiple newlines (optional)

None
index_nlp_backend str

NLP backend for indexing (default: 'nltk')

'nltk'
verbose bool

Whether to enable verbose logging (default: False)

False
semantic_threshold float

Similarity threshold for semantic chunking (default: 0.5)

0.5
topic_threshold float

Similarity threshold for topic chunking (default: 0.3)

0.3
backend str

Storage backend ('sqlite' or 'pgvector') (default: 'sqlite')

'sqlite'
connection_string str | None

PostgreSQL connection string for pgvector backend

None
build_index_from_sources(sources, output_file, file_types, exclude_patterns=None, languages=None, tags=None, overwrite=False)

Build complete search index from multiple sources (files and directories)

Parameters:

Name Type Description Default
sources list[Path]

List of Path objects (files and/or directories)

required
output_file str

Output .swsearch file path

required
file_types list[str]

List of file extensions to include for directories

required
exclude_patterns list[str] | None

Glob patterns to exclude

None
languages list[str] | None

List of languages to support

None
tags list[str] | None

Global tags to add to all chunks

None
build_index(source_dir, output_file, file_types, exclude_patterns=None, languages=None, tags=None)

Build complete search index from a single directory

Parameters:

Name Type Description Default
source_dir str

Directory to scan for documents

required
output_file str

Output .swsearch file path

required
file_types list[str]

List of file extensions to include

required
exclude_patterns list[str] | None

Glob patterns to exclude

None
languages list[str] | None

List of languages to support

None
tags list[str] | None

Global tags to add to all chunks

None
validate_index(index_file)

Validate an existing search index

migration

Copyright (c) 2025 SignalWire

This file is part of the SignalWire SDK.

Licensed under the MIT License. See LICENSE file in the project root for full license information.

logger = get_logger(__name__) module-attribute

SearchIndexMigrator

Migrate search indexes between different backends

verbose = verbose instance-attribute
__init__(verbose=False)

Initialize the migrator

Parameters:

Name Type Description Default
verbose bool

Enable verbose output

False
migrate_sqlite_to_pgvector(sqlite_path, connection_string, collection_name, overwrite=False, batch_size=100)

Migrate a .swsearch SQLite index to pgvector

Parameters:

Name Type Description Default
sqlite_path str

Path to .swsearch file

required
connection_string str

PostgreSQL connection string

required
collection_name str

Name for the pgvector collection

required
overwrite bool

Whether to overwrite existing collection

False
batch_size int

Number of chunks to insert at once

100

Returns:

Type Description
dict[str, Any]

Migration statistics

migrate_pgvector_to_sqlite(connection_string, collection_name, output_path, batch_size=100)

Migrate a pgvector collection to SQLite .swsearch format

Parameters:

Name Type Description Default
connection_string str

PostgreSQL connection string

required
collection_name str

Name of the pgvector collection

required
output_path str

Output .swsearch file path

required
batch_size int

Number of chunks to fetch at once

100

Returns:

Type Description
dict[str, Any]

Migration statistics

get_index_info(index_path)

Get information about a search index

Parameters:

Name Type Description Default
index_path str

Path to index file or pgvector collection identifier

required

Returns:

Type Description
dict[str, Any]

Index information including type, config, and statistics

models

Copyright (c) 2025 SignalWire

This file is part of the SignalWire SDK.

Licensed under the MIT License. See LICENSE file in the project root for full license information.

MODEL_ALIASES = {'mini': 'sentence-transformers/all-MiniLM-L6-v2', 'base': 'sentence-transformers/all-mpnet-base-v2', 'large': 'sentence-transformers/all-mpnet-base-v2'} module-attribute

DEFAULT_MODEL = MODEL_ALIASES['mini'] module-attribute

resolve_model_alias(model_name)

Resolve model alias to full model name

Parameters:

Name Type Description Default
model_name str

Model name or alias (mini, base, large)

required

Returns:

Type Description
str

Full model name

pgvector_backend

Copyright (c) 2025 SignalWire

This file is part of the SignalWire SDK.

Licensed under the MIT License. See LICENSE file in the project root for full license information.

PGVECTOR_AVAILABLE = True module-attribute

logger = get_logger(__name__) module-attribute

PgVectorBackend

PostgreSQL pgvector backend for search indexing and retrieval

connection_string = connection_string instance-attribute
conn = None instance-attribute
__init__(connection_string)

Initialize pgvector backend

Parameters:

Name Type Description Default
connection_string str

PostgreSQL connection string

required
create_schema(collection_name, embedding_dim=768)

Create database schema for a collection

Parameters:

Name Type Description Default
collection_name str

Name of the collection

required
embedding_dim int

Dimension of embeddings

768
store_chunks(chunks, collection_name, config)

Store document chunks in the database

Parameters:

Name Type Description Default
chunks list[dict[str, Any]]

List of processed chunks with embeddings

required
collection_name str

Name of the collection

required
config dict[str, Any]

Configuration metadata

required
get_stats(collection_name)

Get statistics for a collection

list_collections()

List all collections in the database

delete_collection(collection_name)

Delete a collection and its data

close()

Close database connection

PgVectorSearchBackend

PostgreSQL pgvector backend for search operations

connection_string = connection_string instance-attribute
collection_name = _sanitize_collection_name(collection_name) instance-attribute
table_name = f'chunks_{self.collection_name}' instance-attribute
conn = None instance-attribute
config = self._load_config() instance-attribute
__init__(connection_string, collection_name)

Initialize search backend

Parameters:

Name Type Description Default
connection_string str

PostgreSQL connection string

required
collection_name str

Name of the collection to search

required
search(query_vector, enhanced_text, count=5, similarity_threshold=0.0, tags=None, keyword_weight=None)

Perform hybrid search (vector + keyword + metadata).

NOTE: As of the unified-pipeline refactor, production traffic flows through SearchEngine.search() which calls fetch_candidates() here and then runs all post-processing (scoring, dedup, diversity) in SearchEngine. This method remains as a self-contained search path for direct backend use and test coverage, but does NOT receive the SearchEngine's exact-match boost, filename diversity, or match-type diversity logic. For consistent behavior, call SearchEngine.search().

Parameters:

Name Type Description Default
query_vector list[float]

Embedding vector for the query

required
enhanced_text str

Processed query text for keyword search

required
count int

Number of results to return

5
similarity_threshold float

Minimum similarity score

0.0
tags list[str] | None

Filter by tags

None
keyword_weight float | None

Manual keyword weight (0.0-1.0). If None, uses default weighting

None

Returns:

Type Description
list[dict[str, Any]]

List of search results with scores and metadata

fetch_candidates(query_vector, enhanced_text, count, similarity_threshold=0.0, tags=None, original_query=None)

Fetch raw candidates with per-source signal scores.

Runs vector/keyword/metadata/filename searches, applies similarity threshold to raw vector scores pre-merge (keeps threshold intuitive), and merges into a candidate list keyed by chunk id. Does NOT compute final scores, boost exact matches, dedupe, or apply diversity - those run uniformly in SearchEngine for every backend.

Result shape per candidate (matches sqlite path): { 'id', 'content', 'metadata': {filename, section, tags, **custom}, 'search_type', 'vector_score' (if vector matched), 'vector_distance' (if vector matched), 'sources': {source_type: True, ...}, 'source_scores': {source_type: raw_score, ...}, }

get_stats()

Get statistics for the collection

close()

Close database connection

query_processor

Copyright (c) 2025 SignalWire

This file is part of the SignalWire SDK.

Licensed under the MIT License. See LICENSE file in the project root for full license information.

logger = get_logger(__name__) module-attribute

stopwords_language_map = {'en': 'english', 'es': 'spanish', 'fr': 'french', 'de': 'german', 'it': 'italian', 'pt': 'portuguese', 'nl': 'dutch', 'ru': 'russian', 'ar': 'arabic', 'da': 'danish', 'fi': 'finnish', 'hu': 'hungarian', 'no': 'norwegian', 'ro': 'romanian', 'sv': 'swedish', 'tr': 'turkish'} module-attribute

pos_mapping = {'NOUN': wn.NOUN, 'VERB': wn.VERB, 'ADJ': wn.ADJ, 'ADV': wn.ADV, 'PROPN': wn.NOUN} module-attribute

detect_language(text)

Detect language of input text Simple implementation - can be enhanced with langdetect library

load_spacy_model(language)

Load spaCy model for the given language Returns None if spaCy is not available or model not found

set_global_model(model)

Legacy function - adds model to cache instead of setting globally

vectorize_query(query, model=None, model_name=None)

Vectorize query using sentence transformers Returns numpy array of embeddings

Parameters:

Name Type Description Default
query str

Query string to vectorize

required
model Any

Optional pre-loaded model instance. If not provided, uses cached model.

None
model_name str | None

Optional model name to use if loading a new model

None

ensure_nltk_resources()

Download required NLTK resources if not already present

get_wordnet_pos(spacy_pos)

Map spaCy POS tags to WordNet POS tags.

get_synonyms(word, pos_tag, max_synonyms=5)

Get synonyms for a word using WordNet

remove_duplicate_words(input_string)

Remove duplicate words from the input string while preserving the order and punctuation.

preprocess_query(query, language='en', pos_to_expand=None, max_synonyms=5, debug=False, vector=False, vectorize_query_param=False, nlp_backend=None, query_nlp_backend='nltk', model_name=None, preserve_original=True)

Advanced query preprocessing with language detection, POS tagging, synonym expansion, and vectorization

Parameters:

Name Type Description Default
query str

Input query string

required
language str

Language code ('en', 'es', etc.) or 'auto' for detection

'en'
pos_to_expand list[str] | None

List of POS tags to expand with synonyms

None
max_synonyms int

Maximum synonyms per word

5
debug bool

Enable debug output

False
vector bool

Include vector embedding in output

False
vectorize_query_param bool

If True, just vectorize without other processing

False
nlp_backend str | None

DEPRECATED - use query_nlp_backend instead

None
query_nlp_backend str

NLP backend for query processing ('nltk' for fast, 'spacy' for better quality)

'nltk'

Returns:

Type Description
dict[str, Any]

Dict containing processed query, language, POS tags, and optionally vector

preprocess_document_content(content, language='en', nlp_backend=None, index_nlp_backend='nltk')

Preprocess document content for better searchability

Parameters:

Name Type Description Default
content str

Document content to process

required
language str

Language code for processing

'en'
nlp_backend str | None

DEPRECATED - use index_nlp_backend instead

None
index_nlp_backend str

NLP backend for document processing ('nltk' for fast, 'spacy' for better quality)

'nltk'

Returns:

Type Description
dict[str, Any]

Dict containing enhanced text and extracted keywords

search_engine

Copyright (c) 2025 SignalWire

This file is part of the SignalWire SDK.

Licensed under the MIT License. See LICENSE file in the project root for full license information.

NDArray = np.ndarray module-attribute

logger = get_logger(__name__) module-attribute

SearchEngine

Hybrid search engine for vector and keyword search

backend = backend instance-attribute
model = model instance-attribute
index_path = index_path instance-attribute
config = self._load_config() instance-attribute
embedding_dim = int(self.config.get('embedding_dimensions', 768)) instance-attribute
__init__(backend='sqlite', index_path=None, connection_string=None, collection_name=None, model=None)

Initialize search engine

Parameters:

Name Type Description Default
backend str

Storage backend ('sqlite' or 'pgvector')

'sqlite'
index_path str | None

Path to .swsearch file (for sqlite backend)

None
connection_string str | None

PostgreSQL connection string (for pgvector backend)

None
collection_name str | None

Collection name (for pgvector backend)

None
model Any

Optional sentence transformer model

None
search(query_vector, enhanced_text, count=3, similarity_threshold=0.0, tags=None, keyword_weight=None, original_query=None)

Unified hybrid search: backends fetch candidates, SearchEngine post-processes.

Flow is backend-agnostic
  1. fetch_candidates(k = count * search_multiplier) — per-source signals
  2. _process_candidates — score / boost / dedupe / diversity
  3. return top count

Backends (sqlite, pgvector, future) only implement candidate fetching. All quality logic (max-signal scoring with agreement boost, exact-match boosting, content dedup, filename diversity, match-type diversity) runs once here and therefore applies to every backend uniformly.

Parameters:

Name Type Description Default
query_vector list[float]

Embedding vector for the query

required
enhanced_text str

Processed query text for keyword search

required
count int

Number of results to return

3
similarity_threshold float

Minimum similarity score (applied to raw vector pre-merge so the threshold is intuitive)

0.0
tags list[str] | None

Filter by tags

None
keyword_weight float | None

Accepted for API stability; scoring is max-signal-wins with agreement boost, so this is no-op currently

None
original_query str | None

Original query for exact matching / filename match

None

Returns:

Type Description
list[dict[str, Any]]

List of search results with scores and metadata

get_stats()

Get statistics about the search index

search_service

Copyright (c) 2025 SignalWire

This file is part of the SignalWire SDK.

Licensed under the MIT License. See LICENSE file in the project root for full license information.

logger = get_logger('search_service') module-attribute

SearchRequest

query = query instance-attribute
index_name = index_name instance-attribute
count = count instance-attribute
similarity_threshold = similarity_threshold instance-attribute
tags = tags instance-attribute
language = language instance-attribute
__init__(query, index_name='default', count=3, similarity_threshold=0.0, tags=None, language=None)

SearchResult

content = content instance-attribute
score = score instance-attribute
metadata = metadata instance-attribute
__init__(content, score, metadata)

SearchResponse

results = results instance-attribute
query_analysis = query_analysis instance-attribute
__init__(results, query_analysis=None)

SearchService

Local search service with HTTP API supporting both SQLite and pgvector backends

port = port instance-attribute
backend = backend instance-attribute
connection_string = connection_string instance-attribute
indexes = indexes instance-attribute
search_engines = {} instance-attribute
model = None instance-attribute
security = SecurityConfig(config_file=config_file, service_name='search') instance-attribute
app = None instance-attribute
__init__(port=8001, indexes=None, basic_auth=None, config_file=None, backend='sqlite', connection_string=None)
search_direct(query, index_name='default', count=3, distance=0.0, tags=None, language=None)

Direct search method (non-async) for programmatic use

start(host='0.0.0.0', port=None, ssl_cert=None, ssl_key=None)

Start the service with optional HTTPS support.

Parameters:

Name Type Description Default
host str

Host to bind to (default: "0.0.0.0")

'0.0.0.0'
port int | None

Port to bind to (default: self.port)

None
ssl_cert str | None

Path to SSL certificate file (overrides environment)

None
ssl_key str | None

Path to SSL key file (overrides environment)

None
stop()

Stop the service (placeholder for cleanup)