Embeddings
An embedding model is a tool that converts text data into a vector representation. The quality of the embedding model is crucial for the quality of the search results. You can configure multiple embedding models in your Django settings and use them for different fields in your documents.
Configuration
Default Embedding Model
Configure the default embedding model that will be used when no specific model is specified:
SEMANTIC_SEARCH = {
"default_embeddings": {
"model": "django_semantic_search.embeddings.SentenceTransformerModel",
"configuration": {
"model_name": "sentence-transformers/all-MiniLM-L6-v2",
},
},
}
Named Embedding Models
You can define multiple named embedding models to use for different fields:
SEMANTIC_SEARCH = {
"embedding_models": {
"title_model": {
"model": "django_semantic_search.embeddings.SentenceTransformerModel",
"configuration": {
"model_name": "sentence-transformers/all-mpnet-base-v2",
"document_prompt": "Title: ",
},
},
"content_model": {
"model": "django_semantic_search.embeddings.OpenAIEmbeddingModel",
"configuration": {
"model": "text-embedding-3-small",
},
},
},
...
}
Then reference these models in your document definitions:
@register_document
class BookDocument(Document):
class Meta:
model = Book
indexes = [
VectorIndex("title", embedding_model="title_model"),
VectorIndex("content", embedding_model="content_model"),
VectorIndex("summary"), # Will use default_embeddings
]
Note: Fields without a specified embedding_model
will use the model defined in default_embeddings
.
Supported Models
Currently, django-semantic-search
supports the following embedding models:
Sentence Transformers
The Sentence Transformers library provides a way to convert text data into a vector representation. There are over 5,000 pre-trained models available, and you can choose the one that fits your needs the best.
One of the available models is all-MiniLM-L6-v2
, which is a lightweight model that provides a good balance between the
quality of the search results and the resource consumption.
django_semantic_search.embeddings.SentenceTransformerModel
Bases: DenseTextEmbeddingModel
Sentence-transformers model for embedding text.
It is a wrapper around the sentence-transformers library. Users would rarely need to use this class directly, but rather specify it in the Django settings.
Requirements:
Usage:
SEMANTIC_SEARCH = {
"default_embeddings": {
"model": "django_semantic_search.embeddings.SentenceTransformerModel",
"configuration": {
"model_name": "sentence-transformers/all-MiniLM-L6-v2",
},
},
...
}
Some models accept prompts to be used for the document and query. These prompts are used as additional
instructions for the model to generate embeddings. For example, if the document_prompt
is set to "Doc: "
, the
model will generate embeddings with the prompt "Doc: "
followed by the document text. Similarly, the
query_prompt
is used for the query, if set.
SEMANTIC_SEARCH = {
"default_embeddings": {
"model": "django_semantic_search.embeddings.SentenceTransformerModel",
"configuration": {
"model_name": "sentence-transformers/all-MiniLM-L6-v2",
"document_prompt": "Doc: ",
"query_prompt": "Query: ",
},
},
...
}
Source code in src/django_semantic_search/embeddings/sentence_transformers.py
__init__(model_name, document_prompt=None, query_prompt=None)
Initialize the sentence-transformers model.
Some models accept prompts to be used for the document and query. These prompts are used as additional
instructions for the model to generate embeddings. For example, if the document_prompt
is set to "Doc: ", the
model will generate embeddings with the prompt "Doc: " followed by the document text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
name of the model to use. |
required |
document_prompt
|
Optional[str]
|
prompt to use for the document, defaults to None. |
None
|
query_prompt
|
Optional[str]
|
prompt to use for the query, defaults to None. |
None
|
Source code in src/django_semantic_search/embeddings/sentence_transformers.py
embed_document(document)
Embed a document into a vector.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
document
|
str
|
document to embed. |
required |
Returns:
Type | Description |
---|---|
DenseVector
|
document embedding. |
Source code in src/django_semantic_search/embeddings/sentence_transformers.py
embed_query(query)
Embed a query into a vector.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
query to embed. |
required |
Returns:
Type | Description |
---|---|
DenseVector
|
query embedding. |
Source code in src/django_semantic_search/embeddings/sentence_transformers.py
vector_size()
Return the size of the individual embedding.
Returns:
Type | Description |
---|---|
int
|
size of the embedding. |
OpenAI
OpenAI provides powerful embedding models through their API. The default model is text-embedding-3-small
, which
offers a good balance between quality and cost.
To use OpenAI embeddings, first install the required dependencies:
Then configure it in your Django settings:
SEMANTIC_SEARCH = {
"default_embeddings": {
"model": "django_semantic_search.embeddings.OpenAIEmbeddingModel",
"configuration": {
"model": "text-embedding-3-small",
"api_key": "your-api-key", # Optional if set in env
},
},
...
}
The API key can also be provided through the OPENAI_API_KEY
environment variable.
django_semantic_search.embeddings.OpenAIEmbeddingModel
Bases: DenseTextEmbeddingModel
OpenAI text embedding model that uses the OpenAI API to generate dense embeddings.
Requirements:
Usage:
SEMANTIC_SEARCH = {
"default_embeddings": {
"model": "django_semantic_search.embeddings.OpenAIEmbeddingModel",
"configuration": {
"model": "text-embedding-3-small",
"api_key": "your-api-key", # Optional if set in env
},
},
...
}
Source code in src/django_semantic_search/embeddings/openai.py
__init__(model='text-embedding-3-small', api_key=None, **kwargs)
Initialize the OpenAI embedding model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
OpenAI model to use for embeddings |
'text-embedding-3-small'
|
api_key
|
Optional[str]
|
OpenAI API key. If not provided, will look for OPENAI_API_KEY env variable |
None
|
kwargs
|
Additional kwargs passed to OpenAI client |
{}
|
Source code in src/django_semantic_search/embeddings/openai.py
embed_document(document)
embed_query(query)
vector_size()
Source code in src/django_semantic_search/embeddings/openai.py
FastEmbed
FastEmbed is a lightweight and efficient embedding library that supports both dense and sparse embeddings. It provides fast, accurate embeddings suitable for production use.
Installation
To use FastEmbed embeddings, install the required dependencies:
Dense Embeddings
For dense embeddings, configure FastEmbed in your Django settings:
SEMANTIC_SEARCH = {
"default_embeddings": {
"model": "django_semantic_search.embeddings.FastEmbedDenseModel",
"configuration": {
"model_name": "BAAI/bge-small-en-v1.5",
},
},
...
}
django_semantic_search.embeddings.FastEmbedDenseModel
Bases: DenseTextEmbeddingModel
FastEmbed dense embedding model that uses the FastEmbed library to generate dense embeddings.
Requirements:
Usage:
SEMANTIC_SEARCH = {
"default_embeddings": {
"model": "django_semantic_search.embeddings.FastEmbedDenseModel",
"configuration": {
"model_name": "BAAI/bge-small-en-v1.5",
},
},
...
}
Source code in src/django_semantic_search/embeddings/fastembed.py
__init__(model_name, **kwargs)
Initialize the FastEmbed dense model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
name of the model to use |
required |
kwargs
|
additional kwargs passed to FastEmbed |
{}
|
Source code in src/django_semantic_search/embeddings/fastembed.py
embed_document(document)
Embed a document into a vector.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
document
|
str
|
document to embed. |
required |
Returns:
Type | Description |
---|---|
DenseVector
|
document embedding. |
Source code in src/django_semantic_search/embeddings/fastembed.py
embed_query(query)
Embed a query into a vector.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
query to embed. |
required |
Returns:
Type | Description |
---|---|
DenseVector
|
query embedding. |
Source code in src/django_semantic_search/embeddings/fastembed.py
vector_size()
Return the size of the individual embedding.
Returns:
Type | Description |
---|---|
int
|
size of the embedding. |
Source code in src/django_semantic_search/embeddings/fastembed.py
Sparse Embeddings (Coming Soon)
Note: Sparse embeddings support is currently under development and not yet available for use in django-semantic-search. This feature will be available in a future release.
While FastEmbed supports sparse embeddings (like BM25), the integration with django-semantic-search is still in progress.