Weaviate Integration
Weaviate is the vector database powering AtlasML's semantic search and similarity features. This guide covers everything you need to know about how AtlasML integrates with Weaviate.
What is Weaviate?
Weaviate is an open-source vector database that stores and searches high-dimensional vectors efficiently.
Key Features
- Vector Search: Find similar items using semantic similarity
- Hybrid Search: Combine vector and keyword search
- GraphQL API: Flexible query language
- Scalable: Handles millions of vectors
- Schema-Based: Structured data with properties
Why AtlasML Uses Weaviate
- Semantic Search: Find similar competencies by meaning, not just keywords
- Fast Retrieval: Sub-100ms queries even with thousands of items
- Flexible Schema: Easy to add new properties without migrations
- Vector Storage: Native support for embeddings
- Python SDK: First-class Python support
Connection Setup
Environment Variables
Configure Weaviate connection in .env:
WEAVIATE_HOST=https://weaviate.example.com # include scheme
WEAVIATE_PORT=443 # REST port
WEAVIATE_API_KEY=your-weaviate-api-key # optional for secured clusters
Connection in Code
File: atlasml/clients/weaviate.py
from atlasml.config import get_settings
import weaviate
from weaviate.auth import AuthApiKey
settings = get_settings()
auth_credentials = AuthApiKey(api_key=settings.weaviate.api_key) if settings.weaviate.api_key else None
client = weaviate.connect_to_custom(
http_host=settings.weaviate.host,
http_port=settings.weaviate.port,
http_secure=(settings.weaviate.scheme == "https"),
grpc_host=None, # REST-only
grpc_port=None,
auth_credentials=auth_credentials,
)
Singleton Pattern
AtlasML uses a singleton to ensure one client instance:
class WeaviateClientSingleton:
_instances = {}
@classmethod
def get_instance(cls, weaviate_settings=None) -> WeaviateClient:
if weaviate_settings is None:
weaviate_settings = get_settings().weaviate
key = (weaviate_settings.host, weaviate_settings.port)
if key not in cls._instances:
cls._instances[key] = WeaviateClient(weaviate_settings)
return cls._instances[key]
Why singleton?
- Reuse connection across requests
- Avoid connection overhead
- Thread-safe connection pooling
Getting the Client
from atlasml.clients.weaviate import get_weaviate_client
client = get_weaviate_client()
Collections (Schemas)
Available Collections
AtlasML defines three main collections:
class CollectionNames(str, Enum):
EXERCISE = "Exercise"
COMPETENCY = "Competency"
SEMANTIC_CLUSTER = "SemanticCluster"
Collection Schemas
1. Competency Collection
COMPETENCY_SCHEMA = {
"class": "Competency",
"description": "Competency records with embeddings",
"vectorizer": "none", # We provide vectors manually
"properties": [
{
"name": "competency_id",
"dataType": ["int"],
"description": "Unique competency ID"
},
{
"name": "title",
"dataType": ["text"],
"description": "Competency title"
},
{
"name": "description",
"dataType": ["text"],
"description": "Competency description"
},
{
"name": "course_id",
"dataType": ["int"],
"description": "Associated course ID"
}
]
}
Example Object:
{
"uuid": "a1b2c3d4-...",
"properties": {
"competency_id": 42,
"title": "Object-Oriented Programming",
"description": "Understanding OOP principles",
"course_id": 1
},
"vector": [0.123, -0.456, 0.789, ...] // 1536 dimensions
}
2. Exercise Collection
EXERCISE_SCHEMA = {
"class": "Exercise",
"description": "Exercise records with competency associations",
"vectorizer": "none",
"properties": [
{
"name": "exercise_id",
"dataType": ["int"],
"description": "Unique exercise ID"
},
{
"name": "title",
"dataType": ["text"],
"description": "Exercise title"
},
{
"name": "description",
"dataType": ["text"],
"description": "Exercise description"
},
{
"name": "competency_ids",
"dataType": ["int[]"],
"description": "Associated competency IDs"
},
{
"name": "course_id",
"dataType": ["int"],
"description": "Course ID"
}
]
}
3. SemanticCluster Collection
SEMANTIC_CLUSTER_SCHEMA = {
"class": "SemanticCluster",
"description": "Semantic clusters of competencies",
"vectorizer": "none",
"properties": [
{
"name": "cluster_id",
"dataType": ["text"],
"description": "Unique cluster identifier"
},
{
"name": "course_id",
"dataType": ["int"],
"description": "Course ID"
}
]
}
Auto-Creation
Collections are created automatically on startup:
def _ensure_collections_exist(self):
for collection_name, schema in COLLECTION_SCHEMAS.items():
if not self.client.collections.exists(collection_name):
self.client.collections.create_from_dict(schema)
logger.info(f"✅ Created collection: {collection_name}")
CRUD Operations
Create (Add Embeddings)
Insert a vector with properties:
from atlasml.clients.weaviate import get_weaviate_client, CollectionNames
client = get_weaviate_client()
uuid = client.add_embeddings(
collection_name=CollectionNames.COMPETENCY.value,
embeddings=[0.1, 0.2, 0.3, ...], # 1536-dim vector
properties={
"competency_id": 100,
"title": "Machine Learning Basics",
"description": "Introduction to ML concepts",
"course_id": 2
}
)
print(f"Created object: {uuid}")
Returns: UUID of the created object
Use Cases:
- Saving new competencies
- Storing exercise embeddings
- Creating clusters
Read (Get Embeddings)
Get by UUID
obj = client.get_embeddings(
collection_name=CollectionNames.COMPETENCY.value,
id="a1b2c3d4-..."
)
print(obj["properties"]["title"])
print(obj["vector"][:5]) # First 5 dimensions
Get All Objects
all_competencies = client.get_all_embeddings(
collection_name=CollectionNames.COMPETENCY.value
)
for comp in all_competencies:
print(f"{comp['properties']['competency_id']}: {comp['properties']['title']}")
Returns: List of objects with vectors and properties
Use Cases:
- Exporting data
- Bulk processing
- Analytics
Get by Property Filter
course_1_competencies = client.get_embeddings_by_property(
collection_name=CollectionNames.COMPETENCY.value,
property_name="course_id",
property_value=1
)
print(f"Found {len(course_1_competencies)} competencies in course 1")
Use Cases:
- Filtering by course
- Finding specific competency IDs
- Scoped queries
Get by Multiple Properties
results = client.search_by_multiple_properties(
collection_name=CollectionNames.COMPETENCY.value,
property_filters={
"course_id": 1,
"competency_id": 42
}
)
Logic: AND condition (all filters must match)
Update (Modify Properties)
Update an existing object:
success = client.update_property_by_id(
collection_name=CollectionNames.COMPETENCY.value,
id="a1b2c3d4-...",
properties={
"title": "Updated Title",
"description": "Updated description"
},
vector=[0.5, 0.6, 0.7, ...] # Optional: update vector too
)
if success:
print("Update successful")
Parameters:
id: Object UUIDproperties: Properties to update (merged with existing)vector: Optional new vector
Use Cases:
- Updating competency descriptions
- Regenerating embeddings
- Fixing data
Delete (Remove Objects)
Delete by property filter:
deleted_count = client.delete_by_property(
collection_name=CollectionNames.COMPETENCY.value,
property_name="competency_id",
property_value=100
)
print(f"Deleted {deleted_count} objects")
Returns: Number of deleted objects
Use Cases:
- Removing deleted competencies
- Cleaning up test data
- Data synchronization
Vector Search
Similarity Search
Find similar vectors:
# 1. Generate query embedding
from atlasml.ml.embeddings import EmbeddingGenerator
generator = EmbeddingGenerator()
query_vector = generator.generate_embeddings_openai("Python programming")
# 2. Get all competencies for comparison
competencies = client.get_embeddings_by_property(
collection_name=CollectionNames.COMPETENCY.value,
property_name="course_id",
property_value=1
)
# 3. Compute similarity
from atlasml.ml.similarity_measures import compute_cosine_similarity
results = []
for comp in competencies:
similarity = compute_cosine_similarity(query_vector, comp["vector"])
results.append({
"competency": comp["properties"],
"similarity": similarity
})
# 4. Sort by similarity (highest first)
results.sort(key=lambda x: x["similarity"], reverse=True)
# 5. Return top results
top_5 = results[:5]
for r in top_5:
print(f"{r['competency']['title']}: {r['similarity']:.3f}")
Output:
Python Programming Fundamentals: 0.892
Object-Oriented Programming in Python: 0.845
Data Structures in Python: 0.723
...
Schema Management
Check if Collection Exists
exists = client.client.collections.exists("Competency")
print(f"Competency collection exists: {exists}")
Create Collection
client.client.collections.create_from_dict({
"class": "MyCollection",
"vectorizer": "none",
"properties": [
{"name": "my_property", "dataType": ["text"]}
]
})
Delete Collection
client.delete_all_data_from_collection("Competency")
# Deletes and recreates empty collection
Recreate Collection
client.recreate_collection("Competency")
# Uses schema from COLLECTION_SCHEMAS
Connection Lifecycle
Startup
On application startup (lifespan in app.py):
@asynccontextmanager
async def lifespan(app):
# Initialize Weaviate client
client = get_weaviate_client()
# Check connection
if client.is_alive():
logger.info("🔌 Weaviate: Connected")
else:
logger.error("❌ Weaviate: Connection failed")
# Ensure collections exist
client._ensure_collections_exist()
yield # App is running
# Shutdown: close connection
client.close()
Health Check
is_alive = client.is_alive()
# Returns: True if responsive, False otherwise
Close Connection
client.close()
# Releases connection resources
Error Handling
Custom Exceptions
from atlasml.clients.weaviate import (
WeaviateConnectionError,
WeaviateOperationError
)
WeaviateConnectionError
Raised when connection fails:
try:
client = WeaviateClient(settings)
except WeaviateConnectionError as e:
logger.error(f"Failed to connect: {e}")
# Fallback: retry or use cached data
WeaviateOperationError
Raised when operations fail:
try:
uuid = client.add_embeddings(...)
except WeaviateOperationError as e:
logger.error(f"Failed to add embedding: {e}")
# Handle error: retry, log, notify
Exception Handling Pattern
@router.post("/suggest")
async def suggest_competencies(request: SuggestCompetencyRequest):
try:
client = get_weaviate_client()
results = client.get_embeddings_by_property(...)
return results
except WeaviateConnectionError:
raise HTTPException(503, "Database unavailable")
except WeaviateOperationError as e:
logger.error(f"Weaviate error: {e}")
raise HTTPException(500, "Failed to query database")
Performance Considerations
Query Optimization
- Filter Early: Use property filters to reduce result set
- Limit Results: Don't fetch more than needed
- Batch Operations: Insert multiple objects at once
- Connection Pooling: Reuse client instance (singleton)
Indexing
Weaviate automatically indexes:
- Vector data (HNSW algorithm)
- Property values (inverted index)
Query Time:
- Vector search: O(log n)
- Property filter: O(1) for indexed properties
Scaling
For large datasets (>1M vectors):
- Use Weaviate clustering
- Optimize
maxConnectionsandefConstructionin HNSW config - Consider sharding by course_id
Debugging
Enable Debug Logging
import logging
logging.getLogger("weaviate").setLevel(logging.DEBUG)
Inspect Collection
collection = client.client.collections.get("Competency")
print(f"Collection: {collection.name}")
print(f"Config: {collection.config}")
Count Objects
competencies = client.get_all_embeddings("Competency")
print(f"Total competencies: {len(competencies)}")
View Sample Object
competencies = client.get_all_embeddings("Competency")
if competencies:
sample = competencies[0]
print(f"UUID: {sample['uuid']}")
print(f"Properties: {sample['properties']}")
print(f"Vector dimensions: {len(sample['vector'])}")
---
## Common Patterns
### Pattern 1: Save Competency with Embedding
```python
def save_competency(competency: Competency):
# 1. Generate embedding
generator = EmbeddingGenerator()
embedding = generator.generate_embeddings_openai(competency.description)
# 2. Save to Weaviate
client = get_weaviate_client()
uuid = client.add_embeddings(
collection_name=CollectionNames.COMPETENCY.value,
embeddings=embedding,
properties={
"competency_id": competency.id,
"title": competency.title,
"description": competency.description,
"course_id": competency.course_id
}
)
return uuid
Pattern 2: Find Similar Competencies
def find_similar(description: str, course_id: int, top_n: int = 5):
# 1. Generate query embedding
generator = EmbeddingGenerator()
query_embedding = generator.generate_embeddings_openai(description)
# 2. Get competencies for course
client = get_weaviate_client()
competencies = client.get_embeddings_by_property(
collection_name=CollectionNames.COMPETENCY.value,
property_name="course_id",
property_value=course_id
)
# 3. Compute similarities
from atlasml.ml.similarity_measures import compute_cosine_similarity
results = []
for comp in competencies:
similarity = compute_cosine_similarity(query_embedding, comp["vector"])
results.append((comp, similarity))
# 4. Sort and return top N
results.sort(key=lambda x: x[1], reverse=True)
return results[:top_n]
Pattern 3: Batch Insert
def batch_insert_competencies(competencies: list[Competency]):
generator = EmbeddingGenerator()
client = get_weaviate_client()
for comp in competencies:
embedding = generator.generate_embeddings_openai(comp.description)
client.add_embeddings(
collection_name=CollectionNames.COMPETENCY.value,
embeddings=embedding,
properties={
"competency_id": comp.id,
"title": comp.title,
"description": comp.description,
"course_id": comp.course_id
}
)
Best Practices
1. Always Use Property Filters
# ✅ Good - Filter by course_id first
competencies = client.get_embeddings_by_property(
collection_name="Competency",
property_name="course_id",
property_value=1
)
# ❌ Bad - Get all then filter in Python
all_comps = client.get_all_embeddings("Competency")
filtered = [c for c in all_comps if c["properties"]["course_id"] == 1]
2. Reuse Client Instance
# ✅ Good - Use singleton
client = get_weaviate_client()
# ❌ Bad - Create new instance
client = WeaviateClient() # Don't do this!
3. Handle Connection Errors
# ✅ Good - Handle errors
try:
results = client.get_embeddings_by_property(...)
except WeaviateConnectionError:
# Fallback or retry
logger.error("Weaviate unavailable")
# ❌ Bad - Let exception propagate
results = client.get_embeddings_by_property(...)
4. Use Appropriate Vector Dimensions
# ✅ Good - Consistent dimensions
# OpenAI: 1536 dims, Local model: 384 dims
embedding = generator.generate_embeddings_openai(text) # 1536
# ❌ Bad - Mixing dimensions
embedding1 = generator.generate_embeddings_openai(text) # 1536
embedding2 = generator.generate_embeddings(text) # 384
# Can't compare these!
5. Clean Up Test Data
# After tests
def cleanup_test_data():
client = get_weaviate_client()
client.delete_by_property(
collection_name="Competency",
property_name="course_id",
property_value=999 # Test course ID
)
Next Steps
- ML Pipelines: Learn how embeddings are generated
- Modules: Deep dive into
weaviate.pyimplementation - Troubleshooting: Debug Weaviate issues
- System Design: Understand Weaviate's role
Resources
- Weaviate Documentation: https://weaviate.io/developers/weaviate
- Python Client: https://weaviate.io/developers/weaviate/client-libraries/python
- Vector Search: https://weaviate.io/developers/weaviate/search/similarity
- Schema Configuration: https://weaviate.io/developers/weaviate/config-refs/schema