RAG Pipeline Guide

Build a retrieval-augmented generation (RAG) pipeline using SovereignEG's embeddings and chat completions APIs.

What is RAG?

RAG lets your AI answer questions using your own data. Instead of relying on the model's training data, you:

Embed your documents into vectors
Store vectors in a database
Retrieve relevant chunks when a user asks a question
Generate an answer grounded in retrieved context

Full example

from openai import OpenAI
import numpy as np
 
client = OpenAI(api_key="sk-...", base_url="https://backend.sovereigneg.com/v1")
 
# Step 1: Embed your documents
documents = [
    "SovereignEG serves AI models through an OpenAI-compatible API with EGP billing.",
    "The catalog includes open-weight chat and embedding models.",
    "Rate limits are configured per API key and organization.",
    "Billing is denominated in Egyptian pounds (EGP).",
]
 
doc_embeddings = []
for doc in documents:
    response = client.embeddings.create(
        model="embeddinggemma-300m",  # live embed id
        input=doc
    )
    doc_embeddings.append(response.data[0].embedding)
 
# Step 2: Embed the user's question
question = "What is the latency from Cairo?"
q_response = client.embeddings.create(model="embeddinggemma-300m", input=question)
q_embedding = q_response.data[0].embedding
 
# Step 3: Find most relevant documents (cosine similarity)
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
scores = [cosine_sim(q_embedding, d) for d in doc_embeddings]
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:3]
context = "\n".join([documents[i] for i in top_indices])
 
# Step 4: Generate answer with context
response = client.chat.completions.create(
    model="gpt-oss-20b",
    messages=[
        {
            "role": "system",
            "content": f"Answer based on this context:\n\n{context}\n\nIf the answer is not in the context, say so."
        },
        {"role": "user", "content": question}
    ]
)
 
print(response.choices[0].message.content)

Production RAG with a vector database

For production, use a vector database instead of in-memory search:

# Using ChromaDB
import chromadb
 
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
 
# Add documents with embeddings from SovereignEG
for i, doc in enumerate(documents):
    embedding = client.embeddings.create(model="embeddinggemma-300m", input=doc)
    collection.add(
        ids=[f"doc-{i}"],
        embeddings=[embedding.data[0].embedding],
        documents=[doc]
    )
 
# Query
results = collection.query(
    query_embeddings=[q_embedding],
    n_results=3
)

Arabic RAG

Use a live embed model from the catalog for multilingual embeddings that handle Arabic well:

arabic_docs = [
    "تقدم SovereignEG واجهة برمجة متوافقة مع OpenAI وفوترة بالجنيه المصري.",
    "النماذج متعددة اللغات في الكتالوج تدعم العربية والإنجليزية في نفس المسار.",
]
 
for doc in arabic_docs:
    embedding = client.embeddings.create(model="embeddinggemma-300m", input=doc)
    # Store in vector DB...

Best practices

Chunk size: 256-512 tokens per chunk works best for most documents
Overlap: Use 50-100 token overlap between chunks to avoid losing context at boundaries
Top-K: Retrieve 3-5 chunks. More isn't always better — it can dilute relevance
System prompt: Always instruct the model to answer based on context only
Citations: Ask the model to cite which chunks it used
Reranking: For high-quality results, use a cross-encoder reranker after initial retrieval

Or just use the agent

Managed RAG agents are on the roadmap. Today, build RAG with the embeddings and chat endpoints shown above.