RAG Pipeline Guide

Build a retrieval-augmented generation (RAG) pipeline using SovereignEG's embeddings and chat completions APIs.

What is RAG?

RAG lets your AI answer questions using your own data. Instead of relying on the model's training data, you:

  1. Embed your documents into vectors
  2. Store vectors in a database
  3. Retrieve relevant chunks when a user asks a question
  4. Generate an answer grounded in retrieved context

Full example

from openai import OpenAI
import numpy as np
 
client = OpenAI(api_key="sk-...", base_url="https://sovereigneg.com/v1")
 
# Step 1: Embed your documents
documents = [
    "SovereignEG serves AI models through an OpenAI-compatible API with EGP billing.",
    "The catalog includes open-weight chat and embedding models.",
    "Rate limits are configured per API key and organization.",
    "Billing is denominated in Egyptian pounds (EGP).",
]
 
doc_embeddings = []
for doc in documents:
    response = client.embeddings.create(
        model="..."  # live embed id,
        input=doc
    )
    doc_embeddings.append(response.data[0].embedding)
 
# Step 2: Embed the user's question
question = "What is the latency from Cairo?"
q_response = client.embeddings.create(model="..."  # live embed id, input=question)
q_embedding = q_response.data[0].embedding
 
# Step 3: Find most relevant documents (cosine similarity)
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
scores = [cosine_sim(q_embedding, d) for d in doc_embeddings]
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:3]
context = "\n".join([documents[i] for i in top_indices])
 
# Step 4: Generate answer with context
response = client.chat.completions.create(
    model="...",
    messages=[
        {
            "role": "system",
            "content": f"Answer based on this context:\n\n{context}\n\nIf the answer is not in the context, say so."
        },
        {"role": "user", "content": question}
    ]
)
 
print(response.choices[0].message.content)

Production RAG with a vector database

For production, use a vector database instead of in-memory search:

# Using ChromaDB
import chromadb
 
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
 
# Add documents with embeddings from SovereignEG
for i, doc in enumerate(documents):
    embedding = client.embeddings.create(model="..."  # live embed id, input=doc)
    collection.add(
        ids=[f"doc-{i}"],
        embeddings=[embedding.data[0].embedding],
        documents=[doc]
    )
 
# Query
results = collection.query(
    query_embeddings=[q_embedding],
    n_results=3
)

Arabic RAG

Use a live embed model from the catalog for multilingual embeddings that handle Arabic well:

arabic_docs = [
    "تقدم SovereignEG واجهة برمجة متوافقة مع OpenAI وفوترة بالجنيه المصري.",
    "النماذج متعددة اللغات في الكتالوج تدعم العربية والإنجليزية في نفس المسار.",
]
 
for doc in arabic_docs:
    embedding = client.embeddings.create(model="..."  # live embed id, input=doc)
    # Store in vector DB...

Best practices

  1. Chunk size: 256-512 tokens per chunk works best for most documents
  2. Overlap: Use 50-100 token overlap between chunks to avoid losing context at boundaries
  3. Top-K: Retrieve 3-5 chunks. More isn't always better — it can dilute relevance
  4. System prompt: Always instruct the model to answer based on context only
  5. Citations: Ask the model to cite which chunks it used
  6. Reranking: For high-quality results, use a cross-encoder reranker after initial retrieval

Or just use the agent

Managed RAG agents are on the roadmap. Today, build RAG with the embeddings and chat endpoints shown above.