Arabic & MENA Guide

SovereignEG is built for MENA. This guide covers best practices for Arabic NLP, dialect handling, and getting the most out of Arabic-capable models in the live catalog.

Choosing a model for Arabic

Always start from the Model Library — only live models are callable. In practice:

Task	Where to look	Why
Arabic conversation	Live chat models with multilingual training (e.g. Qwen-class)	Strong Arabic without a separate integration
Arabic + English mixed	Same — sort by context window if docs are long	Handles code-switching and bilingual prompts
Arabic with heavy reasoning	Larger live chat models (70B class)	Better instruction-following on complex Arabic tasks
Arabic retrieval (RAG)	Live embedding models (e.g. BGE-M3 class)	Multilingual vectors for Arabic + English corpora

Model IDs change as the catalog grows. Use GET /v1/models or the Model Library — never hardcode an id from a blog post.

System prompt in Arabic

Use an Arabic system prompt for Arabic tasks. This sets tone and language from the first token:

response = client.chat.completions.create(
    model="gpt-oss-20b",  # live model id from the catalog
    messages=[
        {
            "role": "system",
            "content": "أنت مساعد ذكي يتحدث العربية الفصحى. أجب بشكل مختصر ودقيق."
        },
        {"role": "user", "content": "ما هي أهم التحديات التي تواجه الشركات الناشئة في مصر؟"}
    ]
)

Dialect handling

Strong multilingual chat models generally follow dialect instructions when you set them in the system prompt:

# Egyptian dialect
response = client.chat.completions.create(
    model="gpt-oss-20b",
    messages=[
        {"role": "system", "content": "أنت مساعد يتحدث باللهجة المصرية."},
        {"role": "user", "content": "إيه أحسن مطاعم في القاهرة؟"}
    ]
)
 
# Gulf dialect
response = client.chat.completions.create(
    model="gpt-oss-20b",
    messages=[
        {"role": "system", "content": "أنت مساعد يتحدث باللهجة الخليجية."},
        {"role": "user", "content": "وين أروح في دبي؟"}
    ]
)

Test a few live candidates on your own prompts — dialect quality varies by model family.

Arabic tokenization

Arabic text tokenizes differently per model family. Smaller byte-level tokenizers often use more tokens per Arabic word than multilingual models trained with Arabic in the mix:

Model family	"مرحبا بك في مصر" (5 words)	Typical tokens
Multilingual (Qwen-class)	Multilingual BPE	~8
Llama-class	Byte-level BPE	~12

Cost tip: Compare EGP-per-1M rates in the Model Library and run a short Arabic fixture through your top two live models before committing to one.

Right-to-left (RTL) display

When displaying Arabic output in your UI:

.arabic-output {
  direction: rtl;
  text-align: right;
  font-family: 'IBM Plex Arabic', 'Noto Sans Arabic', sans-serif;
  line-height: 1.8;
}

Data residency

Standard requests route through vetted model providers today. Egypt-hosted sovereign deployments are available for regulated workloads — contact us to discuss data residency. This matters for:

Government contracts — many GCC governments require in-region data processing
Banking & finance — regulatory requirements for data residency
Healthcare — patient data must stay in-region in many MENA jurisdictions
Corporate compliance — internal policies on data sovereignty

Local currency

All usage is billed in EGP by default. Dashboard shows costs in Egyptian Pounds. No USD conversion surprises.