Arabic & MENA Guide
SovereignEG is built for MENA. This guide covers best practices for Arabic NLP, dialect handling, and getting the most out of Arabic-capable models in the live catalog.
Choosing a model for Arabic
Always start from the Model Library — only live models are callable. In practice:
| Task | Where to look | Why |
|---|---|---|
| Arabic conversation | Live chat models with multilingual training (e.g. Qwen-class) | Strong Arabic without a separate integration |
| Arabic + English mixed | Same — sort by context window if docs are long | Handles code-switching and bilingual prompts |
| Arabic with heavy reasoning | Larger live chat models (70B class) | Better instruction-following on complex Arabic tasks |
| Arabic retrieval (RAG) | Live embedding models (e.g. BGE-M3 class) | Multilingual vectors for Arabic + English corpora |
Model IDs change as the catalog grows. Use GET /v1/models or the Model Library — never hardcode an id from a blog post.
System prompt in Arabic
Use an Arabic system prompt for Arabic tasks. This sets tone and language from the first token:
response = client.chat.completions.create(
model="...", # live model id from the catalog
messages=[
{
"role": "system",
"content": "أنت مساعد ذكي يتحدث العربية الفصحى. أجب بشكل مختصر ودقيق."
},
{"role": "user", "content": "ما هي أهم التحديات التي تواجه الشركات الناشئة في مصر؟"}
]
)Dialect handling
Strong multilingual chat models generally follow dialect instructions when you set them in the system prompt:
# Egyptian dialect
response = client.chat.completions.create(
model="...",
messages=[
{"role": "system", "content": "أنت مساعد يتحدث باللهجة المصرية."},
{"role": "user", "content": "إيه أحسن مطاعم في القاهرة؟"}
]
)
# Gulf dialect
response = client.chat.completions.create(
model="...",
messages=[
{"role": "system", "content": "أنت مساعد يتحدث باللهجة الخليجية."},
{"role": "user", "content": "وين أروح في دبي؟"}
]
)Test a few live candidates on your own prompts — dialect quality varies by model family.
Arabic tokenization
Arabic text tokenizes differently per model family. Smaller byte-level tokenizers often use more tokens per Arabic word than multilingual models trained with Arabic in the mix:
| Model family | "مرحبا بك في مصر" (5 words) | Typical tokens |
|---|---|---|
| Multilingual (Qwen-class) | Multilingual BPE | ~8 |
| Llama-class | Byte-level BPE | ~12 |
Cost tip: Compare EGP-per-1M rates in the Model Library and run a short Arabic fixture through your top two live models before committing to one.
Right-to-left (RTL) display
When displaying Arabic output in your UI:
.arabic-output {
direction: rtl;
text-align: right;
font-family: 'IBM Plex Arabic', 'Noto Sans Arabic', sans-serif;
line-height: 1.8;
}Data residency
Standard requests route through vetted model providers today. Egypt-hosted sovereign deployments are available for regulated workloads — contact us to discuss data residency. This matters for:
- Government contracts — many GCC governments require in-region data processing
- Banking & finance — regulatory requirements for data residency
- Healthcare — patient data must stay in-region in many MENA jurisdictions
- Corporate compliance — internal policies on data sovereignty
Local currency
All usage is billed in EGP by default. Dashboard shows costs in Egyptian Pounds. No USD conversion surprises.