A farmer asks: “আমার বেগুন গাছের পাতা কুঁকড়ে যাচ্ছে, কী করব?” The voice bot doesn't guess. It retrieves page 47 of the উপপরিচালক, কৃষি সম্প্রসারণ অধিদপ্তর, বেগুন চাষ পদ্ধতি, ২০২৩ manual and reads the approved treatment.
But how did that manual — originally a scanned PDF with faded Bangla text — become something an AI can search in milliseconds? The answer lies in a sophisticated document pipeline that combines OCR, normalization, chunking, and vector embedding.
This is the invisible infrastructure behind every domain‑specific Bangla AI assistant.
Government departments, banks, NGOs, and universities have accumulated decades of knowledge — in files, manuals, and reports. But most of it is:
Before you can build an AI assistant, you must liberate this knowledge.
KrishokBondhu, the agricultural AI assistant, processed over 2,500 pages of farming manuals. Here's how they did it.
Optical Character Recognition (OCR) for Bangla is notoriously difficult. The script has complex conjuncts, multiple vowel signs, and similar-looking characters (প, থ, etc.). Generic OCR (like Tesseract) performs poorly on Bangla — often below 60% accuracy.
KrishokBondhu used a specialized Bangla OCR engine, trained on thousands of government documents. It handles:
Result: >95% character accuracy on clean documents, >85% on faded ones.
OCR output is messy. It might produce multiple variants of the same character (e.g., different Unicode encodings for “ক”). Normalization:
This step is critical for retrieval accuracy. Without normalization, searches for “পোকা” might miss documents where it was OCRed as “পোকা”.
You can't throw a 200-page PDF at an AI and expect it to find the right answer. The text must be split into chunks — small, self-contained pieces that can be retrieved independently.
KrishokBondhu used:
Average chunk size: ~300 words — enough to contain a complete piece of advice.
Each chunk is converted into a vector embedding — a mathematical representation that captures its meaning. When a farmer asks a question, that query is also converted to a vector, and the system finds the chunks with the most similar vectors.
Bangla embedding challenge: Most pre-trained embeddings (like those from OpenAI) are English-centric. They perform poorly on Bangla semantic search. KrishokBondhu used a fine-tuned Bangla embedding model, trained on a corpus of agricultural texts.
Result: Retrieval accuracy improved by 27% compared to generic multilingual embeddings.
The embeddings are stored in a vector database (like Pinecone or Weaviate) optimized for fast similarity search. When a query comes in:
Once the pipeline was set up, KrishokBondhu could ingest a new 100-page manual in under 3 hours. The entire 2,500-page corpus was processed in about a week.
Today, that corpus powers a voice bot that serves thousands of farmers daily.
The same pipeline works for:
🏦 Banking example: A leading bank processed 15 years of circulars (1,200+ documents) into a RAG pipeline. Now, branch officers can ask: “এনআরবি অ্যাকাউন্ট খোলার নিয়ম কী?” and get the exact circular reference in seconds.
Speaklar provides an end-to-end document pipeline:
You send us PDFs; we return a voice-ready knowledge base.
📊 Speed benchmark: Speaklar's pipeline processes 100 pages per minute on standard hardware. A 500-page manual is voice-ready in under an hour.
Imagine a new government circular released today. By tomorrow, it's in the voice bot's knowledge base. That's where we're heading — with automated ingestion pipelines that monitor official websites and update vector stores in near real-time.
📄 Turn your documents into a voice-ready knowledge base
Speaklar demo →Upload a PDF. Get a voice bot. It's that simple.
📄 কাগজের জ্ঞান এখন কণ্ঠে — ডকুমেন্ট পাইপলাইনেই সমাধান
🔍 Learn more about document pipelines at speaklar.com
Keywords: Bengali OCR for AI, document processing Bangladesh, AI knowledge base creation, Bangla PDF to text · based on KrishokBondhu deployment 2026