📄 OCR · DOCUMENT PIPELINE 9 min read

How OCR and Document Pipelines Power Domain-Specific Bangla AI Assistants

Turning 2,500 pages of scanned manuals into a voice bot that farmers trust — the untold story of knowledge base engineering.

By Speaklar Knowledge Engineering Team 📚 March 2026

Bengali OCR for AI document processing Bangladesh AI knowledge base creation Bangla PDF to text

A farmer asks: “আমার বেগুন গাছের পাতা কুঁকড়ে যাচ্ছে, কী করব?” The voice bot doesn't guess. It retrieves page 47 of the উপপরিচালক, কৃষি সম্প্রসারণ অধিদপ্তর, বেগুন চাষ পদ্ধতি, ২০২৩ manual and reads the approved treatment.

But how did that manual — originally a scanned PDF with faded Bangla text — become something an AI can search in milliseconds? The answer lies in a sophisticated document pipeline that combines OCR, normalization, chunking, and vector embedding.

This is the invisible infrastructure behind every domain‑specific Bangla AI assistant.

The challenge: Bangladesh's knowledge is trapped in paper

Government departments, banks, NGOs, and universities have accumulated decades of knowledge — in files, manuals, and reports. But most of it is:

Physical paper — aging documents in district offices.
Scanned PDFs — images of pages, not searchable text.
Faded or low-quality prints — especially older documents.
Mixed languages — Bangla text with English annotations.

Before you can build an AI assistant, you must liberate this knowledge.

The KrishokBondhu pipeline: a case study

KrishokBondhu, the agricultural AI assistant, processed over 2,500 pages of farming manuals. Here's how they did it.

📄 Step 1: OCR — turning images into text

Optical Character Recognition (OCR) for Bangla is notoriously difficult. The script has complex conjuncts, multiple vowel signs, and similar-looking characters (প, থ, etc.). Generic OCR (like Tesseract) performs poorly on Bangla — often below 60% accuracy.

KrishokBondhu used a specialized Bangla OCR engine, trained on thousands of government documents. It handles:

Faded text from old manuals.
Multiple fonts (Nikosh, Sutonny, etc.).
Mixed Bangla-English lines.

Result: >95% character accuracy on clean documents, >85% on faded ones.

🔤 Step 2: Normalization — cleaning the text

OCR output is messy. It might produce multiple variants of the same character (e.g., different Unicode encodings for “ক”). Normalization:

Converts all text to a standard Unicode representation.
Fixes common OCR errors (e.g., “তা” misread as “তাঃ”).
Removes headers, footers, page numbers.

This step is critical for retrieval accuracy. Without normalization, searches for “পোকা” might miss documents where it was OCRed as “পোকা”.

✂️ Step 3: Chunking — splitting into meaningful pieces

You can't throw a 200-page PDF at an AI and expect it to find the right answer. The text must be split into chunks — small, self-contained pieces that can be retrieved independently.

KrishokBondhu used:

Semantic chunking: Splitting at section boundaries (e.g., "পটলের রোগবালাই").
Overlap: 10% overlap between chunks to avoid cutting sentences in half.
Metadata tagging: Each chunk remembers its source (document name, page number, section).

Average chunk size: ~300 words — enough to contain a complete piece of advice.

📊 Step 4: Embedding — making chunks searchable

Each chunk is converted into a vector embedding — a mathematical representation that captures its meaning. When a farmer asks a question, that query is also converted to a vector, and the system finds the chunks with the most similar vectors.

Bangla embedding challenge: Most pre-trained embeddings (like those from OpenAI) are English-centric. They perform poorly on Bangla semantic search. KrishokBondhu used a fine-tuned Bangla embedding model, trained on a corpus of agricultural texts.

Result: Retrieval accuracy improved by 27% compared to generic multilingual embeddings.

🗄️ Step 5: Vector database — storage and retrieval

The embeddings are stored in a vector database (like Pinecone or Weaviate) optimized for fast similarity search. When a query comes in:

Query is embedded.
Database returns top 5 most similar chunks (usually in under 100ms).
These chunks are passed to the generation model as context.

The result: from paper to voice in 72 hours

Once the pipeline was set up, KrishokBondhu could ingest a new 100-page manual in under 3 hours. The entire 2,500-page corpus was processed in about a week.

Today, that corpus powers a voice bot that serves thousands of farmers daily.

Beyond agriculture: document pipelines for every sector

The same pipeline works for:

Banking: Circulars, product brochures, terms and conditions.
Healthcare: Treatment protocols, drug formularies, patient guides.
Legal: Land records, court judgments, laws and regulations.
Education: Textbooks, syllabi, exam routines.

🏦 Banking example: A leading bank processed 15 years of circulars (1,200+ documents) into a RAG pipeline. Now, branch officers can ask: “এনআরবি অ্যাকাউন্ট খোলার নিয়ম কী?” and get the exact circular reference in seconds.

Common pitfalls and how to avoid them

Garbage in, garbage out: If OCR is poor, retrieval fails. Invest in Bangla-specific OCR.
Chunk too big: Irrelevant content dilutes the answer. Chunk too small: missing context. Test different sizes.
Ignoring metadata: Without source tracking, you can't cite where answers come from — crucial for trust.
No human review: Always have a domain expert review a sample of chunks to catch errors.

How Speaklar's document pipeline helps

Speaklar provides an end-to-end document pipeline:

OCR service: Optimized for Bangla government and business documents.
Normalization & chunking: Automated, with customizable settings.
Embedding models: Fine-tuned for Bangla across multiple domains.
Vector storage: Managed database, no infrastructure to maintain.
Integration: Direct connection to voice bot — once documents are processed, they're immediately available for queries.

You send us PDFs; we return a voice-ready knowledge base.

📊 Speed benchmark: Speaklar's pipeline processes 100 pages per minute on standard hardware. A 500-page manual is voice-ready in under an hour.

The future: real-time document ingestion

Imagine a new government circular released today. By tomorrow, it's in the voice bot's knowledge base. That's where we're heading — with automated ingestion pipelines that monitor official websites and update vector stores in near real-time.

📄 Turn your documents into a voice-ready knowledge base

Speaklar demo →

Upload a PDF. Get a voice bot. It's that simple.

📄 কাগজের জ্ঞান এখন কণ্ঠে — ডকুমেন্ট পাইপলাইনেই সমাধান

🔍 Learn more about document pipelines at speaklar.com
Keywords: Bengali OCR for AI, document processing Bangladesh, AI knowledge base creation, Bangla PDF to text · based on KrishokBondhu deployment 2026