Add PDFs and Documents as Chatbot Knowledge in Alee
Upload PDFs and documents to train your Alee chatbot. Learn how text is extracted and chunked, plus best file practices for accurate answers.
PDFs and documents are where most businesses keep the answers customers actually want: the product brochure, the service manual, the refund policy, the membership terms. Alee lets you upload those files directly so your chatbot answers from them word for word, with sources, and never makes things up. This guide covers uploading documents, what happens to your text behind the scenes, and the file practices that separate a sharp bot from a confused one.
Why upload documents instead of retyping them
Your knowledge already exists. A gym has a membership PDF, an ecommerce store has a returns policy doc, a coach has a programme outline, an agency has a one-page service deck. Re-typing all of that into a text box is slow and goes stale the moment the file changes.
When you upload the source file itself:
- The exact wording your legal or ops team approved is what the bot answers from.
- You can re-upload a fresh version when the policy changes, instead of editing scattered FAQ entries.
- Tables, headings and structured sections come across far better than a hand-typed summary.
- Multiple documents stack into one "knowledge brain," so the bot pulls the right answer from the right file automatically.
A website, sitemap, YouTube videos or pasted FAQs all live in the same brain alongside your documents. Documents are just one of several knowledge sources you can mix.
What happens to a document after you upload it
Understanding the method explains the best practices later. Alee uses Advanced RAG (retrieval-augmented generation), and every document goes through the same pipeline:
- Text extraction. Alee reads the words out of your file. For a normal, text-based PDF this is exact. The raw text is what matters, not the layout, so fonts and page design are discarded.
- Chunking. The extracted text is split into small, overlapping passages, typically a few paragraphs each. Each chunk is meant to be a self-contained idea, so a single question can be answered from one or two chunks.
- Embedding. Every chunk is turned into a vector embedding, a numeric fingerprint of its meaning, and stored in a pgvector index, the knowledge brain.
- Retrieval at question time. When a visitor asks something, Alee embeds their question, finds the closest-matching chunks, and an LLM writes an answer grounded only in those chunks, with sources shown.
- Grounding check and caching. Each answer is self-checked to confirm it is actually supported by your content before it is sent. If the answer is not in your documents, the bot says it does not know rather than guessing. Repeat or similar questions are served from a cache, so they come back instantly.
The practical takeaway: the bot is only as good as the chunks it can retrieve. Clean, clearly-worded documents chunk and retrieve accurately. Messy scans and walls of unlabeled text do not.
Step by step: upload a PDF or document
- Sign in to Alee and open the bot you want to train. If you do not have one yet, you can start free and create your first bot in a minute.
- Open the bot's Sources area (the section where knowledge sources are listed) and choose to add a new source.
- Pick the PDF / document (file upload) option rather than the website, sitemap, YouTube or paste-text options.
- Select the file from your computer, or drag and drop it into the upload area.
- Confirm the upload and let Alee process it. The file moves through extraction, chunking and embedding; this usually takes seconds to a minute or two depending on length.
- Wait for the source to show as ready or indexed in the Sources list. Once it is, the bot can answer from it.
- Test it. Open the bot's preview and ask 3 to 5 questions you know are answered inside that document. Confirm the wording is right and that a source is attached.
Repeat for every document you want in the brain. You can add sources any time, and the brain simply grows.
Adding several documents at once
If you have a stack of files (say a brochure, a price sheet and a returns policy), upload them as separate sources rather than merging them into one giant PDF. Separate sources keep each topic cleanly bounded, make retrieval more precise, and let you update or remove one document later without disturbing the others.
Best file practices (this is what makes the bot accurate)
The single biggest factor in answer quality is the quality of the file you feed in. Follow these and your bot will feel sharp.
Use text-based files, not scans
A PDF exported from Word, Google Docs, Canva or a design tool contains real, selectable text. A PDF that is actually a photo or scan of a printed page is just an image, so there is no text to extract. Quick test: open the file and try to select a sentence with your cursor. If you can highlight the words, you are good; if you cannot, it is a scanned image.
If your only copy is a scan, run it through OCR first (Google Docs and many free tools convert a scan to selectable text), then upload the converted version. This matters a lot in India, where many policy and brochure PDFs are scanned from print.
Prefer clean, structured documents
- Keep clear headings and section titles. They give each chunk a natural topic and improve retrieval.
- Write in plain, complete sentences. The bot answers best from prose, not cryptic bullet fragments.
- Spell out anything a customer would ask about: prices, timings, eligibility, steps, contact details.
- Expand abbreviations and internal jargon at least once, so the meaning is searchable.
Watch out for tables, columns and image-only content
Simple tables usually extract fine, but very wide or nested tables can flatten confusingly. If a pricing or comparison table is critical, paste a short plain-text version of the same facts as a separate text source so the bot has an unambiguous copy. Multi-column newsletter layouts can scramble reading order too, so a single-column document is safer. And text living inside an image, infographic or chart will not be read at all; put those facts in actual text.
Keep each document focused and reasonably sized
A 6-page brochure or a 20-page manual is ideal. A single 300-page mega-PDF mixing ten unrelated topics still works, but splitting it into logical documents gives tighter, more relevant retrieval. Remove cover pages, blank pages and pure-decoration spreads where it is easy.
Strip out what you do not want the bot to repeat
The bot can quote anything in the file. Before uploading, remove internal notes, draft watermarks, staff phone numbers, or confidential pricing you do not want a public visitor to see.
Supported documents and a worked example
Alongside PDFs you can train on the other content types Alee supports: a website URL, a whole sitemap, YouTube video transcripts, and raw text or FAQs you paste. For the document workflow specifically, a typical mix looks like this.
Worked example: a yoga studio in Pune
The owner wants the bot to handle enquiries while the front desk is busy. She uploads:
membership-plans.pdf(a clean export from Canva, real text) with prices and what each plan includes.class-schedule.pdfwith timings by day.cancellation-policy.pdfwith the refund and freeze rules.
She also pastes a short FAQ as a text source ("Do you have parking?", "Is there a trial class?") because those answers were not written down anywhere yet.
A visitor asks: "How much is the 3-month plan and can I freeze it if I travel?" Alee retrieves the price chunk from the membership PDF and the freeze-rule chunk from the cancellation PDF, then writes one grounded answer citing both, cached instantly for the next similar question. When someone asks "Do you offer personal training?" and nothing covers it, the bot honestly says it does not have that information instead of inventing a price.
Keeping documents up to date
Documents go stale, so treat your Sources list like a living shelf:
- When a price list or policy changes, upload the new version and remove the outdated source so the bot never quotes an old number.
- After any update, re-test with a couple of questions to confirm the new wording is being used.
- Use the question-triage inbox and the Top Questions list in your analytics to spot gaps. If visitors keep asking something your documents do not cover, that is your cue to add or improve a source.
How document sources fit your plan
Every paid and free plan can train on documents; the difference between plans is how many bots and how many monthly messages you get, not the file types. Free includes 1 bot and 200 messages a month, which is plenty to load a few PDFs and try it on a live site. As you grow, Pro, Agency and Scale add more bots and capacity, and the Agency plan lets you white-label and run separate bots for separate clients, each with its own documents. See pricing for the full breakdown, browse more guides, or compare the approach in Alee vs SiteGPT.
Frequently asked questions
Can Alee read a scanned PDF or a photo of a document?
Not directly, because a scan is an image with no extractable text. Run it through an OCR tool first to convert it into selectable text, then upload that version so Alee can chunk and index the words.
How many documents can I add to one bot?
You can add many documents to a single bot, and they all combine into one knowledge brain. The practical limit is your plan's message capacity, not a document count, so upload as many relevant files as you need and keep each one focused.
Will the bot mix up answers from different documents?
No. Each answer is retrieved from the closest-matching chunks and self-checked for grounding before it is sent, so the bot pulls the right passage from the right file and cites its source. If nothing in your documents answers the question, it says it does not know rather than guessing.
Ready to turn your brochures, manuals and policies into a chatbot that answers in your exact words? [Start free with Alee](/signup) and upload your first document today.
Try it in your own Alee bot
Train it on your site, embed it anywhere, capture leads 24/7. Free to start, no card.