What OCR Does, and When You Need It
A scanned page or a photographed document is just an image — to a computer it's a grid of pixels, not letters, so you can't select, copy or search the words in it. Optical Character Recognition (OCR) bridges that gap: it analyses the shapes in the image and reconstructs the actual text. That's what lets you lift a quote out of a scanned contract, make an old report searchable, or paste figures from a faxed invoice without retyping them.
This is the key difference from the PDF to Word tool: that one extracts an existing text layer and works only on born-digital PDFs, whereas OCR creates a text layer from scratch for documents that never had one.
How In-Browser OCR Works
First, the pdf.js engine renders each page of your PDF to a high-resolution image inside the browser. Then Tesseract.js — a browser build of Google's open-source Tesseract OCR engine — examines that image and recognises the characters, producing plain text that's assembled page by page into a downloadable file.
Crucially, this all happens on your own device: your scanned document is never uploaded. The only thing downloaded is the open-source OCR engine and its language data, fetched once on the first run (around 10–20 MB) and then cached by your browser for next time.
Getting the Best Results
OCR accuracy depends heavily on input quality. Use the highest-resolution scan you have, make sure the page is straight and evenly lit, and pick the language that matches the document — choosing English for an English page, Thai for a Thai page, or both for a mixed document. Clean printed text reads very well; faint photocopies, heavy backgrounds and handwriting are much harder and may need a clearer source.