Machinalis: OCR with Django and Tesseract

On this occasion we wanted to show you how easy it is to do OCR with Django and Tesseract, using the tesserocr library.

Installation

Tesserocr requires a fairly recent versions of tesseract-ocr and leptonica. On Ubuntu these can be installed with:

$ sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev

Depending on your environment, you might have to install these packages from the source code. Follow their respective documentations on instructions on how to do it. Next, you have to install the project’s requirements:

(venv) $ pip3 install Cython==0.24.1
(venv) $ pip3 install -r ocr_with_django/requirements.txt

and run the necessary steps to set-up the Django site:

(venv) $ cd ocr_with_django/
(venv) $ python manage.py migrate
(venv) $ python manage.py collectstatic --noinput

We’ve included a Vagrantfile script for you to see the site in action by yourself. Once the VM is up and running, open http://localhost:8000, select an image with text and hit “Go!”:

https://raw.githubusercontent.com/abarto/ocr-with-django/master/screenshot.png

All the code is available on GitHub.

OCRView

The OCR requests are handled by the OcrView Django view in the documents app:

# documents/views.pyclassOcrView(View):defpost(self,request,*args,**kwargs):withPyTessBaseAPI()asapi:withImage.open(request.FILES['image'])asimage:sharpened_image=image.filter(ImageFilter.SHARPEN)api.SetImage(sharpened_image)utf8_text=api.GetUTF8Text()returnJsonResponse({'utf8_text':utf8_text})

We take the uploaded image, process it using a Pillow filter (we sharpen it), and pass along the result to the Tesseract OCR API through tesserocr.

We tried to keep the view as simple as possible (no Form, no validation) to focus only on the OCR processes. If you read PyTessBaseAPI docstrings you’ll see that there are tons of things you can do with the image and recognition result.

ocr_form.js

The form page is very simple, and everything is done using an ajax call:

$(document).ready(function() {
    var $imageInput = $("[data-js-image-input]");
    var $imageContainer = $("[data-js-image-container]");
    var $resultContainer = $("[data-js-result-container]");
    $imageInput.change(function(event) {
        event.stopPropagation();
        event.preventDefault();
        var file = event.target.files[0];

        var fileReader = new FileReader();
        fileReader.onload = (function(theFile) {
            return function(event) {
                $imageContainer.html('<img class="image" src="' + event.target.result + '">');
            };
        })(file);
        fileReader.readAsDataURL(file);
    });
    $("[data-js-go-button]").click(function(event) {
        event.stopPropagation();
        event.preventDefault();
        data = new FormData();
        data.append('image', $imageInput[0].files[0]);
        $.post({
            url: "/ocr/",
            data: data,
            cache: false,
            contentType: false,
            processData: false
        }).done(function(data) {
            console.log(data);
            $resultContainer.removeClass("result-default result-error");
            $resultContainer.addClass("result-success");
            $resultContainer.html(data.utf8_text);
        })
        .fail(function(jqXHR) {
            $resultContainer.removeClass("result-default result-success");
            $resultContainer.addClass("result-error");
            $resultContainer.html('I AM ERROR');
        });
    });
});

Conclusions

As you can see, everything is pretty simple as most of the heavy lifting is done by Pillow and Tesseract.

Having an integrated OCR into a Django site can make it easier to search existing physical documents integrating bulk scanning processes with OCR and text indexing, and integrate natural language processing tools like our very own iepy.

Feedback

As usual, comments, suggestions and pull requests are more than welcomed. We’re also interested in other use cases for OCRs.

Machinalis: OCR with Django and Tesseract

Installation

OCRView

ocr_form.js

Conclusions

Feedback

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...