#Extracting #Text #Images #Python
Machine Vision has come a long way since the days of “how can a computer recognize this image as an apple.” There are many tools available that can easily help to identify the contents of an image. This topic was covered in the previous article Image Recognition in Python and SQL Server, in which a solution to programmatically identifying an image by its contents was presented. Optical Character Recognition (OCR) takes this a step further, by allowing developers to extract the text presented in an image. Extracting the text would allow for the text to be indexable and searchable. We will be covering this topic in today’s Python programming tutorial.
You can read more about image recognition in our tutorial: Image Recognition in Python and SQL Server.
What is OCR?
OCR – or Optical Character Recognition – had been quite a hot topic in the long-past days of digitizing paper artifacts such as documents, newspapers and other such physical media, but, as paper has gone by the wayside, OCR, while continuing to be a hot research topic, briefly moved to the back burner as a “pop culture technology.”
The use of screenshots as a note-taking method changed that trajectory. Consumers of information typically do not want to download PowerPoint presentations and search through them. They simply take photos of the slides they are interested in and save them for later. Recognizing the text in these photos has become a standard feature of most photo management software. But how would a developer integrate this technology into his or her own software project?
Google’s Tesseract offering gives software developers access to a “commercial grade” OCR software at a “bargain basement” price. Tesseract is open-source and offered under the Apache 2.0 license which gives developers a wide berth in how this software can be included in their own offerings. This software development tutorial will focus on implementing Tesseract within an Ubuntu Linux environment, since this is the easiest environment for a beginner to exploit.
OCR is Not a Silver Bullet
Before getting into the technical details, it is important to dispense with the idea that OCR can always magically read all of the text in an image. Even with decades of hard work going into researching this, there are still instances in which OCR may not be the best solution for text extraction. There may be situations in which different OCR software may be necessary depending on the use case. Tesseract in particular may require additional “training” (its jargon) to be better at reading text data from images. Tesseract always works better with 300dpi (dots per inch) or higher images. This is typically printing quality as opposed to web quality. You may also need to “massage” an input image before it could be read correctly.
However, out of the box, Tesseract can be “good enough” for the purposes of extracting just enough text from an image in order to accomplish what you may need to do in your software application.
How to Install Tesseract
Installing Tesseract in Debian-based Linux is easy. It is installable as a software package. For Debian-based Linux distributions such as Kali or Ubuntu, use the following command:
$ sudo apt install tesseract-ocr
If you run into issues installing Tesseract in this manner, you may need to update your Linux installation as follows:
$ sudo apt update -y; sudo apt upgrade -y
For other Linux distributions, Windows or MacOS, it will be necessary to build from source.
How to Run Tesseract from the Command Line
Once Tesseract is installed, it can be run directly from a terminal. Consider the following images, along with the text output generated by Tesseract. To display the extracted text in standard output, use the following command:
$ tesseract imageFile stdout
Here are some example outputs, along with the original image with text. These come from slides that are typically the kinds that students might take pictures of in a classroom setting:
In each of the examples above, the text which “did not quite” get captured accurately is highlighted with red rectangles. This is likely due to the presentation quality image dpi (72 dpi) used for these images. As you can see below, some images are read better than others:
Note: The above is not a defect in Tesseract. It is possible to “train” Tesseract to recognize different fonts. Also, if you are scanning documents, you can configure your scanner to read at higher dpi levels.
Programmatic Text Extraction in Python with pytessract
Naturally, extracting text within the context of a program is the next logical step. While it is always possible to use system calls from within Python or some other language in order to execute the Tesseract program, it is far more elegant to use an API to handle such calls instead.
One important thing to note: While it is not “verboten” to call Tesseract via system calls in a programming language, you must take care to ensure that no unchecked user input is passed to that system call. If no such checks are performed, then it is possible for an external user to run commands on your system with a well-constructed filename or other information.
The Python module pytesseract provides a wrapper to the Tesseract application. pytesseract can be installed via the command:
$ pip3 install pytesseract
Note that if you access Python 3.x via the python command as opposed to python3, you will need to use the command:
$ pip install pytesseract
The following sample code will extract all the text it can find from any image file in the current directory using Python and pytesseract:
#!/usr/bin/python3 # mass-ocr-images.py from PIL import Image import os import pytesseract import sys # You must specify the full path to the tesseract executable. # In Linux, you can get this by using the command: # which tesseract pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract' def main(argv): for filename in os.listdir("."): if str(filename) not in ['.', '..']: nameParts = str(filename).split(".") if nameParts[-1].lower() in ["gif", "png", "jpg", "jpeg", "tif", "tiff"]: # Calls to the API should always be bounded by a timeout, just in case. try: print ("Found filename [" + str(filename) + "]") ocrText = pytesseract.image_to_string(str(filename), timeout=5) print (ocrText) print ("") except Exception as err: print ("Processing of [" + str(filename) + "] failed due to error [" + str(err) + "]") if __name__ == "__main__": main(sys.argv[1:])
Using a Database to Store Images and Extracted Text in Python
We can use a database to store both the images and the extracted text. This will allow for developers to write an application that can search against the text and tell us which image matches this text. The following code extends the first listing by saving the collected data into a MariaDB database:
#!/usr/bin/python3 # ocr-import-images.py from PIL import Image import mysql.connector import os import pytesseract import shutil import sys # You must specify the full path to the tesseract executable. # In Linux, you can get this by using the command: # which tesseract pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract' def main(argv): try: conn = mysql.connector.connect(user="rd_user", password='myPW1234%', host="127.0.0.1", port=63306, database="RazorDemo") cursor = conn.cursor() for filename in os.listdir("."): if str(filename) not in ['.', '..']: nameParts = str(filename).split(".") if nameParts[-1].lower() in ["gif", "png", "jpg", "jpeg", "tif", "tiff"]: # Calls to the API should always be bounded by a timeout, just in case. try: print ("Found filename [" + str(filename) + "]") ocrText = pytesseract.image_to_string(str(filename), timeout=5) fout = open("temp.txt", "w") fout.write (ocrText) fout.close() # Insert the database record: sql0 = "insert into Images (file_name) values (%s)" values0 = [str(filename)] cursor.execute(sql0, values0) conn.commit() # We need the primary key identifier created by the last insert so we can insert the extracted # text and binary data. lastInsertID = cursor.lastrowid print ("Rcdid of insert is [" + str(lastInsertID) + "]") # We need to copy the image file and the text file to a directory that is readable by the # database. shutil.copyfile("temp.txt", "/tmp/db-tmp/temp.txt") shutil.copyfile(str(filename), "/tmp/db-tmp/" + str(filename)) # Also, FILE privileges may be needed for the MariaDB user account: # grant file on *.* to 'rd_user'@'%'; # flush privileges; sql1 = "update Images set extracted_text=LOAD_FILE(%s), file_data=LOAD_FILE(%s) where rcdid=%s" values1 = ["/tmp/db-tmp/temp.txt", "/tmp/db-tmp/" + str(filename), str(lastInsertID)] cursor.execute(sql1, values1) conn.commit() os.remove("/tmp/db-tmp/temp.txt") os.remove("/tmp/db-tmp/" + str(filename)) except Exception as err: print ("Processing of [" + str(filename) + "] failed due to error [" + str(err) + "]") cursor.close() conn.close() except Exception as err: print ("Processing failed due to error [" + str(err) + "]") if __name__ == "__main__": main(sys.argv[1:])
The Python code example above interacts with a MariaDB table that has the following structure:
create table Images (rcdid int not null auto_increment primary key, file_name varchar(255) not null, extracted_text longtext null, file_data longblob null);
In the code example above, longtext and longblob were chosen because those data types are intended to point to large volumes of text or binary data, respectively.
How to Load File Data into MariaDB
Loading binary or non-standard text into any database can pose all sorts of challenges, especially if text encoding is a concern. In most popular RDBMS, binary data is almost never inserted into or updated in a database record via a typical insert statement that is used for other kinds of data. Instead, specialized statements are used for such tasks.
For MariaDB, in particular, FILE permissions are required for any such operations. These are not assigned in a typical GRANT statement that grants privileges on a database to a user account. Instead, FILE permissions must be granted to the server itself, with a separate set of commands. To do this in MariaDB for the rd_user account used in our second code example, it will be necessary to log into MariaDB with its root account and execute the following commands:
grant file on *.* to 'rd_user'@'%'; flush privileges;
Once FILE permissions are granted, the LOAD FILE command can be used to load longtext or longblob data into a particular existing record. The following example show how to attach longtext or longblob data to an existing record in a MariaDB database:
-- For the extracted text, which can contain non-standard characters. update Images set extracted_text=LOAD_FILE('/tmp/test.txt') where rcdid=rcdid -- For the binary image data update Images set file_data=LOAD_FILE('/tmp/myImage.png') where rcdid=rcdid
If you use a typical select * statement on this data after running these updates, then you will get a result that is not terribly useful:
Instead, select substrings of the data:
The result of this query is more useful, at least for ensuring the records populated:
To extract this data back into files, use specialized select statements, as shown below:
select extracted_text into dumpfile '/tmp/ppt-slide-3-text.txt' from Images where rcdid=117; select file_data into dumpfile '/tmp/Sample PPT Slide 3.png' from Images where rcdid=117;
Note that, aside from writing the output to the files above, there is no “special output” from either of these queries.
The directory into which the files will be created must be writable by the user account under which the MariaDB daemon is running. Files with the same names as extracted files cannot already exist.
The file data should match what was initially loaded:
The image will also match:
The image as read from the database, including using the fim command to view it.
These SQL Statements can then be incorporated into an external application for the purposes of retrieving this information.
How to Query Images in a Database
With the images loaded into the database, along with their extracted OCR text, conventional SQL queries can be used to find a particular image:
Note that, in MariaDB, conventional text comparisons do not work with longtext columns. These must be cast as varchars.
This gives the following output:
Final Thoughts on Extracting Text from Images with Python and MariaDB
Google’s Tesseract offering can easily allow for you to incorporate on-the-fly OCR into your applications. This will allow for your users to more easily and more readily be able to extract and use the text that may be contained in those images. But out-of-the-box Tesseract can go much, much further. For all of the “gibberish” results shown above, Tesseract can be trained by a developer to read characters that it cannot recognize, further extending its usability.
Given how technologies such as AI are becoming more mainstream, it is not too far of a stretch to imagine that OCR will only get better and easier to do as time goes on, but Tesseract is a great starting point. OCR is already used for complex tasks like “on the fly” language translation. That was considered unimaginable not that long ago. Who knows how much further we can go?