By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
Daily News Circuit
  • Home
  • World
    Innovation

    Selective: Mini PC rocks detachable Bluetooth speaker out front – Daily News Circuit

    By admin 3 Min Read
    Smartphone

    Selective: Apple Watch Series 9 (Product) Red announced – Daily News Circuit

    By admin December 1, 2023
    Software

    Selective: People on the Move in Tech in November – Daily News Circuit

    By admin December 1, 2023
  • Technology
    Smartphone

    Selective: Apple Watch Series 9 (Product) Red announced – Daily News Circuit

    #Apple #Watch #Series #Product #Red #announced Apple announced a new (Product) Red…

    By admin 1 Min Read
    Software
    Selective: People on the Move in Tech in November – Daily News Circuit
    Beautiful
    Selective: A New Law Requires All Cosmetology Students in New York State Learn to Style Textured Hair – Daily News Circuit
    Discover
    Selective: What is Discovery in Criminal Law? – Daily News Circuit
    Vacation
    Selective: Marriott’s new all-inclusives, tattoos at Andaz and other hotel news you missed – Daily News Circuit
  • Insider

    “Rowan Glen Revitalized: Yoghurt Maker Returns with Management Buyout”.

    #Yoghurt #maker #Rowan #Glen #starts #production #management #buyout New boss vows 'We will start small and…

    By admin

    Court Upholds Right to Keep Edinburgh Strip Clubs Open

    #Ban #Edinburgh #strip #clubs #quashed #court The measures were due to come into force in April,…

    By admin

    “Record-Breaking £6 Billion in Scotch Whisky Exports Defies Domestic Challenges”

    #Scotch #whisky #exports #billion #time #domestic #headwinds Meanwhile, Scottish salmon was UK’s biggest food export during…

    By admin

    “Unwinding of Regulations on Al Fresco Seating Anticipated by Late March!”

    #Relaxation #rules #outdoor #seating #expected #March The changes have all been backed in a consultation by…

    By admin

    “A Bright Future Ahead: Construction of 30 New Council Houses in East Lothian Begins!”

    #Construction #council #houses #East #Lothian #underway The development is scheduled for completion towards the end of…

    By admin

    “Glasgow Apprentice Contestant Forced to Exit Show Amidst Health Crisis”

    #Glasgow #Apprentice #contestant #leaves #show #due #health #issues He becomes the second contestant to quit this…

    By admin

    “A New Era: Expro Announces Acquisition of DeltaTek Global”

    #Expro #acquires #DeltaTek #Global The deal should help with the Aberdeen-based company's international growth plans

    By admin

    The UK Narrowly Escapes Recession as Economy Flatlines at 2022’s End

    #narrowly #avoids #recession #economy #flatlines When counting to two decimal places, the UK managed 0.01% growth…

    By admin
  • My Bookmarks
Reading: “Unlocking Hidden Messages from Photos with Python: An Exploration of Image-to-Text Extraction”
Sign In
  • Join US
Daily News CircuitDaily News Circuit
Aa
  • Bussiness
  • The Escapist
  • Entertainment
  • Science
  • Technology
  • Insider
Search
  • Home
  • Categories
    • Technology
    • Entertainment
    • The Escapist
    • Insider
    • Bussiness
    • Science
    • Health
  • Bookmarks
    • Customize Interests
    • My Bookmarks
  • More DNC
    • Blog Index
    • Sitemap
Have an existing account? Sign In
Follow US
© Daily News Circuit. 2023. All Rights Reserved.
Daily News Circuit > Blog > Technology > Software > “Unlocking Hidden Messages from Photos with Python: An Exploration of Image-to-Text Extraction”
Software

“Unlocking Hidden Messages from Photos with Python: An Exploration of Image-to-Text Extraction”

admin
Last updated: 2023/02/13 at 5:28 PM
By admin 16 Min Read
Share
SHARE

#Extracting #Text #Images #Python

Contents
What is OCR?OCR is Not a Silver BulletHow to Install TesseractHow to Run Tesseract from the Command LineProgrammatic Text Extraction in Python with pytessractUsing a Database to Store Images and Extracted Text in PythonHow to Load File Data into MariaDBHow to Query Images in a DatabaseFinal Thoughts on Extracting Text from Images with Python and MariaDB

Machine Vision has come a long way since the days of “how can a computer recognize this image as an apple.” There are many tools available that can easily help to identify the contents of an image. This topic was covered in the previous article Image Recognition in Python and SQL Server, in which a solution to programmatically identifying an image by its contents was presented. Optical Character Recognition (OCR) takes this a step further, by allowing developers to extract the text presented in an image. Extracting the text would allow for the text to be indexable and searchable. We will be covering this topic in today’s Python programming tutorial.

You can read more about image recognition in our tutorial: Image Recognition in Python and SQL Server.

What is OCR?

OCR – or Optical Character Recognition – had been quite a hot topic in the long-past days of digitizing paper artifacts such as documents, newspapers and other such physical media, but, as paper has gone by the wayside, OCR, while continuing to be a hot research topic, briefly moved to the back burner as a “pop culture technology.”

The use of screenshots as a note-taking method changed that trajectory. Consumers of information typically do not want to download PowerPoint presentations and search through them. They simply take photos of the slides they are interested in and save them for later. Recognizing the text in these photos has become a standard feature of most photo management software. But how would a developer integrate this technology into his or her own software project?

Google’s Tesseract offering gives software developers access to a “commercial grade” OCR software at a “bargain basement” price. Tesseract is open-source and offered under the Apache 2.0 license which gives developers a wide berth in how this software can be included in their own offerings. This software development tutorial will focus on implementing Tesseract within an Ubuntu Linux environment, since this is the easiest environment for a beginner to exploit.

OCR is Not a Silver Bullet

Before getting into the technical details, it is important to dispense with the idea that OCR can always magically read all of the text in an image. Even with decades of hard work going into researching this, there are still instances in which OCR may not be the best solution for text extraction. There may be situations in which different OCR software may be necessary depending on the use case. Tesseract in particular may require additional “training” (its jargon) to be better at reading text data from images. Tesseract always works better with 300dpi (dots per inch) or higher images. This is typically printing quality as opposed to web quality. You may also need to “massage” an input image before it could be read correctly.

However, out of the box, Tesseract can be “good enough” for the purposes of extracting just enough text from an image in order to accomplish what you may need to do in your software application.

Read: Best Python IDE and Code Editors

How to Install Tesseract

Installing Tesseract in Debian-based Linux is easy. It is installable as a software package. For Debian-based Linux distributions such as Kali or Ubuntu, use the following command:

$ sudo apt install tesseract-ocr

If you run into issues installing Tesseract in this manner, you may need to update your Linux installation as follows:

$ sudo apt update -y; sudo apt upgrade -y

For other Linux distributions, Windows or MacOS, it will be necessary to build from source.

How to Run Tesseract from the Command Line

Once Tesseract is installed, it can be run directly from a terminal. Consider the following images, along with the text output generated by Tesseract. To display the extracted text in standard output, use the following command:

$ tesseract imageFile stdout

Here are some example outputs, along with the original image with text. These come from slides that are typically the kinds that students might take pictures of in a classroom setting:

Example 1

Python Text Extraction from Images

Example 2

How t extract text from images in Python

Example 3

Python Text Extraction Tutorial

In each of the examples above, the text which “did not quite” get captured accurately is highlighted with red rectangles. This is likely due to the presentation quality image dpi (72 dpi) used for these images. As you can see below, some images are read better than others:

Example 4

Extract text in PythonNote: The above is not a defect in Tesseract. It is possible to “train” Tesseract to recognize different fonts. Also, if you are scanning documents, you can configure your scanner to read at higher dpi levels.

Programmatic Text Extraction in Python with pytessract

Naturally, extracting text within the context of a program is the next logical step. While it is always possible to use system calls from within Python or some other language in order to execute the Tesseract program, it is far more elegant to use an API to handle such calls instead.

One important thing to note: While it is not “verboten” to call Tesseract via system calls in a programming language, you must take care to ensure that no unchecked user input is passed to that system call. If no such checks are performed, then it is possible for an external user to run commands on your system with a well-constructed filename or other information.

The Python module pytesseract provides a wrapper to the Tesseract application. pytesseract can be installed via the command:

$ pip3 install pytesseract

Note that if you access Python 3.x via the python command as opposed to python3, you will need to use the command:

$ pip install pytesseract

The following sample code will extract all the text it can find from any image file in the current directory using Python and pytesseract:

#!/usr/bin/python3

# mass-ocr-images.py

from PIL import Image
import os
import pytesseract
import sys

# You must specify the full path to the tesseract executable.
# In Linux, you can get this by using the command:
# which tesseract
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'

def main(argv):
 for filename in os.listdir("."):
 if str(filename) not in ['.', '..']:
 nameParts = str(filename).split(".")
 if nameParts[-1].lower() in ["gif", "png", "jpg", "jpeg", "tif", "tiff"]:
 # Calls to the API should always be bounded by a timeout, just in case.
 try:
  print ("Found filename [" + str(filename) + "]")
  ocrText = pytesseract.image_to_string(str(filename), timeout=5)
  print (ocrText)
  print ("")
 except Exception as err:
  print ("Processing of [" + str(filename) + "] failed due to error [" + str(err) + "]")

if __name__ == "__main__":
 main(sys.argv[1:])


Using a Database to Store Images and Extracted Text in Python

We can use a database to store both the images and the extracted text. This will allow for developers to write an application that can search against the text and tell us which image matches this text. The following code extends the first listing by saving the collected data into a MariaDB database:

#!/usr/bin/python3

# ocr-import-images.py

from PIL import Image
import mysql.connector
import os
import pytesseract
import shutil
import sys

# You must specify the full path to the tesseract executable.
# In Linux, you can get this by using the command:
# which tesseract
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'

def main(argv):
 try:
 conn = mysql.connector.connect(user="rd_user", password='myPW1234%', host="127.0.0.1", port=63306,
 database="RazorDemo")
 cursor = conn.cursor()
 for filename in os.listdir("."):
 if str(filename) not in ['.', '..']:
 nameParts = str(filename).split(".")
 if nameParts[-1].lower() in ["gif", "png", "jpg", "jpeg", "tif", "tiff"]:
  # Calls to the API should always be bounded by a timeout, just in case.
  try:
  print ("Found filename [" + str(filename) + "]")
  ocrText = pytesseract.image_to_string(str(filename), timeout=5)
  fout = open("temp.txt", "w")
  fout.write (ocrText)
  fout.close()
  # Insert the database record:
  sql0 = "insert into Images (file_name) values (%s)"
  values0 = [str(filename)]
  cursor.execute(sql0, values0)
  conn.commit()
  # We need the primary key identifier created by the last insert so we can insert the extracted
  # text and binary data.
  lastInsertID = cursor.lastrowid
  print ("Rcdid of insert is [" + str(lastInsertID) + "]")
  # We need to copy the image file and the text file to a directory that is readable by the 
  # database.
  shutil.copyfile("temp.txt", "/tmp/db-tmp/temp.txt")
  shutil.copyfile(str(filename), "/tmp/db-tmp/" + str(filename))
  # Also, FILE privileges may be needed for the MariaDB user account:
  # grant file on *.* to 'rd_user'@'%';
  # flush privileges;
  sql1 = "update Images set extracted_text=LOAD_FILE(%s), file_data=LOAD_FILE(%s) where rcdid=%s"
  values1 = ["/tmp/db-tmp/temp.txt", "/tmp/db-tmp/" + str(filename), str(lastInsertID)]
  cursor.execute(sql1, values1)
  conn.commit()
  os.remove("/tmp/db-tmp/temp.txt")
  os.remove("/tmp/db-tmp/" + str(filename))
  except Exception as err:
  print ("Processing of [" + str(filename) + "] failed due to error [" + str(err) + "]")
 cursor.close()
 conn.close()
 except Exception as err:
 print ("Processing failed due to error [" + str(err) + "]")

if __name__ == "__main__":
 main(sys.argv[1:])


The Python code example above interacts with a MariaDB table that has the following structure:

create table Images
(rcdid int not null auto_increment primary key,
file_name varchar(255) not null,
extracted_text longtext null,
file_data longblob null);

In the code example above, longtext and longblob were chosen because those data types are intended to point to large volumes of text or binary data, respectively.

How to Load File Data into MariaDB

Loading binary or non-standard text into any database can pose all sorts of challenges, especially if text encoding is a concern. In most popular RDBMS, binary data is almost never inserted into or updated in a database record via a typical insert statement that is used for other kinds of data. Instead, specialized statements are used for such tasks.

For MariaDB, in particular, FILE permissions are required for any such operations. These are not assigned in a typical GRANT statement that grants privileges on a database to a user account. Instead, FILE permissions must be granted to the server itself, with a separate set of commands. To do this in MariaDB for the rd_user account used in our second code example, it will be necessary to log into MariaDB with its root account and execute the following commands:

grant file on *.* to 'rd_user'@'%';
flush privileges;

Once FILE permissions are granted, the LOAD FILE command can be used to load longtext or longblob data into a particular existing record. The following example show how to attach longtext or longblob data to an existing record in a MariaDB database:

-- For the extracted text, which can contain non-standard characters.
update Images set extracted_text=LOAD_FILE('/tmp/test.txt') where rcdid=rcdid

-- For the binary image data
update Images set file_data=LOAD_FILE('/tmp/myImage.png') where rcdid=rcdid

If you use a typical select * statement on this data after running these updates, then you will get a result that is not terribly useful:

Python text from image extraction tutorial

Instead, select substrings of the data:

MAriaDB query

The result of this query is more useful, at least for ensuring the records populated:

Python Text Extraction from Images

To extract this data back into files, use specialized select statements, as shown below:

select extracted_text into dumpfile '/tmp/ppt-slide-3-text.txt' from Images where rcdid=117;
select file_data into dumpfile '/tmp/Sample PPT Slide 3.png' from Images where rcdid=117;

Note that, aside from writing the output to the files above, there is no “special output” from either of these queries.

The directory into which the files will be created must be writable by the user account under which the MariaDB daemon is running. Files with the same names as extracted files cannot already exist.

The file data should match what was initially loaded:

Extracting text from images

The image will also match:

Python text processing tutorial

The image as read from the database, including using the fim command to view it.

These SQL Statements can then be incorporated into an external application for the purposes of retrieving this information.

How to Query Images in a Database

With the images loaded into the database, along with their extracted OCR text, conventional SQL queries can be used to find a particular image:

MariaDB query examples

Note that, in MariaDB, conventional text comparisons do not work with longtext columns. These must be cast as varchars.

This gives the following output:

Python Image processing tutorial

Final Thoughts on Extracting Text from Images with Python and MariaDB

Google’s Tesseract offering can easily allow for you to incorporate on-the-fly OCR into your applications. This will allow for your users to more easily and more readily be able to extract and use the text that may be contained in those images. But out-of-the-box Tesseract can go much, much further. For all of the “gibberish” results shown above, Tesseract can be trained by a developer to read characters that it cannot recognize, further extending its usability.

Given how technologies such as AI are becoming more mainstream, it is not too far of a stretch to imagine that OCR will only get better and easier to do as time goes on, but Tesseract is a great starting point. OCR is already used for complex tasks like “on the fly” language translation. That was considered unimaginable not that long ago. Who knows how much further we can go?

Read more Python programming tutorials and guides to software development.

You Might Also Like

Selective: People on the Move in Tech in November – Daily News Circuit

Selective: AWS launches SaaS Quick Launch for easier deployment of SaaS apps – Daily News Circuit

Selective: The promise of generative AI in low-code, testing – Daily News Circuit

Selective: Despite layoffs, software engineering and quality assurance skills remain in-demand – Daily News Circuit

Selective: Google Messages reaches 1 billion RCS users, unleashes 7 new features to celebrate – Daily News Circuit

TAGGED: Exploration, Extraction, Hidden, ImagetoText, Messages, Photos, Python, Unlocking
admin February 13, 2023
Share this Article
Facebook Twitter Email Copy Link Print
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Follow US

Find US on Social Medias

Facebook
Like

Twitter
Follow

Youtube
Subscribe

Telegram
Follow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!

– Advertisement –

 

Popular News



edit
The Escapist

8 Mistakes That Will RUIN Your Weekend Trips Plan

Ruby Staff
Ruby Staff
August 30, 2021
10+ Pics That Prove Jennifer Is a Timeless Beauty
Medicaid Expansion Improves Hypertension and Diabetes Control
12 Summer Outfit Formulas for Lazy Girls Everywhere
Explained: How the President of US is Elected

Global Coronavirus Cases

Confirmed

0

Death

0


More Information:Covid-19 Statistics

More Popular from Daily News Circuit

Technology

“Revolutionary AI Apps Shake Up the Tech World as Bing Hits the Top Charts, Google and Mozilla Test Non-WebKit Browsers” – TechCrunch

By admin 30 Min Read

“Revolutionary AI Apps Shake Up the Tech World as Bing Hits the Top Charts, Google and Mozilla Test Non-WebKit Browsers” – TechCrunch

By admin
Health

“Invasion of Privacy? Uncovering the Risks of Contact Tracing Apps”

By admin 0 Min Read
- Advertisement -
Ad image
Technology

“Revolutionary AI Apps Shake Up the Tech World as Bing Hits the Top Charts, Google and Mozilla Test Non-WebKit Browsers” – TechCrunch

#apps #Bing #hits #Top #Charts #Google #Mozilla #test #nonWebKit #browsers #TechCrunch Welcome back to This Week…

By admin
Beautiful

The Unparalleled Splendor of the Outspoken Beauty Awards: Unveiling the Best Makeup Products of 2022!

#Outspoken #Beauty #Awards #Top #Makeup #Products The Outspoken Beauty Awards 2022 are here. Listen to today's…

By admin
Vacation

“Awe-Inspiring Quito: Unveiling the City’s 7 Reasons for Stealing Hearts & the Top 3 Places to Stay”

#reasons #Quito #steals #hearts #visitors #destinations #stay The city of Quito is the capital of Ecuador,…

By admin
Investment

“Harnessing the Power of Nature: A Tribute to Pitta – Achieving Sustainable Investing Through a Natural Capital Approach”

#Natural #Capital #Approach #Sustainable #Investing #Tribute #Pitta Goodbye, Pitta It was a sunny afternoon when I…

By admin
Technology

“Revolutionary AI Apps Shake Up the Tech World as Bing Hits the Top Charts, Google and Mozilla Test Non-WebKit Browsers” – TechCrunch

#apps #Bing #hits #Top #Charts #Google #Mozilla #test #nonWebKit #browsers #TechCrunch Welcome back to This Week…

By admin
Daily News Circuit

Stay in the loop and on the pulse of entertainment with Daily News Circuit – your ultimate source for sizzling news and electrifying videos from the heart of the entertainment world.

Categories

  • The Escapist
  • Entertainment
  • Bussiness

Quick Links

  • Advertise with us
  • Newsletters
  • Complaint
  • Deal

Copyright © 2023 Daily News Circuit | All Rights Reserved.

Removed from reading list

Undo
Welcome Back!

Sign in to your account

Lost your password?