Fetch Text from Image & PDF Using Selenium Java | Devstringx Technologies

Devstringx Technologies
2 min readJul 15, 2021

In this Blog, we will learn how we can fetch data from image and PDF.

This Blog Contains:

  • Read Text From Image Using OCR with Tesseract (tess4j)
  • Reading PDF Text Using PDFUtil
  • Save PDF as Image Using PDFUtil
  • Extract Images From PDF Using PDFUtil

Fetch Text From Image:

For fetching text from Image, we use Optical Character Recognition (OCR) with Tesseract (tess4j). Tesseract Supports UTF-8 unicode.

  • First we need to create a folder with name “tesseract” in our project and put trainedData in that folder. You can find trainedData for any language from below URL:

https://github.com/tesseract-ocr/tessdata

Just Download eng.trainedData for English Language and put it into Tesseract Folder into your project.

  • Add below is maven dependency for tesseract (tess4j):

<dependency>

<groupId>net.sourceforge.tess4j</groupId>

<artifactId>tess4j</artifactId>

<version>4.5.4</version>

</dependency>

  • Below is Java code to fetch text from image:

ITesseract image = new Tesseract();

image.setDatapath(“Location for TessData Folder”);

image.setLanguage(“eng”);

String str1 = image.doOCR(new File(“Location Of Image”));

Fetch Text From PDF:

  • Add Below Maven Dependency For PDFUtil

<dependency>

<groupId>com.testautomationguru.pdfutil</groupId>

<artifactId>pdf-util</artifactId>

<version>0.0.3</version>

</dependency>

  • Below Java Code is used to Read Text From PDF

String pdfLocation = “Location where we have PDF File”;

PDFUtil pdfUtil = new PDFUtil();

String text = pdfUtil.getText(pdfLocation);

  • Below Java Code is used to Save PDF as Image

String folderLocation = “Location Where we need to save Image”;

String pdfLocation = “Location where we have PDF File”;

PDFUtil pdfUtil = new PDFUtil();

pdfUtil.setImageDestinationPath(folderLocation);

pdfUtil.savePdfAsImage(pdfLocation);

  • Below Java Code is used to Fetch Image From PDF

String folderLocation = “Location Where we need to save Image”;

String pdfLocation = “ Location where we have PDF File”;

PDFUtil pdfUtil = new PDFUtil();

pdfUtil.setImageDestinationPath(folderLocation);

pdfUtil.extractImages(pdfLocation);

. . .

Originally published at https://www.devstringx.com on July 08, 2021.

--

--

Devstringx Technologies

Devstringx Technologies is highly recommended IT company for custom software development, mobile app development and automation testing services