Fetch Text from Image & PDF Using Selenium Java | Devstringx Technologies
In this Blog, we will learn how we can fetch data from image and PDF.
This Blog Contains:
- Read Text From Image Using OCR with Tesseract (tess4j)
- Reading PDF Text Using PDFUtil
- Save PDF as Image Using PDFUtil
- Extract Images From PDF Using PDFUtil
Fetch Text From Image:
For fetching text from Image, we use Optical Character Recognition (OCR) with Tesseract (tess4j). Tesseract Supports UTF-8 unicode.
- First we need to create a folder with name “tesseract” in our project and put trainedData in that folder. You can find trainedData for any language from below URL:
https://github.com/tesseract-ocr/tessdata
Just Download eng.trainedData for English Language and put it into Tesseract Folder into your project.
- Add below is maven dependency for tesseract (tess4j):
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.5.4</version>
</dependency>
- Below is Java code to fetch text from image:
ITesseract image = new Tesseract();
image.setDatapath(“Location for TessData Folder”);
image.setLanguage(“eng”);
String str1 = image.doOCR(new File(“Location Of Image”));
Fetch Text From PDF:
- Add Below Maven Dependency For PDFUtil
<dependency>
<groupId>com.testautomationguru.pdfutil</groupId>
<artifactId>pdf-util</artifactId>
<version>0.0.3</version>
</dependency>
- Below Java Code is used to Read Text From PDF
String pdfLocation = “Location where we have PDF File”;
PDFUtil pdfUtil = new PDFUtil();
String text = pdfUtil.getText(pdfLocation);
- Below Java Code is used to Save PDF as Image
String folderLocation = “Location Where we need to save Image”;
String pdfLocation = “Location where we have PDF File”;
PDFUtil pdfUtil = new PDFUtil();
pdfUtil.setImageDestinationPath(folderLocation);
pdfUtil.savePdfAsImage(pdfLocation);
- Below Java Code is used to Fetch Image From PDF
String folderLocation = “Location Where we need to save Image”;
String pdfLocation = “ Location where we have PDF File”;
PDFUtil pdfUtil = new PDFUtil();
pdfUtil.setImageDestinationPath(folderLocation);
pdfUtil.extractImages(pdfLocation);
. . .
Originally published at https://www.devstringx.com on July 08, 2021.