Search for a word in PDF using Java

You may need to search for a specific text from Word or PDF documents. As a Java developer, you can search any text from PDF documents programmatically. In this article, you will learn how to search for a word in PDF documents using Java.

The following topics are discussed/covered in this article:

Java API for Searching Text

I will be using GroupDocs.Search for Java API for searching in PDF documents. It allows you to perform text search operations in all popular document formats such as PDF, Word, Excel, PowerPoint, and many more. You can fetch your required information from files, documents, emails, and archives easily using this API. It also enables you to create and merge multiple indexes. You can use simple, Boolean, Regular Expression (Regex), Fuzzy, and other types of queries to rapidly and smartly search through indexes. 

Download and Configure

You can download the JAR of the API or just add the following pom.xml configuration in your Maven-based Java applications to try the below-mentioned code examples.

<repository>
	<id>GroupDocsJavaAPI</id>
	<name>GroupDocs Java API</name>
	<url>http://repository.groupdocs.com/repo/</url>
</repository>
<dependency>
        <groupId>com.groupdocs</groupId>
        <artifactId>groupdocs-search</artifactId>
        <version>20.11</version> 
</dependency>

Search Text in PDF using Java

You can easily search any text or a specific word in your PDF documents by following the simple steps mentioned below:

  • Create an Index
  • Specify the path to the index folder
  • Subscribe to index events
  • Add files to Index by calling the add method
  • Perform a search using the search method
  • Use SearchResult and print summary
  • Highlight the searched results in the output using the highlight method

The following code sample shows how to search a word from a PDF document using Java.

The above code sample shall generate the following output:

Documents found: 1
Total occurrences found: 6
	Document: C:\Files\Lorem ipsum.pdf
	Occurrences: 6

Generated HTML file can be opened with Internet browser.
The file can be found by the following path:
C:\Output\Highlighted.html
Search for a word in PDF document using Java

Search for a word in PDF document using Java

The Index and Index Event

The Index class is the main class for indexing documents and search through them. An index can be created in memory or on disk by calling the constructor of this class. I have created it on disk so that it can be reused.

To receive information about indexing errors, I have subscribed to the ErrorOccurred event. It will show the errors if any occurred during indexing the files.

Add Files to Index

The add method of the Index class adds a file or all files in a folder or subfolders by an absolute or relative path. All the documents on the given path will be indexed.

Perform a Search Operation

The Index class provides various search methods to perform the search operation. You can search by simple keyword or by defining a SearchQuery.

The SearchResult class provides details of a search result matching a search query. Some of the methods are described here:

Highlight the Search Results

The HtmlHighlighter class facilitates highlighting the search results in an entire document text formatted in HTML.

The highlight method of the Index class generates HTML output highlighting occurrences of found terms. You can find more details about “Highlighting Search Results” in the documentation.

Get a Free License

You can try the API without evaluation limitations by requesting a free temporary license.

Conclusion

In this article, you have learned how to search for a word in a PDF document using Java. You can learn more about GroupDocs.Search for Java API using the documentation. In case of any ambiguity, please feel free to contact us on the forum.

See Also