Search Text in PDF Documents using C#

Search for a Word in PDF using C#

You may need to search for a piece of particular information, text phrase, or a word from your documents. As a C# developer, you can easily search for any text from PDF documents programmatically in your .NET applications. In this article, you will learn how to search text in PDF documents using C#.

The following topics are discussed/covered in this article:

C# API for Searching Text

For searching text in PDF documents, I will be using GroupDocs.Search for .NET API. It allows you to perform text search operations in all popular document formats such as PDF, Word, Excel, PowerPoint, and many more. It also enables you to fetch your required information from files, documents, emails, and archives. You can create and merge multiple indexes to rapidly and smartly search through them using simple, Boolean, Regular Expression (Regex), Fuzzy, and other types of queries. 

You can either download the DLL of the API or install it using NuGet.

Install-Package GroupDocs.Search

Search Text in PDF Documents using C#

You can search for any text or a specific word in your PDF documents programmatically by following the simple steps mentioned below:

  • Create an instance of the Index class
  • Specify the path to the index folder
  • Subscribe to the index events
  • Add PDF files to the Index by calling the Add() method
  • Define a search query
  • Perform a search using the Search() method with search query
  • Use the SearchResult and print summary
  • Highlight the searched results in the output using the Highlight() method

The following code sample shows how to search text in PDF documents using C#.

The above code sample will generate the following output:

Documents found: 1
Total occurrences found: 4
        Document: C:\Files\Files\sample.pdf
        Occurrences: 4

Generated HTML file can be opened with Internet browser.
The file can be found by the following path:
C:\Files\Files\Highlighted.html
Search-Text-or-Word-in-PDF-using-CSharp
Highlighted the Searched Text in PDF Documents using C#

The Index and Index Event

The Index class is the main class that provides functionality to index the documents and search through them. An index can be created in memory or on disk by calling the constructor of this class. In the above code example, I have created the index on disk so that it can be reused.

The ErrorOccurred event shows the errors if any occurs during indexing the files. So, you need to subscribe to this in order to receive information about indexing errors.

Add Files to the Index

The Add() method of the Index class adds a file or all files in a specified folder or subfolders by an absolute or relative path. All the documents on the given path will be indexed.

Perform a Search Operation

The Index class provides various Search methods to perform the search operation. You can search by providing a simple keyword or by defining a SearchQuery.

The SearchResult class provides details of a search result matching a search query. The following methods and properties of this class facilitate getting details of search results:

Highlight the Search Results

The HtmlHighlighter class highlights the search results in an entire text of the document formatted in HTML.

The Highlight() method of the Index class generates HTML output highlighting occurrences of found terms. You can find more details about “Highlighting Search Results” in the documentation.

Case-Sensitive Text Search in PDF using C#

You can search for any specific text phrase or a word considering uppercase and lowercase letters in your PDF documents programmatically by following the simple steps mentioned below:

  • Create an instance of the Index class
  • Specify the path to the index folder
  • Add PDF files to the Index by calling the Add() method
  • Create an instance of the SearchOptions
  • Set the UseCaseSensitiveSearch property to true
  • Define a search query
  • Perform a search using the Search() method with search query and the SearchOptions
  • Use the SearchResult and print summary

The following code sample shows how to perform a case-sensitive text search in a PDF document using C#.

Documents found: 1
Total occurrences found: 2
        Document: C:\Files\Files\sample.pdf
        Occurrences: 2

The SearchOptions class provides options to perform the search operations. The UseCaseSensitiveSearch property of this class allows you to perform a case-sensitive search for a word or text.

Get a Free License

You can try the API without evaluation limitations by requesting a free temporary license.

Conclusion

In this article, you have learned how to search text in a PDF document using C#. You have also learned how to perform a case-sensitive text search in a PDF document using C#. You can learn more about GroupDocs.Search for .NET API using the documentation. In case of any ambiguity, please feel free to contact us on the forum.

See Also