Convert PDF to Text in C#

Convert PDF to Text in C#

Recently, we published some blog posts such as converting PDF to HTML and PDF to Images programmatically. This article will teach us how to convert PDF to Text in C# using a .NET OCR library. As a .NET developer, you can easily use this library to convert files to other popular file formats. In addition, there is a rich stack of features to automate the text extraction process from PDF documents. However, we will write the steps and the code snippet to demonstrate the text extraction from a scanned PDF file.

The following points will be covered in this article:

.NET PDF to TXT Conversion - OCR Library Installation

This library is powerful and offers comprehensive documentation regarding development and usage. You can convert and process various file formats quickly and efficiently.

To install this API in your .NET project, you can either download the DLL files or run the following command in the NuGet package manager.

Install-Package Aspose.OCR

How to Convert PDF to Text with OCR in C#

The text extraction from a scanned PDF file is quite easy and is a matter of a few lines of source code in C#.

Please follow the steps mentioned below:

  1. Create an object of AsposeOcr class.
  2. Initialize an instance of the DocumentRecognitionSettings class to recognize images from PDF.
  3. Set the value of the DetectAreas property to enable automatic text area detection.
  4. Create a list of RecognitionResult types, extract text from scanned PDF documents by calling the RecognizePdf method and assign the result to the list.

Copy & paste the following code to convert PDF to TEXT in C#.

Extract Text from PDF with OCR in C# - Advanced Options

In this section, we will explore this library further. It also lets you recognize scanned PDFs from the stream.

The following are the steps:

  1. Instantiate an instance of the AsposeOcr class.
  2. Create an instance of the MemoryStream class to recognize PDF from the stream.
  3. Initialize the constructor of FileStream and load the source file.
  4. Invoke the CopyTo method to write the bytes to the memory stream.
  5. Create an object of DocumentRecognitionSettings class that recognizes images from PDF.
  6. Create a list of RecognitionResult types and initialize it with the results of the RecognizePdf method.

The code snippet below shows how to extract text from PDF with OCR in C# with an advanced approach:

Get a Free License

You can get a free temporary license to try the API without evaluation limitations.

Summing up

This brings us to the end of this blog post. You have learned how to convert PDF to Text in C# programmatically. In addition, we also have explored some advanced methods offered by this .NET OCR library. Moreover, you may visit the documentation to learn other features. This guide will surely help you if you are looking to equip your application with a PDF to Text converter. Further, conholdate.com is writing new blog posts on new topics. Therefore, please stay in touch for regular updates.

Ask a question

You can let us know about your questions or queries on our forum.

FAQs

How do I convert a PDF to text programmatically?

You can convert PDF to Text in C# using this .NET OCR library. It exposes the RecognizePdf method that performs this action efficiently.

What is the easiest way to convert PDF to text

You may visit the documentation of this library to learn about the methods with which you can extract the data from scanned PDF files programmatically.

See Also