Extract Text from DOC or DOCX using C#

Extract Text from DOCX

Most of the data is represented as visual text in documents, images, and on the web so extraction of text data is sometimes the most needed thing. You may need to extract text or images from Word or PDF documents. As a C# developer, you can easily extract text from documents programmatically. In this article, you will learn how to extract text from the DOC or DOCX documents using C#.

The following topics are discussed/covered in this article:

C# API for Text Extraction

I will be using GroupDocs.Parser for .NET API for extracting a text from DOCX documents. It allows extracting text, metadata, and images from supported file format documents such as Word, PDF, Excel, and Powerpoint. It also supports the extraction of raw, formatted & structured text as well as metadata from the files of supported formats.

You can either download the DLL of the API or install it using the NuGet.

Install-Package GroupDocs.Parser

Extract Text from DOCX using C#

You can easily parse any document and extract text by following the simple steps mentioned below:

  • Create an instance of Parser class
  • Specify the file path
  • Call the GetText method of the Parser class to extract text
  • Get results in the TextReader class object
  • Show results by calling the ReadToEnd method of TextReader class

The following code sample shows how to extract text from a DOCX file using C#.

Extract Text from DOCX using C#
Extract Text from DOCX using C#

The Parser class is the main class that provides parsing functionality and extraction of text and images. I specified the input file path in the constructor of this class.

The GetText() method of the Parser class extracts a text from the specified document.

Get Formatted Text from DOCX using C#

You can easily parse Word document and extract text without losing the style formatting by following the simple steps mentioned below:

The following code sample shows how to extract formatted text from a DOCX file using C#.

Extract Formatted Text from DOCX using C#
Extract Formatted Text from DOCX using C#

The FormattedTextOptions class provides the options which are used for formatted text extraction such as the extraction Mode. I set extraction mode to the HTML that extracts a document text as HTML.

The GetFormattedText() method of the Parser class extracts a formatted text from the specified document.

Extract Formatted Text from Pages using C#

You can easily parse Word document and extract formatted text from a specific page of the document by following the simple steps mentioned below:

The following code sample shows how to extract formatted text from pages one by one using C#.

Extract Formatted Text from Pages using C#
Extract Formatted Text from Pages using C#

The Parser class provides Features property representing the Features class. It can be used to check whether a feature is supported for the document. You may read more about supported features in the “Get Supported Features” section.

Get a Free License

You can try the API without evaluation limitations by requesting a free temporary license.

Conclusion

In this article, you have learned how to extract text from Word documents using C#. You can learn more about GroupDocs.Parser for .NET API using the documentation. In case of any ambiguity, please feel free to contact us on the forum.

See Also