Extract Text from Word Documents using Java

Extract Text from Word Documents using Java

In certain cases, you may need to extract text from your Word documents for various purposes. As a Java developer, you can easily extract text from DOC or DOCX files programmatically. In this article, you will learn how to extract text from Word documents using Java.

The following topics are discussed/covered in this article:

Java API to Extract Text from Word Documents

For extracting text from the DOC or DOCX files, we will be using GroupDocs.Parser for Java API. It allows extracting text, metadata, and images from popular file formats of Word, PDF, Excel, and PowerPoint. It also supports the extraction of raw, formatted, and structured text from the files of supported formats.

You can download the JAR of the API or just add the following pom.xml configuration in your Maven-based Java application to try the below-mentioned code examples.

<repository>
	<id>GroupDocsJavaAPI</id>
	<name>GroupDocs Java API</name>
	<url>https://repository.groupdocs.com/repo/</url>
</repository>
<dependency>
	<groupId>com.groupdocs</groupId>
	<artifactId>groupdocs-parser</artifactId>
	<version>21.2</version> 
</dependency>

Extract Text from Word Documents using Java

You can parse any Word document and extract text by following the simple steps mentioned below:

  • Firstly, load the DOCX file using the Parser class.
  • Then, call the Parser.getText() method to extract text from the loaded document.
  • Get results of the Parser.getText() method in the TextReader class object.
  • Finally, call the TextReader.readToEnd() method to read all characters from the current position to the end of the text reader and return them as one string.

The following code sample shows how to extract text from a DOCX file using Java.

Extract Text from Word Documents using Java
Extract Text from Word Documents using Java

Extract Text from Specific Pages of a Word Document using Java

You can parse a Word document and extract text from a specific page by following the simple steps mentioned below:

The following code sample shows how to extract text from pages one by one using Java.

Extract Text from Specific Pages of a Document using Java
Extract Text from Specific Pages of a Document using Java

Get Highlight from Word Documents using Java

A highlight is a part of the text which is usually used to explain the context of the found text in the search functionality. You can extract a highlight from a document by following the simple steps mentioned below:

The following code sample shows how to extract a highlight from a document using Java.

At 0: Overview

Extract Formatted Text from DOCX using Java

You can parse Word documents and extract text without losing the style formatting by following the simple steps mentioned below:

The following code sample shows how to extract formatted text from a DOCX file using Java.

Extract Formatted Text from DOCX using Java
Extract Formatted Text from DOCX using Java

Extract Text by Table of Contents using Java

You can extract text from the document by the table of contents by following the simple steps mentioned below:

  • Firstly, load the DOCX file using the Parser class.
  • Then, call the Parser.getToc() method to extract a table of contents as a collection of TocItem class objects. The TocItem represents the item which is used in the table of contents extraction functionality.
  • Now, check if the collection is not null.
  • Then, iterate over TocItem’s collection and call the TocItem.extractText() method to extract text from the document to which the TocItem object refers.
  • Get results in the TextReader class object.
  • Finally, call the TextReader.readToEnd() method to read all the text.

The following code sample shows how to extract text by the table of contents from Word documents using Java.

Extract Text by Table of Contents using Java
Extract Text by Table of Contents using Java

Get a Free License

You can try the API without evaluation limitations by requesting a free temporary license.

Conclusion

In this article, you have learned how to extract text from Word documents using Java. Moreover, you have seen how to extract formatted text from a DOCX file programmatically. This article also explained how to extract text by the table of contents and extract a highlight from a document. Besides, you can learn more about GroupDocs.Parser for Java API using the documentation. In case of any ambiguity, please feel free to contact us on the forum.

See Also