In certain cases, you may need to extract text from your Word documents for various purposes. As a Java developer, you can easily extract text from DOC or DOCX files programmatically. In this article, you will learn how to extract text from Word documents using Java.
The following topics are discussed/covered in this article:
- Java API to Extract Text from Word Documents
- Extract Text from Word Documents using Java
- Extract Text from Specific Pages of a Word Document using Java
- Get Highlight from Word Documents using Java
- Extract Formatted Text from DOCX using Java
- Extract Text by Table of Contents using Java
Java API to Extract Text from Word Documents
For extracting text from the DOC or DOCX files, we will be using GroupDocs.Parser for Java API. It allows extracting text, metadata, and images from popular file formats of Word, PDF, Excel, and PowerPoint. It also supports the extraction of raw, formatted, and structured text from the files of supported formats.
You can download the JAR of the API or just add the following pom.xml configuration in your Maven-based Java application to try the below-mentioned code examples.
<repository>
<id>GroupDocsJavaAPI</id>
<name>GroupDocs Java API</name>
<url>https://repository.groupdocs.com/repo/</url>
</repository>
<dependency>
<groupId>com.groupdocs</groupId>
<artifactId>groupdocs-parser</artifactId>
<version>21.2</version>
</dependency>
Extract Text from Word Documents using Java
You can parse any Word document and extract text by following the simple steps mentioned below:
- Firstly, load the DOCX file using the Parser class.
- Then, call the Parser.getText() method to extract text from the loaded document.
- Get results of the Parser.getText() method in the TextReader class object.
- Finally, call the TextReader.readToEnd()_ _method to read all characters from the current position to the end of the text reader and return them as one string.
The following code sample shows how to extract text from a DOCX file using Java.
Extract Text from Specific Pages of a Word Document using Java
You can parse a Word document and extract text from a specific page by following the simple steps mentioned below:
- Firstly, load the DOCX file using the Parser class.
- Then, use Parser.getFeatures().isText() to check whether the document supports text extraction feature. Read more about supported features.
- Now, call the Parser.getDocumentInfo() method to get the general information about the document. Such as File Type, Page Count, Size, etc.
- Get results of the Parser.getDocumentInfo() method in the IDocumentInfo interface object.
- Then, check if the IDocumentInfo.getPageCount() is not zero. This method returns the total number of document pages.
- Iterate over all the pages and call the Parser.getText() method for each page index to extract text and get results in TextReader class object.
- Finally, show results by calling the TextReader.readToEnd() method to read the extracted text.
The following code sample shows how to extract text from pages one by one using Java.
Get Highlight from Word Documents using Java
A highlight is a part of the text which is usually used to explain the context of the found text in the search functionality. You can extract a highlight from a document by following the simple steps mentioned below:
- Firstly, load the DOCX file using the Parser class.
- Create an instance of the HighlightOptions class object and pass maximum length as an input parameter to its constructor to extract a fixed-length highlight.
- Then, call the Parser.getHighlight() method with start position and HighlightOptions class object to extract a highlight from the document as an object of the HighlightItem class.
- Finally, call the Highlight.getPosition() and the HighlightItem.getText() methods to get the position and text of the highlight.
The following code sample shows how to extract a highlight from a document using Java.
At 0: Overview
Extract Formatted Text from DOCX using Java
You can parse Word documents and extract text without losing the style formatting by following the simple steps mentioned below:
- Firstly, load the DOCX file using the Parser class.
- Define the FormattedTextOptions and set the FormattedTextMode to HTML. It enables you to extract an HTML formatted text from the document.
- Then, call the Parser.getFormattedText() method to extract formatted text.
- Get results of the Parser.getText() method in the TextReader class object.
- Finally, call the TextReader.readToEnd() method to read all the text.
The following code sample shows how to extract formatted text from a DOCX file using Java.
Extract Text by Table of Contents using Java
You can extract text from the document by the table of contents by following the simple steps mentioned below:
- Firstly, load the DOCX file using the Parser class.
- Then, call the Parser.getToc() method to extract a table of contents as a collection of TocItem class objects. The TocItem represents the item which is used in the table of contents extraction functionality.
- Now, check if the collection is not null.
- Then, iterate over TocItem’s collection and call the TocItem.extractText() method to extract text from the document to which the TocItem object refers.
- Get results in the TextReader class object.
- Finally, call the TextReader.readToEnd() method to read all the text.
The following code sample shows how to extract text by the table of contents from Word documents using Java.
Get a Free License
You can try the API without evaluation limitations by requesting a free temporary license.
Conclusion
In this article, you have learned how to extract text from Word documents using Java. Moreover, you have seen how to extract formatted text from a DOCX file programmatically. This article also explained how to extract text by the table of contents and extract a highlight from a document. Besides, you can learn more about GroupDocs.Parser for Java API using the documentation. In case of any ambiguity, please feel free to contact us on the forum.