Reading and parsing HTML documents helps extract information from online content. Whether you’re building a web scraper, analyzing web content, or simply need to extract some text from a web page. Accordingly, this article covers how to read HTML in Java.
Configure HTML Reader API in Java
You can configure the API from the Downloads pages or configure Conholdate.Total for Java using the following Maven configurations:
<dependency>
<groupId>com.conholdate</groupId>
<artifactId>conholdate-total</artifactId>
<version>23.8</version>
<type>pom</type>
</dependency>
Why Read HTML in Java?
Before we dive into the technical aspects, let’s briefly discuss why reading HTML in Java is important:
Data Extraction: HTML is the basis of the web, and many websites store valuable information within their HTML structure. By parsing and extracting data from HTML documents, you can automate data collection and analysis.
Web Scraping: When building web scraping tools, Java provides the capability to access and scrape HTML content, enabling you to gather data from websites for various purposes, such as market research or data analysis.
Custom Web Applications: Java can be used to create custom web applications and services that interact with HTML content, allowing you to build web-based tools or integrate external data sources.
Read an HTML File in Java
HTML (Hypertext Markup Language) is the standard language for creating web pages. When working with HTML in Java, your goal might be to extract data from a web page, manipulate its structure, or analyze its content. To accomplish these tasks, you need to read and parse HTML effectively. You can read an HTML file in Java by following the steps below:
- Get the input HTML content using the HTMLDocument class.
- Read the HTML with OuterHTML property.
The following sample code shows how to read HTML files in Java:
String documentPath = "document.html"; | |
// Load an HTML file | |
var document = new com.aspose.html.HTMLDocument(documentPath); | |
// Write the document content to the output stream | |
System.out.println(document.getDocumentElement().getOuterHTML()); |
Navigate HTML File to Read HTML Contents in Java
You can iterate through HTML data by navigating through different nodes as outlined in the following steps:
- Get the HTML content with HTMLDocument class.
- Access the first child of the Body content.
- Navigate the subsequent nodes and read HTML contents.
The code snippet below elaborates on how to navigate HTML to read HTML in Java:
Read HTML File as String in Java
You can parse data from HTML files as a string and export it to text format with the steps listed below:
- Create an object of HTMLDocument class.
- Parse the text contents of the HTML data.
- Export the output TXT file.
The following code sample explains how to read an HTML file as String in Java:
// Initialize HTMLDocument object with URL | |
com.aspose.html.HTMLDocument document = new com.aspose.html.HTMLDocument(dataDir + "document.html"); | |
// Read the text contents of the HTML format | |
String text = document.getBody().getTextContent(); | |
// Export HTML to Text file | |
String path = dataDir + "webpage.txt"; | |
Files.write( Paths.get(path), text.getBytes()); |
Read HTML Online
You can parse HTML pages and data to extract text or images online without needing to install any specialized application. Just use the following HTML Reader and Parser Online tool and follow simple steps to read HTML data.
Free Evaluation License
You may request a free temporary license to evaluate the API to its full capacity.
Summing Up
Reading HTML in Java is an essential skill for various web-related tasks, from data extraction and web scraping to building custom web applications. By following the steps and examples outlined in this blog post, you can start working with HTML in Java and unlock a wide range of possibilities for your projects. In case of any queries, please feel free to reach out to us at the forum.