Read HTML Java

Reading and parsing HTML documents helps extract information from online content. Whether you’re building a web scraper, analyzing web content, or simply need to extract some text from a web page. Accordingly, this article covers how to read HTML in Java.

Configure HTML Reader API in Java

You can configure the API from the Downloads pages or configure Conholdate.Total for Java using the following Maven configurations:

<dependency>
<groupId>com.conholdate</groupId>
<artifactId>conholdate-total</artifactId>
<version>23.8</version>
<type>pom</type>
</dependency>

Why Read HTML in Java?

Before we dive into the technical aspects, let’s briefly discuss why reading HTML in Java is important:

Data Extraction: HTML is the basis of the web, and many websites store valuable information within their HTML structure. By parsing and extracting data from HTML documents, you can automate data collection and analysis.

Web Scraping: When building web scraping tools, Java provides the capability to access and scrape HTML content, enabling you to gather data from websites for various purposes, such as market research or data analysis.

Custom Web Applications: Java can be used to create custom web applications and services that interact with HTML content, allowing you to build web-based tools or integrate external data sources.

Read an HTML File in Java

HTML (Hypertext Markup Language) is the standard language for creating web pages. When working with HTML in Java, your goal might be to extract data from a web page, manipulate its structure, or analyze its content. To accomplish these tasks, you need to read and parse HTML effectively. You can read an HTML file in Java by following the steps below:

  • Get the input HTML content using the HTMLDocument class.
  • Read the HTML with OuterHTML property.

The following sample code shows how to read HTML files in Java:

String documentPath = "document.html";
// Load an HTML file
var document = new com.aspose.html.HTMLDocument(documentPath);
// Write the document content to the output stream
System.out.println(document.getDocumentElement().getOuterHTML());
view raw Read-HTML.java hosted with ❤ by GitHub

You can iterate through HTML data by navigating through different nodes as outlined in the following steps:

  • Get the HTML content with HTMLDocument class.
  • Access the first child of the Body content.
  • Navigate the subsequent nodes and read HTML contents.

The code snippet below elaborates on how to navigate HTML to read HTML in Java:

// Prepare HTML code
var html_code = "<span>Hello</span> <span>World!</span>";
// Initialize a document from the prepared code
var document = new com.aspose.html.HTMLDocument(html_code, ".");
// Get the reference to the first child (first SPAN) of the BODY
var element = document.getBody().getFirstChild();
System.out.println(element.getTextContent()); // output: Hello
// Get the reference to the whitespace between html elements
element = element.getNextSibling();
System.out.println(element.getTextContent()); // output: ' '
// Get the reference to the second SPAN element
element = element.getNextSibling();
System.out.println(element.getTextContent()); // output: World!

Read HTML File as String in Java

You can parse data from HTML files as a string and export it to text format with the steps listed below:

  • Create an object of HTMLDocument class.
  • Parse the text contents of the HTML data.
  • Export the output TXT file.

The following code sample explains how to read an HTML file as String in Java:

// Initialize HTMLDocument object with URL
com.aspose.html.HTMLDocument document = new com.aspose.html.HTMLDocument(dataDir + "document.html");
// Read the text contents of the HTML format
String text = document.getBody().getTextContent();
// Export HTML to Text file
String path = dataDir + "webpage.txt";
Files.write( Paths.get(path), text.getBytes());

Read HTML Online

You can parse HTML pages and data to extract text or images online without needing to install any specialized application. Just use the following HTML Reader and Parser Online tool and follow simple steps to read HTML data.

Free Evaluation License

You may request a free temporary license to evaluate the API to its full capacity.

Summing Up

Reading HTML in Java is an essential skill for various web-related tasks, from data extraction and web scraping to building custom web applications. By following the steps and examples outlined in this blog post, you can start working with HTML in Java and unlock a wide range of possibilities for your projects. In case of any queries, please feel free to reach out to us at the forum.

See Also