Document parsing with Jsoup

Last updated : Jul 30, 2023 12:00 AM

jsoup is a Java library to work with HTML and XML markups. jsoup provides an API to extract and manipulate markup data, allowing us to scrape and parse HTML and XML from a URL, file, or string.

Install JSOUP with Maven

If you use maven to manage project dependencies, insert the below code into your POM files dependencies section.

<dependency>
   <groupId>org.jsoup</groupId>
   <artifactId>jsoup</artifactId>
   <version>1.15.3</version>
</dependency>

Install JSOUP with Gradle

implementation 'org.jsoup:jsoup:1.15.3'

This tutorial covers a majority of APIs provided by jsoup. Each method is described with a code example. Feel free to download the entire project by the link provided at the end of the tutorial. The sample HTML code we use to test our API methods is listed below.

<html>
<head>
<title>Page Title</title>
</head>
<body>
   <div id='toc' class='toc first'>Table of content</div>
   <div id='index' class='toc'>
      <table>
         <tr><td>Name</td><td>Address</td></tr>
      </table>
   </div>
   <div id='dynamic' class='toc'>Original Content</div>
   <custom class='customClass commonClass'>Custom tag</custom>
   <p>Parsed HTML into a doc.</p>
   <a href='https://www/example.com/link1' class='hide'>Link 1</a>
   <a href='https://www/example.com/link2' class='normal'>Link 2</a>
   <a href='https://www/example.com/link3' class='normal'>Link 3</a>
   <a href='https://www/example.com/link4' class='normal'>Link 4</a>
   <a href='https://www/example.com/link5' class='normal'>Link 5</a>
   <div>
      <form action='/submit' name='myinputs'id='form'><input type='text' value='Name'/></form>
   </div>
   <input type='text' name='text box 1'/><button type='submit' class='hidden' value='Submit 1'/>
   <input type='text' name='text box 2'/><button type='submit' value='Submit 2'/>
   <input type='text' name='text box 3'/><button type='submit' value='Submit 3'/>
   <input type='text' name='text box 4'/><button type='submit' class='hidden' value='Submit 4'/>
   <input type='text' name='text box 5'/><button type='submit' value='Submit 5'/>
   <customForm>
      <form action='/submit' name='customInputs'id='custom'>
      <input type='text' value='Lance'/></form>
   </customForm>
</body>
</html>

Before interacting with HTML elements, we must parse the HTML string to a document model. This document is similar to the browser's DOM or document object we refer to in Javascript. jsoup's document is derived from org.jsoup.nodes.Document

Document document = Jsoup.parse(html);

The below code sets up the necessary steps to test our methods. We will add API calls to the parseHtml() method when we progress through this tutorial. Since we modify the HTML document in each jsoup method call, we will call the method resetAndReloadDocument() to bring back the document object to its original state before the next API method.

public class JSoup{
   static String html = "";//Insert above html here
   static Document document = Jsoup.parse(html);
   public static void main(String args[]) {
      parseHtml();
   }

   public static void parseHtml() {
      //Below test methods go gere
   }
}

Getting page title

document.title() retrieves the string contents of the document's title element.

System.out.println(document.title());

Output :

Page Title

Find elements from the document

document.select(query) finds elements that match the Selector query. Matched elements can include this element and its children.

Elements elements = document.select("a");
Element a1 = elements.get(0);
System.out.println(a1.absUrl("href"));
a1 = elements.get(1);
System.out.println(a1.absUrl("href"));
a1 = elements.get(2);
System.out.println(a1.absUrl("href"));

Output :

https://www/example.com/link1
https://www/example.com/link2
https://www/example.com/link3

Find element by id

document.getElementById(id) finds the first matching ID, starting with this element. To get the html of the element use document.getElementById(id).html()

System.out.println(document.getElementById("toc"));

Output :

<div id="toc" class="toc first">
 Table of content
</div>

Find element by attribute key and value

document.getElementsByAttributeValue(key, value) finds elements that have an attribute with the specific value. Key and value are case insensitive. Use document.getElementsByAttributeValue("class", "customClass commonClass").html() to get inner html.

System.out.println(document.getElementsByAttributeValue("class", "customClass commonClass"));

Output :

<custom class="customClass commonClass">
 Custom tag
</custom>

Replace elements inner html

element.html(html) clears existing inner html sets the new value as inner HTML.

Element element = document.getElementById("dynamic");
element.html("Replaced content");
System.out.println(element.toString());
Output :
<div id="dynamic" class="toc">
 Replaced content
</div>

Find element by selector query

document.select(query) finds elements that match the selector query. Matched elements can contain their children.

elements = document.select("custom");
element = elements.get(0);
System.out.println(element.html());
element.html("Replaced custom tag content");
System.out.println(element.html());

Output :

Custom tag
Replaced custom tag content

Append an element to the end of the list

elements.add(Element e) appends the specified element to the end of this list.

elements = document.select("custom");
element = new Element("anotherCustomTag");
element.html("Another custom tag");
elements.add(element);
System.out.println(elements);

Output :

Custom tag
<custom class="customClass commonClass">
 Replaced custom tag content
</custom>
<anotherCustomTag>
 Another custom tag
</anotherCustomTag>

Add an element to the specified location

elements.add(index, Element) inserts the element at the specified index in this list. Shifts the element at the current position if any.

elements = document.select("custom");
element = new Element("custom_0").html("Another custom tag 0");
elements.add(0, element);
System.out.println(elements);
System.out.println("---------------------------");
element = new Element("custom_1");
element.html("Another custom tag 1");
elements.add(2, element);
System.out.println(elements);

Output :

<custom_0>
 Another custom tag 0
</custom_0>
<custom class="customClass commonClass">
 Replaced custom tag content
</custom>
---------------------------
<custom_0>
 Another custom tag 0
</custom_0>
<custom class="customClass commonClass">
 Replaced custom tag content
</custom>
<custom_1>
 Another custom tag 1
</custom_1>

Append a collection of elements

elements.addAll(Collection< extends Element> c) appends all of the DOM elements in the collection to the end of the list.

elements = document.select("custom");
List collection = new ArrayList();
element = new Element("custom_0");
element.html("Another custom tag 0");
collection.add(element);
element = new Element("custom_1");
element.html("Another custom tag 1");
collection.add(element);
elements.addAll(collection);
System.out.println(elements);

Output :

<custom class="customClass commonClass">
 Replaced custom tag content
</custom>
<custom_0>
 Another custom tag 0
</custom_0>
<custom_1>
 Another custom tag 1
</custom_1>

Append a collection of elements to a specified index position

elements.addAll(int index, Collection< extends Element> c) appends all of the DOM elements in the collection to the list starting from the specified index position.

elements = document.select("custom");
collection = new ArrayList();
element = new Element("custom_0");
element.html("Another custom tag 0");
collection.add(element);
element = new Element("custom_1");
element.html("Another custom tag 1");
collection.add(element);
elements.addAll(0,collection);
System.out.println(elements);

Output :

<custom_0>
 Another custom tag 0
</custom_0>
<custom_1>
 Another custom tag 1
</custom_1>
<custom class="customClass commonClass">
 Replaced custom tag content
</custom>

Add a class name to an element

elements.addClass(String className) adds the class name to every matched element's class attribute.

elements = document.select("custom");
elements.addClass("dynamicClass");
System.out.println(elements);
Output :
Placeholder 27Description
<custom class="customClass commonClass dynamicClass">
 Custom tag
</custom>

Add element after given element

elements.after(String html) inserts the HTML after each matched element's outer HTML.

elements = document.select("custom");
elements.after("<anotherCustom>Another custom element</anotherCustom>");
System.out.println(document);

Output :

...
<custom class="customClass commonClass">Custom tag</custom>
<anothercustom>Another custom element</anothercustom>
...<

Add an element to the end of the inner HTML

elements.append(String html) adds the supplied HTML to the end of each matched element's inner HTML

elements = document.select("custom");
elements.append("<anotherCustom>Another custom element</anotherCustom>");
System.out.println(elements);

Output :

<custom class="customClass commonClass">
 Custom tag
 <anothercustom>
  Another custom element
 </anothercustom>
</custom>

Get attribute value of an element

element.attr(String attributeKey) gets an attribute value from the first matched element that has the attribute. attributeKey is case sensitive.

element = document.getElementById("dynamic");
System.out.println(element.attr("class"));
Output :
toc

Set attribute

element.attr(String attributeKey, String attributeValue) sets an attribute value on the element. If the element already has an attribute with the key, its value is updated. Otherwise, the new attribute is added.

element = document.getElementById("dynamic");
element.attr("class", "newClass");
System.out.println(element);

Output :

<div id="dynamic" class="newClass">
 Original Content
</div>

Insert element before given element

elements.before(String html) inserts the supplied HTML before each matched element's outer HTML

elements = document.select("custom");
elements.before("<before>Insert before</before>");
System.out.println(document);
Output :
...
<before>
   Insert before
  </before>
  <custom class="customClass commonClass">
   Custom tag
  </custom>
...

Remove all elements

elements.clear() removes all of the dom elements from this list. The list will be empty after this call returns.

elements = document.select("custom");
System.out.println(elements);
elements.clear();
System.out.println(elements.isEmpty());
Output :
<custom class="customClass commonClass">
 Custom tag
</custom>
true

Deep copy dom elements

elements.clone() creates a deep copy of these elements

elements = document.select("a");
Elements cloned = elements.clone();
System.out.println(cloned);
Output :
<a href="https://www/example.com/link1" class="hide">Link 1</a>
<a href="https://www/example.com/link2" class="normal">Link 2</a>
<a href="https://www/example.com/link3" class="normal">Link 3</a>
<a href="https://www/example.com/link4" class="normal">Link 4</a>
<a href="https://www/example.com/link5" class="normal">Link 5</a>

Check if specific element present

elements.contains(Element element) returns true if the element list contains the specified element. Returns true if and only if this list contains at least one element e such that (o==null ? e==null : o.equals(e))

element = document.select("a").get(1);
elements = document.select("a");
System.out.println(elements.contains(element));
element = document.select("custom").get(0);
System.out.println(elements.contains(element));

Output :

true
false

Check if all elements present

elements.containsAll(Elements elements) returns true if the element collection contains all of the elements in the specified collection.

elements = document.select("a");
cloned = document.select("a");
System.out.println(elements.containsAll(cloned));

Output :

true

Remove all child nodes from element

elements.empty() removes all child nodes from each matched element. This is similar to setting the inner HTML of each element to empty.

elements = document.select("div");
elements.empty();
System.out.println(elements);

Output :

<div id="toc" class="toc first"></div>
<div id="index" class="toc"></div>
<div id="dynamic" class="toc"></div>
<div></div>

Get element as Elements array

elements.eq(int index) gets the nth matched element as an Elements object

elements = document.select("a");
System.out.println(elements.eq(0));

Output :

<a href="https://www/example.com/link1" class="hide">Link 1</a>

Get the first matched element

elements.first() gets the first matched element

elements = document.select("a");
System.out.println(elements.first());

Output :

<a href="https://www/example.com/link1" class="hide">Link 1</a>

Check if at least one element has a specified attribute set

elements.hasAttr(String attributeKey) checks if any of the matched elements have this attribute set

elements = document.select("div");
System.out.println(elements.hasAttr("class"));

Output :

true

Check if the specified class exists

elements.hasClass(String className) checks if any of the matched elements have this class name set in their class attribute

elements = document.select("div");
System.out.println(elements.hasClass("toc"));
System.out.println(elements.hasClass("cot"));

Output :

true
false

Get combined inner HTML of elements

elements.html() retrieves the combined inner HTML of all matched elements

elements = document.select("div");
System.out.println(elements.html());

Output :

Table of content
<table>
 <tbody>
  <tr>
   <td>Name</td>
   <td>Address</td>
  </tr>
 </tbody>
</table>
Original Content
<form action="/submit" name="myinputs" id="form">
 <input type="text" value="Name">
</form>

Set inner HTML of all elements

elements.html(String html) sets the inner HTML of each matched elements

elements = document.select("div");
System.out.println(elements.html("New HTML"));

Output :

<div id="toc" class="toc first">New HTML</div>
<div id="index" class="toc">New HTML</div>
<div id="dynamic" class="toc">New HTML</div>
<div>New HTML</div>
Lance

By: Lance

Hi, I'm Lance Raney, a dedicated Fullstack Developer based in Oklahoma with over 15 years of exp

Read more...