jsoup is a Java library to work with HTML and XML markups. jsoup provides an API to extract and manipulate markup data, allowing us to scrape and parse HTML and XML from a URL, file, or string.
If you use maven to manage project dependencies, insert the below code into your POM files
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.3</version>
</dependency>
implementation 'org.jsoup:jsoup:1.15.3'
This tutorial covers a majority of APIs provided by jsoup. Each method is described with a code example. Feel free to download the entire project by the link provided at the end of the tutorial. The sample HTML code we use to test our API methods is listed below.
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div id='toc' class='toc first'>Table of content</div>
<div id='index' class='toc'>
<table>
<tr><td>Name</td><td>Address</td></tr>
</table>
</div>
<div id='dynamic' class='toc'>Original Content</div>
<custom class='customClass commonClass'>Custom tag</custom>
<p>Parsed HTML into a doc.</p>
<a href='https://www/example.com/link1' class='hide'>Link 1</a>
<a href='https://www/example.com/link2' class='normal'>Link 2</a>
<a href='https://www/example.com/link3' class='normal'>Link 3</a>
<a href='https://www/example.com/link4' class='normal'>Link 4</a>
<a href='https://www/example.com/link5' class='normal'>Link 5</a>
<div>
<form action='/submit' name='myinputs'id='form'><input type='text' value='Name'/></form>
</div>
<input type='text' name='text box 1'/><button type='submit' class='hidden' value='Submit 1'/>
<input type='text' name='text box 2'/><button type='submit' value='Submit 2'/>
<input type='text' name='text box 3'/><button type='submit' value='Submit 3'/>
<input type='text' name='text box 4'/><button type='submit' class='hidden' value='Submit 4'/>
<input type='text' name='text box 5'/><button type='submit' value='Submit 5'/>
<customForm>
<form action='/submit' name='customInputs'id='custom'>
<input type='text' value='Lance'/></form>
</customForm>
</body>
</html>
Before interacting with HTML elements, we must parse the HTML string to a document model.
This document is similar to the browser's DOM or document object we refer to in Javascript.
jsoup's document is derived from
Document document = Jsoup.parse(html);
The below code sets up the necessary steps to test our methods. We will add API calls to the
public class JSoup{
static String html = "";//Insert above html here
static Document document = Jsoup.parse(html);
public static void main(String args[]) {
parseHtml();
}
public static void parseHtml() {
//Below test methods go gere
}
}
System.out.println(document.title());
Output :
Page Title
Elements elements = document.select("a");
Element a1 = elements.get(0);
System.out.println(a1.absUrl("href"));
a1 = elements.get(1);
System.out.println(a1.absUrl("href"));
a1 = elements.get(2);
System.out.println(a1.absUrl("href"));
Output :
https://www/example.com/link1
https://www/example.com/link2
https://www/example.com/link3
System.out.println(document.getElementById("toc"));
Output :
<div id="toc" class="toc first">
Table of content
</div>
System.out.println(document.getElementsByAttributeValue("class", "customClass commonClass"));
Output :
<custom class="customClass commonClass">
Custom tag
</custom>
Element element = document.getElementById("dynamic");
element.html("Replaced content");
System.out.println(element.toString());
<div id="dynamic" class="toc">
Replaced content
</div>
elements = document.select("custom");
element = elements.get(0);
System.out.println(element.html());
element.html("Replaced custom tag content");
System.out.println(element.html());
Output :
Custom tag
Replaced custom tag content
elements = document.select("custom");
element = new Element("anotherCustomTag");
element.html("Another custom tag");
elements.add(element);
System.out.println(elements);
Output :
Custom tag
<custom class="customClass commonClass">
Replaced custom tag content
</custom>
<anotherCustomTag>
Another custom tag
</anotherCustomTag>
elements = document.select("custom");
element = new Element("custom_0").html("Another custom tag 0");
elements.add(0, element);
System.out.println(elements);
System.out.println("---------------------------");
element = new Element("custom_1");
element.html("Another custom tag 1");
elements.add(2, element);
System.out.println(elements);
Output :
<custom_0>
Another custom tag 0
</custom_0>
<custom class="customClass commonClass">
Replaced custom tag content
</custom>
---------------------------
<custom_0>
Another custom tag 0
</custom_0>
<custom class="customClass commonClass">
Replaced custom tag content
</custom>
<custom_1>
Another custom tag 1
</custom_1>
elements = document.select("custom");
List collection = new ArrayList();
element = new Element("custom_0");
element.html("Another custom tag 0");
collection.add(element);
element = new Element("custom_1");
element.html("Another custom tag 1");
collection.add(element);
elements.addAll(collection);
System.out.println(elements);
Output :
<custom class="customClass commonClass">
Replaced custom tag content
</custom>
<custom_0>
Another custom tag 0
</custom_0>
<custom_1>
Another custom tag 1
</custom_1>
elements = document.select("custom");
collection = new ArrayList();
element = new Element("custom_0");
element.html("Another custom tag 0");
collection.add(element);
element = new Element("custom_1");
element.html("Another custom tag 1");
collection.add(element);
elements.addAll(0,collection);
System.out.println(elements);
Output :
<custom_0>
Another custom tag 0
</custom_0>
<custom_1>
Another custom tag 1
</custom_1>
<custom class="customClass commonClass">
Replaced custom tag content
</custom>
elements = document.select("custom");
elements.addClass("dynamicClass");
System.out.println(elements);
elements = document.select("custom");
elements.after("<anotherCustom>Another custom element</anotherCustom>");
System.out.println(document);
Output :
...
<custom class="customClass commonClass">Custom tag</custom>
<anothercustom>Another custom element</anothercustom>
...<
elements = document.select("custom");
elements.append("<anotherCustom>Another custom element</anotherCustom>");
System.out.println(elements);
Output :
<custom class="customClass commonClass">
Custom tag
<anothercustom>
Another custom element
</anothercustom>
</custom>
element = document.getElementById("dynamic");
System.out.println(element.attr("class"));
toc
element = document.getElementById("dynamic");
element.attr("class", "newClass");
System.out.println(element);
Output :
<div id="dynamic" class="newClass">
Original Content
</div>
elements = document.select("custom");
elements.before("<before>Insert before</before>");
System.out.println(document);
...
<before>
Insert before
</before>
<custom class="customClass commonClass">
Custom tag
</custom>
...
elements = document.select("custom");
System.out.println(elements);
elements.clear();
System.out.println(elements.isEmpty());
<custom class="customClass commonClass">
Custom tag
</custom>
true
elements.clone() creates a deep copy of these elements
elements = document.select("a");
Elements cloned = elements.clone();
System.out.println(cloned);
<a href="https://www/example.com/link1" class="hide">Link 1</a>
<a href="https://www/example.com/link2" class="normal">Link 2</a>
<a href="https://www/example.com/link3" class="normal">Link 3</a>
<a href="https://www/example.com/link4" class="normal">Link 4</a>
<a href="https://www/example.com/link5" class="normal">Link 5</a>
element = document.select("a").get(1);
elements = document.select("a");
System.out.println(elements.contains(element));
element = document.select("custom").get(0);
System.out.println(elements.contains(element));
Output :
true
false
elements = document.select("a");
cloned = document.select("a");
System.out.println(elements.containsAll(cloned));
Output :
true
elements = document.select("div");
elements.empty();
System.out.println(elements);
Output :
<div id="toc" class="toc first"></div>
<div id="index" class="toc"></div>
<div id="dynamic" class="toc"></div>
<div></div>
elements = document.select("a");
System.out.println(elements.eq(0));
Output :
<a href="https://www/example.com/link1" class="hide">Link 1</a>
elements.first() gets the first matched element
elements = document.select("a");
System.out.println(elements.first());
Output :
<a href="https://www/example.com/link1" class="hide">Link 1</a>
elements = document.select("div");
System.out.println(elements.hasAttr("class"));
Output :
true
elements = document.select("div");
System.out.println(elements.hasClass("toc"));
System.out.println(elements.hasClass("cot"));
Output :
true
false
elements = document.select("div");
System.out.println(elements.html());
Output :
Table of content
<table>
<tbody>
<tr>
<td>Name</td>
<td>Address</td>
</tr>
</tbody>
</table>
Original Content
<form action="/submit" name="myinputs" id="form">
<input type="text" value="Name">
</form>
elements = document.select("div");
System.out.println(elements.html("New HTML"));
Output :
<div id="toc" class="toc first">New HTML</div>
<div id="index" class="toc">New HTML</div>
<div id="dynamic" class="toc">New HTML</div>
<div>New HTML</div>