Updated April 3, 2023
Introduction to Jsoup Parser
Parsing an HTML file is one of the widely used necessities. HTML files are expected to be parsed and there are different libraries for parsing the HTML documents. Some of the widely used HTML parsers are in python and java. Jsoup is one of the famous java-based parsers for HTML files. This Jsoup allows to parse through the HTML code flexibly and make the needed parsing over it. The key thing is the convenient feature of jsoup for fetching details from the corresponding URLs and allowing manipulation of the data to happen. The specification used by the JSOUP parser is applied by means of WHATWG HTML5 which is very comparative to modern-day browsers. This is among the biggest capabilities of JSOUP parsers. The key capabilities include scraping HTML data from a specific URL, extracting the data, manipulating the elements associated with the HTML file, etc.
JSOUP Parse HTML String
The process of JSOUP can also be implemented directly on HTML strings. So this means an HTML string can be directly passed for the jsoup parser and then parsed for the elements and items within it. The intention here to set up an HTML file in the form of an HTML string is where innovation comes into play. For developing the jsoup based application there are six different packages that help in jsoup based applications. The classes associated to jsoup parsing are into three sections, they are as follows.
1) Jsoup import
2) Document
3) Element
From the parser perspective, the parsing will happen with all attempts to clean and create the HTML provided without taking into consideration whether the HTML string is well-formed or not. Some of the key elements which are handled by jsoup are unclosed tags. These are tags where the star of the tag is correctly mentioned, whereas the end tag is not been correctly placed. So the opening tag is placed as expected but in the other case, the closing tags are not been placed as expected. From the case of implicit tags, these tags are also been handled wisely by these jsoup tags. These implicit tags are conditions in which a naked table item is wrapped within a uncommon table string. Also, the document-level structures can be well-formed here.
Let’s consider an example to explain the HTML string level parsing of JSOUP. The key elements to be used in string level parsing of jsoup are below,
1) Import of jsoup library from org. jsoup
2) Import of document library from jsoup class and nodes class.
3) Next a string variable has to be declared. This string variable will be having the HTML file section which is expected to be parsed. So the entire HTML string section has to be encapsulated here. Basically, the documents can be used for elements and textnodes. And several other misc nodes can be used. The chain from the inheritance perspective is a document that extends elements which again extends for the node. Moreover, the text node extends the node. All the list of children nodes is placed within the elements. A parent element is been placed on top of this item.a filtered perception of child elements is also placed.
String string_var = “The corresponding HTML string has to be placed here for reference”
4) Then a document-level object has to be declared. This document-level object which is imported from the jsoup parse library will be responsible for translating the string variable which was declared above into an HTML document-oriented structure.
Document doc_item = jsoup.parse(string_Var);
5) Next the title object has to be declared, the title object will be responsible for parsing the HTML string and identifying the title tag value from the string. The title() method of the document is specifically used for retrieving the value of the title element.
String title_obj = doc_item.title();
6) Lastly the body tag has to be used for retrieving the value of the HTML body. This can be achieved by means of the body method. The body() method will be helpful in initiating and pulling the contents associated with an HTML body.
String body_obj = doc_item.body();
The objects created above can be printed onto the console to know the values from the HTML body.
JSOUP Examples
Example #1
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Jsoupprintfileexample{
public static void main( String[] args ){
Document doc_item = Jsoup.parse(new File("https://www.codebudz.com/"),"utf-8");
String title_item = doc_item.title();
System.out.println("value of title is : " + title_item);
}
}
Output:
Explanation:
The above program is used for parsing an HTML page and retrieving the element title from the HTML page. This means the jsoup will go through the html page and return the value associated with the title tag of the html page. We can notice the program uses the jsoup import, import of jsoup.nodes.document, and the main class is declared below that. Within the main class, the object for a document is declared. The document URL for which the title needs to be retrieved for is been placed here in the document object-based declaration. Next, the title object has to be declared. Basically, the documents can be used for elements and textnodes. And several other misc nodes can be used. The chain from the inheritance perspective is a document that extends elements which again extends for the node. Moreover, the text node extends the node. So the title object is declared from the document.title() method and the data type of the item is maintained as a string. Lastly, the title value associated with the object is placed to the title_item variable and the value of the title_item variable is printed onto the console. we can notice the value of title for the URL https://www.codebudz.com/ is returned by the system of jsoup post parsing as codebudz. Here the value codebudz is retrieved from the source.
Conclusion – Jsoup parser
The article explains what is jsoup, how jsoup is been used what is the different items associated with jsoup, and an example of jsoup is executed and explained.
Recommended Articles
This is a guide to Jsoup parser. Here we discuss what is jsoup, how jsoup is been used, and examples along with the codes and outputs. You may also have a look at the following articles to learn more –