• Awards Season
  • Big Stories
  • Pop Culture
  • Video Games
  • Celebrities

Learn the Basics of HTML: A Step-by-Step Guide

HTML (HyperText Markup Language) is the language used to create webpages and is an essential part of web development. It is easy to learn and can be used to create simple or complex websites. This guide will provide a step-by-step introduction to the basics of HTML so you can get started creating your own webpages.

Understanding HTML Tags

HTML tags are used to structure the content on a webpage. They are written in angle brackets, with an opening tag and a closing tag. For example,

is the opening tag for a heading and

is the closing tag. Each tag has its own purpose and can be used to add structure, formatting, and other elements to a webpage.

Creating Your First Webpage

Now that you understand HTML tags, it’s time to create your first webpage. Start by creating a new file in your text editor and save it as “index.html”. This will be the main page of your website. You will then need to add some basic HTML tags such as ,

Once you have added the basic HTML tags, you can start adding content to your webpage. You can use text, images, videos, links, and other elements to create an engaging user experience. To add content, you will need to use specific HTML tags such as

for paragraphs, for images, and for links. You can also use CSS (Cascading Style Sheets) to style your webpage with colors, fonts, backgrounds, etc.

Learning HTML is an important skill for anyone interested in web development or creating their own website. With this guide as a starting point, you should now have all the tools you need to get started with HTML and begin creating your own webpages.

This text was generated using a large language model, and select text has been reviewed and moderated for purposes such as readability.

MORE FROM ASK.COM

html xpath tutorial

XML Tutorial

Xpath tutorial, xslt tutorial, xquery tutorial, xsd data types, web services, xpath syntax.

XPath uses path expressions to select nodes or node-sets in an XML document. The node is selected by following a path or steps.

The XML Example Document

We will use the following XML document in the examples below.

Selecting Nodes

XPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps. The most useful path expressions are listed below:

In the table below we have listed some path expressions and the result of the expressions:

Advertisement

Predicates are used to find a specific node or a node that contains a specific value.

Predicates are always embedded in square brackets.

In the table below we have listed some path expressions with predicates and the result of the expressions:

Selecting Unknown Nodes

XPath wildcards can be used to select unknown XML nodes.

Selecting Several Paths

By using the | operator in an XPath expression you can select several paths.

Create your site with Spaces

COLOR PICKER

colorpicker

Report Error

If you want to report an error, or if you want to make a suggestion, do not hesitate to send us an e-mail:

[email protected]

Thank You For Helping Us!

Your message has been sent to W3Schools.

Top Tutorials

Top references, top examples, get certified.

previous episode

Introduction to web scraping, next episode, selecting content on a web page with xpath.

Overview Teaching: 30 min Exercises: 15 min Questions How can I select a specific element on web page? What is XPath and how can I use it? Objectives Introduce XPath queries Explain the structure of an XML or HTML document Explain how to view the underlying HTML content of a web page in a browser Explain how to run XPath queries in a browser Introduce the XPath syntax Use the XPath syntax to select elements on this web page

Before we delve into web scraping proper, we will first spend some time introducing some of the techniques that are required to indicate exactly what should be extracted from the web pages we aim to scrape.

The material in this section was adapted from the XPath and XQuery Tutorial written by Kim Pham ( @tolloid ) for the July 2016 Library Carpentry workshop in Toronto.

Introduction

XPath (which stands for XML Path Language) is an expression language used to specify parts of an XML document. XPath is rarely used on its own, rather it is used within software and languages that are aimed at manipulating XML documents, such as XSLT, XQuery or the web scraping tools that will be introduced later in this lesson. XPath can also be used in documents with a structure that is similar to XML, like HTML.

Markup Languages

XML and HTML are markup languages . This means that they use a set of tags or rules to organise and provide information about the data they contain. This structure helps to automate processing, editing, formatting, displaying, printing, etc. that information.

XML documents stores data in plain text format. This provides a software- and hardware-independent way of storing, transporting, and sharing data. XML format is an open format, meant to be software agnostic. You can open an XML document in any text editor and the data it contains will be shown as it is meant to be represented. This allows for exchange between incompatible systems and easier conversion of data.

XML and HTML Note that HTML and XML have a very similar structure, which is why XPath can be used almost interchangeably to navigate both HTML and XML documents. In fact, starting with HTML5, HTML documents are fully-formed XML documents. In a sense, HTML is like a particular dialect of XML.

XML document follows basic syntax rules:

  • An XML document is structured using nodes , which include element nodes, attribute nodes and text nodes
  • XML element nodes must have an opening and closing tag, e.g. <catfood> opening tag and </catfood> closing tag
  • XML tags are case sensitive, e.g. <catfood> does not equal <catFood>
  • XML elements must be properly nested:
  • Text nodes (data) are contained inside the opening and closing tags
  • XML attribute nodes contain values that must be quoted, e.g. <catfood type="basic"></catfood>

XPath Expressions

XPath is written using expressions. Expressions consist of values, e.g., 368, and operators, e.g., +, that will return a single value. 368 + 275 is an example of an expression. It will return the value 643 . In programming terminology, this is called evaluating, which simply means reducing down to a single value. A single value with no operators, e.g. 35 , can also be called an expression, though it will evaluate only to its existing value, e.g. 35.

Using XPath is similar to using advanced search in a library catalogue, where the structured nature of bibliographic information allows us to specify which metadata fields to query. For example, if we want to find books about Shakespeare but not works by him, we can limit our search function to the subject field only.

When we use XPath, we do not need to know in advance what the data we want looks like (as we would with regular expressions, where we need to know the pattern of the data). Since XML documents are structured into fields called nodes, XPath makes use of that structure to navigate through the nodes to select the data we want. We just need to know in which nodes within an XML file the data we want to find resides. When XPath expressions are evaluated on XML documents, they return objects containing the nodes that you specify.

XPath always assumes structured data.

Now let’s start using XPath.

Navigating through the HTML node tree using XPath

A popular way to represent the structure of an XML or HTML document is the node tree :

In an HTML document, everything is a node:

  • The entire document is a document node
  • Every HTML element is an element node
  • The text inside HTML elements are text nodes

The nodes in such a tree have a hierarchical relationship to each other. We use the terms parent , child and sibling to describe these relationships:

  • In a node tree, the top node is called the root (or root node )
  • Every node has exactly one parent , except the root (which has no parent)
  • A node can have zero, one or several children
  • Siblings are nodes with the same parent
  • The sequence of connections from node to node is called a path

Paths in XPath are defined using slashes ( / ) to separate the steps in a node connection sequence, much like URLs or Unix directories.

In XPath, all expressions are evaluated based on a context node . The context node is the node in which a path starts from. The default context is the root node, indicated by a single slash (/), as in the example above.

The most useful path expressions are listed below:

Navigating through a webpage with XPath using a browser console

We will use the HTML code that describes this very page you are reading as an example. By default, a web browser interprets the HTML code to determine what markup to apply to the various elements of a document, and the code is invisible. To make the underlying code visible, all browsers have a function to display the raw HTML content of a web page.

Display the source of this page Using your favourite browser, display the HTML source code of this page. Tip: in most browsers, all you have to do is do a right-click anywhere on the page and select the “View Page Source” option (“Show Page Source” in Safari). Another tab should open with the raw HTML that makes this page. See if you can locate its various elements, and this challenge box in particular.
Using the Safari browser If you are using Safari, you must first turn on the “Develop” menu in order to view the page source, and use the functions that we will use later in this section. To do so, navigate to Safari > Preferences and in the Advanced tab select the “Show Develop in menu bar” option. Note: In recent versions of Safari you must first turn on the “Develop” menu (in Preferences) and then navigate to Develop > Show Javascript Console and then click on the “Console” tab.

The HTML structure of the page you are currently reading looks something like this (most text and elements have been removed for clarity):

We can see from the source code that the title of this page is in a title element that is itself inside the head element, which is itself inside an html element that contains the entire content of the page.

Say we wanted to tell a web scraper to look for the title of this page, we would use this information to indicate the path the scraper would need to follow at it navigates through the HTML content of the page to reach the title element. XPath allows us to do that.

We can run XPath queries directly from within all major modern browsers, by enabling the built-in JavaScript console.

Display the console in your browser In Firefox, use to the Tools > Web Developer > Web Console menu item. In Chrome, use the View > Developer > JavaScript Console menu item. In Safari, use the Develop > Show Error Console menu item. If your Safari browser doesn’t have a Develop menu, you must first enable this option in the Preferences, see above.

Here is how the console looks like in the Firefox browser:

JavaScript console in Firefox

For now, don’t worry too much about error messages if you see any in the console when you open it. The console should display a prompt with a > character ( >> in Firefox) inviting you to type commands.

The syntax to run an XPath query within the JavaScript console is $x("XPATH_QUERY") , for example:

This should return something similar to

The output can vary slightly based on the browser you are using. For example in Chrome, you have to “open” the return object by clicking on it in order to view its contents.

Let’s look closer at the XPath query used in the example above: /html/head/title/text() . The first / indicates the root of the document. With that query, we told the browser to

Using this syntax, XPath thus allows us to determine the exact path to a node.

Select the “Introduction” title Write an XPath query that selects the “Introduction” title above and try running it in the console. Tip: if a query returns multiple elements, the syntax element[1] can be used. Note that XPath uses one-based indexing, therefore the first element has index 1, the second has index 2 etc. Solution $x("/html/body/div/article/h1[1]") should produce something similar to <- Array [ <h1#introduction> ]

Before we look into other ways to reach a specific HTML node using XPath, let’s start by looking closer at how nodes are arranged within a document and what their relationships with each others are.

For example, to select all the blockquote nodes of this page, we can write

This produces an array of objects:

This selects all the blockquote elements that are under html/body/div . If we want instead to select all blockquote elements in this document, we can use the // syntax instead:

This produces a longer array of objects:

Why is the second array longer? If you look closely into the array that is returned by the $x("//blockquote") query above, you should see that it contains objects like <blockquote.solution> that were not included in the results of the first query. Why is this so? Tip: Look at the source code and see how the challenges and solutions elements are organised.

We can use the class attribute of certain elements to filter down results. For example, looking at the list of blockquote elements returned by the previous query, and by looking at this page’s source, we can see that the blockquote elements on this page are of different classes (challenge, solution, callout, etc.).

To refine the above query to get all the blockquote elements of the challenge class, we can type

which returns

Select the “Introduction” title by ID In a previous challenge, we were able to select the “Introduction” title because we knew it was the first h1 element on the page. But what if we didn’t know how many such elements were on the page. In other words, is there a different attribute that allows us to uniquely identify that title element? Using the path expressions introduced above, rewrite your XPath query to select the “Introduction” title without using the [1] index notation. Tips: Look at the source of the page or use the “Inspect element” function of your browser to see what other information would enable us to uniquely identify that element. The syntax for selecting an element like <div id="mytarget"> is div[@id = 'mytarget'] . Solution $x("/html/body/div/h1[@id='introduction']") should produce something similar to <- Array [ <h1#introduction> ]
Select this challenge box Using an XPath query in the JavaScript console of your browser, select the element that contains the text you are currently reading on this page. Tips: In principle, id attributes in HTML are unique on a page. This means that if you know the id of the element you are looking for, you should be able to construct an XPath that looks for this value without having to worry about where in the node tree the target element is located. The syntax for selecting an element like <div id="mytarget"> is div[@id = 'mytarget'] . Remember that XPath queries are relative to a context node, and by default that node is the root node. Use the // syntax to select for elements regardless of where they are in the tree. The syntax to select the parent element relative to a context node is .. The $x(...) JavaScript syntax will always return an array of nodes, regardless of the number of nodes returned by the query. Contrary to XPath, JavaScript uses zero based indexing , so the syntax to get the first element of that array is therefore $x(...)[0] . Make sure you select this entire challenge box. If the result of your query displays only the title of this box, have a second look at the HTML structure of the document and try to figure out how to “expand” your selection to the entire challenge box. Solution Let’s have a look at the HTML code of this page, around this challenge box (using the “View Source” option) in our browser). The code looks something like this: <!doctype html> <html lang="en"> <head> (...) </head> <body> <div class="container"> (...) <blockquote class="challenge"> <h2 id="select-this-challenge-box">Select this challenge box</h2> <p>Using an XPath query in the JavaScript console of your browser...</p> (...) </blockquote> (...) </div> </body> </html> We know that the id attribute should be unique, so we can use this to select the h2 element inside the challenge box: $x("//h2[@id = 'select-this-challenge-box']/..")[0] This should return something like <- <blockquote class="challenge"> Let’s walk through that syntax: $x(" This function tells the browser we want it to execute an XPath query. // Look anywhere in the document… h2 … for an h2 element … [@id = 'select-this-challenge-box'] … that has an id attribute set to select-this-challenge-box … .. and select the parent node of that h2 element ")" This is the end of the XPath query. [0] Select the first element of the resulting array (since $x() returns an array of nodes and we are only interested in the first one). By hovering on the object returned by your XPath query in the console, your browser should helpfully highlight that object in the document, enabling you to make sure you got the right one:

Advanced XPath syntax

FIXME: All the content below is from the original XPath lesson. Adapt content to use current example.

Operators are used to compare nodes. There are mathematical operators, boolean operators. Operators can give you boolean (true/false values) as a result. Here are some useful ones:

Predicates are used to find a specific node or a node that contains a specific value.

Predicates are always embedded in square brackets, and are meant to provide additional filtering information to bring back nodes. You can filter on a node by using operators or functions.

XPath wildcards can be used to select unknown XML nodes.

In-text search

XPath can do in-text searching using functions and also supports regex with its matches() function. Note: in-text searching is case-sensitive!

Complete syntax: XPath Axes

XPath Axes fuller syntax of how to use XPath. Provides all of the different ways to specify the path by describing more fully the relationships between nodes and their connections. The XPath specification describes 13 different axes:

  • self ‐‐ the context node itself
  • child ‐‐ the children of the context node
  • descendant ‐‐ all descendants (children+)
  • parent ‐‐ the parent (empty if at the root)
  • ancestor ‐‐ all ancestors from the parent to the root
  • descendant‐or‐self ‐‐ the union of descendant and self • ancestor‐or‐self ‐‐ the union of ancestor and self
  • following‐sibling ‐‐ siblings to the right
  • preceding‐sibling ‐‐ siblings to the left
  • following ‐‐ all following nodes in the document, excluding descendants
  • preceding ‐‐ all preceding nodes in the document, excluding ancestors • attribute ‐‐ the attributes of the context node

XPath Axes Image Credit: SAMS Teach Yourself XSLT in 21 Days

Oftentimes, the elements we are looking for on a page have no ID attribute or other uniquely identifying features, so the next best thing is to aim for neighboring elements that we can identify more easily and then use node relationships to get from those easy to identify elements to the target elements.

For example, the node tree image above has no uniquely identifying feature like an ID attribute. However, it is just below the section header “Navigating through the HTML node tree using XPath”. Looking at the source code of the page, we see that that header is a h2 element with the id navigating-through-the-html-node-tree-using-xpath .

FIXME: add more XPath functions such as concat() and normalize-space(). FIXME: mention XPath Checker for Firefox FIXME: Firefox sometime cleans up the HTML of a page before displaying it, meaning that the DOM tree we can access through the console might not reflect the actual source code. <tbody> elements are typically not reliable. The Scrapy documentation has more on the topic.

  • W3Schools: JavaScript HTML DOM Navigation
  • XPath Cheatsheet
Key Points XML and HTML are markup languages. They provide structure to documents. XML and HTML documents are made out of nodes, which form a hierarchy. The hierarchy of nodes inside a document is called the node tree. Relationships between nodes are: parent, child, sibling. XPath queries are constructed as paths going up or down the node tree. XPath queries can be run in the browser using the $x() function.
  • Coding Ground
  • Corporate Training

XPath Tutorial

XPath Tutorial

  • XPath - Home
  • XPath - Overview
  • XPath - Expression
  • XPath - Nodes
  • XPath - Absolute Path
  • XPath - Relative Path
  • XPath - Axes
  • XPath - Operators
  • XPath - Wildcard
  • XPath - Predicate
  • XPath Useful Resources
  • XPath - Quick Guide
  • XPath - Useful Resources
  • XPath - Discussion
  • Selected Reading
  • UPSC IAS Exams Notes
  • Developer's Best Practices
  • Questions and Answers
  • Effective Resume Writing
  • HR Interview Questions
  • Computer Glossary

XPath Tutorial

XPath is a query language that is used for traversing through an XML document. It is used commonly to search particular elements or attributes with matching patterns.

This tutorial explains the basics of XPath. It contains chapters discussing all the basic components of XPath with suitable examples.

This tutorials has been designed for beginners to help them understand the basic concepts related to XPath. This tutorial will give you enough understanding on XPath from where you can take yourself to higher levels of expertise.

Prerequisites

Before proceeding with this tutorial, you should have basic knowledge of XML, HTML, and JavaScript.

Extract Summit is back! The most exciting data extraction event of the year. Join us in Dublin on 26th Oct 2023

An introduction to XPath: How to get started

Let's start with what is XPath? XPath is a powerful language that is often used for scraping the web. It allows you to select nodes or compute values from an XML or HTML document and is actually one of the languages that you can use to extract web data using Scrapy.

The other is CSS and while CSS selectors are a popular choice, XPath can actually allow you to do more.

With XPath, you can extract data based on text elements' contents, and not only on the page structure. So when you are scraping the web and you run into a hard-to-scrape website, XPath may just save the day (and a bunch of your time!).

This is an introductory XPath tutorial will walk you through the basic concepts of XPath, crucial to a good understanding of it, before diving into more complex use cases .

Note: You can use the XPath playground to experiment with XPath. Just paste the HTML samples provided in this post and play with the expressions.

Consider this HTML document:

XPath handles any XML/HTML document as a tree. This tree's root node is not part of the document itself. It is in fact the parent of the document element node ( <html> in case of the HTML above). This is how the XPath tree for the HTML document looks like:

tree-7

As you can see, there are many node types in an XPath tree:

  • Element node: represents an HTML element, a.k.a an HTML tag.
  • Attribute node: represents an attribute from an element node, e.g. “href” attribute in <a href=”http://www.example.com”>example</a> .
  • Comment node: represents comments in the document ( <!-- … --> ).
  • Text node: represents the text enclosed in an element node ( example in <p>example</p> ).

Distinguishing between these different types is useful to understand how XPath expressions work. Now let's start digging into XPath.

Here is how we can select the title element from the page above using an XPath expression:

This is what we call a location path . It allows us to specify the path from the context node (in this case the root of the tree) to the element we want to select, as we do when addressing files in a file system. The location path above has three location steps , separated by slashes. It roughly means: start from the ‘html’ element, look for a ‘head’ element underneath, and a ‘title’ element underneath that ‘head’ . The context node changes in each step. For example, the head node is the context node when the last step is being evaluated.

However, we usually don't know or don’t care about the full explicit node-by-node path, we just care about the nodes with a given name. We can select them using:

This means: look in the whole tree, starting from the root of the tree ( // ) and select only those nodes whose name matches title . In this example, // is the axis and title is the node test .

In fact, the expressions we've just seen are using XPath's abbreviated syntax . Translating //title to the full syntax we get:

So, // in the abbreviated syntax is short for descendant-or-self , which means the current node or any node below it in the tree . This part of the expression is called the axis and it specifies a set of nodes to select from, based on their direction on the tree from the current context (downwards, upwards, on the same tree level). Other examples of axes are parent, child, ancestor, etc -- we’ll dig more into this later on.

The next part of the expression, node() , is called a node test , and it contains an expression that is evaluated to decide whether a given node should be selected or not. In this case, it selects nodes from all types. Then we have another axis, child which means go to the child nodes from the current context , followed by another node test, which selects the nodes named as title .

So, the axis defines where in the tree the node test should be applied and the nodes that match the node test will be returned as a result .

You can test nodes against their name or against their type.

Here are some examples of name tests:

And here are some examples of node type tests:

We can also combine name and node tests in the same expression. For example:

This expression selects the text nodes from inside p elements. In the HTML snippet shown above, it would select "This is the first paragraph.".

Now, let’s see how we can further filter and specify things . Consider this HTML document:

Say we want to select only the first li node from the snippet above. We can do this with:

The expression surrounded by square brackets is called a predicate and it filters the node-set returned by //li (that is, all li nodes from the document) using the given condition. In this case, it checks each node's position using the position() function, which returns the position of the current node in the resulting node-set (notice that positions in XPath start at 1, not 0). We can abbreviate the expression above to:

Both XPath expressions above would select the following element:

Check out a few more predicate examples:

So, a location path is basically composed of steps, which are separated by / and each step can have an axis, a node test, and a predicate. Here we have an expression composed of two steps, each one with axis, node test, and predicate:

And here is the same expression, written using the non-abbreviated syntax:

We can also combine multiple XPath expressions in a single one using the union operator | . For example, we can select all a and h2 elements in the document above using this expression:

Now, consider this HTML document:

Say we want to select only the a elements whose link points to an HTTPS URL. We can do it by checking their href attribute :

This expression first selects all the a elements from the document and for each of those elements, it checks whether their href attribute starts with "https". We can access any node attribute using the @attributename syntax.

Here we have a few additional examples using attributes:

More on axes

We've seen only two types of axes so far:

  • descendant-or-self

But there's plenty more where they came from and we'll see a few examples. Consider this HTML document:

Now we want to extract only the first paragraph after each of the titles. To do that, we can use the following-sibling axis, which selects all the siblings after the context node. Siblings are nodes who are children of the same parent, for example, all children nodes of the body tag are siblings. This is the expression:

In this example, the context node where the following-sibling axis is applied to is each of the h1 nodes from the page.

What if we want to select only the text that is right before the footer ? We can use the preceding-sibling axis:

In this case, we are selecting the first text node before the div footer ( "A single paragraph, with no markup" ).

XPath also allows us to select elements based on their text content. We can use such a feature, along with the parent axis, to select the parent of the p element whose text is "Footer text":

The expression above selects <div id="footer"><p>Footer text</p></div> . As you may have noticed, we used .. here as a shortcut to the parent axis.

As an alternative to the expression above, we could use:

It selects, from all elements, the ones that have a p child which text is "Footer text", getting the same result as the previous expression.

You can find additional axes in the XPath specification: https://www.w3.org/TR/xpath/#axes

Wrapping up this XPath tutorial

XPath is very powerful and this post is just an introduction to the basic concepts. If you want to learn more about it, check out these resources:

  • http://zvon.org/comp/r/tut-XPath_1.html
  • http://fr.slideshare.net/scrapinghub/xpath-for-web-scraping
  •   XPath tips from the web scraping trenches

And stay tuned , because we will post a series with more XPath tips from the trenches in the following months.

html xpath tutorial

Javatpoint Logo

XPath Tutorial

Interview Questions

JavaTpoint

  • Send your Feedback to [email protected]

Help Others, Please Share

facebook

Learn Latest Tutorials

Splunk tutorial

Transact-SQL

Tumblr tutorial

Reinforcement Learning

R Programming tutorial

R Programming

RxJS tutorial

React Native

Python Design Patterns

Python Design Patterns

Python Pillow tutorial

Python Pillow

Python Turtle tutorial

Python Turtle

Keras tutorial

Preparation

Aptitude

Verbal Ability

Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

Artificial Intelligence

AWS Tutorial

Cloud Computing

Hadoop tutorial

Data Science

Angular 7 Tutorial

Machine Learning

DevOps Tutorial

B.Tech / MCA

DBMS tutorial

Data Structures

DAA tutorial

Operating System

Computer Network tutorial

Computer Network

Compiler Design tutorial

Compiler Design

Computer Organization and Architecture

Computer Organization

Discrete Mathematics Tutorial

Discrete Mathematics

Ethical Hacking

Ethical Hacking

Computer Graphics Tutorial

Computer Graphics

Software Engineering

Software Engineering

html tutorial

Web Technology

Cyber Security tutorial

Cyber Security

Automata Tutorial

C Programming

C++ tutorial

Control System

Data Mining Tutorial

Data Mining

Data Warehouse Tutorial

Data Warehouse

Javatpoint Services

JavaTpoint offers too many high quality services. Mail us on h [email protected] , to get more information about given services.

  • Website Designing
  • Website Development
  • Java Development
  • PHP Development
  • Graphic Designing
  • Digital Marketing
  • On Page and Off Page SEO
  • Content Development
  • Corporate Training
  • Classroom and Online Training

Training For College Campus

JavaTpoint offers college campus training on Core Java, Advance Java, .Net, Android, Hadoop, PHP, Web Technology and Python. Please mail your requirement at [email protected] . Duration: 1 week to 2 week

RSS Feed

Practical XPath for Web Scraping

XPath is a technology that uses path expressions to select nodes or node-sets in an XML document (or in our case an HTML document). Even if XPath is not a programming language in itself, it allows you to write an expression which can directly point to a specific HTML element, or even tag attribute, without the need to manually iterate over any element lists.

It looks like the perfect tool for web scraping right? At ScrapingBee we love XPath! ❤️

In our previous article about web scraping with Python we already briefly addressed XPath expressions. And now it's time to dig a bit deeper into this subject.

Why learn XPath

  • Knowing how to use basic XPath expressions is a must-have skill when extracting data from a web page
  • It's more powerful than CSS selectors (e.g. you can reference parent elements)
  • It allows you to navigate the DOM in any direction
  • Can match text inside HTML elements

Entire books have been written on the subject of XPath and I most certainly would not want to make the claim now that this article will provide a comprehensive guide on every single aspect of the subject, it rather is an introduction to XPath and we will see through real examples how you can use it for your web scraping projects.

But first, let's talk a little about the DOM

Document Object Model

I am going to assume you already know HTML, so this is just a small refresher.

As you already know, a web page is a document structured with a hierarchy of HTML tags, which describe the overall page layout (i.e. paragraphs, lists) and contain the relevant content (i.e. text, links, images) and so on. Let's check out a basic HTML page, to understand what the Document Object Model is.

As you notice from the image (and the line indentation provides another hint), the HTML document can be viewed as a tree. And that's exactly what most HTML parsers (i.e. your web browser) will do, they will parse that HTML content into an internal tree representation - and that is called the DOM , the D ocument O bject M odel.

The following image is a screenshot of Chrome's developer tools and shows the DOM in its textual representation, which - in our example - is quite similar to our HTML code.

One thing to keep in mind, although, in our example the DOM tree is quite similar to our HTML code, there's no guarantee that this will always be the case and the DOM tree may vary greatly from the HTML code the server originally sent. Why that is, you ask? Our good old friend JavaScript .

As long as there is no JavaScript involved, the DOM tree will mostly match what the server sent, however with JavaScript all bets are off and the DOM tree may have been heavily manipulated by it. Especially SPAs often only send a basic HTML skeleton, which then gets "enriched" by JavaScript. Take Twitter for example, whenever you scroll to the bottom of the page, some JavaScript code will fetch new tweets and will append them to the page, and by that, to the DOM tree.

Now, that we have learned (or rather refreshed) the basics on HTML and the DOM, we can dive into XPath .

XPath Syntax

First let's have a look at some XPath vocabulary:

  • Nodes - there are different types of nodes, the root node, element nodes, attribute nodes, and so called atomic values which is a synonym for text nodes in an HTML document.
  • Parents - the immediate element containing the current element. Each element node has one parent. In our example above, html is the parent of head and body , whereas body is the parent of the site's actual content.
  • Children - the immediate elements contained by the current element. Element nodes can have any number of children. In our example, h1 and the two p elements are all children of body .
  • Siblings - nodes on the same level as the current element. In our example, head and body are siblings (in their function as children of html ), as are h1 and the two p elements (in their function as children of body ).
  • Ancestors - a list of all parent elements of the current element.
  • Descendants - a list of all child elements (with their own children) of the current element.

Following is a list of the fundamental syntax elements, which you will use to assemble your XPath expressions.

XPath Predicates

XPath also supports predicates , which allow you to filter on the list of elements you got with your original expression. Predicates are appended to your XPath expression in square brackets, [mypredicate] . A couple of predicate examples would be

XPath Examples

All right, now that we have covered the basic syntax, let's check out a few examples based on the HTML code from our previous example.

XPath In The Browser

Fortunately browsers support XPath natively, so just open your favourite website, press F12 to the developer tools, and switch over to the Elements/Inspector tab to show the current page's DOM tree.

Now, just press Ctrl/Cmd + F and you should get a DOM search field where you can enter any XPath expression and, upon Enter , your browser should highlight the next match.

💡 The developer tools also provide a convenient way to get the XPath expression for any DOM element. Just right-click a DOM element and copy the XPath.

XPath with Python

There are lots of Python packages with support for XPath,

For the following examples, we are going to use Selenium with Chrome in headless mode. Please check out Scraping Single Page Application with Python for more details on how to set up the environment.

1. E-commerce product data extraction

In this example, we will be loading the following Amazon page

and the use a couple of XPath expressions to select the product name, its price, and its Amazon image.

While all the browser setup calls can be fascinating (after all, we really run a full-fledged browser instance with that code), we really want to focus on the following expressions in this tutorial.

  • //span[@id="productTitle"]
  • //div[@id="corePrice_feature_div"]//span[@data-a-color="price"]/span[1]
  • //div[@id="imgTagWrapperId"]/img

All three expressions are relative ones (note the // ), which means we are selecting elements from the entire DOM tree without specifying a fully absolute path. Each expression is also using a predicate, to filter based on the elements' IDs.

  • The first expression simply selects a <span> tag with the ID "productTite". This should give us the product name .
  • The second expression selects a <div> tag with the ID "corePrice_feature_div" and then searches its children for a <span data-a-color="price"> tag and use the first immediate <span> child for the product price .
  • Last but not least, the image URL . Here, we search for a <div id="imgTagWrapperId"> tag and select its immediate <img> child.

Our example was still relatively easy because we had the luxury of HTML IDs which should be unique. If, for example, you were to filter for HTML classes, you may have to pay more attention and might have to resort to absolute paths.

2. A generic approach to submit login forms with XPath

When you scrape sites, you often have to authenticate against the site. While login forms have different styles and layouts, they usually follow a similar format, with one text field for the username, another one for the password, and finally one submit button.

Even if the format is the same, the DOM structure will differ from site to site - and that's exactly where we can employ XPath and its DOM navigation capabilities to create a "generic" authentication function. Our function will take a Selenium driver object, a URL, a username, and a password and will use all of that to log you into the site.

All right, what have we exactly done here?

  • We simply loaded the specified URL into our driver object with the get() method. Pretty straightforward so far, right?
  • Now we used the expression //input[@type="password"] to find an <input type="password" /> tag and - as we boldly assume it's our one and only password field 😎 - enter the provided password ( send_keys() ). Still easy, isn't it? Don't fret, it's getting more complicated, we are now searching backwards.
  • Next up, finding the username field . Starting from the password field, we went backwards in the DOM ( preceding::input ) and tried to find the immediate previous <input /> field which is not hidden. Again, boldly we assume it's our username field and enter the provided username ( send_keys() ). The hidden part is rather important here, as forms tend to contain such additional fields, for example with one-time to prevent Cross-Site Request Forgery attempts. If we did not exclude such hidden fields, we'd select the wrong input element.
  • We compiled the form and only need to submit it, but for that we should first find the form . We used ancestor::form to find the <form> tag which encloses our password field.
  • We made it. We found the form element, compiled the relevant authentication elements, and now only need to submit the form . For that, we used *[@type="submit"] to get, from within the form context, any tag of the "submit" type and click it. Done and dusted! 🥳

Please do keep in mind, while this example will work with many sites and will save you the time of analysing each login page manually, it's primarily still a basic showcase for XPath and there will be plenty of sites where it won't work (i.e. combine sign-up/sign-in pages), so please don't use it as drop-in solution for all your scrapers.

3. Handling and filtering HTML content

If you are regular reader of our blog and its tutorials, you will certainly have noticed that we very much like to provide samples on scraping Hacker News , and of course, we'd like to continue this tradition in this article as well.

So, let's scrape - again - the first three pages of https://news.ycombinator.com/newest , but this time with an XPath twist . 🥂

Once more, let us take a step-by-step deep-dive into what exactly we did here.

  • We started with lots of initialisation 😅
  • Then, we loaded our start page https://news.ycombinator.com/newest
  • Based on the HTML content we received, we used an XPath expression with two predicates and parent pointers to get the grandparent of the selected elements
  • Now, we looped over all found elements and stored their IDs in an array , together with the link details of a child <a> tag
  • We said we wanted the first three pages, right? So, let's find the next button, click it , and GOTO 2
  • We have collected now quite a bit of information, so it would be a waste not to print it at least, wouldn't it?

As before, in a real world setting we could optimise that code to some extent ( e.g. we would not need to search for the anchor tags, only to then go straight to their table row parents ) but the point of this exercise was of course to show more XPath use cases - and there occasionally definitely are sites with an HTML structure requiring such acrobatics.

XPath is a very versatile, compact, and expressive tool when it comes to XML (and for that matter HTML) and is often more powerful than CSS selectors, which are very similar in nature of course.

While XPath expressions may seem complicated at the beginning, the really challenging bit often is not the expression itself, but getting the right path , to be precise enough to select the desired element and, at the same time, flexible enough to not immediately break when there are minor changes in the DOM tree.

At ScrapingBee, as we mentioned at the beginning of the article, we really love XPath and CSS selectors and our scraping API makes heavy of both technologies.

💡 While it can be fun to play and tweak XPath expressions, it can still take some significant time out of your business day. If you want an easier solution, please check out our no-code scraping platform . The first 1,000 requests are on us , of course.

I hope you enjoyed this article, if you're interested in more information on CSS selectors, please check out this BeautifulSoup tutorial .

We also wrote an article about XPath vs CSS selectors , don't hesitate to check this out.

Happy Scraping!

image description

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee . He is also the author of the Java Web Scraping Handbook.

You might also like:

Web scraping with python: everything you need to know (2022).

html xpath tutorial

Learn about web scraping in Python with this step-by-step tutorial. We will cover almost all of the tools Python offers to scrape the web. From Requests to BeautifulSoup, Scrapy, Selenium and more.

Pyppeteer: the Puppeteer for Python Developers

html xpath tutorial

Pyppeteer is a Python wrapper for Puppeteer. This article will show you how to use it to scrape dynamic site, automate and render Javascript-heavy websites.

Using Python and wget to Download Web Pages and Files

html xpath tutorial

This tutorial will teach you to use wget with Python using runcmd. This article will show you the benefits of using Wget with Python with some simple examples.

Guru99

XPath in Selenium: How to Find & Write? (Text, Contains, AND)

Krishna Rungta

What is XPath in Selenium?

XPath in Selenium is an XML path used for navigation through the HTML structure of the page. It is a syntax or language for finding any element on a web page using XML path expression. XPath can be used for both HTML and XML documents to find the location of any element on a webpage using HTML DOM structure.

In Selenium automation, if the elements are not found by the general locators like id, class, name, etc. then XPath is used to find an element on the web page.

In this tutorial, we will learn about the Xpath and different XPath expression to find the complex or dynamic elements, whose attributes changes dynamically on refresh or any operations.

XPath Syntax

XPath contains the path of the element situated at the web page. Standard XPath syntax for creating XPath is.

Xpath=//tagname[@attribute='value']

The basic format of XPath in selenium is explained below with screen shot.

Basic Format of XPath

Basic Format of XPath

  • // : Select current node.
  • Tagname: Tagname of the particular node.
  • @: Select attribute.
  • Attribute: Attribute name of the node.
  • Value: Value of the attribute.

To find the element on web pages accurately there are different types of locators:

Types of X-path

There are two types of XPath:

1) Absolute XPath

2) Relative XPath

Absolute XPath:

It is the direct way to find the element, but the disadvantage of the absolute XPath is that if there are any changes made in the path of the element then that XPath gets failed.

The key characteristic of XPath is that it begins with the single forward slash(/) ,which means you can select the element from the root node.

Below is the example of an absolute Xpath expression of the element shown in the below screen.

NOTE: You can practice the following XPath exercise on this http://demo.guru99.com/test/selenium-xpath.html

Click here if the video is not accessible

/html/body/div[2]/div[1]/div/h4[1]/b/html[1]/body[1]/div[2]/div[1]/div[1]/h4[1]/b[1]

Absolute XPath

  • Absolute XPath

Relative Xpath:

Relative Xpath starts from the middle of HTML DOM structure. It starts with double forward slash (//). It can search elements anywhere on the webpage, means no need to write a long xpath and you can start from the middle of HTML DOM structure. Relative Xpath is always preferred as it is not a complete path from the root element.

Below is the example of a relative XPath expression of the same element shown in the below screen. This is the common format used to find element by XPath.

Relative XPath: //div[@class='featured-box cloumnsize1']//h4[1]//b[1]

Relative XPath

  • Relative XPath

What are XPath axes.

XPath axes search different nodes in XML document from current context node. XPath Axes are the methods used to find dynamic elements, which otherwise not possible by normal XPath method having no ID , Classname, Name, etc.

Axes methods are used to find those elements, which dynamically change on refresh or any other operations. There are few axes methods commonly used in Selenium Webdriver like child, parent, ancestor, sibling, preceding, self, etc.

How To Write Dynamic XPath In Selenium WebDriver

1) basic xpath:.

XPath expression select nodes or list of nodes on the basis of attributes like ID , Name, Classname , etc. from the XML document as illustrated below.

Xpath=//input[@name='uid']

Here is a link to access the page http://demo.guru99.com/test/selenium-xpath.html

Basic XPath

Basic XPath

Some more basic xpath expressions:

2) Contains():

Contains() is a method used in XPath expression. It is used when the value of any attribute changes dynamically, for example, login information.

The contain feature has an ability to find the element with partial text as shown in below XPath example.

In this example, we tried to identify the element by just using partial text value of the attribute. In the below XPath expression partial value ‘sub’ is used in place of submit button. It can be observed that the element is found successfully.

Complete value of ‘Type’ is ‘submit’ but using only partial value ‘sub’.

Xpath=//*[contains(@type,'sub')]

Complete value of ‘name’ is ‘btnLogin’ but using only partial value ‘btn’.

Xpath=//*[contains(@name,'btn')]

In the above expression, we have taken the ‘name’ as an attribute and ‘btn’ as an partial value as shown in the below screenshot. This will find 2 elements (LOGIN & RESET) as their ‘name’ attribute begins with ‘btn’.

XPath in Selenium WebDriver: Complete Tutorial

Similarly, in the below expression, we have taken the ‘id’ as an attribute and ‘message’ as a partial value. This will find 2 elements (‘User-ID must not be blank’ & ‘Password must not be blank’) as its ‘id’ attribute begins with ‘message’.

Xpath=//*[contains(@id,'message')]

XPath in Selenium WebDriver: Complete Tutorial

In the below expression, we have taken the “text” of the link as an attribute and ‘here’ as a partial value as shown in the below screenshot. This will find the link (‘here’) as it displays the text ‘here’.

Xpath=//*[contains(text(),'here')] Xpath=//*[contains(@href,'guru99.com')]

XPath in Selenium WebDriver: Complete Tutorial

3) Using OR & AND:

In OR expression, two conditions are used, whether 1st condition OR 2nd condition should be true. It is also applicable if any one condition is true or maybe both. Means any one condition should be true to find the element.

In the below XPath expression, it identifies the elements whose single or both conditions are true.

Xpath=//*[@type='submit' or @name='btnReset']

Highlighting both elements as “LOGIN ” element having attribute ‘type’ and “RESET” element having attribute ‘name’.

XPath in Selenium WebDriver: Complete Tutorial

In AND expression, two conditions are used, both conditions should be true to find the element. It fails to find element if any one condition is false.

Xpath=//input[@type='submit' and @name='btnLogin']

In below expression, highlighting ‘LOGIN’ element as it having both attribute ‘type’ and ‘name’.

XPath in Selenium WebDriver: Complete Tutorial

4) Xpath Starts-with

XPath starts-with() is a function used for finding the web element whose attribute value gets changed on refresh or by other dynamic operations on the webpage. In this method, the starting text of the attribute is matched to find the element whose attribute value changes dynamically. You can also find elements whose attribute value is static (not changes).

For example -: Suppose the ID of particular element changes dynamically like:

Id=” message12″

Id=” message345″

Id=” message8769″

and so on.. but the initial text is same. In this case, we use Start-with expression.

In the below expression, there are two elements with an id starting “message”(i.e., ‘User-ID must not be blank’ & ‘Password must not be blank’). In below example, XPath finds those element whose ‘ID’ starting with ‘message’.

Xpath=//label[starts-with(@id,'message')]

XPath in Selenium WebDriver: Complete Tutorial

5) XPath Text() Function

The XPath text() function is a built-in function of selenium webdriver which is used to locate elements based on text of a web element. It helps to find the exact text elements and it locates the elements within the set of text nodes. The elements to be located should be in string form.

In this expression, with text function, we find the element with exact text match as shown below. In our case, we find the element with text “UserID”.

Xpath=//td[text()='UserID']

XPath in Selenium WebDriver: Complete Tutorial

XPath axes methods:

These XPath axes methods are used to find the complex or dynamic elements. Below we will see some of these methods.

For illustrating these XPath axes method, we will use the Guru99 bank demo site.

1) Following:

Selects all elements in the document of the current node( ) [ UserID input box is the current node] as shown in the below screen.

Xpath=//*[@type='text']//following::input

XPath in Selenium WebDriver: Complete Tutorial

There are 3 “input” nodes matching by using “following” axis- password, login and reset button. If you want to focus on any particular element then you can use the below XPath method:

Xpath=//*[@type='text']//following::input[1]

You can change the XPath according to the requirement by putting [1],[2]…………and so on.

With the input as ‘1’, the below screen shot finds the particular node that is ‘Password’ input box element.

XPath in Selenium WebDriver: Complete Tutorial

2) Ancestor:

The ancestor axis selects all ancestors element (grandparent, parent, etc.) of the current node as shown in the below screen.

In the below expression, we are finding ancestors element of the current node(“ENTERPRISE TESTING” node).

Xpath=//*[text()='Enterprise Testing']//ancestor::div

XPath in Selenium WebDriver: Complete Tutorial

There are 13 “div” nodes matching by using “ancestor” axis. If you want to focus on any particular element then you can use the below XPath, where you change the number 1, 2 as per your requirement:

Xpath=//*[text()='Enterprise Testing']//ancestor::div[1]

You can change the XPath according to the requirement by putting [1], [2]…………and so on.

Selects all children elements of the current node (Java) as shown in the below screen.

Xpath=//*[@id='java_technologies']//child::li

XPath in Selenium WebDriver: Complete Tutorial

There are 71 “li” nodes matching by using “child” axis. If you want to focus on any particular element then you can use the below xpath:

Xpath=//*[@id='java_technologies']//child::li[1]

You can change the xpath according to the requirement by putting [1],[2]…………and so on.

4) Preceding:

Select all nodes that come before the current node as shown in the below screen.

In the below expression, it identifies all the input elements before “LOGIN” button that is Userid and password input element.

Xpath=//*[@type='submit']//preceding::input

XPath in Selenium WebDriver: Complete Tutorial

There are 2 “input” nodes matching by using “preceding” axis. If you want to focus on any particular element then you can use the below XPath:

Xpath=//*[@type='submit']//preceding::input[1]

5) Following-sibling:

Select the following siblings of the context node. Siblings are at the same level of the current node as shown in the below screen. It will find the element after the current node.

xpath=//*[@type='submit']//following-sibling::input

XPath in Selenium WebDriver: Complete Tutorial

One input nodes matching by using “following-sibling” axis.

Selects the parent of the current node as shown in the below screen.

Xpath=//*[@id='rt-feature']//parent::div

XPath in Selenium WebDriver: Complete Tutorial

There are 65 “div” nodes matching by using “parent” axis. If you want to focus on any particular element then you can use the below XPath:

Xpath=//*[@id='rt-feature']//parent::div[1]

Selects the current node or ‘self’ means it indicates the node itself as shown in the below screen.

XPath in Selenium WebDriver: Complete Tutorial

One node matching by using “self ” axis. It always finds only one node as it represents self-element.

Xpath =//*[@type='password']//self::input

8) Descendant:

Selects the descendants of the current node as shown in the below screen.

In the below expression, it identifies all the element descendants to current element ( ‘Main body surround’ frame element) which means down under the node (child node , grandchild node, etc.).

Xpath=//*[@id='rt-feature']//descendant::a

XPath in Selenium WebDriver: Complete Tutorial

There are 12 “link” nodes matching by using “descendant” axis. If you want to focus on any particular element then you can use the below XPath:

Xpath=//*[@id='rt-feature']//descendant::a[1]

XPath is required to find an element on the web page as to do an operation on that particular element.

  • XPath Axes are the methods used to find dynamic elements, which otherwise not possible to find by normal XPath method
  • XPath expression select nodes or list of nodes on the basis of attributes like ID , Name, Classname, etc. from the XML document .

Also Check:- Selenium Tutorial for Beginners: Learn WebDriver in 7 Days

  • Flash Testing with Selenium WebDriver
  • How to Verify Tooltip in Selenium WebDriver
  • Selenium with Cucumber (BDD Framework)
  • How to Drag and Drop in Selenium (Example)
  • Selenium C# Tutorial with NUnit Example

IMAGES

  1. Introducción a XPath

    html xpath tutorial

  2. XPath: XML/HTML 路径查找语言

    html xpath tutorial

  3. Xpath In Selenium Webdriver Complete Tutorial

    html xpath tutorial

  4. r

    html xpath tutorial

  5. XPath in Selenium WebDriver: Complete Tutorial

    html xpath tutorial

  6. xpath tutorials

    html xpath tutorial

VIDEO

  1. Css vs Xpath #automation #selenium #playwright #xpath #css #locators

  2. How does xpath works

  3. Belajar Xpath #2

  4. Belajar Xpath #3

  5. Major difference between XPath Expressions & CSS Selectors (Selenium Interview Question #255)

  6. New Feature: How to identify if xpath is dynamic or stable

COMMENTS

  1. Get Started with HTML: A Comprehensive Tutorial

    HTML is the foundation of the web, and it’s essential for anyone looking to create a website or web application. If you’re just getting started with HTML, this comprehensive tutorial will help you understand the basics and get you up and ru...

  2. Learn the Basics of HTML: A Step-by-Step Guide

    HTML (HyperText Markup Language) is the language used to create webpages and is an essential part of web development. It is easy to learn and can be used to create simple or complex websites.

  3. Mastering HTML: A Beginner’s Guide

    HTML (Hypertext Markup Language) is the most fundamental language used to create webpages. It is the foundation of any website, and mastering it is essential for anyone looking to create a website or build a career in web development.

  4. XPath Tutorial

    XPath stands for XML Path Language · XPath uses "path like" syntax to identify and navigate nodes in an XML document · XPath contains over 200 built-in functions

  5. XPath Syntax

    XPath uses path expressions to select nodes or node-sets in an XML document.

  6. Selecting content on a web page with XPath

    Note that HTML and XML have a very similar structure, which is why XPath can be used almost interchangeably to navigate both HTML and XML documents. In fact

  7. XPath Tutorial

    with this tutorial, you should have basic knowledge of XML, HTML, and JavaScript.

  8. Учебник XPath

    XPath – основной элемент в стандарте XSLT. XPath может использоваться для навигации по элементам и атрибутам XML документа. Данный учебник рассказывает о

  9. An introduction to XPath: How to get started

    XPath is a powerful language that is often used for scraping the web. It allows you to select nodes or compute values from an XML or HTML document and is

  10. Learn XPath Tutorial

    XPath is a component of XSLT standard provided by W3C. It is used to traverse the elements and attributes of an XML document. Our XPath tutorial includes all

  11. Practical XPath for Web Scraping

    In this tutorial, we are going to see how to use XPath ... I am going to assume you already know HTML, so this is just a small refresher.

  12. XPath in Selenium: How to Find & Write? (Text, Contains, AND)

    XPath in Selenium: Learn XPath definition, Types, Basic XPath, Contains, ... XPath can be used for both HTML and XML documents to find the

  13. XPath Tutorial

    The ancestor, descendant, following, preceding and self axes partition a document (ignoring attribute and namespace nodes): they do not overlap and together

  14. AAA

    Пример 1 · /AAA. Выбирается корневой узел AAA · /AAA/CCC. Выбираются все элементы CCC, являющиеся дочерними по отношению к корневому узлу AAA · /AAA/DDD/BBB.