Agenty CSS extractor offers multiple extract types to extract anything from an HTML element. We can define our CSS selector using Agenty chrome extension or manually in scraping agent editor, and then choose the extract type in Extract drop-down box, without writing any code. Here are the 5 extract types available in scraping agent, when you’ve selected a field type as CSS
Let’s explore them …
- TEXT
- HTML
- ATTR
- InnerHTML
- OuterText
TEXT
The TEXT extractor is used to extract the plain-text of given CSS selector, for example if I want to extract the text
only from the p tag from below HTML sample, I can use the .page > p
selector and then extract type as TEXT in the Agenty CSS extractor
Sample HTML :
<div class="page">
<h1>Some Heading</h1>
<p>This is some text with <a href="https://www.domain.com">link to other page.</a></p>
<div>
Selector :
.page > p
Type : CSS
Extract : TEXT
Result :
This is some text with link to other page.
HTML
The HTML extract type is used to extract the complete HTML source of given selector, for example if I want to extract the complete p
tag from the below HTML sample, I can use the .page > p
selector and then extract type as HTML
in the CSS extractor
Sample HTML :
<div class="page">
<h1>Some Heading</h1>
<p>This is some text with <a href="https://www.domain.com">link to other page.</a></p>
<div>
Selector :
.page > p
Type : CSS
Extract : HTML
Result :
<p>This is some text with <a href="https://www.domain.com">link to other page.</a></p>
ATTR
The attribute extractor is used to extract the value of any attribute on selected element. The ATTR option is mostly used to extract the hyperlinks, images but can also be used to extract any attribute value by providing the name of the attribute we want to extract. For example, if I want to extract the href link from the HTML sample given below : I can use the .page > a
as my selector and then ATTR
option with href
as the name of attribute to extract it’s value
Sample HTML :
<div class="page">
<h1>Some Heading</h1>
<p>This is some text with <a href="https://www.domain.com">link to other page.</a></p>
<div>
Selector :
.page > a
Type : CSS
Extract : ATTR
Attribute: href
Result :
https://www.domain.com
InnerHTML
The InnerHTML extract type is used to extract the inner HTML of given selector, for example if I want to extract the complete p
tag from the sample HTML given below but don’t the selector element tag in my result, but it’s inner portion only, I can use the the .page > p
selector and then extract type as InnerHTML
in the CSS extractor
Sample HTML :
<div class="page">
<h1>Some Heading</h1>
<p>This is some text with <a href="https://www.domain.com">link to other page.</a></p>
<div>
Selector :
.page > p
Type : CSS
Extract : InnerHTML
Result :
This is some text with <a href="https://www.domain.com">link to other page.</a>
OuterText
The OuterText extract type is designed to extract the outer text only, by deleting all the children for a given selector. Because, there may be some scenario where the target HTML has some text you want to extract, but that text is not a part of a particular child element where we can use the nested selector and needs a way to pick that particular text only instead of the entire TEXT of that selector.
For example, if you see the sample HTML given below it has a discount value as : Discount (15%).
And, this is just text content after br
tag and doesn’t have it’s own html tag, which can be used to extract the discount value in a scraping agent field.
And if you use the .price
selector and then extract type as TEXT. It will result in $49 Discount ($15) $41.65
(Because, the TEXT extractor is designed to extract the entire text of given selector and those div tags with old-price
and new-price
class are also the part of .price
selector)
Here, we need to use the OuterText option, so the extractor can delete all the HTML tags first from given selector > and then extract the leftover text
Sample HTML :
<div class="page">
<h1>Some Heading</h1>
<div class="price">
<div class="old-price">$49</div>
<br/> Discount (15%)
<div class="new-price">$41.65</div>
</div
<div>
Selector :
.price
Type : CSS
Extract : OuterText
Result :
Discount (15%)