Scraping Web API Data Using JSONPath Query Selectors

Agenty allows you to extract data from JSON web API using the JSONPath query selectors. The JSONPath is a query language for JSON that allows us to refer to a JSON object structure in the same way as XPath expressions do for XML documents.

You can apply JSONPath expressions in a scraping agent to refer specific objects or elements in the page to extract any field from JSON using JSON as field type.

The JSON (JavaScript Object Notation) is a lightweight data-interchange format and widely used format on web API to display the data in a structured way or for integration with other apps, so it’s important to have such capability in every web scraping tool, because many websites offers the API access and having the capability to scrape JSON or XML directly from API will allow Agenty users to centralized their all data collection agents in single platform.

In this tutorial, we will learn how to scrape data from web API using the JSONPath query selectors.

JSON Example

For this demo, I have created this JSON example page https://cdn.agenty.com/examples/json/json-example-1.json. Here, if you see the JSON content, it’s an array of objects where each object has 5 property ( rank, cnt, tt, au, yr) and their corresponding values.

[
  {
    "rank": 1,
    "cnt": 5,
    "tt": "The Great Gatsby",
    "au": "F. Scott Fitzgerald",
    "yr": "1990"
  },
  {
    "rank": 2,
    "cnt": 5,
    "tt": "The Grapes of Wrath",
    "au": "John Steinbeck",
    "yr": "1972"
  },
  {
    "rank": 3,
    "cnt": 5,
    "tt": "The Catcher in the Rye",
    "au": "J.D. Salinger",
    "yr": "1993"
  },
  {
    "rank": 4,
    "cnt": 4,
    "tt": "Invisible Man",
    "au": "Ralph Ellison",
    "yr": "1988"
  },
  {
    "rank": 5,
    "cnt": 4,
    "tt": "The Sound and the Fury",
    "au": "William Faulkner",
    "yr": "1987"
  },
  {
    "rank": 6,
    "cnt": 4,
    "tt": "The Sun Also Rises",
    "au": "",
    "yr": "1988"
  },
  {
    "rank": 7,
    "cnt": 4,
    "tt": "Things Fall Apart",
    "au": "Chinua Achebe",
    "yr": "1996"
  },
  {
    "rank": 8,
    "cnt": 4,
    "tt": "Lolita",
    "au": "Vladimir Vladimirovich Nabokov",
    "yr": "1983"
  },
  {
    "rank": 9,
    "cnt": 4,
    "tt": "A Passage to India",
    "au": "E. M. Forster",
    "yr": "1984"
  },
  {
    "rank": 10,
    "cnt": 4,
    "tt": "1984",
    "au": "George Orwell",
    "yr": "1977"
  },
  {
    "rank": 11,
    "cnt": 4,
    "tt": "Beloved",
    "au": "Toni Morrison",
    "yr": "1987"
  },
  {
    "rank": 12,
    "cnt": 4,
    "tt": "Native Son",
    "au": "Richard T. Wright",
    "yr": "1940"
  },
  {
    "rank": 13,
    "cnt": 4,
    "tt": "Catch-22",
    "au": "Joseph Heller",
    "yr": ""
  },
  {
    "rank": 14,
    "cnt": 4,
    "tt": "Go Tell it on the Mountain",
    "au": "James Baldwin",
    "yr": "1954"
  },
  {
    "rank": 15,
    "cnt": 4,
    "tt": "On the Road",
    "au": "Jack Kerouac",
    "yr": "1991"
  },
  {
    "rank": 16,
    "cnt": 3,
    "tt": "ULYSSES",
    "au": "James Joyce",
    "yr": "1961"
  },
  {
    "rank": 17,
    "cnt": 3,
    "tt": "Don Quixote",
    "au": "Miguel de Cervantes",
    "yr": "1982"
  },
  {
    "rank": 18,
    "cnt": 3,
    "tt": "To the Lighthouse",
    "au": "Virginia Woolf",
    "yr": "1982"
  },
  {
    "rank": 19,
    "cnt": 3,
    "tt": "Madame Bovary",
    "au": "Gustave Flaubert",
    "yr": "1998"
  },
  {
    "rank": 20,
    "cnt": 3,
    "tt": "An American Tragedy",
    "au": "Theodore Dreiser",
    "yr": "2000"
  }
]

Since, it’s not an HTML page scraping where we can use our Chrome Extension to generate CSS selectors automatically.

So we would need to create our agent manually and then edit the agent in Agenty to add, update the field and URL. So let’s create a placeholder agent from samples, or you can also create from any website.

Create Agent

  1. Go to agents
  2. Click on New Agent
  3. Get any of the example agent available in Sample Agent section. (Because we are going to edit the URL, selector, fields etc. So we can use any demo agent to create one, and then edit that to change our selector and field names).

JSONPath Reference

We will be use JSONPath selectors to extract the individual property value from the JSON objects. So use any online JSON Path testing tool to build/or test your selector. I am going to use jsonpath.com in this example to demonstrate the selector as it show the result instantly as we type the selector. We need to enter the sample JSON in Inputs box, and then the tool will display the result as we type our query selector. Here is the complete JSONPath reference :

Expression Description
$ The root object or an array.
object.property or ['object'].['property'] Selects the property property in the object object. Note: Use the latter notation if the name of the property includes special characters (for example, spaces), or begins with a character other than A..Za..z_.
[_n_] Selects the n-th element from an array. Indexes start from 0.
[_n1_, _n2_] Selects n1 and n2 array items. Returns the list of properties.
..property Performs a deep scan for the specified property in all available objects. Always returns a list of properties, even for a single match.
* Wildcard. Selects all elements in an object or array, regardless of their names or indexes.

Note: All of the JSONPath expressions (including property names and values) are case-sensitive. See more example here in JsonPath github repository

JSONPath Notation

A JSONPath expression describes the path to a single property or set of properties in a JSON structure. So, we can use any of the following notations:

Dot notation

For example $.[*].rank to extract all the rank property from all objects

Bracket notation

For example $[0].[rank] to extract the rank property from first array and first object only

So, using the above reference if we want to extract the rank field from the JSON API. Our JSONPath query selector will be $.[*].rank

Similarly, for the next field cnt the JSONPath will be $.[*].cnt

And for the tt filed it will be $.[*].tt

Now, we can create our selectors for au and yr fields as well.

Add Fields

Now, we have the JSON query selector tested for all our fields we want to scrape from JSON. So we need to edit the scraping agent and then add these fields by selecting the field type as JSON

Edit the scraping agent by clicking on the Edit tab on scraping agent page.

  1. Add a new field, and give it some name as we have given rank in below screenshot.
  2. Now select the field Type as JSON and paste your JSONPath query selector in "JSON Path box.
    JSONPath query selectors
  3. Similarly, add next field cnt and enter the JSONPath query selector in JSONPath box.
    JSON type
  4. Add next field tt and enter the JSONPath query selector in JSON Path box.
    Add JSON expression
  5. Add next field au and enter the JSONPath query selector in JSON Path box.
  6. Add next field yr and enter the JSONPath query selector in JSON Path box.
    JSON query
  7. Same way we can add any number of fields, and can enter their JSONPath query selector by selecting the JSON as the type of field. This will tells Agenty to use the JSON parser to extract those fields.
  8. Now Save the scraping agent configuration. (Remember, saving agent just update the configuration and we need to re-run our agent by clicking on Start button in order to reflect the changes in result).
  9. Change the SOURCE URL to the page with JSON content, if not already : https://cdn.agenty.com/examples/json/json-example-1.json. Or you can upload a list if multiple URLs.
  10. And re-run your agent to refresh the result as per the changes made in agent configuration.

Execution

Once the job has been completed, we can see the JSON scraping result in Result tab and can add any number of URLs with similar structure API to scrape data from web API.

Try it out

We know you want to try it out, so we’ve uploaded this agent in scraping agent demos. You may login to your account > go to New agent > and then Sample agents tab

Now, click on the Get it button to clone the agent in your account.

Signup now to get 100 pages credit free

14 days free trial, no credit card required!