How to Scrape JSON Response with Scrapy

how- to-scrape-json-response-with-scrapy

JSON is a widely used data format returned by APIs, representing JavaScript Object Notation. This tutorial aims to guide you on how to scrape JSON responses using the Scrapy framework in Python.

Scrapy is a well-known Python library for web scraping, offering a straightforward and efficient approach to extracting data from websites. It serves as a versatile tool for tasks such as data mining and information processing. Alongside its general-purpose web crawling capabilities, Scrapy can also be utilized to retrieve data through APIs.

Table of Content

  1. Install Scrapy
  2. Building Scrapy Project and Spider
  3. Code Explanation
  4. Running Spider

1. Install Scrapy

To install Scrapy, simply enter the following command in your command line or terminal:

> pip install scrapy

Let’s now explore an example that demonstrates how to extract data from jService.io, specifically to obtain a random question and its corresponding answer. You can access the API endpoint for this example at: https://jservice.io/api/random.

Related: Check the article on How to Pull JSON Data from an API in VBA

The data returned by the API endpoint has the following structure:

[
  {
    "id": 23249,
    "answer": "a gun (revolver)",
    "question": "You're not allowed to use this invention of Samuel Colt in the park named for him in Hartford, Connecticut",
    "value": 200,
    "airdate": "1991-06-17T19:00:00.000Z",
    "created_at": "2022-12-30T18:47:08.052Z",
    "updated_at": "2022-12-30T18:47:08.052Z",
    "category_id": 2208,
    "game_id": 7533,
    "invalid_count": null,
    "category": {
      "id": 2208,
      "title": "usa",
      "created_at": "2022-12-30T18:47:04.495Z",
      "updated_at": "2022-12-30T18:47:04.495Z",
      "clues_count": 30
    }
  }
] 

2. Building Scrapy Project and Spider

To get started, let’s create a new Scrapy project.

To begin the project, we can run the scrapy startproject command along with the name we will call the project. Create a new Scrapy project with the name jservice by the below command in the terminal.

> scrapy startproject jservice

New Scrapy project 'jservice', using template directory 'C:\Program Files\Python311\Lib\site-packages\scrapy\templates\project', created in:
    C:\scrapy\jservice

You can start your first spider with:
    cd jservice
    scrapy genspider example example.com

Once a project has been created, you want to generate Spider for the project. This is done with the scrapy genspider command. Run the below command in the terminal to create a new spider with the name question.

> cd jservice
> scrapy genspider question jservice.io
Created spider 'question' using template 'basic' in module:
  jservice.spiders.question

To begin, open your preferred code editor. In this example, I will be using VS Code.

Next, navigate to the jservice/spiders/question.py file. Inside this file, you will find the following initial code snippet.

import scrapy

class QuestionSpider(scrapy.Spider):
    name = "question"
    allowed_domains = ["jservice.io"]
    start_urls = ["https://jservice.io"]

    def parse(self, response):
        pass

To fetch and scrape JSON responses from the jService website, add the following code to the above file. The final source code will look like the below block.

import scrapy
import json

class QuestionSpider(scrapy.Spider):
    name = "question"

    def start_requests(self):
        start_url = "https://jservice.io/api/random"
        yield scrapy.Request(url=start_url, callback = self.parse)

    def parse(self, response):
        api_data = json.loads(response.text)

        question_text = api_data[0]["question"]
        answer_text = api_data[0]["answer"]

        yield {
            "question": question_text,
            "answer": answer_text,
        }

To prevent a forbidden error, open the jservice/settings.py file and modify the value for ROBOTSTXT_OBEY= False.

3. Code Explanation

We define a Scrapy spider called QuestionSpider and set its name to question. The name is used to identify the spider when running it.

The start_requests method is a generator function that yields the initial requests to be sent. In this case, we create a scrapy.Request object with the URL of the jService.io API endpoint for retrieving a random question. We specify the parse method as the callback function, which will be called when the response is received.

The parse method is the callback function that receives the response from the API request. Here, we load the JSON data from the response using json.loads(response.text). Assuming the response returns an array of objects, we extract the question and answer fields from the first object and store them in variables question_text and answer_text, respectively.

Finally, we yield a Python dictionary containing the extracted data. This dictionary will be passed to the Scrapy pipeline for further processing or saving to an output file.

4. Running Spider

To run this spider, navigate to the directory where the project root folder, and execute the following command:

> scrapy crawl question -o output.json

This command will run the spider and save the scraped data into a file named output.json. The content in output.json file will look like the below:

[
    {"question": "As a verb, thumb is a synonym for this", "answer": "hitchhiking"}
]

You can change the output format to CSV, JSON lines, or other supported formats by adjusting the -o argument.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top