Skip to main content

Spiders

In the Scrapy framework, Spiders are the core component where you define the custom behaviour for crawling and parsing pages. They are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites).

Understanding Spiders

A Spider is the part of your Scrapy application that is in charge of processing a response and extracting the structured data. It's also responsible for finding new URLs to follow and creating new requests (Request objects) from them.

import scrapy

class MySpider(scrapy.Spider):
name = 'my_spider'

def start_requests(self):
urls = ['http://example.com', 'http://example2.com']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
# We'll implement the parsing part here.

In the code above, we've defined a simple spider that can fetch responses from the URLs in the start_requests method. The parse method will be used to extract data from these responses.

Spider Attributes

Let's go through the key attributes of a Scrapy Spider:

  1. name: A string which defines the name of the spider. It must be unique within a project.

  2. allowed_domains: An optional list of strings containing domains that this spider is allowed to crawl.

  3. start_urls: A list of URLs where the spider begins to crawl from, when no particular URLs are specified.

class MySpider(scrapy.Spider):
name = 'my_spider'
allowed_domains = ['example.com']
start_urls = ['http://example.com']

Spider Methods

Spiders have several methods, but these are the most important ones:

  1. start_requests(): This method must return an iterable of Request objects. This is the first method Scrapy calls in your spider, by default, that generates Request objects from the start_urls.

  2. parse(response): This method is called with the text response of each request, it must return a dictionary, Item object, Request object, or an iterable of these.

Here's a simple example of a spider that starts by fetching a URL and then extracts and follows the links found in the <a> elements:

class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['http://example.com']

def parse(self, response):
for link in response.css('a::attr(href)').getall():
yield scrapy.Request(response.urljoin(link), self.parse)

Using Spider Arguments

You can provide command line arguments to your spiders by using the -a option when running the scrapy crawl command. These arguments are passed to the Spider's __init__ method and become spider attributes by default.

class MySpider(scrapy.Spider):
name = 'my_spider'

def __init__(self, category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [f'http://example.com/{category}']

You can run this spider using scrapy crawl my_spider -a category=electronics.

In this tutorial, we've learned about Scrapy's concept of Spiders, their attributes, and methods. We also explored how to create a basic Spider that crawls pages and extracts data. This foundational knowledge will help you build more complex and powerful web scraping tools using Scrapy. Happy crawling!