Spiders
In the Scrapy framework, Spiders are the core component where you define the custom behaviour for crawling and parsing pages. They are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites).
Understanding Spiders
A Spider is the part of your Scrapy application that is in charge of processing a response and extracting the structured data. It's also responsible for finding new URLs to follow and creating new requests (Request
objects) from them.
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
def start_requests(self):
urls = ['http://example.com', 'http://example2.com']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# We'll implement the parsing part here.
In the code above, we've defined a simple spider that can fetch responses from the URLs in the start_requests
method. The parse
method will be used to extract data from these responses.
Spider Attributes
Let's go through the key attributes of a Scrapy Spider:
name
: A string which defines the name of the spider. It must be unique within a project.allowed_domains
: An optional list of strings containing domains that this spider is allowed to crawl.start_urls
: A list of URLs where the spider begins to crawl from, when no particular URLs are specified.
class MySpider(scrapy.Spider):
name = 'my_spider'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
Spider Methods
Spiders have several methods, but these are the most important ones:
start_requests()
: This method must return an iterable ofRequest
objects. This is the first method Scrapy calls in your spider, by default, that generatesRequest
objects from thestart_urls
.parse(response)
: This method is called with the text response of each request, it must return a dictionary,Item
object,Request
object, or an iterable of these.
Here's a simple example of a spider that starts by fetching a URL and then extracts and follows the links found in the <a>
elements:
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['http://example.com']
def parse(self, response):
for link in response.css('a::attr(href)').getall():
yield scrapy.Request(response.urljoin(link), self.parse)
Using Spider Arguments
You can provide command line arguments to your spiders by using the -a
option when running the scrapy crawl
command. These arguments are passed to the Spider's __init__
method and become spider attributes by default.
class MySpider(scrapy.Spider):
name = 'my_spider'
def __init__(self, category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [f'http://example.com/{category}']
You can run this spider using scrapy crawl my_spider -a category=electronics
.
In this tutorial, we've learned about Scrapy's concept of Spiders, their attributes, and methods. We also explored how to create a basic Spider that crawls pages and extracts data. This foundational knowledge will help you build more complex and powerful web scraping tools using Scrapy. Happy crawling!