Introduction to Spiders

Scrapy is an open-source and collaborative web crawling framework for Python. It's used for data mining, information processing, and historical archival. The term "Spiders" in Scrapy refers to the classes which define how a certain site (or a group of websites) will be scraped.

What are Spiders?

In Scrapy, Spiders are the core component where you define the custom behavior for crawling and parsing pages. They are classes that Scrapy uses to scrape information from a website (or a group of websites). They must subclass scrapy.Spider and define the initial requests.

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        self.log('Visited %s' % response.url)

In the above example, MySpider is a spider class where:

name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
start_urls: a list of URLs where the Spider will begin to crawl from, when no particular URLs are specified.
parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.

Basic Scrapy Spider

The simplest form of a Scrapy spider that follows links looks like this:

import scrapy

class MySpider(scrapy.Spider):

    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    def parse(self, response):
        self.log('A response from %s just arrived!' % response.url)

In this example, the spider example.com would start crawling example.com's homepage and will follow all links within the allowed_domains.

Spider Arguments

Sometimes, you may want to modify your spider to take arguments from the command line. In this case, you can use the __init__ method as follows:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]

You can now run the spider with -a to pass the argument:

scrapy crawl myspider -a category=electronics

Conclusion

That's it for the basics of Scrapy spiders. Remember, spiders are the heart of your Scrapy web crawler, and defining them correctly is critical to successfully navigate and extract data from your target websites. In the upcoming sections, we will dive deeper into some advanced topics related to Scrapy spiders.

Introduction to Spiders

Introduction to Spiders

What are Spiders?​

Basic Scrapy Spider​

Spider Arguments​

Conclusion​

What are Spiders?

Basic Scrapy Spider

Spider Arguments

Conclusion