Introduction to Spiders
Introduction to Spiders
Scrapy is an open-source and collaborative web crawling framework for Python. It's used for data mining, information processing, and historical archival. The term "Spiders" in Scrapy refers to the classes which define how a certain site (or a group of websites) will be scraped.
What are Spiders?
In Scrapy, Spiders are the core component where you define the custom behavior for crawling and parsing pages. They are classes that Scrapy uses to scrape information from a website (or a group of websites). They must subclass scrapy.Spider
and define the initial requests.
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com']
def parse(self, response):
self.log('Visited %s' % response.url)
In the above example, MySpider
is a spider class where:
name
: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.start_urls
: a list of URLs where the Spider will begin to crawl from, when no particular URLs are specified.parse()
: a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance ofTextResponse
that holds the page content and has further helpful methods to handle it.
Basic Scrapy Spider
The simplest form of a Scrapy spider that follows links looks like this:
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
def parse(self, response):
self.log('A response from %s just arrived!' % response.url)
In this example, the spider example.com
would start crawling example.com
's homepage and will follow all links within the allowed_domains
.
Spider Arguments
Sometimes, you may want to modify your spider to take arguments from the command line. In this case, you can use the __init__
method as follows:
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self, category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.example.com/categories/%s' % category]
You can now run the spider with -a
to pass the argument:
scrapy crawl myspider -a category=electronics
Conclusion
That's it for the basics of Scrapy spiders. Remember, spiders are the heart of your Scrapy web crawler, and defining them correctly is critical to successfully navigate and extract data from your target websites. In the upcoming sections, we will dive deeper into some advanced topics related to Scrapy spiders.