Understanding Basic Spider
Understanding Basic Spider
Scrapy is a potent open-source web crawling framework written in Python. One of the fundamental components of Scrapy is the Spider, which is a Python class where you define how to crawl and parse pages for structured data.
What is a Spider?
In Scrapy, a Spider is the code you write to tell Scrapy how to navigate through a website and extract the data you need. You can think of it as a web bot that systematically browses the internet and collects data.
Creating a Basic Spider
Here's a simple Scrapy spider to scrape quotes from the website http://quotes.toscrape.com:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
This Spider starts at http://quotes.toscrape.com/page/1/
, iterates over the quotes on the page, and yields Python dictionaries with the quote text, author, and tags.
Spider Components
Let's break down the components of a Scrapy Spider:
name
: This identifies the Spider. It must be unique within a project and is used to target spiders when running crawls.start_urls
: A list of URLs where the Spider will begin to crawl from. The Spider will start making requests from these URLs when opened for scraping.parse()
: This is the default callback used by Scrapy to process downloaded responses, unless you override it. The parse method is what the Spider uses to extract the data from the website.
Selectors
Selectors are a mechanism to extract data from the HTML source. In the above example, we used CSS selectors (response.css
). You could also use XPath selectors with response.xpath
.
For example, to select the text from the quote, we use 'span.text::text'
as our CSS selector. This tells Scrapy to look for <span>
elements with a class of "text" and extract the text within them.
Running the Spider
To run your spider, you would use the scrapy crawl
command in your terminal, followed by the name of the Spider. In our case, that would be scrapy crawl quotes
.
Remember, it's essential to understand that Spiders are classes that Scrapy instantiates and uses to scrape information from web pages. They define how to perform the crawl (i.e., follow links) and how to extract structured data from their contents (i.e., scraping items).
In the next sections, we'll look at how to handle more complex scenarios, such as following links, dealing with pagination, and storing the scraped data.