Introduction to Selectors
What are Selectors?
Selectors are a feature of Scrapy that allows us to pinpoint the data we want to scrape from a webpage. They let us "select" certain parts of an HTML document specified either by XPath or CSS expressions. XPath and CSS are the languages used for navigating through elements and attributes in HTML and XML documents.
XPath and CSS
XPath (XML Path Language) is a querying language for selecting nodes from an XML document. In addition, HTML can be handled as XML with XPath expressions.
CSS (Cascading Style Sheets) is a stylesheet language used for describing the look and formatting of a document written in HTML or XML. CSS selectors are patterns used to select elements.
Scrapy Selectors
In Scrapy, selectors are instances of the Selector
class, which in turn is a thin wrapper around parsel
library. Scrapy selectors provide the following methods for data extraction:
xpath(query)
: Returns a list of selectors for each node in the document that matches the XPathquery
.css(query)
: Returns a list of selectors for each node in the document that matches the CSSquery
.re(regex)
: Returns a list of unicode strings extracted by applying the regular expressionregex
.get()
: Returns the result of the selector serialized as a unicode string.getall()
: Returns all results of the selector serialized as a list of unicode strings.
Basic Usage of Selectors
Here's an example of how you might use selectors in a Scrapy spider:
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://blog.example.com']
def parse(self, response):
for title in response.css('h2.entry-title'):
yield {'title': title.css('a ::text').get()}
for next_page in response.css('div.prev-post > a'):
yield response.follow(next_page, self.parse)
In this example, response.css('h2.entry-title')
and title.css('a ::text').get()
are using CSS selectors. The ::text
is a CSS pseudo-element which is used to select nodes that contains text. The .get()
method returns the first match.
Conclusion
Selectors are a powerful feature of Scrapy that enable precise, flexible, and efficient data extraction. They are a fundamental part of writing Scrapy spiders, and a thorough understanding of how they work will greatly aid your web scraping endeavors.