Skip to main content

Introduction to Selectors

What are Selectors?

Selectors are a feature of Scrapy that allows us to pinpoint the data we want to scrape from a webpage. They let us "select" certain parts of an HTML document specified either by XPath or CSS expressions. XPath and CSS are the languages used for navigating through elements and attributes in HTML and XML documents.

XPath and CSS

XPath (XML Path Language) is a querying language for selecting nodes from an XML document. In addition, HTML can be handled as XML with XPath expressions.

CSS (Cascading Style Sheets) is a stylesheet language used for describing the look and formatting of a document written in HTML or XML. CSS selectors are patterns used to select elements.

Scrapy Selectors

In Scrapy, selectors are instances of the Selector class, which in turn is a thin wrapper around parsel library. Scrapy selectors provide the following methods for data extraction:

  • xpath(query): Returns a list of selectors for each node in the document that matches the XPath query.
  • css(query): Returns a list of selectors for each node in the document that matches the CSS query.
  • re(regex): Returns a list of unicode strings extracted by applying the regular expression regex.
  • get(): Returns the result of the selector serialized as a unicode string.
  • getall(): Returns all results of the selector serialized as a list of unicode strings.

Basic Usage of Selectors

Here's an example of how you might use selectors in a Scrapy spider:

import scrapy

class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://blog.example.com']

def parse(self, response):
for title in response.css('h2.entry-title'):
yield {'title': title.css('a ::text').get()}

for next_page in response.css('div.prev-post > a'):
yield response.follow(next_page, self.parse)

In this example, response.css('h2.entry-title') and title.css('a ::text').get() are using CSS selectors. The ::text is a CSS pseudo-element which is used to select nodes that contains text. The .get() method returns the first match.

Conclusion

Selectors are a powerful feature of Scrapy that enable precise, flexible, and efficient data extraction. They are a fundamental part of writing Scrapy spiders, and a thorough understanding of how they work will greatly aid your web scraping endeavors.