Scraping a Page

In the realm of web scraping, Scrapy is a widely popular, powerful, and versatile Python framework. This tutorial will guide you through the process of setting up Scrapy and using it to scrape a webpage.

Prerequisites

This tutorial assumes that you already have Python installed on your computer. If not, you can download it from Python's official website.

Installing Scrapy

Before we can start using Scrapy, we first need to install it. You can do this by running the following command in your terminal:

pip install Scrapy

Your First Scrapy Spider

A Scrapy Spider is a Python class where you define how to scrape information from a website (or a group of websites). Let's create our first spider:

import scrapy

class MyFirstSpider(scrapy.Spider):
    name = "my_first_spider"
    
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

In this example, the Spider scrapes quotes from http://quotes.toscrape.com/page/1/, and saves the page content into a file.

Running the Spider

To run the spider, navigate to the project's root directory and run the following command:

scrapy crawl my_first_spider

This should start the spider, which will send a GET request to the specified URL, handle the response in the parse() method, and save the page content in a file.

Extracting Data

The real power of Scrapy comes from its ability to extract data from websites. Let's modify the parse() method to extract quotes from the webpage:

def parse(self, response):
    for quote in response.css('div.quote'):
        text = quote.css('span.text::text').get()
        author = quote.css('span small::text').get()
        tags = quote.css('div.tags a.tag::text').getall()
        yield {
            'text': text,
            'author': author,
            'tags': tags,
        }

This will extract the quote text, the author's name, and the associated tags for each quote on the page, and yield a Python dictionary.

Conclusion

And that's it! You've successfully scraped your first webpage with Scrapy. This is just a basic introduction - Scrapy's capabilities are much more extensive. With this foundation, you can move on to more complex projects, scraping more data from more websites.

Scraping a Page

Prerequisites​

Installing Scrapy​

Your First Scrapy Spider​

Running the Spider​

Extracting Data​

Conclusion​