Scraping a Page
In the realm of web scraping, Scrapy is a widely popular, powerful, and versatile Python framework. This tutorial will guide you through the process of setting up Scrapy and using it to scrape a webpage.
Prerequisites
This tutorial assumes that you already have Python installed on your computer. If not, you can download it from Python's official website.
Installing Scrapy
Before we can start using Scrapy, we first need to install it. You can do this by running the following command in your terminal:
pip install Scrapy
Your First Scrapy Spider
A Scrapy Spider is a Python class where you define how to scrape information from a website (or a group of websites). Let's create our first spider:
import scrapy
class MyFirstSpider(scrapy.Spider):
name = "my_first_spider"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
In this example, the Spider scrapes quotes from http://quotes.toscrape.com/page/1/
, and saves the page content into a file.
Running the Spider
To run the spider, navigate to the project's root directory and run the following command:
scrapy crawl my_first_spider
This should start the spider, which will send a GET request to the specified URL, handle the response in the parse()
method, and save the page content in a file.
Extracting Data
The real power of Scrapy comes from its ability to extract data from websites. Let's modify the parse()
method to extract quotes from the webpage:
def parse(self, response):
for quote in response.css('div.quote'):
text = quote.css('span.text::text').get()
author = quote.css('span small::text').get()
tags = quote.css('div.tags a.tag::text').getall()
yield {
'text': text,
'author': author,
'tags': tags,
}
This will extract the quote text, the author's name, and the associated tags for each quote on the page, and yield a Python dictionary.
Conclusion
And that's it! You've successfully scraped your first webpage with Scrapy. This is just a basic introduction - Scrapy's capabilities are much more extensive. With this foundation, you can move on to more complex projects, scraping more data from more websites.