Defining and Using Items
Scrapy is an open-source web-crawling framework written in Python. It allows users to write spider programs to crawl web pages and extract structured data from them. In this tutorial, we'll be focusing on defining and using items in Scrapy.
What is an Item?
In Scrapy, an Item is a simple container to collect the scraped data. It's a Python dict-like object, which defines a set of fields that are used to carry the data.
Defining an Item
To start using Items, you need to define them. This involves creating a Python class and defining the type of data each field will contain. Here is an example of how to define an item:
import scrapy
class ExampleItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
In this example, ExampleItem
has two fields, title
and url
. You can define as many fields as you need.
Using Items in Spiders
Let's see how to use these items in a spider.
import scrapy
from myproject.items import ExampleItem
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = ['http://example.com']
def parse(self, response):
item = ExampleItem()
item['title'] = response.css('title::text').get()
item['url'] = response.url
return item
In this spider, we first import our ExampleItem
. In the parse
method, we create a new Item instance item
and fill it with data. We're using CSS selectors to extract the title text and the page URL.
Processing Items
Once an Item has been scraped by a Spider, it is returned and typically written to a file, database or sent over the network. However, before storing the items, you may want to clean and validate the data. This can be done using Item Pipelines, a powerful feature of Scrapy. Here's a simple example of an Item Pipeline:
class ExamplePipeline(object):
def process_item(self, item, spider):
item['title'] = item['title'].upper()
return item
This pipeline takes the title
field from the item
and converts it to uppercase. After this, it returns the item, which could then be stored or sent over the network.
Conclusion
Items are a fundamental part of Scrapy, providing a straightforward and versatile way to represent and handle the data you're scraping. By defining and using items, you can structure and process your scraped data in a way that suits your project's needs.
This tutorial covered the basics of defining and using items in Scrapy. As you become more comfortable with these concepts, you'll be able to build more complex and powerful scraping tools.