Skip to main content

Defining and Using Items

Scrapy is an open-source web-crawling framework written in Python. It allows users to write spider programs to crawl web pages and extract structured data from them. In this tutorial, we'll be focusing on defining and using items in Scrapy.

What is an Item?

In Scrapy, an Item is a simple container to collect the scraped data. It's a Python dict-like object, which defines a set of fields that are used to carry the data.

Defining an Item

To start using Items, you need to define them. This involves creating a Python class and defining the type of data each field will contain. Here is an example of how to define an item:

import scrapy

class ExampleItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()

In this example, ExampleItem has two fields, title and url. You can define as many fields as you need.

Using Items in Spiders

Let's see how to use these items in a spider.

import scrapy
from myproject.items import ExampleItem

class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = ['http://example.com']

def parse(self, response):
item = ExampleItem()
item['title'] = response.css('title::text').get()
item['url'] = response.url
return item

In this spider, we first import our ExampleItem. In the parse method, we create a new Item instance item and fill it with data. We're using CSS selectors to extract the title text and the page URL.

Processing Items

Once an Item has been scraped by a Spider, it is returned and typically written to a file, database or sent over the network. However, before storing the items, you may want to clean and validate the data. This can be done using Item Pipelines, a powerful feature of Scrapy. Here's a simple example of an Item Pipeline:

class ExamplePipeline(object):
def process_item(self, item, spider):
item['title'] = item['title'].upper()
return item

This pipeline takes the title field from the item and converts it to uppercase. After this, it returns the item, which could then be stored or sent over the network.

Conclusion

Items are a fundamental part of Scrapy, providing a straightforward and versatile way to represent and handle the data you're scraping. By defining and using items, you can structure and process your scraped data in a way that suits your project's needs.

This tutorial covered the basics of defining and using items in Scrapy. As you become more comfortable with these concepts, you'll be able to build more complex and powerful scraping tools.