Skip to main content

Pipeline Examples

Scrapy's Item Pipeline is a powerful feature that allows you to process and filter the data that your spiders have scraped. To help solidify your understanding, we are going to dive deep into some practical pipeline examples.

What is a Pipeline in Scrapy?

A pipeline in Scrapy is a series of processing steps that handle and process the items returned by the spiders. Each item is passed through all the components of the pipeline sequentially, and these components are called "pipeline classes".

Simple Pipeline Example

We'll start with a simple pipeline example. Suppose we have scraped some data about books, and we want to remove any books whose price is less than $50. We can write a pipeline class for this:

class PriceFilterPipeline:
def process_item(self, item, spider):
if item['price'] < 50:
raise DropItem("Item price below 50")
else:
return item

In the above code, process_item function is a method that will be called for every item pipeline component. If the price of the item is less than $50, it will raise DropItem exception and the item will be dropped and not processed by any further pipeline components.

Pipeline to Store Items in MongoDB

Now, suppose we want to store our scraped data in a MongoDB database. We can create a pipeline class that establishes a connection to the database and inserts items into it:

import pymongo

class MongoPipeline:

collection_name = 'scrapy_items'

def open_spider(self, spider):
self.client = pymongo.MongoClient('localhost', 27017)
self.db = self.client["mydatabase"]

def close_spider(self, spider):
self.client.close()

def process_item(self, item, spider):
self.db[self.collection_name].insert_one(dict(item))
return item

In the above code, open_spider method is called when the spider is opened and here we establish our connection to the MongoDB. close_spider method is called when the spider is closed and here we close our connection to the database. In the process_item method, we insert the item into the database.

Activating Pipelines

To activate a pipeline, you need to add it to the ITEM_PIPELINES setting in your Scrapy project's settings.py file. The ITEM_PIPELINES setting is a dictionary where keys are the pipeline classes and the values are integers that determine the order in which the pipelines are run. Lower values have higher priority:

ITEM_PIPELINES = {
'myproject.pipelines.PriceFilterPipeline': 300,
'myproject.pipelines.MongoPipeline': 400,
}

In this setting, PriceFilterPipeline will run before MongoPipeline because it has a lower value.

This concludes our tutorial on Scrapy Item Pipeline examples. Remember, pipelines are incredibly powerful tools that can help you process, validate, and persist the data you scrape. So, feel free to explore and experiment with them to meet your specific data scraping needs.