Skip to main content

Scrapy with Splash

Introduction

Scrapy is a powerful and flexible web scraping framework that allows us to extract structured data from the web. Splash, on the other hand, is a lightweight, scriptable browser that allows us to handle JavaScript on websites we want to scrape. In this tutorial, we will learn how to use Scrapy with Splash to scrape data from JavaScript-heavy websites.

Prerequisites

  • Basic understanding of Python
  • Familiarity with HTML
  • Scrapy installed on your system
  • Docker installed on your system (for Splash)

Setting Up Splash

Splash is delivered as a Docker image which means you can run it regardless of your operating system. To run Splash, open your terminal and enter the following command:

docker run -p 8050:8050 scrapinghub/splash

Now, Splash should be accessible at http://localhost:8050.

Installing Scrapy-Splash

Scrapy-Splash is a Scrapy middleware that provides Splash integration to Scrapy. To install it, run:

pip install scrapy-splash

Configuring Scrapy to Use Splash

To configure Scrapy to use Splash, we need to modify the settings.py file in our Scrapy project. Here is what you need to add:

SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

Making Requests with Splash

To make a request with Splash, we use scrapy_splash.SplashRequest instead of scrapy.Request. Here is an example:

from scrapy_splash import SplashRequest

def start_requests(self):
url = 'http://example.com'
yield SplashRequest(url, self.parse_result)

Handling JavaScript with Splash

Splash allows us to interact with JavaScript-heavy websites. Here is an example of how to wait for JavaScript to load:

script = """
function main(splash, args)
splash:set_user_agent('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1')
splash:go(args.url)
splash:wait(0.5)
return splash:html()
end
"""

def start_requests(self):
url = 'http://example.com'
yield SplashRequest(url, self.parse_result, endpoint='execute', args={'lua_source': script})

In this script, splash:go(args.url) navigates to the page, splash:wait(0.5) waits for half a second, and splash:html() returns the HTML of the page.

Conclusion

Scrapy and Splash is a powerful combo that allows us to scrape almost any website, no matter how heavy it is on JavaScript. This tutorial only scratches the surface of what's possible, so I encourage you to explore the documentation of both Scrapy and Splash for more advanced features.