aboutsummaryrefslogtreecommitdiff

Web Grater

A very simple scrapy app for finding broken hrefs/images on your website. I've used this on a number of website to help find broken content and hopefully it can help you.

How do I use this?

The project is dockerized but if don't want to use docker you can make a virtualenv. First you need to update the site you want to crawl in web_grater/spiders/grater_spider.py. After that you can just run:

make up && make crawl

This will print off all the urls the scraper can find and the broken ones in red.

In grater_spider.py there is a CustomLinkExtractor with a method called extract_links. There is some commented out code that shows how you can search for the broken link during link extraction and print the parent page in yellow.