diff options
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 24 |
1 files changed, 24 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..411a8d8 --- /dev/null +++ b/README.md @@ -0,0 +1,24 @@ +# Web Grater + +A very simple scrapy app for finding broken hrefs/images on your website. I've +used this on a number of website to help find broken content and hopefully it +can help you. + +## How do I use this? + +The project is dockerized but if don't want to use docker you can make a +virtualenv. First you need to update the site you want to crawl in +`web_grater/spiders/grater_spider.py`. After that you can just run: + +``` +make up && make crawl +``` + +This will print off all the urls the scraper can find and the broken ones in +red. + +## How do I find which page contains the broken link? + +In `grater_spider.py` there is a `CustomLinkExtractor` with a method called +`extract_links`. There is some commented out code that shows how you can search +for the broken link during link extraction and print the parent page in yellow. |