aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 411a8d810c756376f66719b33c80e70671863bf1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Web Grater

A very simple scrapy app for finding broken hrefs/images on your website. I've
used this on a number of website to help find broken content and hopefully it
can help you.

## How do I use this?

The project is dockerized but if don't want to use docker you can make a
virtualenv. First you need to update the site you want to crawl in
`web_grater/spiders/grater_spider.py`. After that you can just run:

```
make up && make crawl
```

This will print off all the urls the scraper can find and the broken ones in
red.

## How do I find which page contains the broken link?

In `grater_spider.py` there is a `CustomLinkExtractor` with a method called
`extract_links`. There is some commented out code that shows how you can search
for the broken link during link extraction and print the parent page in yellow.