aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md24
1 files changed, 24 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..411a8d8
--- /dev/null
+++ b/README.md
@@ -0,0 +1,24 @@
+# Web Grater
+
+A very simple scrapy app for finding broken hrefs/images on your website. I've
+used this on a number of website to help find broken content and hopefully it
+can help you.
+
+## How do I use this?
+
+The project is dockerized but if don't want to use docker you can make a
+virtualenv. First you need to update the site you want to crawl in
+`web_grater/spiders/grater_spider.py`. After that you can just run:
+
+```
+make up && make crawl
+```
+
+This will print off all the urls the scraper can find and the broken ones in
+red.
+
+## How do I find which page contains the broken link?
+
+In `grater_spider.py` there is a `CustomLinkExtractor` with a method called
+`extract_links`. There is some commented out code that shows how you can search
+for the broken link during link extraction and print the parent page in yellow.