Web Crawler
Simple web crawler for retrieving site links, (*1)
This web crawler package is a simple package, designed for taking websites and extracting
the files it can find from the html that the site provides., (*2)
It is restricted to the source domain by default, can be altered using the restrict_domain option
of the crawl method., (*3)
It was built for handling known self linking sites, although I will add controls to prevent
external crawling when required., (*4)
It is simple to use, and solves some of the issues other people have had trying to build simple
crawlers., (*5)
Supported
- Scanning and retrieving web page.
- Reading and pulling out all links in web page.
- Deducing if link is to another directory or to a file.
- Storing file and directory location (web location)
- Handles relative and non relative urls
- Times crawls
- Provides minimal count statistic
- Exports data collected as array
- Exports data collected as Json
Warning
Use this at your own risk, please don't crawl sites of people that are not expecting it, the risk is all yours, (*6)
Simple Test Script
A simple script for testing is included., (*7)