NAME

crawl-web-for-images.pl

SYNOPSIS

# Start with the URLS in seed_urls.txt and write a verbose logfile.
crawl-web-for-images.pl  --input-file=seed_urls.txt --logfile=getting-images.log --verbose --timestamps

# Quit after downloading 100 images or after running for about two
# hours, whichever is sooner.
crawl-web-for-images.pl -i seed_urls.txt --image-count=100 --timelimit=2h

# Download about ten images per day until the script is killed.
crawl-web-for-images.pl -i seed_urls.txt --daily=10 

# Specify a target directory to download twenty pictures a day to.
crawl-web-for-images.pl -i seed_urls.txt --dir /home/user/Pictures/random -D 20

# Only download images if their area in square pixels is at least 1600.
crawl-web-for-images.pl -i seed_urls.txt --area=1600 

# Wait an average of an hour between downloads, varying randomly.
crawl-web-for-images.pl -i seed_urls.txt --wait=1h --random-wait

# The same, but don't do exponential backoff after errors since our baseline
# weight is already pretty long.
crawl-web-for-images.pl -i seed_urls.txt --w 1h -r --noexponential

# Exit the first time we have a problem downloading or saving anything.
crawl-web-for-images.pl -i seed_urls.txt --max-errors=1

# Apply a set of regular expression weights to all URLs found.
crawl-web-for-images.pl -i seed_urls.txt --weights=weights.tab

DESCRIPTION

crawl-web-for-images.pl will randomly crawl the web, starting with one or more specified seed URLs, and download random image files. It will generally visit several different sites, gathering image URLs from <img> tags, before it starts downloading any actual images. As it gathers more image tag URLs, it will download images relatively more often and HTML pages relatively less often. It won't save the HTML pages to disk, only the images it finds, except in debug mode where it has some problem parsing the HTML.

It allows you to limit which sites to visit and images to download in several ways. However, even with the most carefully constructed seed URL list and weights, there's no way to be sure that the crawler won't wander into a site with porn images and download them. Be sure you're okay with the risk of that before you run the scripts.

It honors robots.txt and won't visit pages or download images it's not allowed to.

OPTIONS

Where to get images, where to put them

Any command-line arguments that start with 'http' will be treated as seed URLs. Alternatively, you can list seed URLs in a text file with the command line option:

-i --input-file=[filename]

Points to a file containing a list of seed URLs to start our web crawl with. See "FILES" for more detail.

-d --dir=[directory]

The directory to download image files to. It's also the default directory for the logfile if a full path isn't specified for it. Defaults to the current directory.

If this doesn't exist yet, the script will try to create it.

How long to run, how many images to get

If none of these options are specified, the crawler will sleep

-c --image-count=[integer]

Exit after downloading this many images.

-t --timelimit=[integer]([unit])

Exit after running this amount of time. The time limit can take an integer argument (i.e. so many seconds), or it can be an integer followed by a letter specifying the units. The units allowed for --timelimit are:

s     seconds (same as leaving units off)
m     minutes
h     hours
d     days
-D --daily=[integer]

This argument represents the approximate number of images to download per day. The amount of time to pause between attempts to download URLs is set based on this count, weighted by the number of web pages vs. image URLs the robot has in its lists of URLs. Early on when it has only web page URLs, it will download one every few seconds. Once it has accumulated some image URLs, it will download less often, and when it has plenty of image URLs, it will spend most of its time downloading images and little time downloading more web pages, so the time to sleep between GET requests will approach a limit of 24 hours / number of images per day. Assuming no unsuccessful attempts, it will download roughly this many images in each 24-hour period, and keep running indefinitely unless you specify an --image-count or --timelimit option.

-w --wait=[integer]([time unit])

This specified the amount of time to sleep between downloads. Like the --timelimit, it can take a number of seconds or a number followed by a time unit (s, m, h or d).

This is ignored if you specify a --daily count, as that will cause the algorithm to calculate its own sleep time according to the daily image quota and the changing proportions between the number of web page URLs and image URLs in memory.

If neither this option nor --daily are specified, the default sleep time is 15 minutes.

-r --random-wait

If this is set, the sleep time will randomly vary between half the time specified by --wait and one and a half times that long.

Error handling

-e --max-errors=[integer]

Up until this many nonfatal errors, the script will keep trying to work. After this many errors, it will quit. Types of nonfatal errors include HTTP GET requests failing. The default maximum error limit is 5.

--exponential =item --noexponential

Exponential backoff is on by default; it means that after an HTTP request fails, the script will double its wait time before trying another such request, and continue doubling the wait time until it either succeeds in getting a page or reaches the maximum errors and exits. Once a request succeeds, the wait time is restored to normal.

Messages and logging

-q --quiet

Write nothing to the terminal.

-v --verbose

Write extra-detailed messages.

-l --logfile=[filename]

Write output to the specified logfile. Messages will also be written to the terminal unless --quiet is specified.

--debug

Write debug messages, and if there is a problem parsing an HTML file, save the HTML file to the download directory.

-T --timestamps

Include the time and date with each message written to the log file.

Which images to get

--weights=[filename]

Specifies a weights file consisting of regular expressions and weights, separated by tabs. If a URL for an image or web page matches one or more of the regular expressions, the weights will be applied to it in determining how many copies of the URL are inserted into the list. A weight of 0 means don't visit URLs matching that regular expression, 2 means insert two copies into the list, etc. If multiple regular expressions match, their weights are multiplied together.

-A --area=integer

Don't download images if we are able to determine their size and their area is less than this many square pixels.

-H --height=integer

Don't download images if we are able to determine their size and their height is less than this many pixels.

-W --width=integer

Don't download images if we are able to determine their size and their width is less than this many pixels.

--or

Download this image if either of the --height or --width criteria are satisfied.

Other options

-b --balancing=[algorithm](lowerbound,upperbound)

This option selects the balancing algorithm that determines whether to visit another web page looking for URLs or download an image from among the image URLs we've collected during the main event loop.

The algorithm can be one of three strings: linear, log, or equal, followed by an optional pair of numbers separated by a comma, the lower bound and upper bound supplied to said algorithm.

The default algorithm is log, the default lower bound is 4.61 (log 100) and the default upper bound is 9.21 (log 10000). You probably don't want to change this option unless you've read and understood the relevant code in the main function and set_pausetime(). The default is the default for a reason; I just wanted to have a way to switch between algorithms during development and testing without having to edit the code.

--help

Get brief help.

-M --manual

Get detailed help.

FILES

Seed URLs file

This can simply be a list of URLs, one per line, but if you want the crawler to be more likely to visit some of them than others, you can also specify weights -- an integer separated by whitespace from the URL. E.g.,

http://americangallery.wordpress.com          15
http://www.americanartarchives.com/artzybasheff.htm   20
http://goldenagepaintings.blogspot.com/       30

The numbers aren't percentages; they're the number of copies of that URL which are initially seeded into the list of URLs. No matter how many copies of a given URL are seeded, the robot won't visit that URL more than once; the weight just affects the initial probability of visiting that website rather than one of the other seed URLs, or (a little later) the probability of visiting that seed site rather than one of the various URLs gathered from the pages already visited.

If a line just has an URL with no seed weight, the default weight is 10.

This format also allows comments, beginning with # and continuing to the end of the line.

Weights file

If the --weights argument is used, the following filename will be read and interpreted as a weights file, where each line consists of a regular expression, a tab, and a weight (a nonnegative number). If a URL found in a page we visit matches any of the regular expressions in the weights file, then the corresponding weight will be applied; 0 means to exclude URLs matching that regular expression, 0.5 means to give those URLs half the default probability of being visited, 2 means to give them double the default probability, and so on. If multiple regular expressions match a single URL, the applied weights are multiplied.

(The script basically does this by inserting zero or more copies of the URL into the list of web pages or images to get, depending on the weight and a suitable random factor if the weight is not an integer.)

Some example weights

^[a-z]+://[^/]+/?$     10      # give preference to visiting a new site for the first time

\.webp$        0       # no .webp because Eye of Gnome doesn't support it

tumblr.*avatar.*png    0
tumblr\.png    0

# give higher weight to higher-resolution versions of images on blogspot sites
blogspot.*s1600        3
blogspot.*s400 0.5
blogspot.*s320 0.4

# block these if they're in the filename (not the domain name)
# they tend to be banner adverts?
blogger[^/]*$  0
yahoo[^/]*$    0
google[^/]*$   0

# don't try to edit Wikipedia pages!
action=edit    0
action=history 0

# or submit forms on any site
submit\?       0
blogspot\.com/search   0
delete-comment 0

(?i)thumbnail  0
_sm\.(gif|png|jpg)     0
_small\.(gif|png|jpg)  0

As you can see, I mostly use the weights file to block out undesirable images or pages that it is not suitable or worthwhile to visit, though in many cases those would be blocked in the first place by the site's robots.txt, which this script honors.

See WeightRandomList.html for more information.

DEPENDENCIES

This script uses the following Perl modules:

strict, warnings, constant, Cwd, Pod::Usage, and Getopt::Long are in the standard library.

LWP::UserAgent, WWW::RobotRules, and HTTP::Response are available from CPAN.

ImageSites.pm and WeightRandomList.pm are included with this script.

BUGS AND LIMITATIONS

This script doesn't use any image libraries to parse the images downloaded and make sure they are what the file extension says they are or check the size. Sometimes you'll get HTML file (mostly some sort of "image not found" page) saved as a .png or .jpg or whatever file type, and sometimes you'll get a .png saved as a .jpg or vice versa because the file extension in the <img> tag was wrong.

So decisions based on --area, --height and --width use heuristics from the <img> tag attributes and the filename (e.g., looking for a substring like '200x300'), not the actual image size.

The HTML parsing is very basic, just searching for <a href> and <img> tags with regular expressions.

AUTHOR

Jim Henry III, http://jimhenry.conlang.org/software/

LICENSE

This script is free software; you may redistribute it and/or modify it under the same terms as Perl itself.

TO DO LIST

Find a suitable image library and check size, file type, etc. before saving images.

Use HTML::Parser to find links and image tags more reliably.

Test alt text and title elements against regexes and apply weights.

Add some sort of whitelist option (another file, or simply a flag to treat the domain names in the seed URLs file as a whitelist?) or option to look for an existing blacklist on the web.