I recently came across PragPub and noticed they had an archive of back issues available for perusal. Before subscribing, I thought I’d download the entire archive to see what’s already been published. Since there are nearly 50 issues available, this was a great opportunity for web-scraping. My first instinct was to Google if anyone had come up with a solution. Indeed there were a couple of scripts!

Then it hit me…

All I needed to do was check the download link. If it was semantic, then wget is all I would need. Upon inspection, the URLs are perfect for this situation:

https://pragprog.com/magazines/download/1.epub

wget would be all I need:

wget https://pragprog.com/magazines/download/{1..49}.epub

A one-liner!

Now compare that to the results Google turned up. The first script is a perl script that requires the use of several CPAN modules.

#!/usr/bin/env perl
use URI;
use URI::URL;
use Web::Scraper;
use LWP::Simple;
my $pp = "http://pragprog.com/magazines";
my $ok = "epub";
my $mags = scraper {
process "span.link", "mags[]" => scraper {
process "a", link => '@href';
};
};
my $res = $mags->scrape(URI->new($pp));
for my $mag (@{$res->{mags}}) {
my $url = $mag->{link};
my @parts = (new URI::URL $url)->path_components;
my $file = $parts[3];
if ( ! -e $file && $file =~ /$ok/ ) {
print "SAVING FILE:\t$file\n";
getstore($url, $file);
}
}

The second is a python script that actually uses wget!

#!/usr/bin/env python
"""
find download links to all PragPub magazines
usage:
$ ./pragpub_get.py [pdf | html | epub | mobi] > pragpub.lst
$ wget -c -i pragpub.lst
"""
import sys, re, urllib
if len(sys.argv) > 1:
ext = sys.argv[1]
else:
ext = 'pdf'
url = 'http://pragprog.com/magazines'
pattern = re.compile(r'"(%s/download/.+?\.%s)"' % (url, ext), flags=re.IGNORECASE)
page = 1
while True:
html = urllib.urlopen(url + '?page=' + str(page)).read()
links = pattern.findall(html)
if not links:
break
for l in links:
print urllib.urlopen(l).geturl()
page += 1

Now, I should say that the best tool for a job is often the one that you have on hand or even the one that you are most familiar with. I don’t mean to criticize the authors of these scripts—I’m sure the scripts worked for them (and that’s why I haven’t named the authors). What I want to point out is that knowing the capabilities of the tools available on most modern systems can be of great use in situations like these.

It is worthwhile to keep in mind the simplicity of our UNIX forefathers.