Lab2 - HTTP crawler

The program will be given a web-site address (source, ex: http://info.railean.net/) and a target directory (destination, ex: D:\Sites\Inforail) to which the site's pages and files will be saved.
The program will use the HTTP protocol to retrieve the data.
Additional parameters can be taken either from the command line or from a configuration file.

Generic requirements

The program must rely on the BSD sockets API, not some other library which is an abstraction on top of BSD sockets.

Client requirements

You don't have to make a program that will [attempt to] make a copy of the entire Internet on your hard drive :-), so naturally, some constraints will be added to simplify the problem:

Process only the links that are on the same domain as the source.
Download only the following file types:
- Images: {PNG, JPG, BMP}
- Archives: {ZIP, RAR}
- Documents: {PDF, DOC, ODF, DOCX, TXT, RTF}
Ignore, i.e. do not download files that are greater than N MB in size (N can be adjusted)
For files that are <N MB in size, resume support must be implemented (if the connection is broken at the time the file is downloaded, you must be able to resume the download from the point where it stopped, rather than download the file from scratch)
When the program is done, it must show a report which lists all the pages/files that could not be processed correctly for various reasons (ex: connection timed out, dead link, the resource was moved, etc)
The web-server must be under the impression that your program is "Netscape Navigator 4.2" running on "Windows 95"

Notes

The program does not have to be one with a GUI, a command line application will do.
It does not matter whether your application uses threads, multiple instances of itself or runs in a single thread. The objectives of this assignment are to learn how to implement a protocol, so multi-threading is not relevant in this context. However, you should feel free to experiment with different approaches if you are curious how they will affect the performance of your program.

Suggestions

The program should be made of a group of functions that can be reused. Recommendations:

DownloadFile(source, destination) - implements the download feature, as well as the resume functionality;
GetFileSize(source) - returns the size of a file stored on a HTTP server;
DownloadFileFragment(source, offset, size) - this function can be used inside DownloadFile, it will download size bytes starting from offset;

Grading policy

Assuming that everything works right,

8 - for making your program able to download a file via HTTP;
9 - for implementing resume support;
[a] 10 - if everything else works too;
[b] 10 - if you add the possibility to limit the download speed to a value specified in KB/s.

A partially implemented feature that does not work 100% right may still be accepted as fully implemented, provided that your report describes that nature of the problem and provides some guesses about what the solution is supposed to be.

References

http://en.wikipedia.org/wiki/HTTP
http://tools.ietf.org/html/rfc2616
http://en.wikipedia.org/wiki/Packet_analyzer
http://www.rejetto.com/hfs/ - lightweight HTTP server which will make debugging easier.
http://www.vimeo.com/10011691 - introduction to network sniffers; the video uses a sniffer to illustrate some of the basics of HTTP

Examples

import this

Lab2 - HTTP crawler

Contents

Keywords

Objectives

Overview

Generic requirements

Client requirements

Notes

Suggestions

Grading policy

References

Examples

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools