Lab2 - HTTP crawler

From Inforail
Jump to: navigation, search

Keywords

client, TCP, IP, BSD sockets, HTTP, protocol, download, resume, spider, crawler, bot, RFC

Objectives

Illiterate-xenomorphs.jpg

The Department of Heritage Preservation is looking for talents who can prepare the planet for the xenomorph invasion. The first test challenge is to write a program that can make copies of a subset of the Internet.

Those who successfully pass this test will be given additional instructions. Don't contact us, we'll contact you.

Good luck!

  • Write a program that can make a copy of a web-site that can be viewed offline;
  • Understand how a protocol works by reading its specifications;
  • Learn how to do some basic protocol reverse-engineering using a network sniffer.

Overview

  • The program will be given a web-site address (source, ex: http://info.railean.net/) and a target directory (destination, ex: D:\Sites\Inforail) to which the site's pages and files will be saved.
  • The program will use the HTTP protocol to retrieve the data.
  • Additional parameters can be taken either from the command line or from a configuration file.



Generic requirements

  • The program must rely on the BSD sockets API, not some other library which is an abstraction on top of BSD sockets.


Client requirements

You don't have to make a program that will [attempt to] make a copy of the entire Internet on your hard drive :-), so naturally, some constraints will be added to simplify the problem:

  • Process only the links that are on the same domain as the source.
  • Download only the following file types:
    • Images: {PNG, JPG, BMP}
    • Archives: {ZIP, RAR}
    • Documents: {PDF, DOC, ODF, DOCX, TXT, RTF}
  • Ignore, i.e. do not download files that are greater than N MB in size (N can be adjusted)
  • For files that are <N MB in size, resume support must be implemented (if the connection is broken at the time the file is downloaded, you must be able to resume the download from the point where it stopped, rather than download the file from scratch)
  • When the program is done, it must show a report which lists all the pages/files that could not be processed correctly for various reasons (ex: connection timed out, dead link, the resource was moved, etc)
  • The web-server must be under the impression that your program is "Netscape Navigator 4.2" running on "Windows 95"


Notes
  • The program does not have to be one with a GUI, a command line application will do.
  • It does not matter whether your application uses threads, multiple instances of itself or runs in a single thread. The objectives of this assignment are to learn how to implement a protocol, so multi-threading is not relevant in this context. However, you should feel free to experiment with different approaches if you are curious how they will affect the performance of your program.


Suggestions

The program should be made of a group of functions that can be reused. Recommendations:

  • DownloadFile(source, destination) - implements the download feature, as well as the resume functionality;
  • GetFileSize(source) - returns the size of a file stored on a HTTP server;
  • DownloadFileFragment(source, offset, size) - this function can be used inside DownloadFile, it will download size bytes starting from offset;
Grading policy

Assuming that everything works right,

  • 8 - for making your program able to download a file via HTTP;
  • 9 - for implementing resume support;
  • [a] 10 - if everything else works too;
  • [b] 10 - if you add the possibility to limit the download speed to a value specified in KB/s.

A partially implemented feature that does not work 100% right may still be accepted as fully implemented, provided that your report describes that nature of the problem and provides some guesses about what the solution is supposed to be.

References


Examples

import this