Lab2 - HTTP extensions

From Inforail
Jump to: navigation, search

#include <Lab2 - HTTP crawler>

The generic requirements are extracted from the standard assignment mentioned above, the grading policy is adjusted as follows [assuming that everything works right]:

  • 8 - for making your program able to download a file via HTTP;
  • 9 - for implementing resume support;
  • 10 - for all of the above + implementing one of the features described in the sections below.

There are no constraints, except one:

  • Use BSD sockets, no high-level wrappers are allowed

Shill-o-scope

The Internet is full of shills - your mission is to reveal them. Write a tool that analyzes the list of contributors to a Wikipedia article, as well as the list of their other contributions.

The result must be a graph that answers these questions:

  • which users edited an article?
  • what else have they edited?

This way you can easily spot the accounts that were created with the sole purpose of promoting the agenda of a single entity (e.g. a politician, a party, or a company and its product). In contrast, edits coming from accounts with a diverse set of interests are less likely to be fishy.

You get bonus points for visualizing the graph in a user-friendly way, e.g. (though it doesn't have to be that pretty):

Shilloscope.png

In the picture above it is clear that Marfusha Agafieva, Magarizza, PA3ot and N-gree Dorque are only editing articles related to Partidul Retrograd din Petrograd, whereas Pasakistakapustalaudamas has a more diverse range of interests. What about Murziq? That's up to the analyst to figure out - your software should only render a graph; labeling editors as "shill" or "honest contributor" is outside the scope.


The input of your program is:

  • a list of articles and/or
  • a list of usernames or IP addresses edits were made from

Note: all names are fictitious, coincidences are just coincidences.

Music MP3 finder

The program must be able to find and download a song (in MP3 format) from the Internet, given a title and artist name.

Example:

  • If the program is told to look for "Bart Claessen - Elf" - it will attempt to find this song, download it and save it in a local directory.
  • If the song is not found, a "Song not found" error message must be returned.
Test cases

Testing is very important, so here is a list of test cases for you:

  • Michael Cassette - Ghost In The Machine
  • Parhelia - Perpetual Motion
  • Dandy Warhols - Bohemian like you
  • Many many many more test cases


Implementation recommendations
  • Use Z-music.ru as a backend for searching and downloading files
  • Sssshh! Don't tell anyone about the site


Advice ripper

This site is a great source of advice for engineers, unfortunately it takes a lot of time to read all the pages.

Your mission is to develop a program that:

  • retrieves all the advices from the site and store them in a large, properly formatted data file™, for quicker and more efficient learning;
  • is polite, and won't knock the server down with frequent requests;
  • is insisting, and makes sure that all the information is retrieved even if the server temporarily refuses to serve the client.
Implementation recommendations

Podcast downloader

The 7th Avenue project is an excellent source of podcasts - interviews with great people, all that wisdom is distilled for us and served for free.

Your mission is to develop a program that:

  • retrieves a list of available podcasts in this format: date, title, file-size
  • when asked so by the user, the program will download the selected podcast (specified by a podcast number in the list above)
Implementation recommendations
  • Who said you had to parse HTML? Me? No, I didn't say that!
  • XML, on the other hand, is a good candidate for the title of a large, properly formatted data file


Wallpaper finder

People are stuck with the wallpapers shipped with the operating system, but they're looking for something new. Who said you don't have the power to change that?


Your mission is to develop a program that goes to the Internet in search of a wallpaper, finds one, downloads it, and saves it to a local directory. The following functionality has to be provided:

  • download wallpapers of a specific size (ex: 800x600, 1024x768, 1280x1024, etc)
  • keyword pic - a keyword [or set of keywords] is entered, the program will try its best to find a wallpaper that is somehow related to the given words.
  • "I'm feeling lucky" mode - the program will download whatever it feels like downloading - as long as the retrieved image will be of the specified size, it is accepted as a good wallpaper.


Implementation recommendations
  • To ensure that you can write "No people were harmed while developing and testing this software" in your project report, make sure that these keywords are banned from the searches: {goatse, tubgirl}
  • Getting wallpapers via RSS
  • Yahoo search API
  • Simpledesktops - minimalistic site, a collection of high-res wallpapers, easy to parse

Guest submissions

Advanced desktop Wikipedia searcher

The simplest way to search the Wikipedia is to go to its home page and type the keywords into the search box. This often takes too much effort, this is why there are a number of browser plug-ins or stand-alone command-line programs which search the Wikipedia in a less cumbersome manner.

Your mission is to develop an application which takes the following input:

  • 1 or N keywords to search for;
  • a number M, less than N (if not specified, M is the same as N);
  • a boolean value InOrder (False by default);

The application should produce links to at most the first K pages matching the search criterion (K may be hard-coded, but had better be a configurable value). The following conditions apply:

  • only the pages with paragraphs which contain M out of N keywords match the search criterion;
  • if InOrder == True only the pages with paragraphs which contain the keywords in the order in which they were specified match the search criterion;
  • the search should only be done within the body of the page, excluding the control elements.

The links may be accompanied by the corresponding matching paragraphs; the number of paragraphs to show (which is greater *or equal* to zero) is at your discretion.

Implementation recommendations
  • Crawling the Wikipedia to find the pages matching the search criterion is just too brutal; consider filtering the results produced by the Wikipedia search engine.