Cryptographic hashes

From Inforail
Jump to: navigation, search

Keywords

cryptography, hash, md5, sha1, sha2, sha256, sha512, digest, collision.

Objective

Understand the purpose of hashing algorithms and create a tool that solves a problem using one of them.

Directory integrity checker

This program keeps an eye on the contents of a directory, notifying you when something inside it has changed. It is ran at regular intervals by a scheduler, comparing the current state of the system with a previous snapshot, reporting differences it found.

Thus, if your system was hacked and malware was planted into the file system, or if the existing files were modified to include malicious code - you'll know right away.

The software must provide the following bonus functionality:

  • exclude certain files or directories by
    • size (skip files above a certain size given in bytes)
    • mask (skip paths that match a pattern *.txt or temp_*.bin)
  • -silent - this command line argument ensures that nothing is printed on the screen if no differences were found
  • the report must be in the form of a list, each line containing the full path of a modified file

Examples of use:

dircheck /var/www
dircheck /var/www -maxsize 60000
dircheck /var/www -maxsize 2048 -exclude *.tmp *.bak
dircheck /var/www -silent

Example of output:

* /var/www/index.php (modified)
- /var/www/serious.html (deleted)
+ /var/www/nothing/to/see/here/move.along (created)
Recommendations
  • Traverse the directory recursively
  • Store your snapshots in sqlite databases or in pickled dictionaries (if you use Python). SQLite will make it easier to do things like "give me a list of files that are in A but are not in B", etc)
  • Some files can be so large that they won't fit into RAM, so you have to process them in smaller chunks
  • Do not build your own scheduler, use cron or Windows Task Scheduler to run the tool at regular intervals
  • Sometimes you can skip hashing a file; if it has a different size - obviously the hash is different too

Dedupe

A duplicate finder, which analyzes a given directory and prints a list of identical files that have different names or paths.

Examples of use:

dedupe /home/murzilka

Example of output:

8cff6a87456225afc3b0bd8fecb8c515
/home/murzilka/photos/photo-one.jpg
/home/murzilka/archive/images/fotografia-unu.JPEG
07239defa7db6e7f1c221953c68fe609
/home/murzilka/warez/prodigy - mindfields.mp3
/home/murzilka/music/mooz0n-cumparat-legal/the prodigy - mindfields (original mix).mp3


Grading policy

Assuming that everything works right,

  • 7 - for being able to print the hash of a given file to stdout;
  • 8 - same as 7, but knowing how to do it for files larger than the size of your RAM;
  • 9 - for implementing the basic features;
  • 10 - for the basics + bonus features.


Self-test questions

  • Enumerate several hashing algorithms
  • What is a collision? Why do collisions exist?
  • When is it a good idea to use MD5?
  • How many bits are there in a SHA1 digest?
  • Which hashing algorithm is the fastest? Why? How do you know?
  • Why are multiple rounds of hashing sometimes applied?
  • What must an attacker do in order to fool the discussed directory checker and avoid the detection of a maliciously altered file?

References


Examples

>>> import hashlib

#hash a string on the spot
>>> hashlib.md5('test').hexdigest()
'098f6bcd4621d373cade4e832627b4f6' #view digest as a hexadecimal string
>>> hashlib.md5('test').digest()
"\t\x8fk\xcdF!\xd3s\xca\xdeN\x83&'\xb4\xf6" #get the digest as raw bytes

#hash a longer string by taking it one chunk at a time
#this is handy when dealing with large volumes of data
>>> h = hashlib.md5()
>>> h.update('te')
>>> h.update('st')
>>> h.digest()
"\t\x8fk\xcdF!\xd3s\xca\xdeN\x83&'\xb4\xf6"
>>> h.hexdigest()
'098f6bcd4621d373cade4e832627b4f6'
>>> 
>>> h.update('.') #at this point the data we actually hashed are `test.`
>>> #observe that even though the difference is in just one character, the hash is
>>> #entirely different
>>> h.hexdigest()
'8cff6a87456225afc3b0bd8fecb8c515'

#hash the same string in one move and get the same output
>>> hashlib.md5('test.').hexdigest()
'8cff6a87456225afc3b0bd8fecb8c515'