3. Cryptographic hash function
Hash functions
You might have heard about hash functions that are used in hash tables. The hash functions were initially made in the 1950s for faster data lookups.
A hash function takes in an input and produces an output called hash.
"Orange butter" ---> [ Hash function ] ---> 15
The hash function's output, that is, the hash or also called the hash value gets stored in a data structure known as hash table.
Hash tables are like an index of a book, we can look at the index to find the page number for a chapter and quickly go to that chapter, we don't need to go through every page.
Suppose, if we want to search for a user named "Orange butter". We pass the string "Orange butter" as input to the hash function. We get 15 as output (the hash value).
"Orange butter" ---> [ Hash function ] ---> 15
Now, we can access the 15th index of the hash table in constant time O(1)
because hash table's
indexes (also called buckets) are an array. We didn't need to go through all the users in sequence or in a
binary search fashion. Hash table lookups
are O(1)
. Super fast.
All of the above was about hash functions used for hash table's purposes.
As time passed, transistors came along and computers started reaching consumer's hands. People started sharing software programs and files over network and floppy disks.
Programmers realised that they needed some way to verify if a file's contents are not tampered with. That is, to verify whether the bits of a file are not modified by any third-party. This is called file integrity check -- whether the bits of a file are accurately integrated.
This need for integrity checks led to the widespread development of cryptographic hash functions.
Cryptographic hash functions
Cryptographic hash functions share similar spirit with the non-cryptographic hash function we discussed above, but the cryptographic hash functions are designed with totally different objectives in mind so their internal design and implementation are different from the non-cryptographic ones.
Cryptographic hash functions given an input produces a fixed-length hash.
In the below code editor, try running the code. Change the input a little bit and try running it again. You will observe that even a smallest change like a space generates an entirely different hash.
Every cryptographic hash function must have the below three properties --
- They must be deterministic, which means they must return the same hash value when the same input is passed to the hash function.
- It must be impossible to take the hash and figure out the input from the hash. Cryptographic hash functions are a one-way function. There must be no way at all to turn the hash value back to the input.
- For almost every input there should be a unique hash, no two differing input should share the same hash. If two input share the same hash, it's called a hash collision.
If a hash collision can be generated for a cryptographic hash function in a feasible way, then that cryptographic hash function is meant to be insecure.
For example, a popular cryptographic hash function SHA-1 suffered a hash collision on 2017. It retired its use on web browsers and digital signatures.
Currently, SHA-256 and SHA-512 are cryptographically secure. By being cryptographically secure, it means that a hash collision can not be generated for SHA-256 hash in a practical way. If you are given a SHA-256 hash, to find a hash collision for it using brute-force, it will take approx. 10^59 years! That's just impractical to compute.
For a cryptographic hash function to be secure, it must be impractical even for a large computing farm to be able to generate a hash collision for a given hash.
Let us understand the 256 and 512 in SHA-256 and SHA-512. The 256 in SHA-256 denotes the hash output size (or also called hash length) in bits. So this means, SHA-256 will produce a hash output of 256 bits. Similarly, the SHA-512 will produce a hash output of 512 bits. ----------------------------------------------------------------------- SHA-256("curious cryptography") = b9b6062f154f139f5bd54b36a1eaf5c2182ed9521c8225c59f60bbd5127f9671 The above hash output is in hexadecimal format. If you count the number of hex characters in the above SHA-256 hash, its 64 characters and one hexadecimal character occupies 4 bits. So, 64 total characters * 4 bits per character = 256 bits. That's why SHA-256.
We understand that it must be impractical to generate a hash collision. But what would happen if we were able to generate a hash collision in a practical way...? What bad would happen?
To understand that we need to talk about real world use cases of cryptographic hash functions. Once we understand them, we will get why it's super important for a cryptographic hash function to be immune to hash collisions.
Real world uses of cryptographic hash functions
Git commits and objects
Git extensively used SHA-1 hash function. When we make a git commit
, git collects information
about
the current directory's files, current timestamp and author details, it passes all this data to a SHA-1 hash
function, which produces a commit hash like 27f0f577ce740bc302eb8bf55faf293c15a2131c
.
Now imagine, if an attacker somehow manages to edit files in a particular fashion that creates a commit hash which already exists in the repo (a hash collision), other developers might unknowingly pull and run the malicious code, thinking it's a legitimate version.
The above situation is quite imaginary because a malicious code cannot be directly committed by an outsider, it needs to be accepted & merged by your team via a pull request. But, you get the idea why it's bad if somehow someone manages to produce a hash collision.
Don't worry, Git has decided to migrate to a more cryptographically secure hash function.
Package's integrity
As we already hinted above, integrity is one of the major use cases for cryptographic hash functions. In the Ubuntu package distribution system, there are several mirrors of the package archive. Mirrors are created off the main archive to serve local geography. For example, MIT has its own mirror of the Ubuntu package archive, so MIT students can download Ubuntu packages very quickly without querying the far away main archive.
The main Ubuntu archive publishes hash values of their packages called checksums. Checksum
is
another name for
the
hash value. When packages are downloaded via apt
from a mirror, a checksum is generated by
passing
the
downloaded package
to
the same hash function that was used on the main Ubuntu archive. If the checksum matches to that of the main
Ubuntu
archive, it means the downloaded package's content is exactly the same as that of the main Ubuntu site. So,
it
can
be safely installed and run.
Imagine if a bad actor creates a mirror, uploads a modified package 😈 that produces the same hash (a hash
collision) as the main Ubuntu
archive's package. Then when you install that package, apt
is not going to
complain since the
hash
of the downloaded package and that of the main Ubuntu archive are same, apt
thinks that the
downloaded package's content and that of main Ubuntu archive are same as well. This could lead to malicious
code being run on your computer.
So, hash collisions are the biggest enemy for a cryptographic hash function. In the case of non-cryptographic hash functions that we use with hash tables, if a hash collision occurs, it's not that big of a deal; the hash collision is just going to affect the performance of the lookup. But for cryptographic hash functions, hash collisions are intolerable.
Go to next lesson: Password hashing