#️⃣ Hash functions

Hash functions

You might have heard about hash functions that are used in hash tables. The hash functions were initially made in the 1950s for faster data lookups.

non-cryptographic hash function — A hash table

A hash function takes in an input and produces an output called hash that serves as a shortcut to access the input data much faster next time we need it.

Hash tables are like an index of a book, we can look at the index to find the page number for a chapter and quickly go to that chapter, we don't need to go through every page.

From the above image, if we want to search for a user named "Orange butter". We pass the string "Orange butter" as input to the same hash function we used earlier when we created the hash table. We get 15 as output (the hash value).

        "Orange butter"  --->  [ Hash function ] ---> 15

Now, we can access the 15th index of the hash table in constant time O(1) because hash table's indexes (also called buckets) are an array. We didn't need to go through all the users in sequence or in a binary search fashion. Hash table lookups are O(1). Super fast.

That's about hash functions used for hash table's purposes.

As time passed, transistors came along and computers started reaching consumer's hands. People started sharing software programs and files over network and floppy disks.

Programmers realised that they needed some way to verify if a file's contents are not tampered with. That is, to verify whether the bits of a file are not modified by any third-party. This is called file integrity check -- whether the bits of a file are accurately integrated.

This need for integrity checks led to the widespread development of cryptographic hash functions.

Cryptographic hash functions

Cryptographic hash functions share similar spirit with the non-cryptographic hash function we discussed above, but the cryptographic hash functions are designed with totally different objectives in mind so their internal design and implementation are different from the non-cryptographic ones.

Cryptographic hash functions given an input produces a fixed-length hash.

In the below code editor, try running the code. Change the input a little bit and try running it again. You will observe that even a smallest change like a space generates an entirely different hash.

Every cryptographic hash function must have the below three properties --

They must be deterministic, which means they must return the same hash value when the same input is passed to the hash function.
It must be impossible to take the hash and figure out the input from the hash. Cryptographic hash functions are a one-way function. There must be no way at all to turn the hash value back to the input.
For almost every input there should be a unique hash, no two differing input should share the same hash. If two input share the same hash, it's called a hash collision.

hash collision illustration — A hash collision

If a hash collision is detected in a cryptographic hash function, then that cryptographic hash function is meant to be insecure. For example, a popular cryptographic hash function SHA-1 suffered a hash collision on 2017 that retired its use on web browsers.

Currently, SHA-256 and SHA-512 are cryptographically secure. Let us understand the 256 and 512 in these SHAs.

        The SHA-256 will produce a hash of 256 bits (32 bytes). 
        The 256 in SHA denotes the hash output size (or also called hash length) in bits.

        Similarly, the SHA-512 will produce a hash of 512 bits (64 bytes). 

        -----------------------------------------------------------------------

        SHA-256("curious cryptography") = 
        b9b6062f154f139f5bd54b36a1eaf5c2182ed9521c8225c59f60bbd5127f9671

        The above hash output is in hexadecimal format.
        We can't use the UTF-8 encoding because hash output bytes are random, 
        they won't necessarily map to a UTF-8 character, so trying to encode
        to UTF-8 will error out.
        That's why, the hash outputs are always shown in hexadecimal!
        
        If you count the number of hex characters in the above hash, 
        its 64 characters.
        One hexadecimal character occupies 4 bits, and total hash size is 256 bits.
        So, number of hex characters in SHA-256 should be 256 / 4 = 64 hexadecimal characters.

You might wonder, why hash collisions make a cryptographic hash function insecure? To understand that we need to talk about real world use cases of cryptographic hash functions.

Real world uses of cryptographic hash functions

Git commits and objects

git log hash — git log of one of my repo's

Git extensively used SHA-1 hash function. When we make a git commit, git collects information about the current directory's files, current timestamp and author details, it passes all this data to a SHA-1 hash function, which produces a commit hash like 27f0f577ce740bc302eb8bf55faf293c15a2131c.

Now imagine, if an attacker somehow manages to edit files in a particular fashion that creates a commit hash which already exists in the repo (a hash collision), other developers might unknowingly pull and run the malicious code, thinking it's a legitimate version.

The above situation is quite imaginary because a malicious code cannot be directly committed by an outsider, it needs to be accepted & merged by your team via a pull request. But, you get the idea why it's bad if somehow someone manages to produce a hash collision.

Don't worry, Git has decided to migrate to a more cryptographically secure hash function.

Package's integrity

file integrity illustration — Package integrity process

As we already hinted above, integrity is one of the major use cases for cryptographic hash functions. In the Ubuntu package distribution system, there are several mirrors of the package archive. Mirrors are created off the main archive to serve local geography. For example, MIT has its own mirror of the Ubuntu package archive, so MIT students can download Ubuntu packages very quickly without querying the far away main archive.

The main Ubuntu archive publishes hash values of their packages called checksums. Checksum is another name for the hash value. When packages are downloaded via apt from a mirror, a checksum is generated by passing the downloaded package to the same hash function that was used on the main Ubuntu archive. If the checksum matches to that of the main Ubuntu archive, it means the downloaded package's content is exactly the same as that of the main Ubuntu site. So, it can be safely installed and run.

Imagine if a bad actor creates a mirror, uploads a modified package 😈 that produces the same hash (a hash collision) as the main Ubuntu archive's package. Then when you install that package, apt is not going to complain since the hash of the downloaded package and that of the main Ubuntu archive are same, apt thinks that the downloaded package's content and that of main Ubuntu archive are same as well.

So, hash collisions are the biggest enemy for a cryptographic hash function. In the case of non-cryptographic hash functions that we use with hash tables, if a hash collision occurs, it's not that big of a deal; the hash collision is just going to affect the performance of the lookup. But for cryptographic hash functions, hash collisions are intolerable.

🌟 Hash hash hash. Please, fill a short feedback form for the hashes around you.