Hash Functions in Web3 (SHA-256, Keccak-256, Merkle Trees)

Hash Functions in Web3 (SHA-256, Keccak-256, Merkle Trees)

Preimage resistance, collisions, avalanche, and how hashes secure blocks, addresses, storage, logs, and proofs across modern chains.

TL;DR: A cryptographic hash maps arbitrary data to a fixed-size digest while resisting
preimage, second-preimage, and collision attacks. Bitcoin standardizes on
SHA-256 (often applied twice), while the EVM stack uses Keccak-256 (the pre-NIST variant many people casually call “SHA-3”).
Hashes link blocks, derive addresses and function selectors, anchor commitments, and power Merkle proofs and Patricia tries.
Good engineering means domain separation, structured encoding, and clear Merkle conventions (ordering, padding, and odd leaves).
Avoid classic pitfalls (length-extension, ambiguous abi.encodePacked, endianness mistakes); use HMAC or EIP-712 when authenticating.

1) Security Properties and the Avalanche Effect

A cryptographic hash compresses any input into a fixed-length output (for example 256 bits) that looks random to an attacker with bounded resources. The strength of a hash family is usually summarized by three properties:

  • Preimage resistance: given a digest H, it should be infeasible to find any m such that hash(m) = H. Work factor ≈ 2n for an n-bit hash.
  • Second-preimage resistance: given m, it is infeasible to find a different m′ with hash(m′) = hash(m).
  • Collision resistance: infeasible to find any pair (m, m′) with the same digest. Because of the birthday paradox, cost is ≈ 2n/2 not 2n.

The avalanche effect means a one-bit change in input flips about half the output bits on average. That sensitivity is critical for tamper detection, block linking, and Merkle structures. A toy example:

"Hello" → 3615f80c9…    // illustration, not full digest
"hello" → 2cf24dba5…    // radically different

Security levels in practice. A 128-bit collision resistance target is generally considered strong. That implies a 256-bit digest (2256 preimage; 2128 collision). Bitcoin and Ethereum land here with 256-bit digests.

2) How Modern Hashes Are Built (Merkle–Damgård and Sponge)

Most blockchain tooling uses either Merkle Damgård or sponge constructions.

  • SHA-256 (SHA-2): Merkle–Damgård with a Davies–Meyer compression function, fixed IV, and length-padding. If you naïvely compute hash(key || message) to “authenticate,” an attacker can exploit length-extension. That is why we use HMAC (hash the key on both sides of the message) for message authentication.
  • Keccak-256: a sponge construction. It absorbs input into a large permutation state and then squeezes output. The EVM opcode KECCAK256 implements the pre-NIST variant often called “Keccak,” not the NIST standardized SHA-3 bit-padding. Sponge design avoids the classic Merkle Damgård length-extension property, but you still need domain separation and clear encodings.

Mental model: SHA-256 is the gravity in Bitcoin land and pervades tools around it. Keccak-256 is the default primitive in EVM land: addresses, event topics, function selectors, storage keys, and RLP-trie hashing all hinge on Keccak.

3) Where Hashes Appear On Chain

  • Block linking: Each block header commits to the previous header’s hash. On Bitcoin, miners find a nonce such that doubleSHA256(header) is below the target. The double application is a legacy defense and a pragmatic “belt and suspenders.”
  • Ethereum addresses: the last 20 bytes of keccak256 of the uncompressed secp256k1 public key. The mixed-case checksum (EIP-55) uses another keccak256 of the hex string to decide which letters to uppercase.
  • Function selectors: first 4 bytes of keccak256("transfer(address,uint256)"). Event topics are hashes of event signatures, enabling log indexing and filtering.
  • Commit and reveal: commit now by posting a hash; reveal later to prove integrity (auctions, randomness beacons, allowlists). Hashes make commitments binding and hidden until reveal.
  • EIP-712: structured hashing of typed data, explicitly scoped by a domain separator (name, version, chain id, and verifying contract), which prevents cross-contract or cross-chain signature replay.
  • Bloom filters: Ethereum block headers include a log bloom to accelerate topic searches; it uses multiple hashings to set bits in a 2048-bit filter (probabilistic, with false positives).

4) Ethereum Storage Hashing and Tries

Ethereum maintains several Merkle–Patricia tries (MPT): a state trie mapping account addresses to account objects, a storage trie per account mapping 256-bit slots to 256-bit values, and a transaction and receipt trie per block. Keccak-256 is the core hash; RLP is the canonical encoding. A few practical notes you will hit in audits and client integrations:

  • Storage slots: a simple state variable at slot p is keyed by p. A mapping from key k at slot p stores its value at keccak256(encode(k) ++ encode(p)). Dynamic arrays and nested mappings have their own conventions. Getting this wrong yields “proof does not verify” bugs.
  • Proofs: EVM JSON-RPC proof methods (for example, account and storage proofs) return the trie nodes so light clients or L2s can verify values against a known state root.
  • Verkle tomorrow: future Ethereum roadmap swaps MPT for Verkle trees (vector commitments) to reduce witness sizes. Hashes still feature prominently, but the commitment scheme changes. Keep an eye on tooling if you verify proofs on chain.
// Mapping storage slot (concept)
// slot = keccak256(abi.encode(key, mappingSlot))
bytes32 slot = keccak256(abi.encode(key, uint256(mappingSlot)));

5) Binary Merkle Trees and Proofs

A Merkle tree hashes leaves, then pairs of children repeatedly until a single root remains. Proving “leaf L is in the tree with root R” needs only the sibling hashes along the path. That proof is logarithmic in the number of leaves, which is why Merkle proofs are perfect for airdrops, inclusion checks, and light clients.

L1  L2  L3  L4
 │   │   │   │
H1  H2  H3  H4           // hash leaves
  \ /     \ /
   A       B             // parents
      \   /
        ROOT             // root = hash(A || B)

Conventions you must fix and document:

  • Pair ordering: many “airdrop” trees hash left || right. Some choose “sorted pairs” (hash min(left,right) || max(left,right)) so the root is independent of insertion order. Pick one and commit to it in your code and docs.
  • Odd leaves: either duplicate the last node or promote it unchanged; Bitcoin duplicates, some application trees promote. Your verifier must match the builder.
  • Hash function: Keccak-256 for EVM app trees is common; Bitcoin uses double SHA-256 and little-endian encodings for its transaction tree.
  • Multi-proofs and sparse trees: bundle many leaves into a single proof that reuses siblings. For key–value maps with fixed depth, sparse Merkle trees commit to 2d leaves with default zero hashes and prove membership or non-membership succinctly.
// Solidity-style verification (sorted pairs; conceptual)
function verify(bytes32 leaf, bytes32[] memory proof, bytes32 root) internal pure returns (bool ok) {
  bytes32 h = leaf;
  for (uint i = 0; i < proof.length; i++) {
    bytes32 s = proof[i];
    if (h < s) { h = keccak256(abi.encodePacked(h, s)); }
    else       { h = keccak256(abi.encodePacked(s, h)); }
  }
  return h == root;
}

6) Common Pitfalls and Secure Patterns

  • Length-extension on Merkle–Damgård: never authenticate with hash(key || msg). Use HMAC (HMAC-SHA-256 etc.) or EIP-712 for user-visible signing. While sponge designs do not have the same property, you still need domain separation and schema discipline.
  • Domain separation: prepend context tags and versions so digests cannot cross contexts. EIP-191 and EIP-712 make this explicit for the EVM.
  • abi.encodePacked collisions: concatenating dynamic types without separators can collide (for example ("ab","c") vs ("a","bc")). Prefer abi.encode for structured hashing unless you have only fixed-width types.
  • Keccak versus SHA-3: Ethereum uses Keccak-256 (pre-NIST). Many off-chain “sha3” libraries compute SHA-3-256 with different padding. Pick the exact function your on-chain code expects.
  • Endianness and casting: treat digests as byte strings; only interpret as numbers if the protocol defines an endianness. Bitcoin’s wire format is little-endian in places; do not mix conventions.
  • String normalization: hashing “Equal” and “equal ” (trailing space) yields different digests. Normalize inputs (lowercasing where appropriate), and prefer bytes rather than human strings in protocol-critical paths.
  • Salting and binding: include a salt or caller address and chain id in commitments so equal values from different users or chains do not collide semantically.

7) Solidity and Client Implementation Notes

  • Structured hashing: prefer keccak256(abi.encode(...)) for tuples. Use abi.encodePacked only with fixed-width types and explicit separators.
  • OpenZeppelin: audited MerkleProof helpers exist. For sparse trees, multi-proofs, or incremental trees, use audited libraries rather than rolling your own.
  • Storage layout tests: write tests that compute mapping and array slots both in Solidity and in your off-chain script (ethers.js, viem, or python). Mismatches here cause hair-pulling bugs.
  • Commit–reveal: bind the commitment to msg.sender and include a salt:
    bytes32 commit = keccak256(abi.encode(msg.sender, value, salt));
  • Events and topics: remember that the first topic is the event signature hash; if you change parameter types or names in the signature string, the topic changes too.
  • Assembly hashing (advanced): hashing large memory ranges can be more gas-efficient with inline assembly, but readability and safety matter more than micro-optimizations in most apps.

8) ZK-Friendly Hashes (Poseidon, MiMC, Rescue)

When you build zero-knowledge circuits, Keccak and SHA-256 are expensive to prove because their bitwise operations do not map cleanly to arithmetic constraints. ZK-friendly hashes trade raw CPU speed for proving efficiency on elliptic curves or finite fields.

  • Poseidon / Poseidon2: Sponge-like design over prime fields with carefully chosen S-boxes. Excellent for Merkle trees inside circuits (rollups, mixers, privacy wallets).
  • MiMC: Minimal Multiplicative Complexity, simple S-box, compact constraints; security parameters must be tuned and modern variants are preferred.
  • Rescue / Rescue-Prime / Griffin: Algebraic permutations designed for a balance of security and constraint count.

Pattern: Use Poseidon-style trees inside your circuit and Keccak-256 at the EVM boundary. Publish both roots when needed, or carry a small on-chain verifier that maps Poseidon roots into Keccak commitments.

9) Performance and Gas Notes

  • Hash only what you must: pre-hash offline and submit roots on chain. For example, precompute your airdrop tree and store the root and a Merkle-ized bitmap for claimed leaves.
  • Avoid redundant hashing: cache roots per epoch; do not recompute the same tree inside multiple transactions.
  • Proof compression: multi-proofs can reduce calldata size when verifying many leaves at once. Measure gas; for small lists a basic proof may be cheaper.
  • Slot hashing is “free enough” but not zero: keccak256 is an opcode with cost proportional to input size. Avoid hashing huge dynamic arrays on chain.
  • Events vs storage: if you only need public auditability (not contract reads), emitting digests as events can be cheaper than persisting them in storage.

10) Tooling, Test Vectors, and Debugging

Most hash mismatches are tooling mismatches. A quick checklist saves hours:

  • Exact primitive: In Node, ethers.utils.keccak256 operates on hex data; do not give it a UTF-8 string unless you first toUtf8Bytes. In Python, pysha3 provides Keccak; hashlib.sha3_256 is NIST SHA-3 (different padding).
  • Encoding: abi.encode vs abi.encodePacked; integers are left-padded to 32 bytes. Strings and bytes are length-prefixed under abi.encode.
  • End-to-end tests: Produce known test vectors and assert equality in Solidity and off-chain tests (Foundry/Hardhat/viem).
  • Explorers: Etherscan and others display function selectors and event topics; use these to sanity-check your computed hashes.
// JavaScript (ethers v6) — structured hashing
import { keccak256, encodeAbiParameters, toBytes, toUtf8Bytes } from "viem";
const encoded = encodeAbiParameters(
  [{ type: "address" }, { type: "uint256" }, { type: "bytes32" }],
  ["0xabc...def", 123n, "0x"+ "00".repeat(32)]
);
const digest = keccak256(encoded);

// Wrong: hashing the hex string directly
// keccak256("0xabc123") ≠ keccak256(hexBytes)

11) Design Patterns: Commit–Reveal, Salts, and Nonces

Hashes let you commit to data without revealing it, and later open the commitment. A few sturdy patterns:

  • Commit–reveal with binding: bind to the caller and context so others cannot reuse your commitment.
    bytes32 public commit;
    function commitValue(bytes32 c) external { commit = c; }
    function reveal(uint256 v, bytes32 salt) external {
      require(keccak256(abi.encode(msg.sender, v, salt)) == commit, "bad reveal");
      // use v safely
    }
    
  • Nonce hygiene: include a nonce or epoch to prevent replay across time. Reset nonces per user or per route to keep logic simple.
  • Domain separation for signatures: use EIP-712 domain separators (name, version, chain id, verifying contract) so signed digests cannot hop to other contracts or chains.
  • HMAC when you share a secret: if two services share a secret key and need message authentication, use HMAC-SHA-256 or HMAC-SHA-512, not bare hashing.
  • HKDF for key derivation: when deriving multiple keys from a single seed, use HKDF and include a context string (purpose, version) so subkeys are independent.

12) Cheat Sheet: Which Hash When?

  • EVM internals, addresses, selectors, storage: Keccak-256.
  • Bitcoin headers and Merkle roots: SHA-256 (often double).
  • Authenticate messages with a shared secret: HMAC-SHA-256 (or 512).
  • Typed user signatures: EIP-712 (Keccak-256 over structured encodings and domain).
  • On-chain airdrop trees: Binary Merkle with Keccak-256 (document sorted vs left–right and odd-leaf rule).
  • In-circuit Merkle trees and privacy systems: Poseidon or other ZK-friendly hash inside the proof, Keccak-256 on the EVM edge.

13) FAQ and Misconceptions

Q: Is Keccak-256 the same as SHA-3-256?
A: No. The EVM uses Keccak with pre-NIST padding. SHA-3-256 standardized by NIST uses different padding constants. Many libraries label functions “sha3” even when they compute Keccak. Always match what the chain expects.

Q: Why did Bitcoin choose double SHA-256?
A: Historical defense in depth and attempts to reduce some early design concerns about length-extension or accidental collisions. It remains part of consensus and therefore cannot change.

Q: Can I use hashes for randomness?
A: Not safely by themselves. keccak256(block.timestamp) is manipulable by miners or block producers. Use VRFs or commit–reveal with multiple independent parties and delays.

Q: Are collisions a realistic risk for 256-bit hashes?
A: Not with current knowledge and compute, but protocol bugs and poor encodings routinely create practical collisions in structured hashing. Most “hash failures” are engineer-made, not cryptanalytic breakthroughs.

Q: Do I ever hash twice on Ethereum?
A: Rarely needed. Double hashing is more a Bitcoin convention. On the EVM, use one Keccak unless you are matching a specific format or mitigating a very specific malleability.

Quick check

  1. Name the three core properties of cryptographic hashes and which one has the birthday bound.
  2. Explain in one sentence why hash(key || msg) is unsafe for authentication with SHA-256.
  3. List three conventions you must fix and document when publishing a Merkle root.
  4. Why do many Ethereum projects use keccak256(abi.encode(...)) instead of abi.encodePacked?
  5. Give one reason to prefer Poseidon inside a ZK circuit and how you would bridge it to the EVM.
Show answers
  • Preimage, second-preimage, and collision resistance; collisions face the birthday bound (~2n/2 for an n-bit hash).
  • Merkle–Damgård hashes are vulnerable to length-extension, letting attackers compute hash(key || msg || extra) without knowing the key.
  • Pair ordering (left–right vs sorted), odd-leaf rule (duplicate vs promote), and the exact hash function and input encoding (Keccak-256 vs SHA-256, byte order).
  • It prevents ambiguous concatenations across dynamic types and cleanly matches cross-language structured encodings.
  • Poseidon has far fewer constraints to prove in ZK; publish a Poseidon root for the circuit and a Keccak-256 commitment or on-chain verifier that checks a mapping between the two.

Go deeper

  • Concept lectures: Merkle–Damgård and length-extension, sponge capacity and rate, HMAC proofs, birthday attacks, sparse and Verkle trees.
  • Engineering lectures: EIP-712 typed data hashing, EIP-55 checksums, RLP and trie hashing, multi-proof design, endianness pitfalls.
  • Security lectures: domain separation strategies, commitment schemes for auctions, audited Merkle libraries, cross-client parity tests and vectors.
  • ZK lectures: Poseidon and Rescue design, constraint counts, building a circuit-friendly Merkle tree and bridging to the EVM.

Next: zero-knowledge proofs — prove facts without revealing secrets.


Next: Zero-Knowledge Proofs →