Indexing and Querying (The Graph)

Indexing and Querying: The Graph, Subgraphs and GraphQL (Complete Guide)

Indexing and Querying: The Graph, Subgraphs and GraphQL is how serious Web3 products turn raw on-chain activity into fast dashboards, analytics, and clean APIs. This complete guide explains the full indexing pipeline, how to design entities that match product questions, how to write deterministic mappings that survive reorgs, how GraphQL should be queried without performance traps, and how to run indexing infrastructure like real production software.

TL;DR

  • A subgraph is an event-driven indexing program that transforms contract events into structured entities exposed through GraphQL.
  • The three core components are: schema (entities), manifest (data sources and handlers), and mappings (deterministic transforms).
  • Determinism is the law: no randomness, no external calls, no “current time,” and IDs must be reproducible during replay.
  • Chain reorgs are normal. Your indexing logic must be reorg-safe and idempotent.
  • Performance is mainly modeling: precompute snapshots, avoid query-time aggregation, and use cursor pagination instead of large skip.
  • Production readiness means testing, monitoring, versioning, and migration discipline, not just “it compiles.”
  • If you need foundations first, start with Blockchain Technology Guides, then deepen your systems view in Blockchain Advance Guides.
Prerequisite reading Think in events, then think in product questions

Indexing is easiest when you already understand how transactions create logs and how dApps read state. If you want a clean base for logs, events, and the “why” behind deterministic execution, start with Blockchain Technology Guides. Then come back here and treat this article like backend engineering for on-chain data.

Indexing in human terms: what is actually happening

Blockchains store facts, not answers. A chain can tell you that a swap happened, but it will not directly answer “what is the daily volume per pool” or “which wallets are net buyers this week” or “how many unique traders interacted with this contract.” Those questions require computation, relationships, and time-series modeling.

Most developers start by calling RPC methods or scanning logs in the browser. That works until it does not. The moment you have a real product, you need consistent responses, pagination, and query patterns that do not time out. You need a read-optimized model.

The Graph is one of the most widely used patterns for this. It lets you define a subgraph, which is basically an indexing program: listen to events, process them deterministically, store structured entities, and expose them via GraphQL. Your front end stops scanning logs and starts querying entities. That is the shift.

Raw chain
Facts in logs
Correct but not query-friendly for product needs.
Subgraph
Structured entities
Deterministic transforms into read-optimized tables.
GraphQL
Precise queries
Fetch exactly what UI needs, with pagination and filters.

Where indexing lives in the Web3 stack

A typical Web3 product stack has three “data planes”:

  • Chain state: what the chain knows right now, accessible through RPC calls and contract reads.
  • Event history: what happened over time, accessible via logs and block scanning.
  • Product model: what your app needs, expressed as entities like trades, positions, rewards, users, snapshots, and alerts.

The Graph bridges the event history plane to the product model plane. That is why it is so powerful. But it also means your modeling decisions become product decisions. If you model poorly, your app feels slow or inaccurate. If you model well, your app feels instant and reliable.

The Graph data pipeline (practical view) Events become entities. Entities become a GraphQL API that your UI can query. Blockchain Blocks, transactions, logs, contract calls Indexing node Reorg-aware replay, deterministic handlers, entity writes Entity store Read-optimized model: Users, Trades, Pools, Snapshots GraphQL API Filter, paginate, and fetch exactly what UI needs Your front end stops scanning logs and starts querying entities

What is The Graph and what is a subgraph?

The Graph is an indexing and query layer for on-chain data. A subgraph is a specific indexing definition, like a program that describes what to index and how to shape it.

A subgraph typically includes:

  • Schema: your database-like entity model defined in GraphQL types.
  • Manifest: the list of contracts and handlers, including network, start blocks, and ABI details.
  • Mappings: deterministic handler code that transforms events and calls into entity writes.

The big idea is simple: events are your input stream, entities are your output model, and GraphQL is your query interface.

Why products fail without indexing discipline

Many Web3 products ship a UI first and treat data as an afterthought. That is almost always reversed in reality. Data quality drives trust, and trust drives retention.

Without indexing discipline, you typically see:

  • inconsistent numbers (volume differs per page load)
  • slow dashboards (queries are doing aggregation on the fly)
  • missing relationships (no clean way to join users, pools, and trades)
  • reorg bugs (duplicate entities, wrong totals after chain reorganizations)
  • front-end hacks (client-side state that tries to patch data gaps)

The fix is to design your subgraph as if it is a backend service. Because it is.

Subgraph anatomy: the folder layout that stays sane

A clean repository layout reduces future pain. Your goal is to keep indexing logic predictable and testable.

./abis/
  Exchange.json
  ERC20.json
./schema.graphql
./subgraph.yaml
./src/
  mappings/
    exchange.ts
    token.ts
  helpers/
    decimals.ts
    ids.ts
    time.ts
./tests/
  exchange.test.ts
  fixtures.ts
./package.json
./tsconfig.json

A practical rule: mapping handlers should be short and boring. Push complexity into helpers where you can test and reason about it. Handlers should look like “load entity, update fields, save.”

Schema design: model for product questions, not for logs

The schema is where most subgraphs either become powerful or become painful. A log-first schema mirrors event structure. A product-first schema mirrors how users interact with data.

Start with questions:

  • What does the UI need on the dashboard?
  • What filters will the user apply?
  • What lists need pagination?
  • What aggregates should load instantly?
  • What historical charts do we show?

Then model entities that answer those questions without heavy query-time computation.

Entities and relationships that feel natural

Common entities across DeFi and analytics:

  • User: wallet address plus derived fields like trade count, total volume, first seen timestamp.
  • Token: address, symbol, decimals, and derived metrics like total volume and holders count (if you track it).
  • Pool: trading pair, fee tier, and rolling aggregates.
  • Trade: single swap with deterministic ID tied to transaction hash and log index.
  • Snapshot: daily or hourly entity storing precomputed metrics for charts.
Entity model intuition Model for reads: Users trade in Pools using Tokens, and Snapshots power charts. User id = address Trade id = txHash-logIndex links: user, pool, tokens Pool id = pool address Token id = token address Snapshot id = poolId-dayStartTimestamp stores: volume, txCount, fees, liquidity stats creates belongs to updates

Entity IDs: the non-negotiable rules

ID strategy is one of the biggest sources of bugs. In subgraphs, IDs must be stable and reproducible. If you generate IDs differently across replays, reorg rollbacks will create duplicates or corrupt totals.

Entity Good ID strategy Why it works Bad strategy Failure mode
Trade / Swap txHash + "-" + logIndex Unique per event, deterministic timestamp + counter Duplicates during reorg or replay
User lowercased address Stable identity firstSeenBlock + address Identity changes if startBlock changes
Snapshot poolId + "-" + dayStartTimestamp Stable time bucket blockNumber bucket Chart flicker and inconsistent aggregation
Position protocolPositionId Matches protocol identity derived hash without full inputs Collisions and hard-to-debug merges

Manifest design: define what you index and when you start

The manifest is where you declare the data sources your indexer will follow. It includes the network, contract address, ABI, start block, and the handlers you want to run.

The start block matters more than people think. Starting too early can increase indexing time by days. Starting too late can break derived metrics and “first seen” logic. A good approach is:

  • start at the deployment block of the contract you care about
  • include the earliest block where events you rely on exist
  • use templates for dynamic contracts created later (factory patterns)
specVersion: 0.0.5
schema:
  file: ./schema.graphql

dataSources:
  - kind: ethereum/contract
    name: Factory
    network: mainnet
    source:
      address: "0xFactory..."
      abi: Factory
      startBlock: 12345678
    mapping:
      kind: ethereum/events
      apiVersion: 0.0.7
      language: wasm/assemblyscript
      entities:
        - Pool
        - Token
      abis:
        - name: Factory
          file: ./abis/Factory.json
        - name: Pool
          file: ./abis/Pool.json
        - name: ERC20
          file: ./abis/ERC20.json
      eventHandlers:
        - event: PoolCreated(indexed address,indexed address,address)
          handler: handlePoolCreated
      file: ./src/mappings/factory.ts

templates:
  - kind: ethereum/contract
    name: PoolTemplate
    network: mainnet
    source:
      abi: Pool
    mapping:
      kind: ethereum/events
      apiVersion: 0.0.7
      language: wasm/assemblyscript
      entities:
        - Pool
        - Trade
        - Snapshot
      abis:
        - name: Pool
          file: ./abis/Pool.json
      eventHandlers:
        - event: Swap(indexed address,uint256,uint256,uint256,uint256)
          handler: handleSwap
        - event: Mint(indexed address,uint256,uint256)
          handler: handleMint
        - event: Burn(indexed address,uint256,uint256)
          handler: handleBurn
      file: ./src/mappings/pool.ts

Mappings and determinism: what you can and cannot do

Mappings are where events become entities. In The Graph model, mapping code runs in a deterministic environment. That constraint is the whole point. It guarantees that if the indexer replays blocks, you get the same result.

That means:

  • You cannot call external APIs.
  • You cannot use randomness.
  • You cannot read local files or system time.
  • You should not rely on “current block” as a global mutable variable.

The only truth is in the event and block data you are processing.

A clean handler pattern

A handler should read like a receipt. Load the minimal set of entities, apply the delta, then save. If the logic is complex, move calculations into helpers.

export function handleSwap(event: SwapEvent): void {
  // Deterministic unique id
  const tradeId = event.transaction.hash.toHex() + "-" + event.logIndex.toString()

  // Load or create pool
  let pool = Pool.load(event.address.toHex())
  if (pool == null) {
    pool = new Pool(event.address.toHex())
    pool.txCount = BigInt.fromI32(0)
    pool.volume0 = BigDecimal.zero()
    pool.volume1 = BigDecimal.zero()
  }

  // Create trade entity
  const trade = new Trade(tradeId)
  trade.pool = pool.id
  trade.sender = event.params.sender
  trade.amount0 = toDecimal(event.params.amount0, 18)
  trade.amount1 = toDecimal(event.params.amount1, 18)
  trade.timestamp = event.block.timestamp
  trade.blockNumber = event.block.number
  trade.txHash = event.transaction.hash

  // Update aggregates
  pool.txCount = pool.txCount.plus(BigInt.fromI32(1))
  pool.volume0 = pool.volume0.plus(trade.amount0.abs())
  pool.volume1 = pool.volume1.plus(trade.amount1.abs())

  // Update daily snapshot
  const dayStart = dayStartTimestamp(event.block.timestamp)
  const snapId = pool.id + "-" + dayStart.toString()
  let snap = PoolDaySnapshot.load(snapId)
  if (snap == null) {
    snap = new PoolDaySnapshot(snapId)
    snap.pool = pool.id
    snap.dayStartTimestamp = dayStart
    snap.volume0 = BigDecimal.zero()
    snap.volume1 = BigDecimal.zero()
    snap.txCount = BigInt.fromI32(0)
  }
  snap.volume0 = snap.volume0.plus(trade.amount0.abs())
  snap.volume1 = snap.volume1.plus(trade.amount1.abs())
  snap.txCount = snap.txCount.plus(BigInt.fromI32(1))

  // Save in a predictable order
  trade.save()
  snap.save()
  pool.save()
}

Reorg safety: why your totals break and how to stop it

Reorgs are not an edge case. They are a normal property of distributed consensus. A subgraph must behave correctly if the chain replaces the last N blocks.

In practice, reorg-safe indexing is about two habits:

  • Idempotent writes: creating and updating entities in a way that replays cleanly without duplicates.
  • Deterministic identity: entity IDs that do not change across replay.

If you use txHash-logIndex IDs for event entities, then when a block is replayed the event will recreate the same entity ID. If the event disappears due to reorg, the indexer can roll back the write. That is what you want.

Reorg handling: rollback then replay Your mapping logic must produce the same entities when blocks are replayed. Canonical chain head ... B100, B101, B102, B103 Reorg happens B102 and B103 replaced, their entity writes rolled back Replay on new blocks ... B100, B101, B102', B103' Success condition IDs and aggregates remain consistent after rollback and replay No duplicates, no drift, no “double counted” volume

Handler types: event handlers, call handlers, block handlers

Most subgraphs are event-driven, and that is usually correct. Events are stable, cheap to process, and designed for consumers. But there are cases where you need call handlers or block handlers.

Handler type Best for Common use Risk Mitigation
Event handlers Most indexing Swaps, mints, burns, transfers Missing data if protocol forgets events Cross-check with calls in rare cases
Call handlers Non-event state changes Configuration reads at init, fee tier logic More complexity, can be missed if calls revert Use selectively, keep logic minimal
Block handlers Periodic snapshots Hourly metrics, moving averages High cost if too frequent Use interval, store only what you need

Dynamic data sources: factory patterns and templates

Many protocols create new contracts over time, like pools created by a factory. If you hardcode every pool address in the manifest, you lose.

Templates solve this. You listen to the factory event, then create a new data source instance for the new contract address. From that point, your mapping handlers can index events from the newly created contract.

import { PoolTemplate } from "../generated/templates"

export function handlePoolCreated(event: PoolCreated): void {
  const poolAddress = event.params.pool
  // Create a new data source for the pool
  PoolTemplate.create(poolAddress)

  // Persist pool entity metadata
  const pool = new Pool(poolAddress.toHex())
  pool.token0 = event.params.token0
  pool.token1 = event.params.token1
  pool.createdAt = event.block.timestamp
  pool.createdBlock = event.block.number
  pool.txCount = BigInt.fromI32(0)
  pool.save()
}

Decimals and math: avoid subtle errors that destroy trust

Most dashboard trust issues are not advanced. They are decimals. Token amounts in events are usually integers in the smallest unit. If you forget to scale by decimals, your volume is wrong by orders of magnitude.

Practical approach:

  • store raw amounts when you need exact precision
  • store derived decimals for display and analytics
  • cache token decimals on first encounter
export function toDecimal(value: BigInt, decimals: i32): BigDecimal {
  const precision = BigInt.fromI32(10).pow(u8(decimals))
  return value.toBigDecimal().div(precision.toBigDecimal())
}

GraphQL querying: how to query without burning yourself

GraphQL is powerful because the client requests exactly what it needs. But it can still be misused. Querying strategy determines UI latency.

The most common performance trap is large pagination using skip. skip works, but it becomes slow on large datasets because the system must walk over skipped rows.

Cursor pagination: the default for scalable lists

Cursor pagination in The Graph usually means ordering by a stable field (often id) and using id_gt to fetch the next page.

query Trades($pool: String!, $cursor: String!) {
  trades(
    where: { pool: $pool, id_gt: $cursor }
    orderBy: id
    orderDirection: asc
    first: 200
  ) {
    id
    amount0
    amount1
    timestamp
    txHash
  }
}

Filtering: pick fields that match how you search

Your schema should expose fields that support filters your UI will use. For example:

  • filter trades by pool
  • filter trades by user
  • filter snapshots by pool and dayStartTimestamp
  • filter users by firstSeen timestamp range (if you store it)

GraphQL anti-patterns that create slow pages

Query anti-patterns to avoid

  • deep nested relationships on large lists, especially when each item pulls its own nested list
  • using large skip values as “page number” pagination
  • query-time aggregation like summing a list of trades to compute volume
  • fetching huge fields you do not render
  • sorting by non-stable fields when you need consistent pagination

Snapshots: the secret weapon for fast dashboards

If your dashboard displays volume charts, transaction counts, fees, or unique traders per day, do not compute those at query time. Compute them during indexing and store them as snapshot entities.

Snapshot design is straightforward:

  • choose a time bucket (hour, day)
  • derive a deterministic bucket key (day start timestamp)
  • store the aggregate metrics you want to render
  • update them in event handlers

This transforms “chart needs 30 days of volume” into “query 30 snapshot entities,” which is stable and fast.

Production is not optional: monitoring, testing, and incident playbooks

If a subgraph backs a real product, you must treat it like a service. That means you need visibility into:

  • sync height (how far behind head you are)
  • handler error rates and exceptions
  • query error rate and latency
  • reindex time (how long it takes to rebuild)
  • data quality checks (sanity checks for totals and counts)

Testing: matchstick and deterministic fixtures

Testing matters because subtle logic errors can persist silently. A good test suite:

  • builds mock events with realistic inputs
  • runs handlers
  • asserts entity fields and aggregate changes
  • tests edge cases like “first event creates entity”
  • tests idempotency by replaying events
// Test idea (conceptual)
1) create PoolCreated mock event
2) call handlePoolCreated
3) assert Pool entity exists with correct token fields
4) create Swap mock event
5) call handleSwap
6) assert Trade entity created with txHash-logIndex id
7) assert Pool aggregates updated
8) replay Swap event
9) assert no duplicate and aggregates behave as expected

Ops checklist: what you should have before calling it production

Area What “good” looks like What breaks if ignored
Schema Entities match product questions, snapshots exist for charts Slow UI, inconsistent metrics
ID strategy txHash-logIndex for events, stable bucket keys for snapshots Duplicates and drift after reorgs
Mappings Small handlers, helper functions, deterministic logic Hard-to-debug data corruption
Pagination Cursor-based, stable ordering fields Timeouts on large datasets
Testing Fixtures for key events and edge cases, idempotency checks Silent regressions after refactors
Monitoring Sync lag, handler exceptions, query latency dashboards Outages without visibility
Runbooks Steps for reindex, stuck sync, contract upgrade, migration Panic during incidents

Migrations and upgrades: how to evolve a subgraph without breaking users

Protocols evolve. Contracts upgrade. Event signatures change. You will need to version your subgraph and migrate schemas. The most important mindset is to treat schema changes like API changes.

Practical migration strategies:

  • Additive changes first: add new fields and entities without removing old ones.
  • Backfill via reindex: when fields require historical data, expect a reindex or grafting strategy.
  • Versioned endpoints: keep old subgraph versions alive until clients migrate.
  • Explicit deprecation: mark fields as deprecated in documentation and UI, then remove later.

Security and data integrity: your index can lie if you let it

Indexing is not only a performance layer. It is also an integrity layer. When users see numbers, they believe them. If your index drifts, trust drops fast.

Integrity hazards:

  • incorrect decimals scaling
  • missing events due to wrong ABI or handler mismatch
  • double counting during reorg issues
  • incomplete indexing due to bad startBlock
  • logic errors in aggregation

Integrity guardrails you should implement

  • Sanity checks: volume should not jump by 1000x without explanation
  • Consistency checks: sum of snapshot volumes should match total volume
  • Replay tests: rerun handlers on fixtures to confirm idempotency
  • Schema constraints: avoid nullable fields for required facts
  • Data quality dashboards: track known invariants

Real scenarios: how to model common Web3 products

The easiest way to learn subgraph design is to attach it to real product scenarios. Below are common patterns and how to model them without painting yourself into a corner.

Scenario A: DEX analytics dashboard

Product questions:

  • What is volume per pool per day?
  • Who are the top traders?
  • What is fee revenue estimate?
  • What are the latest swaps?

Model:

  • Pool entity with rolling totals and metadata
  • Trade entity per swap event
  • PoolDaySnapshot entity for charts
  • User entity with derived totals

Critical performance move: compute snapshots in handlers. Do not query all trades to render charts.

Scenario B: NFT marketplace activity

NFT marketplaces often have complex event graphs: listings, bids, cancels, purchases, transfers. The trick is to normalize identities and represent lifecycle states.

  • Listing entity keyed by order hash or marketplace ID
  • Bid entity keyed similarly
  • Sale entity keyed by txHash-logIndex
  • NFT entity keyed by contract + tokenId
  • User entity with buy and sell stats

Lifecycle modeling rule: keep “stateful” entities like Listing up to date, but store immutable facts like Sale as separate entities.

Scenario C: lending positions and health factors

Lending protocols generate events for deposits, borrows, repays, liquidations. The index should represent positions and their changes over time.

  • Position entity keyed by user + market
  • PositionEvent entities for timeline (deposit, borrow, repay)
  • Market entity for aggregated totals
  • Snapshots for market charts

A warning: health factor is often derived from on-chain prices. If you need near-real-time health factor, you may need to read oracle state. Treat that carefully and do not over-index expensive reads.

How TokenToolHub can use subgraphs in a “scan first” workflow

Indexing is a force multiplier for safety tooling. If you already believe in “scan first,” a subgraph can help answer questions that raw contract scans alone cannot answer, such as:

  • How concentrated is trading volume across wallets?
  • Are there repeated patterns of buys followed by immediate sells?
  • Does volume spike correlate with a small cluster of addresses?
  • How fast does liquidity move after launches?
  • Are approvals or interactions coming from suspicious sources?

A subgraph does not replace a contract analyzer. It complements it by turning event history into searchable structure. If you want the “scan first” habit for contracts, start with Token Safety Checker. For deeper ecosystem knowledge, your guides can route users from basics to advanced strategies without overwhelming them.

A practical playbook: build a subgraph like a professional

If you want one mental model to follow, treat your subgraph as a pipeline with guardrails. Here is the process as pseudocode.

// Subgraph build playbook (mental pseudocode)

define_product_questions()
design_entities_for_reads()

for each required_contract:
  identify_events()
  set_start_block()

choose_id_strategies()
add_snapshots_for_charts()

implement_handlers():
  load_or_create_entities()
  apply_deltas()
  write_snapshots()
  save_in_stable_order()

add_tests():
  create_mock_events()
  run_handlers()
  assert_entities()
  replay_and_check_idempotency()

validate_queries():
  use_cursor_pagination()
  avoid_skip_for_large_lists()
  avoid_query_time_aggregation()

deploy():
  monitor_sync_lag()
  alert_on_handler_errors()
  document_runbooks()
  version_for_upgrades()

Common mistakes and how to avoid them

Most subgraph failures follow a predictable pattern. Fixing them early saves weeks.

  • Modeling logs instead of product: you end up with a schema nobody wants to query.
  • No snapshots: your charts require expensive queries and feel slow.
  • Bad IDs: duplicates and drift show up after reorgs.
  • Large skip pagination: the UI becomes slow as dataset grows.
  • Decimals ignored: volume and prices become meaningless.
  • No tests: refactors break totals without anyone noticing.
  • No monitoring: a stuck index becomes a silent outage.

Do you always need The Graph?

Not always. If your product is small, a simple backend that scans logs might be enough. If you need complex analytics, multi-entity relationships, and stable GraphQL queries, a subgraph model is often worth it.

A practical decision rule:

  • If your UI is mostly “read state now,” RPC reads may be enough.
  • If your UI is “show history, rankings, charts,” you need indexing.
  • If you need both, you often use both: RPC reads for live state and subgraph for history and aggregates.

FAQs

What is a subgraph in simple terms?

A subgraph is an indexing definition that listens to blockchain events and transforms them into structured entities that you can query via GraphQL.

Why is txHash-logIndex the best ID for events?

Because it is deterministic and unique per log. Replays and reorgs can recreate the same entity identity without duplicates.

Why are snapshots important?

Snapshots precompute aggregates like daily volume and transaction count, so charts and dashboards can load fast without scanning trades at query time.

What is the biggest GraphQL mistake in subgraph-based apps?

Using skip-based pagination at scale and doing aggregation in queries. Cursor pagination and snapshot entities solve most performance pain.

How do I keep a subgraph reliable over time?

Versioning, testing, monitoring, and disciplined migrations. Treat it like a backend service with runbooks and data quality checks.

References

Reputable sources for deeper learning:


Closing reminder: indexing is not optional plumbing. It is product infrastructure. If you want dashboards that feel instant and numbers that users trust, model for reads, enforce determinism, add snapshots, paginate with cursors, and operate your subgraph like a real backend service.

About the author: Wisdom Uche Ijika Verified icon 1
Founder @TokenToolHub | Web3 Technical Researcher, Token Security & On-Chain Intelligence | Helping traders and investors identify smart contract risks before interacting with tokens