Tutorial: Your First Knowledge Graph

This tutorial walks through building a simple knowledge graph from a CSV file of gene-disease associations. You'll learn the complete workflow: creating configurations, running Tablassert, and examining the output.

Time: 5-10 minutes

Prerequisites

Tablassert installed (see Installation)
Required database: dbssert (pubmed_db and pmc_db are optional enrichments)
Basic familiarity with YAML

The Data

We have a CSV file with gene-disease associations:

gene_symbol,disease_name,p_value,sample_size
TP53,lung cancer,0.001,450
BRCA1,breast cancer,0.0001,1200
EGFR,colorectal cancer,0.005,680
KRAS,pancreatic cancer,0.002,320

Goal: Transform this into KGX-compliant nodes and edges.

Step 1: Create the Data File

Save the CSV as tutorial-data.csv:

cat > tutorial-data.csv <<'EOF'
gene_symbol,disease_name,p_value,sample_size
TP53,lung cancer,0.001,450
BRCA1,breast cancer,0.0001,1200
EGFR,colorectal cancer,0.005,680
KRAS,pancreatic cancer,0.002,320
EOF

Step 2: Create Table Configuration

Create tutorial-table.yaml:

template:
  syntax: TC3
  status: alpha
  source:
    kind: text
    local: ./tutorial-data.csv
    url: https://example.com/data.csv
    row_slice:
      - 1
      - auto
    delimiter: ","
  statement:
    subject:
      method: column
      encoding: gene_symbol
      prioritize:
        - Gene
    predicate: associated_with
    object:
      method: column
      encoding: disease_name
      prioritize:
        - Disease
  provenance:
    repo: PMID
    publication: PMID12345678
    contributors:
      - kind: curation
        name: Tutorial Example
        date: 27 JAN 2026
        organizations:
          - Example Institute
        comment: Synthetic tutorial data
  annotations:
    - annotation: p value
      method: column
      encoding: p_value
    - annotation: sample size
      method: column
      encoding: sample_size

What this does: - source: Reads CSV, skips header row (row_slice starts at 1) - statement: Creates edges where genes (subject) are associated_with diseases (object) - subject/object: Uses column method to read from gene_symbol and disease_name columns - prioritize: Tells entity resolution to prefer Gene/Disease categories - annotations: Adds p-value and sample size as edge attributes

Step 3: Create Graph Configuration

Create tutorial-graph.yaml:

syntax: GC2
name: TUTORIAL_KG
version: 1.0.0
tables:
  - ./tutorial-table.yaml
dbssert: /path/to/dbssert.duckdb
pubmed_db: /path/to/PubMed.db
pmc_db: /path/to/PMCSuppCaptions.db

Important: Replace the database paths with your actual paths.

What this does: - name/version: Output files will be TUTORIAL_KG_1.0.0.nodes.ndjson and TUTORIAL_KG_1.0.0.edges.ndjson - tables: List of table configurations to process - databases: Path to entity resolution database, plus optional provenance databases

Step 4: Run Tablassert

tablassert build-knowledge-graph tutorial-graph.yaml

What happens: 1. Loads graph configuration 2. For each table config: - Downloads file (or uses local) - Applies row slicing - Resolves entities (genes and diseases) - Validates with QC pipeline (exact → fuzzy → BERT) - Creates subgraph parquet file 3. Aggregates all subgraphs 4. Exports NDJSON files

Step 5: Examine Output

Nodes file:

head -n 3 TUTORIAL_KG_1.0.0.nodes.ndjson

Example output:

{"id":"HGNC:11998","name":"TP53","category":["biolink:Gene"],"taxon":"NCBITaxon:9606"}
{"id":"MONDO:0008903","name":"lung cancer","category":["biolink:Disease"]}
{"id":"HGNC:1100","name":"BRCA1","category":["biolink:Gene"],"taxon":"NCBITaxon:9606"}

Edges file:

head -n 2 TUTORIAL_KG_1.0.0.edges.ndjson

Example output:

{"id":"uuid:...","subject":"HGNC:11998","predicate":"biolink:associated_with","object":"MONDO:0008903","p value":0.001,"sample size":450}
{"id":"uuid:...","subject":"HGNC:1100","predicate":"biolink:associated_with","object":"MONDO:0005041","p value":0.0001,"sample size":1200}

Understanding the Transformation

Input: Text strings ("TP53", "lung cancer")

Entity Resolution: - "TP53" → HGNC:11998 (Gene) - "lung cancer" → MONDO:0008903 (Disease)

Quality Control: - Stage 1: Exact match check - Stage 2: Fuzzy matching (if needed) - Stage 3: BERT semantic similarity (if needed)

Output: KGX-compliant nodes and edges with: - Standardized identifiers (CURIEs) - Biolink categories and predicates - Provenance metadata - Edge annotations

What You Learned

Table configuration defines data sources and transformations
Graph configuration orchestrates multiple tables
Entity resolution maps text to standardized identifiers
QC pipeline validates mappings across three stages
Output is KGX-compliant NDJSON ready for NCATS Translator

Next Steps

Configuration Reference - Learn all configuration options
Advanced Example - See real-world usage with complex transformations
API Reference - Understand entity resolution internals