Graph Configuration Reference

Graph configurations orchestrate the processing of multiple table configurations into a single knowledge graph output.

Purpose

A graph configuration file specifies: - Output knowledge graph name and version - List of table configurations to process - Database locations for entity resolution and provenance

Schema

Required Fields

Field Type Description
syntax String Configuration version (must be "GC2")
name String Knowledge graph name (used in output filename)
version String Knowledge graph version (used in output filename)
tables List[Path] Paths to table configuration YAML files
dbssert Path Path to DuckDB entity resolution database
pubmed_db Path Path to SQLite PubMed metadata database
pmc_db Path Path to SQLite PMC figure captions database

Field Details

syntax: "GC2"

Configuration syntax version. Must be "GC2" for version 6.0.0.

name: string

Output knowledge graph name. Used as prefix for NDJSON files.

Example: name: MULTIOMICS_KG produces MULTIOMICS_KG_{version}.nodes.ndjson

version: string

Output knowledge graph version. Used as suffix for NDJSON files.

Common values: "1.0.0", "UNSTABLE", "BETA"

tables: list[path]

List of table configuration file paths. Can be absolute or relative to graph config location.

Each table config defines: - Data source (Excel/CSV/TSV) - Entity resolution rules - Provenance information - Edge annotations

See Table Configuration for details.

dbssert: path

Path to DuckDB database for entity resolution. This database contains: - Synonym mappings (text → CURIE) - Biolink categories - Taxonomic information - Source provenance (which database provided the mapping)

pubmed_db: path

Path to SQLite database with PubMed metadata: - MeSH terms - Authors - Journal information - Publication dates

Used to enrich edges with MeSH annotations.

pmc_db: path

Path to SQLite database with PubMed Central figure captions.

Used when provenance specifies PMC publications.

Path Resolution

Paths can be: - Absolute: /home/user/data/dbssert.duckdb - Relative to graph config: ./tables/table1.yaml - Relative to current directory: ../configs/table.yaml

Minimal Example

syntax: GC2
name: MY_GRAPH
version: 1.0.0
tables:
  - ./my-table.yaml
dbssert: /data/dbssert.duckdb
pubmed_db: /data/PubMed.db
pmc_db: /data/PMCSuppCaptions.db

Multi-Table Example

syntax: GC2
name: MULTIOMICS_KG
version: UNSTABLE
tables:
  - /configs/gene-disease-associations.yaml
  - /configs/drug-targets.yaml
  - /configs/protein-interactions.yaml
dbssert: /databases/dbssert.duckdb
pubmed_db: /databases/PubMed.db
pmc_db: /databases/PMCSuppCaptions.db

Processing Flow

When you run tablassert-cli -i graph.yaml:

  1. Load graph configuration - Parse YAML, validate schema
  2. For each table in tables:
  3. Load table configuration
  4. Download source file (if URL specified)
  5. Apply transformations
  6. Resolve entities using dbssert
  7. Validate with QC pipeline
  8. Create subgraph parquet file
  9. Aggregate subgraphs - Merge all parquet files
  10. Add provenance - Query pubmed_db and pmc_db for metadata
  11. Export NDJSON - Generate {name}_{version}.nodes.ndjson and .edges.ndjson

Output Files

Given this configuration:

name: EXAMPLE_KG
version: 2.0.0

Produces: - EXAMPLE_KG_2.0.0.nodes.ndjson - EXAMPLE_KG_2.0.0.edges.ndjson

Real-World Example

From MOKGV6.yaml:

syntax: GC2
name: MULTIOMICS_KG
version: UNSTABLE
tables:
  - /local_raid1/sgoetz/STORE/CONFIG/TABLASSERT/TABLE/V6/ALAMV6.yaml
dbssert: /local_raid1/sgoetz/CODE/DBSSERT/dbssert.duckdb
pubmed_db: /local_raid1/sgoetz/DBSTORE/local_raid1/sgoetz/DBSTORE/PUBMED/PubMed.db
pmc_db: /local_raid1/sgoetz/DBSTORE/local_raid1/sgoetz/DBSTORE/CAPTIONS/PMCSuppCaptions.db

This processes a single table configuration (ALAMV6.yaml) into a knowledge graph named MULTIOMICS_KG_UNSTABLE.

Next Steps