Table Configuration Reference

Table configurations define how Tablassert transforms tabular data (Excel, CSV, TSV) into knowledge graph assertions.

Purpose

A table configuration specifies: - Data source location and format - How to extract subject-predicate-object triples - Entity resolution rules (taxonomic filtering, category preferences) - Provenance metadata - Optional edge annotations

Template vs Sections

Table configurations support two patterns:

Pattern 1: Template Only

Use when processing a single table with one output.

template:
  syntax: TC3
  source: {...}
  statement: {...}
  provenance: {...}

Pattern 2: Template + Sections

Use when processing variations of the same data (different columns, predicates, etc.) while sharing common configuration.

template:
  syntax: TC3
  source: {...}  # Shared by all sections
  provenance: {...}  # Shared by all sections

sections:
  - statement:  # Section 1: Gene-Disease
      subject: {encoding: gene_column}
      predicate: associated_with
      object: {encoding: disease_column}

  - statement:  # Section 2: Gene-Pathway
      subject: {encoding: gene_column}
      predicate: participates_in
      object: {encoding: pathway_column}

Merge Behavior (fastmerge)

Sections inherit from template and override specific fields:

Dictionaries: Recursive merge, section overrides template keys

template:
  statement:
    subject: {encoding: A}
    predicate: related_to

sections:
  - statement:
      predicate: associated_with  # Overrides, subject stays "A"

Lists: Concatenation (extends)

template:
  statement:
    subject:
      prioritize: [Gene]

sections:
  - statement:
      subject:
        prioritize: [Protein]  # Result: [Gene, Protein]

Scalars: Section replaces template

template:
  syntax: TC3

sections:
  - syntax: TC2  # Overrides (not recommended)

Use Cases

Single output: Template only

template:
  source: {kind: text, local: data.csv}
  statement: {...}

Multiple predicates, same source:

template:
  source: {kind: excel, local: data.xlsx}
  provenance: {publication: PMC123}

sections:
  - statement: {predicate: treats}
  - statement: {predicate: prevents}

Multiple columns, shared provenance:

template:
  source: {kind: text, local: data.csv}
  provenance: {publication: PMID456}
  statement:
    subject: {encoding: gene_symbol}

sections:
  - statement: {object: {encoding: column_A}}
  - statement: {object: {encoding: column_B}}

Configuration Schema

Template Metadata

Field Type Required Description
syntax String Yes Configuration version (must be "TC3")
status String No Development status: "alpha", "beta", "stable"

Source

Defines the data file location and format.

Excel Source

Field Type Required Description
kind String Yes Must be "excel"
local Path Yes Local file path for caching
url URL Yes Download URL (HTTP/HTTPS)
sheet String No Sheet name (default: "Sheet1")
row_slice List[Int|"auto"] No Row range: [start, end] or [start, "auto"]
rows List[Int] No Specific rows to include
reindex List[Reindex] No Conditional row filtering

Example:

source:
  kind: excel
  local: ./data/mydata.xlsx
  url: https://example.com/data.xlsx
  sheet: "Sheet1"
  row_slice:
    - 2  # Start at row 2 (skip header)
    - auto  # Read to end

Text Source (CSV/TSV)

Field Type Required Description
kind String Yes Must be "text"
local Path Yes Local file path for caching
url URL Yes Download URL
delimiter String No Column delimiter (default: ",")
row_slice List[Int|"auto"] No Row range
rows List[Int] No Specific rows
reindex List[Reindex] No Conditional filtering

Example:

source:
  kind: text
  local: ./data/mydata.tsv
  url: https://example.com/data.tsv
  delimiter: "\t"
  row_slice:
    - 1
    - auto

Reindexing (Conditional Filtering)

Filter rows based on column values.

Field Type Description
column String Column name to evaluate
comparison String Operator: "eq", "ne", "lt", "le", "gt", "ge"
comparator String|Int|Float Value to compare against

Example:

reindex:
  - column: p_value
    comparison: lt
    comparator: 0.05  # Keep rows where p_value < 0.05

Statement (Triple Definition)

Defines subject-predicate-object relationships.

Field Type Required Description
subject NodeEncoding Yes Subject entity configuration
predicate String Yes Biolink predicate (e.g., "associated_with")
object NodeEncoding Yes Object entity configuration
qualifiers List[Qualifier] No Edge qualifiers (context)

Example:

statement:
  subject:
    method: column
    encoding: gene_symbol
    prioritize: [Gene]
  predicate: treats
  object:
    method: column
    encoding: disease_name
    prioritize: [Disease]

NodeEncoding

Defines how to extract and resolve entities.

Field Type Required Description
method String Yes "value" (literal) or "column" (column reference)
encoding String|Int|Float Yes Literal value or column name
taxon Int No NCBI Taxon ID for filtering (e.g., 9606 for human)
prioritize List[String] No Preferred Biolink categories
avoid List[String] No Excluded Biolink categories
regex List[Regex] No Pattern replacements
remove List[String] No Strings to filter out
prefix String No Add prefix to values
suffix String No Add suffix to values
explode_by String No Delimiter to split multi-value cells
transformations List[Math] No Mathematical transformations

Method: Value vs Column

method: value - Use a literal value

subject:
  method: value
  encoding: CHEBI:41774  # All rows get this CURIE

method: column - Reference a column

Excel columns use letters converted to column_N: - Column A → column_1 or just "A" - Column B → column_2 or just "B"

subject:
  method: column
  encoding: A  # Read from column A

CSV/TSV columns use header names:

subject:
  method: column
  encoding: gene_symbol  # Read from "gene_symbol" column

Taxonomic Filtering

taxon: int - Filter entities by organism

subject:
  encoding: gene_column
  taxon: 9606  # Only human genes (Homo sapiens)

Common taxon IDs: - 9606 - Homo sapiens (human) - 10090 - Mus musculus (mouse) - 7227 - Drosophila melanogaster (fruit fly)

Category Prioritization

prioritize: list[category] - Prefer specific Biolink categories

subject:
  encoding: A
  prioritize:
    - Gene
    - Protein

If "TP53" maps to both Gene and Protein, prefer Gene.

avoid: list[category] - Exclude specific categories

subject:
  encoding: organism_name
  prioritize:
    - OrganismTaxon
  avoid:
    - Gene

Prevents misclassifying organism names as genes.

Text Transformations

regex: list[{pattern, replacement}] - Pattern-based replacements

subject:
  encoding: A
  regex:
    - pattern: ".*g__"
      replacement: ""  # Remove genus prefix
    - pattern: ";s__"
      replacement: " "  # Replace species separator

Executed in order.

remove: list[string] - Filter out specific strings

subject:
  encoding: A
  remove:
    - "^NA "  # Remove rows starting with "NA "

prefix / suffix - Add text

object:
  encoding: identifier
  prefix: "CUSTOM:"  # "123" → "CUSTOM:123"

Multi-Value Handling

explode_by: string - Split delimited values into multiple rows

object:
  encoding: pathway_list
  explode_by: ";"  # "P1;P2;P3" → 3 separate edges

Mathematical Transformations

transformations: list[{function, arguments}]

Available functions: add, subtract, multiply, divide, log, exp, abs, round

Use "values" token to reference column values:

annotations:
  - annotation: "adjusted p value"
    method: column
    encoding: p_value
    transformations:
      - function: multiply
        arguments: ["values", 0.05]  # p_value * 0.05

Qualifiers

Add context to edges (anatomical location, species, etc.).

Field Type Description
qualifier String Biolink qualifier (e.g., "species_context")
(inherits NodeEncoding) All NodeEncoding fields available

Example:

qualifiers:
  - qualifier: species_context
    method: value
    encoding: NCBITaxon:9606

Provenance

Required metadata about data source.

Field Type Required Description
repo String Yes Repository: "PMC", "PUBMED", "DOI"
publication String Yes Identifier (e.g., "PMC11708054", "PMID123")
contributors List[Contributor] Yes Curation information

Contributor fields:

Field Type Required Description
kind String Yes "curation", "analysis", "validation"
name String Yes Contributor name
date String Yes Date (free format)
organizations List[String] No Affiliations
comment String No Notes

Example:

provenance:
  repo: PMC
  publication: PMC11708054
  contributors:
    - kind: curation
      name: Skye Lane Goetz
      date: 09 JAN 2025
      organizations:
        - Institute for Systems Biology
        - CalPoly SLO
      comment: Migrated from TC2 to TC3

Annotations

Optional edge attributes (statistical metadata, notes, etc.).

Field Type Description
annotation String Attribute name (e.g., "p value", "sample size")
(inherits Encoding) All Encoding fields available (method, encoding, regex, etc.)

Example:

annotations:
  - annotation: p value
    method: column
    encoding: C  # Read from column C

  - annotation: sample size
    method: value
    encoding: 450  # Literal value for all edges

  - annotation: multiple testing correction method
    method: value
    encoding: "Benjamini Hochberg"

Complete Example

Minimal table configuration:

template:
  syntax: TC3
  status: alpha

  source:
    kind: text
    local: ./data.csv
    url: https://example.com/data.csv
    row_slice: [1, auto]
    delimiter: ","

  statement:
    subject:
      method: column
      encoding: gene
      prioritize: [Gene]
    predicate: associated_with
    object:
      method: column
      encoding: disease
      prioritize: [Disease]

  provenance:
    repo: PUBMED
    publication: PMID12345678
    contributors:
      - kind: curation
        name: Example User
        date: 27 JAN 2026

  annotations:
    - annotation: p value
      method: column
      encoding: p_val

Next Steps