Table Configuration Reference

Table configurations define how Tablassert transforms tabular data (Excel, CSV, TSV) into knowledge graph assertions.

Purpose

A table configuration specifies: - Data source location and format - How to extract subject-predicate-object triples - Entity resolution rules (taxonomic filtering, category preferences) - Provenance metadata - Optional edge annotations

Template vs Sections

Table configurations support two patterns:

Pattern 1: Template Only

Use when processing a single table with one output.

template:
  syntax: TC3
  source: {...}
  statement: {...}
  provenance: {...}

Pattern 2: Template + Sections

Use when processing variations of the same data (different columns, predicates, etc.) while sharing common configuration.

template:
  syntax: TC3
  source: {...}  # Shared by all sections
  provenance: {...}  # Shared by all sections

sections:
  - statement:  # Section 1: Gene-Disease
      subject: {method: column, encoding: A}
      predicate: associated_with
      object: {method: column, encoding: B}

  - statement:  # Section 2: Gene-Pathway
      subject: {method: column, encoding: A}
      predicate: participates_in
      object: {method: column, encoding: C}

Merge Behavior (fastmerge)

Sections inherit from template and override specific fields:

Dictionaries: Recursive merge, section overrides template keys

template:
  statement:
    subject: {encoding: A}
    predicate: related_to

sections:
  - statement:
      predicate: associated_with  # Overrides, subject stays "A"

Lists: Concatenation (extends)

template:
  statement:
    subject:
      prioritize: [Gene]

sections:
  - statement:
      subject:
        prioritize: [Protein]  # Result: [Gene, Protein]

Scalars: Section replaces template

template:
  syntax: TC3

sections:
  - syntax: TC2  # Overrides (not recommended)

Use Cases

Single output: Template only

template:
  source: {kind: text, local: data.csv, url: https://example.com/data.csv}
  statement: {...}

Multiple predicates, same source:

template:
  source: {kind: excel, local: data.xlsx, url: https://example.com/data.xlsx}
  provenance: {repo: PMC, publication: 123, contributors: [{name: Example User, date: 27 JAN 2026}]}

sections:
  - statement: {predicate: treats}
  - statement: {predicate: prevents}

Multiple columns, shared provenance:

template:
  source: {kind: text, local: data.csv, url: https://example.com/data.csv}
  provenance: {repo: PMID, publication: 456, contributors: [{name: Example User, date: 27 JAN 2026}]}
  statement:
    subject: {method: column, encoding: A}

sections:
  - statement: {object: {method: column, encoding: B}}
  - statement: {object: {method: column, encoding: C}}

Configuration Schema

Template Metadata

Field	Type	Required	Description
`syntax`	String	No	Configuration version. Defaults to `"TC3"`.
`status`	String	No	Development status. Defaults to `"alpha"`; allowed values are `"alpha"`, `"beta"`, `"primetime"`.

Source

Defines the data file location and format.

Excel Source

Field	Type	Required	Description
`kind`	String	No	Source kind. Model default is `"excel"`, but specify it explicitly in configs.
`local`	Path	Yes	Local file path for caching
`url`	URL	Yes	Download URL (HTTP/HTTPS)
`sheet`	String	No	Sheet name. Defaults to `"Sheet1"`.
`row_slice`	List[Int\|"auto"]	No	Two-value zero-based crop bounds: `[start, stop]`. Each value may be an integer or `"auto"`.
`rows`	List[Int]	No	Zero-based row indices to keep after any `row_slice` crop.
`reindex`	List[Reindex]	No	Conditional row filtering

Example:

source:
  kind: excel
  local: ./data/mydata.xlsx
  url: https://example.com/data.xlsx
  sheet: "Sheet1"
  row_slice:
    - 1  # Start at the second physical row
    - auto  # Read to end

Text Source (CSV/TSV)

Field	Type	Required	Description
`kind`	String	No	Source kind. Model default is `"text"`, but specify it explicitly in configs.
`local`	Path	Yes	Local file path for caching
`url`	URL	Yes	Download URL
`delimiter`	String	No	Field delimiter. Defaults to `","`.
`row_slice`	List[Int\|"auto"]	No	Two-value zero-based crop bounds: `[start, stop]`. Each value may be an integer or `"auto"`.
`rows`	List[Int]	No	Zero-based row indices to keep after any `row_slice` crop.
`reindex`	List[Reindex]	No	Conditional filtering

Example:

source:
  kind: text
  local: ./data/mydata.tsv
  url: https://example.com/data.tsv
  delimiter: "\t"
  row_slice:
    - 1
    - auto

Reindexing (Conditional Filtering)

Filter rows based on column values.

Field	Type	Description
`column`	String	Source column letters to evaluate (`A`-`ZZZ`)
`comparison`	String	Operator. Defaults to `"ne"`; allowed values are `"eq"`, `"ne"`, `"lt"`, `"le"`, `"gt"`, `"ge"`.
`comparator`	String\|Int\|Float	Value to compare against

Example:

reindex:
  - column: C
    comparison: lt
    comparator: 0.05  # Keep rows where column C < 0.05

Statement (Triple Definition)

Defines subject-predicate-object relationships.

Field	Type	Required	Description
`subject`	NodeEncoding	Yes	Subject entity configuration
`predicate`	String	No	Biolink predicate. Defaults to `"related_to"`.
`object`	NodeEncoding	Yes	Object entity configuration
`qualifiers`	List[Qualifier]	No	Edge qualifiers (context)

Example:

statement:
  subject:
    method: column
    encoding: A
    prioritize: [Gene]
  predicate: treats
  object:
    method: column
    encoding: B
    prioritize: [Disease]

NodeEncoding

Defines how to extract and resolve entities.

Field	Type	Required	Description
`method`	String	No	`"value"` (literal) or `"column"` (source column letters). Defaults to `"value"`.
`encoding`	String\|Int\|Float	Yes	Literal value or source column letters, depending on `method`
`taxon`	Int	No	NCBI Taxon ID for filtering (e.g., `9606` for human)
`prioritize`	List[String]	No	Preferred Biolink categories (must be valid `Categories` enum values such as `Gene`, `Protein`)
`avoid`	List[String]	No	Excluded Biolink categories (must be valid `Categories` enum values)
`regex`	List[Regex]	No	Pattern replacements
`fill`	String	No	Null-filling strategy: `"forward"`, `"backward"`, `"min"`, `"max"`, `"mean"`, `"zero"`, `"one"`
`remove`	List[String]	No	Strings to filter out
`prefix`	String	No	Add prefix to values
`suffix`	String	No	Add suffix to values
`explode_by`	String	No	Delimiter to split multi-value cells
`transformations`	List[Math]	No	Mathematical transformations

Method: Value vs Column

method: value - Use a literal value

subject:
  method: value
  encoding: CHEBI:41774  # All rows get this CURIE

method: column - Reference a source column

Source files are read without headers, so column references are Excel-style letters: - Column A -> "A" - Column B -> "B" - Column AA -> "AA"

subject:
  method: column
  encoding: A  # Read from column A

At runtime those letters are converted internally to Polars column names such as column_1, but those internal names are not valid configuration values.

Taxonomic Filtering

taxon: int - Filter entities by organism

subject:
  method: column
  encoding: A
  taxon: 9606  # Only human genes (Homo sapiens)

Common taxon IDs: - 9606 - Homo sapiens (human) - 10090 - Mus musculus (mouse) - 7227 - Drosophila melanogaster (fruit fly)

Category Prioritization

prioritize: list[category] - Prefer specific Biolink categories

subject:
  encoding: A
  prioritize:
    - Gene
    - Protein

If "TP53" maps to both Gene and Protein, prefer Gene.

avoid: list[category] - Exclude specific categories

subject:
  method: column
  encoding: A
  prioritize:
    - OrganismTaxon
  avoid:
    - Gene

Prevents misclassifying organism names as genes.

Text Transformations

regex: list[{pattern, replacement}] - Pattern-based replacements

subject:
  encoding: A
  regex:
    - pattern: ".*g__"
      replacement: ""  # Remove genus prefix
    - pattern: ";s__"
      replacement: " "  # Replace species separator

Executed in order.

Regex dialect: Patterns are passed directly to Polars str.replace_all(), which uses the Rust regex crate. Only features supported by that engine work — in particular, capturing groups ((...), \1) and lookarounds ((?=...), (?<=...), (?!...), (?<!...) are not supported and will raise an error at parse time. Stick to character classes, anchors (^, $), quantifiers, alternation (a|b), and non-capturing groups ((?:...)) if grouping is needed. If a transformation is too complex to express, prefer chaining several simple substitutions or capturing the residual context in a miscellaneous notes annotation instead.

remove: list[string] - Filter out specific strings

subject:
  encoding: A
  remove:
    - "^NA "  # Remove rows starting with "NA "

Same regex constraints apply as the regex field — Polars-compatible patterns only, no capturing groups or lookarounds.

prefix / suffix - Add text

object:
  encoding: identifier
  prefix: "CUSTOM:"  # "123" → "CUSTOM:123"

Null Handling

fill: string - Fill null values using a strategy

Available strategies: - "forward" - Fill nulls with previous non-null value - "backward" - Fill nulls with next non-null value - "min" - Fill with column minimum - "max" - Fill with column maximum - "mean" - Fill with column mean - "zero" - Fill with 0 - "one" - Fill with 1

subject:
  method: column
  encoding: A
  fill: forward  # Propagate values down through null rows

annotations:
  - annotation: expression_level
    method: column
    encoding: C
    fill: mean  # Replace nulls with column average

Multi-Value Handling

explode_by: string - Split delimited values into multiple rows

object:
  method: column
  encoding: B
  explode_by: ";"  # "P1;P2;P3" → 3 separate edges

Mathematical Transformations

transformations: list[{function, arguments}]

Available functions: copysign, pow

Use the "values" token to reference column values in transformations.

Qualifiers

Add context to edges (anatomical location, species, etc.).

Field	Type	Description
`qualifier`	String	Biolink qualifier from the `Qualifiers` enum (e.g., `"species_context_qualifier"`)
(inherits NodeEncoding)		All NodeEncoding fields available

Example:

qualifiers:
  - qualifier: species_context_qualifier
    method: value
    encoding: NCBITaxon:9606

Provenance

Required metadata about data source.

Field	Type	Required	Description
`repo`	String	No	Repository. Defaults to `"PMC"`; allowed values are `"PMC"`, `"PMID"`.
`publication`	String	Yes	Repository-local identifier appended to `repo:` (e.g., `"11708054"`, `"123"`). When `repo` is `"PMC"`, the value must start with `PMC` followed by digits (e.g., `"PMC11708054"`, `"PMC123"`).
`contributors`	List[Contributor]	Yes	Curation information

Contributor fields:

Field	Type	Required	Description
`kind`	String	No	Contributor role. Defaults to `"curation"`; allowed values are `"curation"`, `"validation"`, `"tool"`.
`name`	String	Yes	Contributor name
`date`	String	Yes	Date (free format)
`organizations`	List[String]	No	Affiliations
`comment`	String	No	Notes

Example:

provenance:
  repo: PMC
  publication: 11708054
  contributors:
    - kind: curation
      name: Skye Lane Goetz
      date: 09 JAN 2025
      organizations:
        - Institute for Systems Biology
        - CalPoly SLO
      comment: Migrated from TC2 to TC3

Annotations

Optional edge attributes (statistical metadata, notes, etc.).

Field	Type	Description
`annotation`	String	Attribute name (e.g., `"p value"`, `"sample size"`)
(inherits Encoding)		All Encoding fields available (method, encoding, regex, etc.)

Example:

annotations:
  - annotation: p value
    method: column
    encoding: C  # Read from column C

  - annotation: sample size
    method: value
    encoding: 450  # Literal value for all edges

  - annotation: multiple testing correction method
    method: value
    encoding: "Benjamini Hochberg"

  # Freetext catch-all for context that doesn't fit a structured field —
  # study caveats, units, post-hoc notes, anything you'd otherwise lose.
  - annotation: miscellaneous notes
    method: value
    encoding: "Values are log2 fold-change relative to vehicle control; n=3 biological replicates per arm"

Tip: When source data carries information that can't be cleanly mapped to a structured annotation (assay-specific caveats, non-standard units, qualitative observations), add a miscellaneous notes annotation rather than forcing it into another field or dropping it. It accepts both method: value (one note for the whole table) and method: column (per-row notes from the source).

Complete Example

Minimal table configuration:

template:
  syntax: TC3
  status: alpha

  source:
    kind: text
    local: ./data.csv
    url: https://example.com/data.csv
    row_slice: [1, auto]
    delimiter: ","

  statement:
    subject:
      method: column
      encoding: A
      prioritize: [Gene]
    predicate: associated_with
    object:
      method: column
      encoding: B
      prioritize: [Disease]

  provenance:
    repo: PMID
    publication: 12345678
    contributors:
      - kind: curation
        name: Example User
        date: 27 JAN 2026

  annotations:
    - annotation: p value
      method: column
      encoding: C

Next Steps

Advanced Example - Real-world configuration with complex transformations
Graph Configuration - How to orchestrate multiple tables
Tutorial - Step-by-step walkthrough