Entity Resolution (fullmap)

The fullmap module provides entity resolution functionality, mapping text strings to standardized biological entities (CURIEs).

version4()

Primary entity resolution function using DuckDB queries against the dbssert database.

Function Signature

def version4(
  lf: pl.LazyFrame,
  col: str,
  conn: object,
  taxon: Optional[str],
  prioritize: Optional[list[Categories]],
  avoid: Optional[list[Categories]],
  section_hash: str,
  config_file: str,
  tag: str = " one"
) -> pl.LazyFrame

Parameters

lf: pl.LazyFrame

Input LazyFrame containing the data to process. Internally collected at explicit collection points for DuckDB queries and joins.

col: str

Column name containing text strings to resolve.

conn: object

DuckDB connection to the entity resolution database.

This database contains: - Synonym mappings (text → CURIE) - Preferred entity names - Biolink categories - NCBI Taxon IDs - Source databases and versions

taxon: Optional[str]

Optional NCBI Taxon ID for filtering results.

Example: "9606" filters to human-specific entities.

prioritize: Optional[list[Categories]]

Optional list of Biolink categories to prefer when multiple matches exist.

Example: [Categories.Gene, Categories.Protein] prefers gene/protein mappings.

avoid: Optional[list[Categories]]

Optional list of Biolink categories to exclude from results.

Example: [Categories.Gene] prevents gene mappings.

tag: str (default: " one")

Suffix for NLP processing level column.

The function looks for both: - col (original text, case-preserved) - col + tag (normalized text, typically lowercase)

Default " one" means it uses level-one text processing (lowercase, stripped).

section_hash: str / config_file: str

Context fields used for operational logging when unmatched values are encountered.

Return Value

Returns a Polars LazyFrame with these columns added:

Column Description Example
{col} CURIE identifier "HGNC:11998"
{col} name Preferred entity name "TP53"
{col} category Biolink category "biolink:Gene"
{col} taxon NCBI Taxon ID "NCBITaxon:9606"
{col} source Source database "HGNC"
{col} source version Database version "2025-01"
{col} nlp level NLP processing level 0 or 1

DuckDB Query

The function executes a complex SQL query that:

  1. Ranks matches by:
  2. Category priority (if prioritize specified)
  3. NLP level (exact case match preferred over normalized)
  4. Source confidence

  5. Filters by:

  6. Taxon ID (if specified)
  7. Category avoidance (if specified)

  8. Deduplicates to one CURIE per row per input string

Example Usage

from tablassert.fullmap import version4
from tablassert.enums import Categories
from pathlib import Path
import duckdb
import polars as pl

# Open DuckDB connection
conn = duckdb.connect("/data/dbssert.duckdb", read_only=True)

# LazyFrame with data to resolve
lf = pl.scan_parquet("data.parquet")

# Resolve gene symbols to CURIEs
result = version4(
  lf=lf,
  col="gene_symbol",
  conn=conn,
  taxon="9606",  # Human only
  prioritize=[Categories.Gene],
  avoid=[Categories.Protein],
  section_hash="tutorial-section",
  config_file="tutorial-table.yaml",
  tag=" one"
)

# Result LazyFrame includes:
# - gene_symbol: "HGNC:11998"
# - gene_symbol name: "TP53"
# - gene_symbol category: "biolink:Gene"
# - etc.

NLP Processing Levels

Level 0 (exact case): - Column: col - Matches preserve case - Preferred for acronyms, gene symbols

Level 1 (normalized): - Column: col + " one" - Lowercased, whitespace stripped - Preferred for disease names, free text

The function tries level 0 first, then level 1.

Rows without a valid CURIE are filtered from the returned frame.

Case-Dependent Behavior

"TP53" vs "tp53": - Level 0: Only matches "TP53" synonym - Level 1: Matches any case variant

This preserves specificity for case-sensitive identifiers while allowing fuzzy matching for general terms.

Provenance Tracking

Every resolved entity includes: - Source database (HGNC, MONDO, ChEBI, etc.) - Source version (database snapshot date) - Matched synonym (which text triggered the match)

This enables auditing and quality control.

Integration with QC

Entity resolution output is validated by fullmap_audit() from the qc module before being included in the knowledge graph.

See Quality Control for details.

Next Steps