Entity Resolution (fullmap)

The fullmap module provides entity resolution functionality, mapping text strings to standardized biological entities (CURIEs).

version4()

Primary entity resolution function using DuckDB queries against the dbssert database.

Function Signature

def version4(
  p: Path,
  col: str,
  dbssert: Path,
  taxon: Optional[str],S
  prioritize: Optional[list[Categories]],
  avoid: Optional[list[Categories]],
  tag: str = " one"
) -> pl.DataFrame

Parameters

p: Path

Path to the input parquet file containing the DataFrame to process.

col: str

Column name containing text strings to resolve.

dbssert: Path

Path to the DuckDB entity resolution database.

This database contains: - Synonym mappings (text → CURIE) - Preferred entity names - Biolink categories - NCBI Taxon IDs - Source databases and versions

taxon: Optional[str]

Optional NCBI Taxon ID for filtering results.

Example: "9606" filters to human-specific entities.

prioritize: Optional[list[Categories]]

Optional list of Biolink categories to prefer when multiple matches exist.

Example: [Categories.Gene, Categories.Protein] prefers gene/protein mappings.

avoid: Optional[list[Categories]]

Optional list of Biolink categories to exclude from results.

Example: [Categories.Gene] prevents gene mappings.

tag: str (default: " one")

Suffix for NLP processing level column.

The function looks for both: - col (original text, case-preserved) - col + tag (normalized text, typically lowercase)

Default " one" means it uses level-one text processing (lowercase, stripped).

Return Value

Returns a Polars DataFrame with these columns added:

Column Description Example
{col} CURIE identifier "HGNC:11998"
{col} name Preferred entity name "TP53"
{col} category Biolink category "biolink:Gene"
{col} taxon NCBI Taxon ID "NCBITaxon:9606"
{col} source Source database "HGNC"
{col} source version Database version "2025-01"
{col} nlp level NLP processing level 0 or 1
{col} synonym Matched synonym "TP53" or "tumor protein p53"

DuckDB Query

The function executes a complex SQL query that:

  1. Ranks matches by:
  2. Category priority (if prioritize specified)
  3. NLP level (exact case match preferred over normalized)
  4. Source confidence

  5. Filters by:

  6. Taxon ID (if specified)
  7. Category avoidance (if specified)

  8. Deduplicates to one CURIE per row per input string

Example Usage

from tablassert.fullmap import version4
from tablassert.enums import Categories
from pathlib import Path

# Resolve gene symbols to CURIEs
result = version4(
  p=Path("data.parquet"),
  col="gene_symbol",
  dbssert=Path("/data/dbssert.duckdb"),
  taxon="9606",  # Human only
  prioritize=[Categories.Gene],
  avoid=[Categories.Protein],
  tag=" one"
)

# Result DataFrame includes:
# - gene_symbol: "HGNC:11998"
# - gene_symbol name: "TP53"
# - gene_symbol category: "biolink:Gene"
# - etc.

NLP Processing Levels

Level 0 (exact case): - Column: col - Matches preserve case - Preferred for acronyms, gene symbols

Level 1 (normalized): - Column: col + " one" - Lowercased, whitespace stripped - Preferred for disease names, free text

The function tries level 0 first, then level 1.

Case-Dependent Behavior

"TP53" vs "tp53": - Level 0: Only matches "TP53" synonym - Level 1: Matches any case variant

This preserves specificity for case-sensitive identifiers while allowing fuzzy matching for general terms.

Provenance Tracking

Every resolved entity includes: - Source database (HGNC, MONDO, ChEBI, etc.) - Source version (database snapshot date) - Matched synonym (which text triggered the match)

This enables auditing and quality control.

Integration with QC

Entity resolution output is validated by fullmap_audit() from the qc module before being included in the knowledge graph.

See Quality Control for details.

Next Steps