Entity Resolution (fullmap)
The fullmap module provides entity resolution functionality, mapping text strings to standardized biological entities (CURIEs).
version4()
Primary entity resolution function using DuckDB queries against the dbssert database.
Function Signature
def version4(
p: Path,
col: str,
dbssert: Path,
taxon: Optional[str],S
prioritize: Optional[list[Categories]],
avoid: Optional[list[Categories]],
tag: str = " one"
) -> pl.DataFrame
Parameters
p: Path
Path to the input parquet file containing the DataFrame to process.
col: str
Column name containing text strings to resolve.
dbssert: Path
Path to the DuckDB entity resolution database.
This database contains: - Synonym mappings (text → CURIE) - Preferred entity names - Biolink categories - NCBI Taxon IDs - Source databases and versions
taxon: Optional[str]
Optional NCBI Taxon ID for filtering results.
Example: "9606" filters to human-specific entities.
prioritize: Optional[list[Categories]]
Optional list of Biolink categories to prefer when multiple matches exist.
Example: [Categories.Gene, Categories.Protein] prefers gene/protein mappings.
avoid: Optional[list[Categories]]
Optional list of Biolink categories to exclude from results.
Example: [Categories.Gene] prevents gene mappings.
tag: str (default: " one")
Suffix for NLP processing level column.
The function looks for both:
- col (original text, case-preserved)
- col + tag (normalized text, typically lowercase)
Default " one" means it uses level-one text processing (lowercase, stripped).
Return Value
Returns a Polars DataFrame with these columns added:
| Column | Description | Example |
|---|---|---|
{col} |
CURIE identifier | "HGNC:11998" |
{col} name |
Preferred entity name | "TP53" |
{col} category |
Biolink category | "biolink:Gene" |
{col} taxon |
NCBI Taxon ID | "NCBITaxon:9606" |
{col} source |
Source database | "HGNC" |
{col} source version |
Database version | "2025-01" |
{col} nlp level |
NLP processing level | 0 or 1 |
{col} synonym |
Matched synonym | "TP53" or "tumor protein p53" |
DuckDB Query
The function executes a complex SQL query that:
- Ranks matches by:
- Category priority (if
prioritizespecified) - NLP level (exact case match preferred over normalized)
-
Source confidence
-
Filters by:
- Taxon ID (if specified)
-
Category avoidance (if specified)
-
Deduplicates to one CURIE per row per input string
Example Usage
from tablassert.fullmap import version4
from tablassert.enums import Categories
from pathlib import Path
# Resolve gene symbols to CURIEs
result = version4(
p=Path("data.parquet"),
col="gene_symbol",
dbssert=Path("/data/dbssert.duckdb"),
taxon="9606", # Human only
prioritize=[Categories.Gene],
avoid=[Categories.Protein],
tag=" one"
)
# Result DataFrame includes:
# - gene_symbol: "HGNC:11998"
# - gene_symbol name: "TP53"
# - gene_symbol category: "biolink:Gene"
# - etc.
NLP Processing Levels
Level 0 (exact case):
- Column: col
- Matches preserve case
- Preferred for acronyms, gene symbols
Level 1 (normalized):
- Column: col + " one"
- Lowercased, whitespace stripped
- Preferred for disease names, free text
The function tries level 0 first, then level 1.
Case-Dependent Behavior
"TP53" vs "tp53": - Level 0: Only matches "TP53" synonym - Level 1: Matches any case variant
This preserves specificity for case-sensitive identifiers while allowing fuzzy matching for general terms.
Provenance Tracking
Every resolved entity includes: - Source database (HGNC, MONDO, ChEBI, etc.) - Source version (database snapshot date) - Matched synonym (which text triggered the match)
This enables auditing and quality control.
Integration with QC
Entity resolution output is validated by fullmap_audit() from the qc module before being included in the knowledge graph.
See Quality Control for details.
Next Steps
- Quality Control - Multi-stage validation
- Configuration - How to specify prioritize/avoid in YAML