Batch Resolution (lib)
The lib module exposes resolve_many(), a high-level convenience function for resolving an iterable of entity strings to CURIEs without requiring manual LazyFrame construction, NLP preprocessing, or DuckDB shard management.
It wraps the lower-level resolve() pipeline — applying level_one and level_two normalization, opening all 10 DuckDB shard connections, executing entity resolution, and returning results as a plain Python list of row dictionaries.
resolve_many()
Standalone batch entity resolution function. Accepts a column name, an iterable of text strings, and a path to the datassert database, then returns resolved CURIEs and metadata as a list of row dictionaries.
Function Signature
def resolve_many(
col: str,
entities: Iterable[str],
datassert: Path,
taxon: Optional[str] = None,
prioritize: Optional[list[Categories]] = None,
avoid: Optional[list[Categories]] = None,
column_context: bool = True,
) -> list[dict[str, Any]]
Parameters
col: str
Column name used internally to label the Polars Series and DataFrame columns during resolution. This name propagates through the NLP and resolution pipeline and determines the keys in each returned row dictionary.
For example, if col="gene", each returned row dictionary will contain keys like "gene", "gene name", "gene category", etc.
entities: Iterable[str]
An iterable of text strings to resolve. Each string is treated as a candidate entity name that will be normalized and matched against the datassert synonym database. Accepts any iterable — lists, tuples, generators, sets, etc.
Examples: ["TP53", "BRCA1", "EGFR"], ("aspirin", "ibuprofen"), or a generator expression.
datassert: Path
Filesystem path to the root of the datassert database directory. The function expects a data/ subdirectory containing 10 DuckDB shard files (0.duckdb through 9.duckdb).
Each shard contains: - Synonym mappings (text → CURIE) - Preferred entity names - Biolink categories - NCBI Taxon IDs - Source databases and versions
taxon: Optional[str] (default: None)
Optional NCBI Taxon ID for filtering results to a specific organism.
Example: "9606" restricts matches to human-specific entities. When None, no taxon filtering is applied and matches from all organisms are returned.
prioritize: Optional[list[Categories]] (default: None)
Optional list of Biolink categories to prefer when multiple matches exist for the same input term. Categories listed here receive higher ranking scores during resolution.
Example: [Categories.Gene, Categories.Protein] prefers gene and protein mappings over other categories like diseases or chemicals.
avoid: Optional[list[Categories]] (default: None)
Optional list of Biolink categories to exclude from results entirely. Any match belonging to an avoided category is filtered out before ranking.
Example: [Categories.Gene] prevents gene mappings from appearing in the output, even if they would otherwise be the best match.
column_context: bool (default: True)
Controls category-frequency tie-breaking when multiple matches exist for a term. When True, the resolution query adds a category frequency score and prefers the category that appears most frequently across all terms in the batch. When False, frequency-based tie-breaking is disabled.
This is useful when resolving a column of related entities (e.g., all genes) — the shared context helps disambiguate terms that map to multiple categories.
Return Value
Returns a list[dict[str, Any]] — one dictionary per resolved entity. The list is produced by calling polars.DataFrame.to_dicts() on the collected resolution output.
Each dictionary contains the following keys (where {col} is the value of the col parameter):
| Key | Description | Example Value |
|---|---|---|
original {col} |
Original input text before normalization | "TP53" |
{col} |
CURIE identifier | "HGNC:11998" |
{col} name |
Preferred entity name | "TP53" |
{col} category |
Biolink category (prefixed) | "biolink:Gene" |
{col} taxon |
NCBI Taxon ID (prefixed) | "NCBITaxon:9606" |
{col} source |
Source database | "HGNC" |
{col} source version |
Database version | "2025-01" |
{col} nlp level |
NLP processing level used for match | 1 or 2 |
Important: Only entities that successfully resolve to a CURIE are included in the output. Unresolved entities are filtered out by resolve(). The returned list may therefore be shorter than the input iterable.
Pipeline Internals
resolve_many() executes the following steps internally:
-
Series construction — Wraps the input iterable in a
pl.Serieswith the given column name, then converts to a single-columnpl.LazyFrame. -
NLP normalization — Applies
level_one()(whitespace stripping + lowercasing) andlevel_two()(non-word character removal via\W+) to produce the two normalized columns required byresolve(). -
DuckDB connection management — Opens all 10 shard connections inside a
contextlib.ExitStack, ensuring every connection is properly closed when resolution completes or if an error occurs. -
Entity resolution — Delegates to
fullmap.resolve()which queries the sharded DuckDB database, ranks matches by category priority, preferred-name exactness, NLP level, and category frequency, then deduplicates to one CURIE per input string. -
Collection and conversion — Collects the lazy result into an eager
pl.DataFrameand converts to a list of row dictionaries viato_dicts().
Example Usage
Basic Gene Resolution
from pathlib import Path
from typing import Any
from tablassert.lib import resolve_many
from tablassert.enums import Categories
datassert: Path = Path("/path/to/datassert")
result: list[dict[str, Any]] = resolve_many(
col="gene",
entities=["TP53", "BRCA1", "EGFR", "KRAS"],
datassert=datassert,
taxon="9606",
prioritize=[Categories.Gene],
)
# result[0] → {"original gene": "TP53", "gene": "HGNC:11998", "gene name": "TP53", ...}
# result[1] → {"original gene": "BRCA1", "gene": "HGNC:1100", "gene name": "BRCA1", ...}
Disease Resolution With Category Avoidance
from pathlib import Path
from typing import Any
from tablassert.lib import resolve_many
from tablassert.enums import Categories
datassert: Path = Path("/path/to/datassert")
result: list[dict[str, Any]] = resolve_many(
col="disease",
entities=["diabetes mellitus", "breast cancer", "alzheimer disease"],
datassert=datassert,
avoid=[Categories.Gene, Categories.Protein],
)
# result[0] → {"original disease": "diabetes mellitus", "disease": "MONDO:0005015", ...}
# result[1] → {"original disease": "breast cancer", "disease name": "breast cancer", ...}
Chemical Resolution Without Column Context
from pathlib import Path
from typing import Any
from tablassert.lib import resolve_many
datassert: Path = Path("/path/to/datassert")
result: list[dict[str, Any]] = resolve_many(
col="chemical",
entities=["aspirin", "metformin", "ibuprofen"],
datassert=datassert,
column_context=False,
)
Consuming Results
import polars as pl
from pathlib import Path
from typing import Any
from tablassert.lib import resolve_many
datassert: Path = Path("/path/to/datassert")
result: list[dict[str, Any]] = resolve_many(
col="gene",
entities=["TP53", "BRCA1"],
datassert=datassert,
taxon="9606",
)
# Convert back to a Polars DataFrame
df: pl.DataFrame = pl.DataFrame(result)
# Or iterate over resolved rows
for row in result:
print(f"{row['gene name']} → {row['gene']}")
Comparison With resolve()
| Aspect | resolve_many() |
resolve() |
|---|---|---|
| Module | tablassert.lib |
tablassert.fullmap |
| Input | Plain iterable of strings | Pre-normalized pl.LazyFrame |
| NLP | Applied automatically | Must be applied upstream |
| Connections | Managed internally via ExitStack |
Must be opened externally |
| Output | list[dict[str, Any]] |
pl.LazyFrame |
| Logging | Uses default (log=True) |
Configurable |
| Context params | Not exposed (section_hash, config_file, tag) |
Fully configurable |
| Use case | Standalone batch lookups, scripting, notebooks | Internal pipeline integration |
resolve_many() is designed for ad-hoc and programmatic use — scripts, notebooks, and one-off lookups. For pipeline integration where you need full control over logging, context metadata, and lazy evaluation, use resolve() directly.
NLP Processing
resolve_many() applies both NLP normalization levels before resolution:
Level one — level_one(lf, col):
- Strips leading/trailing whitespace
- Converts to lowercase
- Output column: {col} (overwrites the original)
Level two — level_two(lf, col):
- Removes all non-word characters (\W+ → "") from the level-one result
- Output column: {col} two
Both levels are queried during resolution. Level one (exact case-insensitive match) is preferred; level two is used as a fallback for terms with punctuation or special characters.
Error Handling
- If the
datassertpath does not contain the expected shard files,duckdb.connect()will raise anIOException. - If
entitiesis empty, the function returns[]. - The
ExitStackensures all 10 DuckDB connections are closed even if resolution raises an exception. - Unresolved entities are silently filtered from the output (logged at INFO level by default via
resolve()).
Integration
resolve_many() is a self-contained entry point. It does not require any prior setup beyond having a datassert database available. For full pipeline builds, use the CLI (tablassert build-knowledge-graph) which orchestrates resolution through the Tcode class.
Next Steps
- Entity Resolution — Lower-level
resolve()function details - Quality Control — Multi-stage validation of resolved entities
- Configuration — YAML-driven entity resolution settings