Datassert
Datassert is a high-performance CLI for building a DuckDB-backed assertion store from NCATS Translator BABEL export files, with a focus on fast local builds and simple command-driven workflows. It produces the entity-resolution database used by Tablassert, containing biological synonyms, CURIEs, Biolink categories, taxon IDs, and source provenance, enabling resolve() to map free-text strings to standardized identifiers.
Installation
# Install CLI from GitHub
go install github.com/SkyeAv/datassert@latest
# Verify install
datassert --help
Build Command
# Build a Datassert database (downloads BABEL data automatically)
datassert build
The build command automatically downloads BABEL exports from RENCI (https://stars.renci.org/var/babel_outputs), processes them, and produces sharded DuckDB databases.
Flags
| Flag | Required | Default | Description |
|---|---|---|---|
--skip-downloads / -s |
No | false |
Skip the BABEL download phase (use previously downloaded files) |
--use-existing-parquets / -p |
No | false |
Use existing Parquet files to rebuild DuckDB databases |
Data Pipeline
- Download — BABEL class and synonym files are downloaded from RENCI and split into LZ4-compressed NDJSON chunks under
./datassert/downloads/. - Lookup — Class files (
*.ndjson.lz4) are read to build an in-memory equivalent-identifier lookup. - Parquet Staging — Synonym files are processed with the lookup, quality-controlled, and written as sharded Parquet files to
./datassert/parquets/. - DuckDB Generation — Parquet files are loaded into 10 sharded DuckDB databases under
./datassert/data/.
Examples
# Full build (download, process, and generate databases)
datassert build
# Skip downloads if BABEL files were already fetched
datassert build --skip-downloads
# Rebuild DuckDB databases from existing Parquet files
datassert build --use-existing-parquets
Runtime Behavior
- Displays progress bars for download, class lookup, synonym processing, and DuckDB build phases.
- Uses 90% of available CPUs for concurrent processing.
- Downloads are retried up to 3 times on failure with a 10-second backoff.
- All working files are stored under
./datassert/.
Output Artifacts
- 10 sharded DuckDB databases are written to
./datassert/data/{0..9}.duckdb. - Each shard contains
SOURCES,CATEGORIES,CURIES, andSYNONYMStables, deduplicated, sorted, and indexed for query performance. - Staging Parquet files are written to
./datassert/parquets/{0..9}/.
Terms are routed to shards deterministically via xxhash64(term) % 10, so a given string always hits the same shard.
Schema
Each shard contains four tables:
| Table | Key Columns | Description |
|---|---|---|
SYNONYMS |
SYNONYM, CURIE_ID, SOURCE_ID |
Text synonym → CURIE mapping |
CURIES |
CURIE_ID, CURIE, PREFERRED_NAME, TAXON_ID, CATEGORY_ID |
Canonical identifiers and preferred names |
CATEGORIES |
CATEGORY_ID, CATEGORY_NAME |
Biolink category names |
SOURCES |
SOURCE_ID, SOURCE_NAME, SOURCE_VERSION |
Source database and version provenance |
Usage in Graph Config
The datassert: field in a GC2 graph configuration points to the directory containing the shards. Tablassert opens all 10 shards at startup and passes the connections to resolve().
# graph-config.yaml (GC2)
syntax: GC2
name: my-graph
version: "1.0"
datassert: /path/to/datassert/ # directory containing data/0..9.duckdb
tables:
- ./TABLE/my-table.yaml
Programmatic Usage
When calling resolve() directly, open the shard connections yourself:
import duckdb
from tablassert.fullmap import resolve
datassert_dir = "/path/to/datassert"
conns = [
duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
for i in range(10)
]
See Entity Resolution for the full resolve() API.