Drug pricing and classification data is the operational backbone of every formulary management system. For many PBMs and health plans, that data comes from Medi-Span (also referred to as Metaspan in some organizations), Wolters Kluwer's comprehensive drug data platform. Getting the weekly data ingestion pipeline right from the start is critical because every downstream process depends on it: formulary updates, claims adjudication, rebate calculations, and regulatory reporting.
This article covers the architecture patterns we have found most effective for building reliable, maintainable Metaspan data integration pipelines.
Understanding the Data
Medi-Span data feeds typically include several core file types delivered weekly:
- Master Drug Data Base (MDDB) - The comprehensive drug file containing NDCs, GPI codes (Generic Product Identifier), drug names, strengths, forms, and packaging information.
- Price Files - AWP (Average Wholesale Price), WAC (Wholesale Acquisition Cost), and other pricing benchmarks. These change weekly and are critical for claims pricing.
- Drug-Drug Interaction - Pre-computed interaction pairs with severity levels and clinical monographs.
- Drug-Disease Contraindication - Mappings between drugs and ICD-10 codes where the drug is contraindicated.
- Duplicate Therapy - Identification of drugs in the same therapeutic class for duplication alerts.
The files are typically flat-file format (fixed-width or pipe-delimited) and can be large. The full MDDB file contains hundreds of thousands of records. Weekly updates include both full replacement files and delta (change-only) files.
Pattern 1: Immutable Staging with Version Control
The most important architectural decision is to never overwrite production data directly with incoming feed data. Instead, load each weekly feed into an immutable staging area with a version identifier:
-- Schema: versioned staging tables
CREATE TABLE staging.drug_master (
feed_version DATE NOT NULL, -- e.g., 2025-05-05
ndc VARCHAR(11) NOT NULL,
gpi VARCHAR(14),
drug_name VARCHAR(200),
strength VARCHAR(50),
dosage_form VARCHAR(50),
route VARCHAR(50),
awp_unit_price DECIMAL(12,4),
wac_unit_price DECIMAL(12,4),
obsolete_date DATE,
loaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (feed_version, ndc)
);
-- Index for efficient version comparison
CREATE INDEX idx_drug_master_ndc ON staging.drug_master(ndc);
This pattern gives you several critical capabilities: rollback to any previous week's data, side-by-side comparison between weeks, and a complete audit trail of every data change.
Pattern 2: Change Detection Pipeline
Once the staging data is loaded, a change detection step identifies what actually changed from the previous week. This is where most naive implementations go wrong, because they try to do a full table swap instead of understanding the delta:
def detect_changes(current_version, previous_version, db):
"""Identify new, modified, and discontinued NDCs."""
query = """
SELECT
COALESCE(c.ndc, p.ndc) AS ndc,
CASE
WHEN p.ndc IS NULL THEN 'NEW'
WHEN c.ndc IS NULL THEN 'DISCONTINUED'
WHEN c.awp_unit_price != p.awp_unit_price
OR c.gpi != p.gpi
OR c.drug_name != p.drug_name
THEN 'MODIFIED'
ELSE 'UNCHANGED'
END AS change_type,
c.awp_unit_price AS new_price,
p.awp_unit_price AS old_price,
c.gpi AS new_gpi,
p.gpi AS old_gpi
FROM staging.drug_master c
FULL OUTER JOIN staging.drug_master p
ON c.ndc = p.ndc
AND p.feed_version = %s
WHERE c.feed_version = %s
AND (p.ndc IS NULL
OR c.ndc IS NULL
OR c.awp_unit_price != p.awp_unit_price
OR c.gpi != p.gpi
OR c.drug_name != p.drug_name)
"""
return db.execute(query, [previous_version, current_version])
Pattern 3: Impact Assessment Layer
Raw data changes are not actionable until they are assessed against the current formulary. A price change on an NDC that is not on your formulary is noise. A GPI reclassification on a Tier 1 drug with $50M in annual claims is a critical event. The impact assessment layer connects data changes to business context:
def assess_impact(changes, formulary, claims_data):
"""Score each change by business impact."""
assessments = []
for change in changes:
ndc = change["ndc"]
formulary_entry = formulary.get(ndc)
if not formulary_entry:
continue # Not on our formulary
annual_spend = claims_data.get_annual_spend(ndc)
member_count = claims_data.get_unique_members(ndc)
impact_score = 0
alerts = []
if change["change_type"] == "DISCONTINUED":
impact_score = 100 # Always critical
alerts.append("NDC discontinued - replacement needed")
elif change["change_type"] == "MODIFIED":
if change.get("new_price") and change.get("old_price"):
price_delta = (
(change["new_price"] - change["old_price"])
/ change["old_price"]
)
if abs(price_delta) > 0.10:
impact_score += 50
alerts.append(
f"Price changed {price_delta:.1%}"
)
if change.get("new_gpi") != change.get("old_gpi"):
impact_score += 40
alerts.append("GPI reclassification")
assessments.append({
"ndc": ndc,
"tier": formulary_entry["tier"],
"annual_spend": annual_spend,
"member_count": member_count,
"impact_score": impact_score,
"alerts": alerts,
})
return sorted(assessments,
key=lambda x: x["impact_score"],
reverse=True)
Pattern 4: Automated Validation Gates
Before any data moves from staging to production, automated validation gates must pass. These catch data quality issues that would otherwise propagate through the entire system:
- Record count validation: The new feed should be within 2% of the previous feed's record count. A 50% drop indicates a truncated file.
- Price reasonableness: Flag any NDC with a price change greater than 200% for manual review.
- Referential integrity: All GPI codes in the drug file should exist in the GPI reference table.
- Completeness checks: Required fields (NDC, drug name, GPI) should never be null.
If any gate fails, the pipeline halts and alerts the operations team. The previous week's data remains in production until the issue is resolved. This is why the immutable staging pattern is essential: you always have a clean fallback.
Scheduling and Monitoring
Medi-Span data typically arrives on a predictable weekly schedule (often Monday or Tuesday). The ingestion pipeline should run automatically on a schedule with enough buffer time to handle delivery delays. A typical schedule looks like:
- Monday 6 AM: Check for new data files in the SFTP drop location
- Monday 6:15 AM: Load into staging tables with version tag
- Monday 6:30 AM: Run validation gates
- Monday 7:00 AM: Run change detection and impact assessment
- Monday 7:30 AM: Send impact report to formulary team
- Monday 9:00 AM: Promote to production (automatically if all gates pass, or after manual approval for high-impact changes)
Building this pipeline correctly on the first pass avoids months of technical debt. The patterns here are not complex, but they require discipline: immutable staging, change detection, impact assessment, and automated validation. Skip any one of these and the pipeline becomes a source of operational risk rather than operational confidence.