Enterprise Document Intelligence & Compliance Automation
Processing hundreds of enterprise documents through AI pipelines required thousands of lines of glue code. A 42-line YAML file replaced it.
The Problem
Every new AI document pipeline started the same way: a standalone Python script that coupled ingestion, transformation, and output into a single monolith. Each team rebuilt the same patterns from scratch —API authentication, error handling, parallelism, retry logic.
A 1,500-line async Python script for vision-based document enrichment was the tipping point. It cloned Git repositories, enriched markdown with vision models, rewrote links, and indexed into FAISS —all tightly coupled. Changing the embedding model meant rewriting the indexing logic. Swapping the vector store meant rewriting the output layer. Prompt engineers needed developer involvement to iterate on extraction quality.
Scaling any of these scripts to distributed execution required a full rewrite. There was no reuse between pipelines, no separation of concerns, and no way to compose existing capabilities into new workflows without writing code.
Architecture
The Pipeline SDK (PDK) follows a linear Source -> Processors -> Destination execution model. Pipelines are defined entirely in YAML and executed by a runner that manages Spark sessions, component lifecycle, and status tracking. Every component operates on a canonical DataFrame schema, meaning any source can feed any processor, and any processor can feed any destination.
Every component reads and writes the same four-column DataFrame:
| Column | Type | Purpose |
|---|---|---|
document_id | string | SHA256 hash uniquely identifying the document |
document_uri | string | Source location (e.g., file://, box://, scar://) |
document_metadata | map<string,string> | Extensible key-value metadata accumulated across stages |
document_content | string | Document body —text, JSON, or markdown |
Key Decision
PDK wraps SimpleProcessor implementations in Spark UDFs rather than requiring component authors to write Spark code directly. This means any Python function can become a distributed processor, but it introduces a serialization boundary —component configs must round-trip through JSON, ruling out file handles or database connections in config objects. Components initialize these resources in __init__, which runs once per Spark partition.
For operations that need cross-document context (joins, aggregations, windowing), the SparkAware* base classes provide full DataFrame and SparkSession access.
Use Cases
PDK supports seven pipeline configurations in production. Four are detailed below; the framework also handles knowledge base indexing with vector search, stream processing with link validation, and multi-document fusion for cross-file analysis.
Document Intelligence with LLMs
Ingests documents from Box, applies multi-modal conversion (text + images), and runs LLM-based structured extraction. The converter routes PDFs to PyMuPDF and Office formats to Docling automatically via nested component composition.
source:
class_name: pipelines.sources.box.BoxSource
config:
dir_id: "364548964496"
extension: "pdf,docx"
converter:
class_name: pipelines.converters.unified_converter.UnifiedDocumentConverter
config:
extract_images: true
image_dpi: 150
processors:
- class_name: pipelines.processors.watsonxai.WatsonxAIProcessor
config:
model_id: meta-llama/llama-3-3-70b-instruct
max_new_tokens: 131000
temperature: 0
- Nested component composition: The
UnifiedDocumentConverteris configured inside the source, routing file types to the appropriate converter automatically - Multi-modal processing: The converter extracts both text and base64-encoded images, which the LLM can analyze together
- Model flexibility: Swapping models (Granite, LLaMA, Mistral) requires changing one YAML field —no code changes
Security Compliance Scoring
The most complex pipeline in the ecosystem. Reads compliance data from four Excel spreadsheets, normalizes hostnames, scores hosts across six security dimensions, aggregates to squad level, enriches with live SRE data from four external APIs, runs AI analysis, and generates multi-format scorecards.
Scoring thresholds and RAG bands are defined in a separate YAML file, allowing security teams to adjust without touching code:
patch_management:
checks:
devices_vulnerable: 2
overdue_patches: 3
missing_edr: 2
no_data_penalty: 5
rag_thresholds:
dimension:
green_max: 0
amber_max: 3
total:
green_max: 5
amber_max: 15
- Deep processor chaining: Five SparkAware processors execute sequentially, each transforming the DataFrame and accumulating metadata across stages
- External data enrichment: The SRE enrichment processor calls four external APIs (ServiceNow incidents, CMDB relationships, Instana events, vector store research)
- Multi-format output: The destination generates Excel workbooks with RAG color coding, HTML summaries via Jinja2, and JSON exports —all from a single component
Documentation Publishing
Converts documents and publishes them directly to GitHub as pull requests for MkDocs sites. When no transformation is needed, the processor stage is skipped entirely.
source:
class_name: pipelines.sources.file_system.FileSystemSource
config:
root_dir: examples/data
file_name_pattern: "*.md"
converter:
class_name: pipelines.converters.docling_file_converter.DoclingFileConverter
processors: []
destination:
class_name: pipelines.destinations.github_mkdocs.GitHubMkDocsDestination
config:
github_mkdocs:
repo_url: "<GITHUB_REPO_URL>"
branch_name: "<FEATURE_BRANCH_NAME>"
dest_branch: "<DESTINATION_BRANCH_NAME>"
- Zero-processor pipelines:
processors: []skips the processing stage entirely when no transformation is needed - Git-native destination: Clones the repo, creates a feature branch, commits the document, pushes, and opens a pull request —all automatically
- Template-driven content: Commit messages, PR titles, and PR bodies support template variables (
{document_id},{original_filename},{changed_files_markdown})
SRE Knowledge Enrichment with Vision
A 1,500-line standalone async Python script was decomposed into four reusable PDK components. Spark partitions replaced asyncio.gather for parallelism, and YAML configuration replaced CLI arguments.
| Script Pattern | PDK Equivalent |
|---|---|
| Global state, singletons | Per-partition instantiation in __init__ |
asyncio.gather + semaphores | Spark partition-level parallelism |
| CLI args + environment variables | YAML config + Pydantic validation |
| Per-file try/except with continue | UDF wrapper + error tracking |
| Monolithic function (clone -> enrich -> index) | Source -> Processor -> Destination chain |
- Monolith decomposition: Each responsibility in the original script became one PDK component —tightly coupled functions stay together, independently useful functions become separate components
- Multi-modal vision processing: Detects image references in markdown, resolves local and external images, and routes to different code paths based on image count
- Composable destinations: The same source and processors can feed FAISS, Milvus, local files, or GitHub Pages with no code changes —swap one YAML block
Performance and Scalability
Execution Model
Every component type supports two execution strategies. The runner inspects the class hierarchy to automatically choose the correct one:
- UDF (
Simple*base classes): Wrapped in a PySpark UDF, executed per-row on executors. Best for stateless per-document operations like API calls and LLM inference - DataFrame (
SparkAware*base classes): Receives SparkSession + DataFrame, returns DataFrame. Best for joins, aggregations, and multi-document operations
Approximate per-document latency varies significantly by processor type, with LLM API calls dominating:
Scaling Strategy
| Component | Parallelism | Bottleneck | Strategy |
|---|---|---|---|
| BoxSource | Per-file UDF | Box API rate limits | Batch via merge_documents mode |
| WatsonxAIProcessor | Per-document UDF | LLM API latency (~2-10s/doc) | Scale Spark executors |
| VisionEnrichmentProcessor | Per-document UDF | Vision API latency (~5-15s/doc) | Scale executors; tune max_images_per_call |
| NormalizationProcessor | Full DataFrame join | Memory for pandas conversion | Increase driver memory |
| MilvusDestination | Per-document chunk+embed | Embedding generation | Batch embedding APIs |
| KafkaBatchAdapter | Spark partitions | Consumer throughput | Increase Kafka partitions |
Key insight: For LLM-based processors, the bottleneck is always API latency, not framework overhead. Adding Spark executors directly scales LLM throughput —each executor makes independent API calls in parallel.
Tradeoffs
- Cold start per partition: Each Spark partition re-instantiates the component from serialized config, including API client authentication. Fewer partitions reduce cold starts but limit parallelism
- Synchronous UDF execution: Unlike standalone scripts using
asyncio.gather, UDFs execute synchronously. Spark provides inter-document parallelism at the partition level; for intra-document concurrency, components useThreadPoolExecutorinsideprocess() - Serialization boundary: All component configuration must round-trip through JSON. File handles, database connections, and closures are ruled out in configs. Large system prompts should use file references rather than inline config
- Statelessness contract: Global singletons are replaced by per-partition instances. This eliminates contention but increases resource usage. Broadcast variables offer a future optimization path
Customization
Zero-Code: YAML Assembly
Many pipeline variations require no code changes:
- Swap converters: Change
class_namefromDoclingFileConvertertoPyMuPDFConverterfor lighter-weight PDF processing - Change LLM providers: Replace
WatsonxAIProcessorwithBoxAIProcessororOllamaProcessorby changing the processor block - Switch destinations: Move from local files to Milvus, Box, or GitHub Pages by changing one YAML block
- Adjust scoring thresholds: Security teams modify
scoring_config.yamlwithout touching pipeline definitions
LLM-based processors treat prompts as configuration, enabling prompt engineers to iterate without developer involvement:
processors:
- class_name: pipelines.processors.watsonxai.WatsonxAIProcessor
config:
model_id: meta-llama/llama-3-3-70b-instruct
temperature: 0
top_p: 0.45
system_prompt: |
You are a Universal File Insight Extractor...
# CONTROLS
FORMAT=MD | EVIDENCE=QUOTE | PII=redact
# CHAIN OF THOUGHT
Phase 1: Knowledge Discovery
Phase 2: Knowledge Aggregation
Phase 3: Final Document Generation
Minimal-Code: New Components
Building a new component requires extending one base class and implementing one method. The framework handles Spark UDF wrapping, serialization, error handling, and metadata propagation —no Spark knowledge required.
class MySource(SimpleSource):
def list_documents(self) -> list[Document]:
# Return documents to process
def read_document(self, file_uri: str) -> tuple[DocumentContent, PipelineComponentResult]:
# Read and return document content
class MyProcessor(SimpleProcessor):
def process(self, document_id, document_uri, document_metadata, document_content) -> DocumentContent:
# Transform content and/or metadata
class MyDestination(SimpleDestination):
def write(self, document_id, document_uri, document_metadata, document_content) -> None:
# Persist the document
Component Ecosystem
The PDK ecosystem spans sources, processors, destinations, and supporting infrastructure:
| Service | Role in Pipeline |
|---|---|
| Box.com | Document source and destination (developer token + JWT auth) |
| IBM watsonx.ai | LLM inference (text + vision) and embeddings |
| Ollama | Local LLM inference and embeddings |
| Milvus | Vector database for RAG pipelines |
| FAISS | Local vector indexing for RAG pipelines |
| Apache Kafka | Streaming document source |
| GitHub | Documentation publishing via PR automation |
| PostgreSQL | Metadata storage and pipeline execution history |
| Redis | Caching layer for processors with TTL |
| ServiceNow / Instana | SRE enrichment for compliance pipelines |
Results
Development Velocity:
- 1,500 lines of async Python replaced by 42 lines of YAML
- Prompt engineers iterate on extraction quality without developer involvement
- New pipeline assembly from existing components in hours, not weeks
Operational Scale:
- 6 sources, 14+ processors, 6 destinations composable in any combination
- Dual execution model scales from single-machine to multi-executor Spark clusters
- 4 converters handle 9 file types: PDF, DOCX, PPTX, XLSX, DOC, PPT, XLS, HTML, Markdown
Separation of Concerns:
- Infrastructure engineers manage Spark cluster configuration
- Component developers implement focused single-responsibility classes
- Prompt engineers tune LLM behavior in YAML
- Domain experts define scoring thresholds in external config files
Technology Stack
Framework: PySpark, Python, YAML/Pydantic configuration
AI and ML: watsonx.ai, Box AI, Ollama, Milvus, FAISS
Document Processing: Docling, PyMuPDF, LangChain chunkers, HuggingFace tokenizers
Infrastructure: Apache Kafka, PostgreSQL, Redis, MinIO/S3, Apache Iceberg
Integrations: Box.com, GitHub, ServiceNow, Instana, Ansible
Conclusion
The core insight behind PDK is that a canonical DataFrame schema, a dual execution model, and YAML-driven composition can separate concerns cleanly enough that infrastructure engineers, component developers, prompt engineers, and domain experts each operate independently. The pipeline YAML serves as the integration point, and the UDF execution model enforces this separation architecturally —statelessness per partition, serializable configuration, and deterministic document IDs ensure components remain portable and composable.
The component ecosystem continues to grow as new sources and destinations are added. Each new component immediately becomes composable with every existing one, compounding the value of the framework with every addition.