Enterprise Document Intelligence & Compliance Automation

Processing hundreds of enterprise documents through AI pipelines required thousands of lines of glue code. A 42-line YAML file replaced it.

The Problem

Every new AI document pipeline started the same way: a standalone Python script that coupled ingestion, transformation, and output into a single monolith. Each team rebuilt the same patterns from scratch —API authentication, error handling, parallelism, retry logic.

A 1,500-line async Python script for vision-based document enrichment was the tipping point. It cloned Git repositories, enriched markdown with vision models, rewrote links, and indexed into FAISS —all tightly coupled. Changing the embedding model meant rewriting the indexing logic. Swapping the vector store meant rewriting the output layer. Prompt engineers needed developer involvement to iterate on extraction quality.

Scaling any of these scripts to distributed execution required a full rewrite. There was no reuse between pipelines, no separation of concerns, and no way to compose existing capabilities into new workflows without writing code.

Architecture

The Pipeline SDK (PDK) follows a linear Source -> Processors -> Destination execution model. Pipelines are defined entirely in YAML and executed by a runner that manages Spark sessions, component lifecycle, and status tracking. Every component operates on a canonical DataFrame schema, meaning any source can feed any processor, and any processor can feed any destination.

Every component reads and writes the same four-column DataFrame:

Column	Type	Purpose
`document_id`	`string`	SHA256 hash uniquely identifying the document
`document_uri`	`string`	Source location (e.g., `file://`, `box://`, `scar://`)
`document_metadata`	`map<string,string>`	Extensible key-value metadata accumulated across stages
`document_content`	`string`	Document body —text, JSON, or markdown

Key Decision

PDK wraps SimpleProcessor implementations in Spark UDFs rather than requiring component authors to write Spark code directly. This means any Python function can become a distributed processor, but it introduces a serialization boundary —component configs must round-trip through JSON, ruling out file handles or database connections in config objects. Components initialize these resources in __init__, which runs once per Spark partition.

For operations that need cross-document context (joins, aggregations, windowing), the SparkAware* base classes provide full DataFrame and SparkSession access.

Use Cases

PDK supports seven pipeline configurations in production. Four are detailed below; the framework also handles knowledge base indexing with vector search, stream processing with link validation, and multi-document fusion for cross-file analysis.

Document Intelligence with LLMs

Ingests documents from Box, applies multi-modal conversion (text + images), and runs LLM-based structured extraction. The converter routes PDFs to PyMuPDF and Office formats to Docling automatically via nested component composition.

source:
  class_name: pipelines.sources.box.BoxSource
  config:
    dir_id: "364548964496"
    extension: "pdf,docx"
    converter:
      class_name: pipelines.converters.unified_converter.UnifiedDocumentConverter
      config:
        extract_images: true
        image_dpi: 150

processors:
  - class_name: pipelines.processors.watsonxai.WatsonxAIProcessor
    config:
      model_id: meta-llama/llama-3-3-70b-instruct
      max_new_tokens: 131000
      temperature: 0

Nested component composition: The UnifiedDocumentConverter is configured inside the source, routing file types to the appropriate converter automatically
Multi-modal processing: The converter extracts both text and base64-encoded images, which the LLM can analyze together
Model flexibility: Swapping models (Granite, LLaMA, Mistral) requires changing one YAML field —no code changes

Security Compliance Scoring

The most complex pipeline in the ecosystem. Reads compliance data from four Excel spreadsheets, normalizes hostnames, scores hosts across six security dimensions, aggregates to squad level, enriches with live SRE data from four external APIs, runs AI analysis, and generates multi-format scorecards.

Scoring thresholds and RAG bands are defined in a separate YAML file, allowing security teams to adjust without touching code:

patch_management:
  checks:
    devices_vulnerable: 2
    overdue_patches: 3
    missing_edr: 2
  no_data_penalty: 5

rag_thresholds:
  dimension:
    green_max: 0
    amber_max: 3
  total:
    green_max: 5
    amber_max: 15

Deep processor chaining: Five SparkAware processors execute sequentially, each transforming the DataFrame and accumulating metadata across stages
External data enrichment: The SRE enrichment processor calls four external APIs (ServiceNow incidents, CMDB relationships, Instana events, vector store research)
Multi-format output: The destination generates Excel workbooks with RAG color coding, HTML summaries via Jinja2, and JSON exports —all from a single component

Documentation Publishing

Converts documents and publishes them directly to GitHub as pull requests for MkDocs sites. When no transformation is needed, the processor stage is skipped entirely.

source:
  class_name: pipelines.sources.file_system.FileSystemSource
  config:
    root_dir: examples/data
    file_name_pattern: "*.md"
    converter:
      class_name: pipelines.converters.docling_file_converter.DoclingFileConverter

processors: []

destination:
  class_name: pipelines.destinations.github_mkdocs.GitHubMkDocsDestination
  config:
    github_mkdocs:
      repo_url: "<GITHUB_REPO_URL>"
      branch_name: "<FEATURE_BRANCH_NAME>"
      dest_branch: "<DESTINATION_BRANCH_NAME>"

Zero-processor pipelines: processors: [] skips the processing stage entirely when no transformation is needed
Git-native destination: Clones the repo, creates a feature branch, commits the document, pushes, and opens a pull request —all automatically
Template-driven content: Commit messages, PR titles, and PR bodies support template variables ({document_id}, {original_filename}, {changed_files_markdown})

SRE Knowledge Enrichment with Vision

A 1,500-line standalone async Python script was decomposed into four reusable PDK components. Spark partitions replaced asyncio.gather for parallelism, and YAML configuration replaced CLI arguments.

Script Pattern	PDK Equivalent
Global state, singletons	Per-partition instantiation in `__init__`
`asyncio.gather` + semaphores	Spark partition-level parallelism
CLI args + environment variables	YAML config + Pydantic validation
Per-file try/except with continue	UDF wrapper + error tracking
Monolithic function (clone -> enrich -> index)	Source -> Processor -> Destination chain

Monolith decomposition: Each responsibility in the original script became one PDK component —tightly coupled functions stay together, independently useful functions become separate components
Multi-modal vision processing: Detects image references in markdown, resolves local and external images, and routes to different code paths based on image count
Composable destinations: The same source and processors can feed FAISS, Milvus, local files, or GitHub Pages with no code changes —swap one YAML block

Performance and Scalability

Execution Model

Every component type supports two execution strategies. The runner inspects the class hierarchy to automatically choose the correct one:

UDF (Simple* base classes): Wrapped in a PySpark UDF, executed per-row on executors. Best for stateless per-document operations like API calls and LLM inference
DataFrame (SparkAware* base classes): Receives SparkSession + DataFrame, returns DataFrame. Best for joins, aggregations, and multi-document operations

Approximate per-document latency varies significantly by processor type, with LLM API calls dominating:

Scaling Strategy

Component	Parallelism	Bottleneck	Strategy
BoxSource	Per-file UDF	Box API rate limits	Batch via `merge_documents` mode
WatsonxAIProcessor	Per-document UDF	LLM API latency (~2-10s/doc)	Scale Spark executors
VisionEnrichmentProcessor	Per-document UDF	Vision API latency (~5-15s/doc)	Scale executors; tune `max_images_per_call`
NormalizationProcessor	Full DataFrame join	Memory for pandas conversion	Increase driver memory
MilvusDestination	Per-document chunk+embed	Embedding generation	Batch embedding APIs
KafkaBatchAdapter	Spark partitions	Consumer throughput	Increase Kafka partitions

Key insight: For LLM-based processors, the bottleneck is always API latency, not framework overhead. Adding Spark executors directly scales LLM throughput —each executor makes independent API calls in parallel.

Tradeoffs

Cold start per partition: Each Spark partition re-instantiates the component from serialized config, including API client authentication. Fewer partitions reduce cold starts but limit parallelism
Synchronous UDF execution: Unlike standalone scripts using asyncio.gather, UDFs execute synchronously. Spark provides inter-document parallelism at the partition level; for intra-document concurrency, components use ThreadPoolExecutor inside process()
Serialization boundary: All component configuration must round-trip through JSON. File handles, database connections, and closures are ruled out in configs. Large system prompts should use file references rather than inline config
Statelessness contract: Global singletons are replaced by per-partition instances. This eliminates contention but increases resource usage. Broadcast variables offer a future optimization path

Customization

Zero-Code: YAML Assembly

Many pipeline variations require no code changes:

Swap converters: Change class_name from DoclingFileConverter to PyMuPDFConverter for lighter-weight PDF processing
Change LLM providers: Replace WatsonxAIProcessor with BoxAIProcessor or OllamaProcessor by changing the processor block
Switch destinations: Move from local files to Milvus, Box, or GitHub Pages by changing one YAML block
Adjust scoring thresholds: Security teams modify scoring_config.yaml without touching pipeline definitions

LLM-based processors treat prompts as configuration, enabling prompt engineers to iterate without developer involvement:

processors:
  - class_name: pipelines.processors.watsonxai.WatsonxAIProcessor
    config:
      model_id: meta-llama/llama-3-3-70b-instruct
      temperature: 0
      top_p: 0.45
      system_prompt: |
        You are a Universal File Insight Extractor...
        # CONTROLS
        FORMAT=MD | EVIDENCE=QUOTE | PII=redact
        # CHAIN OF THOUGHT
        Phase 1: Knowledge Discovery
        Phase 2: Knowledge Aggregation
        Phase 3: Final Document Generation

Minimal-Code: New Components

Building a new component requires extending one base class and implementing one method. The framework handles Spark UDF wrapping, serialization, error handling, and metadata propagation —no Spark knowledge required.

class MySource(SimpleSource):
    def list_documents(self) -> list[Document]:
        # Return documents to process
    def read_document(self, file_uri: str) -> tuple[DocumentContent, PipelineComponentResult]:
        # Read and return document content

class MyProcessor(SimpleProcessor):
    def process(self, document_id, document_uri, document_metadata, document_content) -> DocumentContent:
        # Transform content and/or metadata

class MyDestination(SimpleDestination):
    def write(self, document_id, document_uri, document_metadata, document_content) -> None:
        # Persist the document

Component Ecosystem

The PDK ecosystem spans sources, processors, destinations, and supporting infrastructure:

Service	Role in Pipeline
Box.com	Document source and destination (developer token + JWT auth)
IBM watsonx.ai	LLM inference (text + vision) and embeddings
Ollama	Local LLM inference and embeddings
Milvus	Vector database for RAG pipelines
FAISS	Local vector indexing for RAG pipelines
Apache Kafka	Streaming document source
GitHub	Documentation publishing via PR automation
PostgreSQL	Metadata storage and pipeline execution history
Redis	Caching layer for processors with TTL
ServiceNow / Instana	SRE enrichment for compliance pipelines

Results

Development Velocity:

1,500 lines of async Python replaced by 42 lines of YAML
Prompt engineers iterate on extraction quality without developer involvement
New pipeline assembly from existing components in hours, not weeks

Operational Scale:

6 sources, 14+ processors, 6 destinations composable in any combination
Dual execution model scales from single-machine to multi-executor Spark clusters
4 converters handle 9 file types: PDF, DOCX, PPTX, XLSX, DOC, PPT, XLS, HTML, Markdown

Separation of Concerns:

Infrastructure engineers manage Spark cluster configuration
Component developers implement focused single-responsibility classes
Prompt engineers tune LLM behavior in YAML
Domain experts define scoring thresholds in external config files

Technology Stack

Framework: PySpark, Python, YAML/Pydantic configuration

AI and ML: watsonx.ai, Box AI, Ollama, Milvus, FAISS

Document Processing: Docling, PyMuPDF, LangChain chunkers, HuggingFace tokenizers

Infrastructure: Apache Kafka, PostgreSQL, Redis, MinIO/S3, Apache Iceberg

Integrations: Box.com, GitHub, ServiceNow, Instana, Ansible

Conclusion

The core insight behind PDK is that a canonical DataFrame schema, a dual execution model, and YAML-driven composition can separate concerns cleanly enough that infrastructure engineers, component developers, prompt engineers, and domain experts each operate independently. The pipeline YAML serves as the integration point, and the UDF execution model enforces this separation architecturally —statelessness per partition, serializable configuration, and deterministic document IDs ensure components remain portable and composable.

The component ecosystem continues to grow as new sources and destinations are added. Each new component immediately becomes composable with every existing one, compounding the value of the framework with every addition.