Features

WebRobot provides a comprehensive set of features for building and managing agentic ETL pipelines.

Discover the powerful capabilities that make WebRobot the leading platform for agentic ETL pipelines.

Core Features

🚀 Spark-Native Processing

Distributed Computing: Leverage Apache Spark's distributed processing capabilities
Scalability: Handle data from gigabytes to petabytes
Performance: Optimized for speed and efficiency
Resource Management: Intelligent resource allocation and optimization

🤖 AI-Powered Intelligence

Intelligent Stages: LLM-powered stages that adapt to changing web structures
Natural Language Processing: Convert natural language descriptions to executable pipelines
Auto-Programming: Python extensions for dynamic stage generation
Context-Aware Extraction: Intelligent data extraction with minimal configuration

🔌 API-First Architecture

RESTful API: Complete programmatic control via REST API
SDK Support: Official SDKs for multiple programming languages
Webhooks: Real-time notifications for job status and events
Integration Ready: Easy integration with existing tools and workflows

🧩 Maximum Extensibility

Custom Plugins: Build and deploy custom plugins for technical partners
Python Extensions: Dynamic row transforms without compilation
Attribute Resolvers: Custom extraction methods for flexible data extraction
Custom Actions: Extend browser interactions with custom action factories

🌐 Multi-Source Integration

Web Sources: Intelligent web scraping with browser automation
Databases: Connect to PostgreSQL, MySQL, MongoDB, and more
APIs: REST and GraphQL API integration
Streaming: Real-time data ingestion from Kafka, MQTT, and more

📊 Enterprise Features

Monitoring: Comprehensive logging and monitoring capabilities
Security: Enterprise-grade authentication and authorization
Multi-tenancy: Support for multiple organizations and projects
Audit Trail: Complete audit logging for compliance

Advanced Features

Agentic Capabilities

Pipeline Generation: AI agents that generate pipelines from natural language
Auto-Setup: Automated configuration and setup of interactive actions
Context Learning: Agents learn from documentation and examples
Error Recovery: Intelligent error handling and recovery

Vertical Solutions

LLM Fine-tuning: Datasets for training and fine-tuning LLMs
Price Comparison: Real-time price monitoring and comparison
Sports Betting: Surebet detection and arbitrage opportunities
Real Estate: Property clustering and market analysis

Developer Experience

CLI Tools: Command-line interface for pipeline management
IDE Integration: Support for popular IDEs and editors
Testing: Built-in testing and validation tools
Documentation: Comprehensive documentation and examples

AI-Assisted Development

Claude Code Plugin: MCP server + skill set for AI-assisted pipeline building and administration. Claude Code is our recommended environment for vibe coding, particularly for the development of technical partner plugins.
Cursor IDE Support: Native MCP tool integration — list jobs, run pipelines, inspect logs from your editor
Skills: /webrobot-admin, /webrobot-pipeline, /webrobot-plugin-dev, /webrobot-python-extension
AI Agent Workflow: Generate Python Extensions at runtime, register via API, reference in YAML — no compilation

Partner Plugin System

Plugin Marketplace: Technical partners can upload custom ETL and API plugins
Plugin SDK: Scala traits (WSourceStage, WTransformStage, WSinkStage, WFilterStage, WAggregateStage) + Java REST API plugin interface
CI/CD Integration: Jenkins pipeline with automatic JAR upload to MinIO and DB registration
Plugin Manifest: Declarative manifest.json with stage schema, Flyway migrations, and org scoping

Ray Platform (coming soon)

WebroBot is extending its backend with a Ray-based distributed computing layer, complementing the existing Spark engine with capabilities tailored for AI workloads and real-time event-driven architectures.

Training & Fine-tuning

Ray Train and Ray Data will power distributed model training and LLM fine-tuning pipelines, integrated with the same project/job model used for ETL workloads.

Inference & Agentic Execution

Ray Serve will host inference endpoints for custom models. Ray's actor model will support distributed agentic workflows — long-running agents that coordinate across multiple nodes, consume events, and drive pipeline executions autonomously.

Distributed Trading Engine

The Ray layer will serve as the backbone for real-time trading and arbitrage use cases, enabling low-latency event processing and coordination across geographically distributed workers.

Sports Betting — Real-Time Odds Pipeline

The surebet detection vertical will use Ray to monitor live odds from multiple bookmakers in real time. Detected events feed a Kafka queue, which in turn drives a Spark Structured Streaming job for continuous arbitrage calculation and alerting.

Bookmaker APIs → Ray workers (real-time odds collection)
                       ↓
                  Kafka topic
                       ↓
           Spark Structured Streaming
                       ↓
          Surebet detection & alerts

What's Next?

Check out our documentation to see all features and improvements.

Features ​

Core Features ​

🚀 Spark-Native Processing

🤖 AI-Powered Intelligence

🔌 API-First Architecture

🧩 Maximum Extensibility

🌐 Multi-Source Integration

📊 Enterprise Features

Advanced Features ​

Agentic Capabilities ​

Vertical Solutions ​

Developer Experience ​

AI-Assisted Development ​

Partner Plugin System ​

Ray Platform (coming soon) ​

Training & Fine-tuning ​

Inference & Agentic Execution ​

Distributed Trading Engine ​

Sports Betting — Real-Time Odds Pipeline ​

What's Next? ​