Skip to content

Features

WebRobot provides a comprehensive set of features for building and managing agentic ETL pipelines.

Discover the powerful capabilities that make WebRobot the leading platform for agentic ETL pipelines.

Core Features

🚀 Spark-Native Processing

  • Distributed Computing: Leverage Apache Spark's distributed processing capabilities
  • Scalability: Handle data from gigabytes to petabytes
  • Performance: Optimized for speed and efficiency
  • Resource Management: Intelligent resource allocation and optimization

🤖 AI-Powered Intelligence

  • Intelligent Stages: LLM-powered stages that adapt to changing web structures
  • Natural Language Processing: Convert natural language descriptions to executable pipelines
  • Auto-Programming: Python extensions for dynamic stage generation
  • Context-Aware Extraction: Intelligent data extraction with minimal configuration

🔌 API-First Architecture

  • RESTful API: Complete programmatic control via REST API
  • SDK Support: Official SDKs for multiple programming languages
  • Webhooks: Real-time notifications for job status and events
  • Integration Ready: Easy integration with existing tools and workflows

🧩 Maximum Extensibility

  • Custom Plugins: Build and deploy custom plugins for technical partners
  • Python Extensions: Dynamic row transforms without compilation
  • Attribute Resolvers: Custom extraction methods for flexible data extraction
  • Custom Actions: Extend browser interactions with custom action factories

🌐 Multi-Source Integration

  • Web Sources: Intelligent web scraping with browser automation
  • Databases: Connect to PostgreSQL, MySQL, MongoDB, and more
  • APIs: REST and GraphQL API integration
  • Streaming: Real-time data ingestion from Kafka, MQTT, and more

📊 Enterprise Features

  • Monitoring: Comprehensive logging and monitoring capabilities
  • Security: Enterprise-grade authentication and authorization
  • Multi-tenancy: Support for multiple organizations and projects
  • Audit Trail: Complete audit logging for compliance

Advanced Features

Agentic Capabilities

  • Pipeline Generation: AI agents that generate pipelines from natural language
  • Auto-Setup: Automated configuration and setup of interactive actions
  • Context Learning: Agents learn from documentation and examples
  • Error Recovery: Intelligent error handling and recovery

Vertical Solutions

  • LLM Fine-tuning: Datasets for training and fine-tuning LLMs
  • Price Comparison: Real-time price monitoring and comparison
  • Sports Betting: Surebet detection and arbitrage opportunities
  • Real Estate: Property clustering and market analysis

Developer Experience

  • CLI Tools: Command-line interface for pipeline management
  • IDE Integration: Support for popular IDEs and editors
  • Testing: Built-in testing and validation tools
  • Documentation: Comprehensive documentation and examples

AI-Assisted Development

  • Claude Code Plugin: MCP server + skill set for AI-assisted pipeline building and administration. Claude Code is our recommended environment for vibe coding, particularly for the development of technical partner plugins.
  • Cursor IDE Support: Native MCP tool integration — list jobs, run pipelines, inspect logs from your editor
  • Skills: /webrobot-admin, /webrobot-pipeline, /webrobot-plugin-dev, /webrobot-python-extension
  • AI Agent Workflow: Generate Python Extensions at runtime, register via API, reference in YAML — no compilation

Partner Plugin System

  • Plugin Marketplace: Technical partners can upload custom ETL and API plugins
  • Plugin SDK: Scala traits (WSourceStage, WTransformStage, WSinkStage, WFilterStage, WAggregateStage) + Java REST API plugin interface
  • CI/CD Integration: Jenkins pipeline with automatic JAR upload to MinIO and DB registration
  • Plugin Manifest: Declarative manifest.json with stage schema, Flyway migrations, and org scoping

Ray Platform (coming soon)

WebroBot is extending its backend with a Ray-based distributed computing layer, complementing the existing Spark engine with capabilities tailored for AI workloads and real-time event-driven architectures.

Training & Fine-tuning

Ray Train and Ray Data will power distributed model training and LLM fine-tuning pipelines, integrated with the same project/job model used for ETL workloads.

Inference & Agentic Execution

Ray Serve will host inference endpoints for custom models. Ray's actor model will support distributed agentic workflows — long-running agents that coordinate across multiple nodes, consume events, and drive pipeline executions autonomously.

Distributed Trading Engine

The Ray layer will serve as the backbone for real-time trading and arbitrage use cases, enabling low-latency event processing and coordination across geographically distributed workers.

Sports Betting — Real-Time Odds Pipeline

The surebet detection vertical will use Ray to monitor live odds from multiple bookmakers in real time. Detected events feed a Kafka queue, which in turn drives a Spark Structured Streaming job for continuous arbitrage calculation and alerting.

Bookmaker APIs → Ray workers (real-time odds collection)

                  Kafka topic

           Spark Structured Streaming

          Surebet detection & alerts

What's Next?

Check out our documentation to see all features and improvements.

Released under the MIT License.