Introduction
Incedo LighthouseTM began as an experimental proof-of-concept (PoC) to validate the capability of using AI to automate problem discovery and insight generation through KPI Trees. Over time, it evolved into a robust, multi-tenant SaaS platform deployed on AWS, enabling enterprise customers to discover problems through AI-powered analytics.
This blog outlines the technical and strategic journey of transforming Incedo LighthouseTM from an early prototype into a production-grade SaaS platform. It also highlights key AWS services leveraged, architectural decisions made, and lessons learned along the way.
Phase 1: Business Planning & Strategy
At inception, Incedo LighthouseTM was designed to answer a common pain point: how can large organizations (enterprises with 1000+ employees) discover problems in their business without manual data analysis?
We started by identifying our target personas:
- Business users who need quick insights (which can be read in less than 2 mins) into operational breakdowns
- C-Level officers who need quick insights into business
- Data analysts who need to trace anomalies to their root cause
- Data engineers who require consistent data pipelines and model performance
We aligned our product goals to AWS’s SaaS Journey Framework, ensuring we had a strong foundation:
- Define measurable KPIs called metrics
- Identify filters, cohorts
- Identify compliance and isolation requirements for multi-tenancy
Phase 2: MVP Development & Early Architecture
The MVP was built using a monolithic backend and basic Python data pipelines. Key capabilities included:
- Reading and transforming XL, CSV files and other data sources
- Validating schemas and generating KPI Trees
- Returning insights via an API consumed by a React Frontend
The architecture included:
- React: For Frontend
- Spring Boot API: REST endpoints for fetching data from database tables
- Python Layer: Data Science Batch jobs using Pandas which will generate insights and perform machine learning tasks.
- PostgreSQL: Central data store
Although functional, this approach could not handle the volume, variety, and concurrency of real-world production scenarios (processing 0.5M+ records per hour). Limitations around memory, CPU constraints, and slow batch execution became apparent.
Phase 3: Replatforming for Scale
We adopted a modular microservices architecture and replaced bottlenecks with scalable AWS-native services. The replatforming effort focused on decomposing the monolith into specialized services, each responsible for a distinct responsibility in the data processing pipeline. This not only enabled better fault isolation but also allowed services to scale independently.
Microservices Introduced:
- Core API Service
- Built with Springboot
- Accepts user-uploaded files
- Tenant management
- Stores files and its metadata in PostgreSQL
- Login Service
- Provides all the authentication services
- Isolates authorizations complexities
- Data Science Service
- Implemented using PySpark jobs on Amazon EMR
- Performs schema validation, typecasting, date conversions, and missing value imputation
- Dynamically adjusts resources based on file size and complexity using EMR auto scaling
- Writes transformed data to disk in parquet format with partitioned paths
- Detects anomalies
- React Service
- Exposes React Frontend as Cloud Front which is AWS Native
- Required for creation/updation of KPI Trees
By decomposing the system into these modular components, we enabled independent scaling, faster deployment cycles, and better maintainability. Each service can be versioned, monitored, and secured individually, enabling Incedo LighthouseTM to meet the performance and compliance expectations of a modern SaaS offering.
Core Technology Changes:
- Replaced Pandas batch jobs with PySpark on AWS EMR, with 4 executors (8 GB RAM each) and 16 GB driver
- Deployed AI models for real-time anomaly, clustering
- Using more core AWS services like Amazon GuardDuty, AWS Network Firewall, Amazon API Gateway, Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Simple Storage Service (Amazon S3), Amazon CloudWatch, AWS CloudTrail, Amazon Cognito, Amazon Relational Database Service (Amazon RDS), NAT Gateway, Amazon Route 53, Amazon CloudFront, AWS Web Application Firewall (AWS WAF), etc.
- Below are the Performance Metrics:
- Average throughput: 0.8M records processed per hour under peak load
- Average API response time: Upto 3000ms for standard queries
- Scalability limits: Tested up to 4M records/hour with auto-scaling EMR clusters
Phase 4: Multi-Tenancy and SaaS Readiness
We selected a schema-based multi-tenancy model in PostgreSQL to isolate tenant configurations while keeping operational overhead low.
IAM roles and resource policies ensured tenant-level data isolation. Our onboarding workflow automatically provisioned schemas. Key SaaS capabilities included:
- Automated provisioning
- Role-based access control
- Custom metadata management per tenant
- Tenant-specific configurations for model thresholds, KPI rules, and transformation logic
Security Compliance:
- Data protection: Enhanced encryption features at rest and in transit
- Access control: IAM role-based access, least privilege
- Monitoring: Amazon GuardDuty, AWS CloudTrail, AWS Config
- Network security: AWS WAF, security groups
- Compliance frameworks: AWS well-architected security pillar.
Phase 5: CI/CD and Operational Readiness
We formalized development workflows using:
- GitHub actions for CI/CD
- SonarQube for code quality gates
- Pre-commit hooks and branch protection
- EKS: Infrastructure was managed with AWS CDK and CloudFormation. In addition, we containerized all microservices using Docker and deployed them on Amazon EKS (Amazon Elastic Kubernetes Service). This gave us an environment with advanced security features and enabled efficient horizontal scaling across services. Each microservice runs as a container within its own Kubernetes pod, allowing for better fault isolation, autoscaling, and zero-downtime deployments. We used Helm charts to package and deploy Kubernetes resources consistently across environments, enabling simplified version management, parameterization, and rollout strategies for each service. We leveraged Kubernetes features such as ConfigMaps, Secrets, and Horizontal Pod Autoscalers to manage configurations and scale based on custom metrics like CPU usage or request volume. We used AWS Secrets Manager for protected credentials and AWS Systems Manage Parameter Store for configuration management.
Enhanced Tracking and Observability:
- Amazon CloudWatch for logs and metrics
- Amazon GuardDuty and Config for security posture
- AWS WAF to protect APIs
- Alarms and dashboards via Amazon CloudWatch
Phase 6: Real-Time Anomaly Detection
To increase responsiveness, we introduced a real-time anomaly detection pipeline:
- Ingest data continuously via Amazon Kinesis or Kafka
- Apply window-based transformations using Spark Streaming
- The results are stored back in PostgreSQL. This evolution positioned Incedo LighthouseTM as a proactive analytics engine
Phase 7: Cost Optimization
Operating large Spark jobs and ML inference at scale can be expensive. We adopted:
- Spot Instances for EMR jobs
- S3 Intelligent Tiering for data storage
These optimizations reduced costs by ~40% post replatforming within 2 months (exact costs cannot be revealed due to confidentiality)
Phase 8: Go-To-Market and Beyond
After internal production validation, we prepared for public onboarding:
- Implemented SaaS trial tenant flow
- Integrated with AWS Marketplace
- Added support workflows with auto-escalation via email/SNS
System Reliability Data is below:
- Uptime: 99.95% over the past 12 months
- Recovery Time Objective (RTO): < 15 minutes for service disruptions
- Redundancy measures: Multi-AZ deployments for PostgreSQL and EKS clusters, daily backups with point-in-time recovery
Lessons Learned
- Build SaaS mindset from day 1
Designing Incedo LighthouseTM with a SaaS-first mindset helped us ensure that every tenant, regardless of size, had access to a consistent, scalable experience. This meant investing early in capabilities such as tenant onboarding automation, isolated configurations, self-service features, and role-based access control. It also meant architecting for self-serve provisioning, licensing control, and operational visibility across tenants - Use AWS managed services wherever possible
Rather than managing Spark clusters, ML model deployments, or workflow engines ourselves, we relied on services that could auto-scale, automatically recover, and integrate natively with IAM and CloudWatch. This allowed us to focus on application logic rather than infrastructure - Design for failure and scale early
One early challenge was the failure of batch jobs during peak usage (when processing is > 5K records/minute). We addressed this by implementing retry mechanisms, using AWS Step Functions for fault-tolerant workflows, and monitoring each microservice with alerts. By embracing eventual consistency, idempotent operations, and queue-based decoupling, we made the system with improved resilience features. This mindset enabled us to effectively manage spikes in load (2x volume) without service degradation - Start with real use cases
Rather than building abstract data capabilities, we built Incedo LighthouseTM features around actual business scenarios—like root cause analysis for delayed shipments or fluctuating revenues. These use cases helped define what metrics mattered, how anomaly detection thresholds should be tuned, and what visualizations users needed. It kept us grounded and ensured product-market fit from the outset
Customer Success Metrics:
- Use cases: Automated anomaly detection for supply chain delays, financial performance monitoring
- Quantifiable benefits: Reduced issue resolution time by 45%, increased operational efficiency by 30%
- Average implementation timeframe: 2-4 weeks from onboarding to production usage
Conclusion
The journey from PoC to production for Incedo LighthouseTM has been an iterative, insight-rich experience. By closely aligning with AWS SaaS best practices, adopting serverless and elastic compute, and embracing DevOps and cost optimization early, we built a scalable AI SaaS platform that’s helping organizations discover root causes faster and smarter.
As we expand into generative AI, deeper anomaly detection, and smarter metric exploration, Incedo LighthouseTM continues to shine a light on what matters most: actionable insights, at scale.