A systematic review of AI based data analytics pipelines for large scale systems
Keywords:
AI-Based Data Analytics, Large-Scale Systems, Data Pipelines, Distributed Computing, Real-Time Analytics, Machine Learning Integration.Abstract
The exponential growth of data generated by large-scale systems has necessitated the development of advanced analytics pipelines capable of processing, analyzing, and extracting insights in real time. This paper presents a systematic review of AI-based data analytics pipelines for large-scale systems, focusing on their architectural design, computational frameworks, and role in enabling scalable and intelligent data processing. The study synthesizes existing literature on distributed data processing, machine learning integration, and real-time analytics to identify key components, design patterns, and performance optimization strategies. Central to the review is the examination of end-to-end analytics pipelines, including data ingestion, preprocessing, model training, inference, and deployment, all of which are enhanced through artificial intelligence techniques. The integration of AI enables automated feature engineering, anomaly detection, predictive modeling, and adaptive learning, significantly improving the efficiency and accuracy of analytics workflows. Furthermore, the review explores the role of modern technologies such as data lakehouses, stream processing engines, and cloud-native architectures in supporting high-throughput and low-latency analytics. The paper also highlights critical challenges, including scalability constraints, data quality issues, model interpretability, and system interoperability, which impact the performance and reliability of AI-driven pipelines. Emphasis is placed on governance mechanisms and monitoring frameworks that ensure data integrity, compliance, and continuous system optimization. The findings reveal that AI-based analytics pipelines provide significant advantages in handling large-scale data environments but require careful architectural design and integration strategies to achieve optimal performance. This systematic review contributes to the field by offering a comprehensive understanding of AI-driven analytics pipelines and identifying emerging trends and research opportunities for enhancing scalability, automation, and decision intelligence in large-scale systems.