Big Data analysis has been a very hot and active research during the past few years. It is getting hard to efficiently execute data analysis task with traditional data warehouse solutions.
Parallel processing platforms and parallel dataflow systems running on top of them are increasingly popular. They have greatly improved the throughput of data analysis tasks. The trade-off is the consumption of more computation resources. Tens or hundreds of nodes run together to execute one task.
However, it might still take hours or even days to complete a task. It is very important to improve resource utilization and computation efficiency. According to research conducted by Microsoft, there exists around 30% of common sub-computations in usual workloads. Computation redundancy is a waste of time and resources.
Apache Pig is a parallel dataflow system runs on top of Apache Hadoop, which is a parallel processing platform. Pig/Hadoop is one of the most popular combinations used to do large scale data processing.
This project proposed a framework which materializes and reuses previous computation results to avoid computation redundancy on top of Pig/Hadoop. The idea came from the materialized view technique in Relational Databases. Computation outputs were selected and stored in the Hadoop File System due to their large size.
The execution statistics of the outputs were stored in MySQL Cluster. The framework used a plan matcher and rewriter component to find the maximally shared common-computation with the query from MySQL Cluster, and rewrite the query with the materialized outputs. The framework was evaluated with the TPC-H Benchmark.
The results showed that execution time had been significantly reduced by avoiding redundant computation. By reusing sub-computations, the query execution time was reduced by 65% on average; while it only took around 30 ˜ 45 seconds when reuse whole computations. Besides, the results showed that the overhead is only around 25% on average.
📚 Over 50,000 Research Thesis
📱 100% Offline: No internet needed
📝 Over 98 Departments
🔍 Thesis-to-Journal Publication
🎓 Undergraduate/Postgraduate Thesis
📥 Instant Whatsapp/Email Delivery
The project titled "Applying Machine Learning Techniques to Detect Financial Fraud in Online Transactions" aims to address the critical issue of detec...
The project titled "Anomaly Detection in IoT Networks Using Machine Learning Algorithms" focuses on addressing the critical challenge of detecting ano...
The project titled "Applying Machine Learning Algorithms for Predicting Stock Market Trends" aims to explore the application of machine learning algor...
The project titled "Applying Machine Learning Algorithms for Sentiment Analysis in Social Media Data" focuses on utilizing machine learning algorithms...
The project titled "Applying Machine Learning for Predictive Maintenance in Industrial IoT Systems" focuses on leveraging machine learning techniques ...
The project, "Implementation of a Machine Learning Algorithm for Predicting Stock Prices," aims to leverage the power of machine learning techniques t...
The project titled "Development of an Intelligent Traffic Management System using Machine Learning Algorithms" aims to revolutionize the traditional t...
No response received....
The project titled "Applying Machine Learning for Intrusion Detection in IoT Networks" aims to address the increasing cybersecurity threats targeting ...