1. INTRODUCTION

The pursuit of comprehensive, efficient, and scalable data analytics as well as the one-size-does-not-fit-all dictum have given rise to a plethora of data processing platforms (platforms for short). These specialized platforms include DBMS, NoSQL, and MapReduce-like platforms. In fact, just under the umbrella of NoSQL, there are reportedly over 200 different platforms1 . Each excels in specific aspects allowing applications to achieve high performance and scalability. For example, while Spark supports Select queries, Postgres can execute them much faster by using indices. However, Postgres is not as good as Spark for general purpose batch processing where parallel full scans are the key performance factor. Several studies have shown this kind of performance differences [20, 32, 36, 50, 57]. Moreover, today’s data analytics is moving beyond the limits of a single platform. For example: (i) IBM reported that North York hospital needs to process 50 diverse datasets, which run on a dozen different platforms [35]; (ii) Airlines need to analyze large datasets, which are produced by different departments, are of different data formats, and reside on multiple data sources, to produce global reports for decision makers [9]; (iii) Oil & Gas companies need to process large amounts of diverse data spanning various platforms [19, 34]; (iv) Several data warehouse applications require data to be moved from a MapReduce-like system into a DBMS for further analysis [27,53]; and (v) Using multiple platforms for machine learning improves performance significantly [20, 36]. To cope with these new requirements, developers (or data scientists) have to write ad-hoc programs and scripts to integrate different platforms. This is not only a tedious, time-consuming, and costly task, but it also requires knowledge of the intricacies of the different platforms to achieve high efficiency and scalability. Some systems have appeared with the goal of facilitating platform integration [2, 4, 10, 12]. Nonetheless, they all require a good deal of expertise from developers, who still need to decide which processing platforms to use for each task at hand. Recent research has taken steps towards transparent cross-platform execution [15, 28, 32, 43, 55, 56], but lacks several important aspects. Usually these efforts do not automatically map tasks to platforms. Additionally, they do not consider complex data movement (i.e., with data transformations) among platforms [28, 32]. Finally, most of the research focuses on specific applications [15, 43, 55]. Therefore, there is a clear need for a systematic approach to enable efficient cross-platform data processing, i.e., use of multiple data processing platforms. The Holy Grail would be to replicate the success of DBMSs for cross-platform data processing. Users simply send their tasks expressing the logic of their applications, and the cross-platform system decides on which platform(s) to execute each task with the goal of minimizing its cost (e.g., runtime or monetary cost). In other words, users focus on the high level details and the cross-platform system takes care of the low level details. Building a cross-platform system is challenging on numerous fronts: (i) a cross-platform system not only has to effectively find all the suitable platforms for a given task, but also has to choose the most efficient one; (ii) cross-platform settings are characterized by high uncertainty as different platforms are autonomous and thus one has little control over them; (iii) the performance gains of using multiple platforms should compensate the added cost of moving data across platforms; (iv) it is crucial to achieve inter-platform parallelism to prevent slow platforms from dominating execution time; and (v) the system should be extensible to new platforms and application requirements. In this report, we present Rheem2 , the first generalpurpose cross-platform system to tackle all of the above challenges. The goal of Rheem is to enable applications and users to run data analytic tasks efficiently on one or more data processing platforms. To do so, it decouples applications from platforms as shown in Figure 1. Applications issue their tasks to Rheem, which in turn decides where to execute them. As of today, Rheem supports a variety of platforms: Spark, Flink, JavaStreams, Postgres, GraphX, GraphChi, and Giraph. We are currently testing Rheem in a large international airline company and in a biomedical research institute. In the former case, we aim at seamlessly integrating all data analytic activity governing an aircraft; In the latter case, we aim at reducing the effort scientists need for building data analytic pipelines while at the same time speeding up the running time. Several papers show different aspects of Rheem: the vision behind it [17]; its optimizer [39]; its inequality join algorithm [38]; and a couple of its applications [36, 37]. A couple of demo papers showcase the benefits of Rheem [16] and its interface [44]. This report aims at presenting the complete design of Rheem and how all its pieces work together.

 

결론적으로, Section 2에서 교차 플랫폼 데이터 처리에 대한 지원이 필요한 4가지 상황을 확인한다. 각각의 경우에, 우리는 Rheem을 이용한 교차 플랫폼 데이터 처리의 이점을 실험적으로 보여주기 위해 실제 애플리케이션을 사용한다.

 

Section 3에서 우리는 Rheem의 데이터와 처리 모델을 제시하고 그것이 어떻게 기본 플랫폼의 복잡함으로부터 사용자를 보호하는지를 보여준다. Rheem은 기본 플랫폼을 더 잘 활용할수 있는 유연한 운영자 매핑을 제공한다. 또한 확장 가능한 설계로 사용자는 매우 적은 노력으로 새로운 플랫폼과 운영자를 추가할 수 있다.

 

그 다음, Section 4에서 그것을 새로운 것으로 만드는 Rheem의 주요 구성요소에 대해 논의한다. 데이터 이동 비용을 고려하는 비용 기반 교차 플랫폼 최적화 장치, 일관되지 않은 카디널리티 추정을 처리하기 위한 점진적 최적화 매커니즘, 그리고 비용 모델을 조정해야 하는 부담에서 사용자를 완화하는 학습도구.

 

Section 5에서는 사용자가 데이터 분석 작업을 쉽게 코드화하고 실행할 수 있는 Rheem 인터페이스를 제시한다. 특히 데이터 흐름 언어(RheemLatin)와 시각적 통합 개발 환경(Rheem Studio)을 제시한다.

 

Section 6에서는 개발자들이 이러한 인터페이스를 사용하여 애플리케이션을 구축할 수 있는 방법을 더 잘 설명하기 위해 실제 Rheem 계획의 3가지 예를 자세히 다룬다.

 

Section 8에은 Rheem의 한계를 요약한다. 마지막으로 Section 9에서 관련 작업을 논의 하고, Section 10에서 몇가지 공개 문제를 가지고 결론을 내린다.

 

+ Recent posts