Beside functional aspects, designers of Data-Intensive Applications have to consider various quality aspects that are specific to the applications processing huge volumes of data with high throughput and running in clusters of (many) physical machines. A broad set of non-functional aspects positioned in the areas of performance and safety should be included at the early stage of the design process to guarantee high-quality software development.

The evaluation of the correctness of such applications, and when functional and non-functional aspects are both involved, is definitely not trivial. In the case of Data-Intensive Applications, the inherent distributed architecture, the software stratification and the computational paradigm implementing the logic of the applications pose new questions on the criteria that should be considered to evaluate their correctness.

 

Data-intensive applications are commonly realized through independent computational nodes that are managed by a supervisor providing resources allocation and node synchronization functionalities. Message exchange is guaranteed by an underlying network infrastructure over which the (data-intensive) framework might implement suitable mechanisms to guarantee the correct message transfer among the nodes. The logic of the application is the tip of the iceberg of a very complex software architecture which the developer cannot completely govern. Between the application code and the deployed running executables there are many interconnected layers, offering abstractions and running control automatisms, that are not visible to the developers (such as, for instance, the containerization mechanisms, the cluster manager, etc.).

Besides the architectural aspects of the framework, the functionality of data-intensive applications requires, in some cases, a careful analysis of the notion of correctness adopted to evaluate the outcomes. This is the case, for instance, of streaming applications. The functionality of streaming applications is defined through the combination and concatenation of operations on streams, i.e., infinite sequences of messages originated from external data sources or by the computational nodes constituting the application. The operations can transform a stream into a new stream or can aggregate a result by reducing a stream into data. Sometimes, the operations are defined over portions of streams, called windows, that partition the streams on the basis of specific grouping criteria of the messages in the stream. The complexity and the variety of parameters defining the operations make the definition of the streaming transformation semantics not obvious and the assessment of their correctness far from being trivial.

In DICE, the evaluation of correctness concerns “safety” aspects of data intensive applications. Verification of safety properties is done automatically by means of a model checking analysis that the designer performs at design time. The formal abstraction which models the application behavior is first extracted from the application UML diagrams and later verified to check the existence of incorrect executions, i.e., executions that do not conform with specific criteria identifying the required behavior. Time and the ordering relation among the events of the application are the main aspects characterizing the formalism used for verification, that is based on specific extensions of Linear Temporal Logic (LTL). As already pointed out, since the technological framework affects the definition of correctness to be adopted for evaluating the final application, the formal modeling devised for DICE verification combines an abstraction of functional aspects with a simplified representation of the computational paradigm adopted to implement the application.

DICE verification is carried out by D-verT and focuses on Apache Storm and (soon) Spark, two baseline technologies for streaming and batch applications. The computational mechanism they implement is captured by means of logical formulae that, when instantiated, given a specific DTSM application model, represent the executions of the Storm (or Spark) application. The analyses that the user can perform, from the DICE IDE, are bottleneck analysis of Storm applications and worst time analysis of Spark applications (the latter is a work in progress).

In the first case, the developer can verify the existence of a node of a Storm application that cannot process the incoming workload with a timely computation. In such a case, the node is likely to be a bottleneck node for the application that might cause memory saturation and drop the overall performance. In the second case, the developer can perform a worst case analysis of the total time span required by a Spark application to complete a job. The overall job time, that must meet a given deadline at runtime, can be evaluated on the basis of a task time estimation, for the physical resources available in the cluster, that must be known before running the verification.

Details about verification techniques can be found in Deliverable D3.5 – Verification tool Initial Version and on the DICE Github repository.

Related material:

  1. Francesco Marconi, Marcello M. Bersani, Madalina Erascu, Matteo Rossi:
    Towards the Formal Verification of Data-Intensive Applications Through Metric Temporal Logic. ICFEM 2016
  2. Francesco Marconi, Marcello Maria Bersani and Matteo Rossi: Formal Verification of Storm Topologies through D-verT. SAC 2017

Marcello M. Bersani and Verification team (PMI)