PV226 Lasaris Seminar
Towards Antifragile Critical Infrastructure Systems
(Introduction)
Hind Bangui, Barbora Buhnova and Bruno Rossi*
* brossi@mail.muni.cz
Department of Computer Systems and Communications,
Lasaris (Lab of Software Architectures and Information Systems)
Masaryk University, Brno
www.lasaris.cz
2/18
(: Off-topic first :)
3/18
Where is Software Engineering heading?
ICSE 2022 submissions results from Andreas Zeller
4/18
Student Research Competition at SAC’22
I suggest interested students to have a look at
the ACM SRC page → https://src.acm.org
5/18
(: Back to the topic :)
6/18
Where is Software Engineering heading?
https://iansommerville.com/technology/research-impact
7/18
Where is Software Engineering heading?
8/18
The Traditional View of Software Reliability & Resilience
●
Reliability is the probability that a system will work as designed
●
Resiliency can be described as the ability to a system to self-heal after damage, failure, load,
or attack
●
Some assumptions in SE:
– Faults in Software / Hardware might lead to failures
– We can try to predict and take countermeasures based on the analysis of past history
– All models are based on monothonic behaviour (i.e., the fact that there are no concept
drifts)
– We can adapt systems based on our models of failure detection / location
The traditional view from I. Sommerville
9/18
The Traditional View of Software Reliability & Resilience
nr. of failures
over a period
of time
How many faults were
detected in reviewed
Product?
X=A/B
A=Absolute number of faults
detected in review
B=Number of estimated faults
to be detected in review (using
past history or reference
model)
●
We do not know or cannot
search through the whole
space of failures
●
We build models and use
proxies (as faults) to estimate
the failures and adapt
systems ex-post
10/18
Defects Prediction as proxies
●
We can have prediction models telling us about the prediction of defects in code
→ Software Metrics for Software Defects Prediction, Master’s Thesis, Dominik Arne Rebro – supervisor: Bruno Rossi to be defended at FI MU 2022.
F-measure distribution
●
It is assumed that the more defects → the more the failures
●
look into code and improve to avoid future failures (for e.g., to see which modules require more attention)
●
We need to develop/refactor, redeploy, etc… This is an old view of how software systems are built
11/18
Software Reliability Growth Models (SRGM)
●
We can try to fit the cumulative failure data curve to see which models could be better in giving us an estimate
of our failures - clearly impossible to get one-fits-all models
→ Rossi, B., Russo, B., & Succi, G. (2010). Modelling failures occurrences of open source software with reliability growth. In IFIP International Conference on Open Source Systems (pp. 268-280). Springer, Berlin, Heidelberg
→ Chren, S., Micko, R., Buhnova, B., & Rossi, B. (2019). STRAIT: a tool for automated software reliability growth analysis. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE.
→ Radoslav Mičko, Software Reliability Growth Models for Open Source Software. Master Thesis FI MU, 2022.
→ Chren, S., Micko, R., Buhnova, B., & Rossi, B. Applicability of Software Reliability Growth Models to Open Source Software, to appear.
Fitting all models for a project
12/18
Self-healing Systems
●
Modern systems of systems embrace failures
●
Have monitoring capabilities and can self-adapt to emerging situations
●
Can take action to restart services / processes → e.g., the circuit breaker pattern
Examples are Microservices
→ Štefanko, M., Chaloupka, O., Rossi, B., van Sinderen, M., & Maciaszek, L. (2019). The Saga pattern in a reactive microservices environment. In Proc. 14th Int. Conf. Softw. Technologies (ICSOFT 2019) (pp. 483-490). Prague,
Czech Republic: SciTePress.
→ Zezulka, M., Chaloupka, O., & Rossi, B. (2021). Integrating Distributed Tracing into the Narayana Transaction Manager. In COMPLEXIS (pp. 55-62).
https://www.javacodegeeks.com/2016/01/self-healing-systems.html
13/18
Self-healing Systems
●
Modern systems of systems embrace failures
●
Have monitoring capabilities and can self-adapt to emerging situations
●
Can take action to restart services / processes → e.g., the circuit breaker pattern
Examples are Microservices
→ Štefanko, M., Chaloupka, O., Rossi, B., van Sinderen, M., & Maciaszek, L. (2019). The Saga pattern in a reactive microservices environment. In Proc. 14th Int. Conf. Softw. Technologies (ICSOFT 2019) (pp. 483-490). Prague,
Czech Republic: SciTePress.
→ Zezulka, M., Chaloupka, O., & Rossi, B. (2021). Integrating Distributed Tracing into the Narayana Transaction Manager. In COMPLEXIS (pp. 55-62).
https://www.javacodegeeks.com/2016/01/self-healing-systems.html
What we are missing is the
capability of systems to
learn when self-adapting to
failure
Learning from failures,
take countermeasures, and
self-adapt
This can be part of Systemof-Systems
modelling of
“emerging behaviour”
14/18
Using Simulations to learn expected behaviour (1/2)
●
In previous work we created a testing management platform for Smart Grids based on the Mosaik
framework for co-simulations
●
We extended Mosaik with the disconnect method to remove edges from the dataflow graph and the entity
graph → A simple way to simulate node failures
→ Mihal, P., Schvarcbacher, M., Rossi, B., & Pitner, T. (2022). Smart grids co-simulations: Survey & research directions. Sustainable Computing: Informatics and Systems,.
→ Schvarcbacher, M., Hrabovská, K., Rossi, B., & Pitner, T. (2018). Smart grid testing management platform (sgtmp). Applied Sciences, 8(11), 2278.
→ Gryga, L., & Rossi, B. (2021). Co-simulation of Smart Grids: Dynamically Changing Topologies in Failure Scenarios. In COMPLEXIS (pp. 63-69).
Smart Grids Testing Processes
15/18
Using Simulations to learn expected behaviour (2/2)
●
What about comparing results from simulations and “real runs” to determine expected behaviours?
●
Systems can learn from running the system and simulation → AI can help in determing what could be the best
course of action
●
Simulation → failure vs reality→ failure?
→ Cioroaica, E., Blanco, J., Rossi, B. Timing Model for Predictive Simulation, conference paper, under revision.
→ Hind, B., Buhnova, B., Rossi, B.. Shifting Towards Antifragile Critical Infrastructure Systems. In Proceedings of the 7th International Conference on Internet of Things, Big Data and Security (IoTBDS 2022).
Needs definition of what is
an anomaly as well
Can be done at design
time or at runtime (in real-
time)
TM = Temporal Model
TDT = Temporal Digital Twin
16/18
Maybe we need to move forward
from the concept of resilience…
17/18
Maybe we need to move forward
from the concept of resilience…
This is where the next talk starts
18/18
Maybe we need to move forward
from the concept of resilience…
This is where the next talk starts
Hind Bangui will have all the answers :)