PV226 Lasaris Seminar Towards Antifragile Critical Infrastructure Systems (Introduction) Hind Bangui, Barbora Buhnova and Bruno Rossi* * brossi@mail.muni.cz Department of Computer Systems and Communications, Lasaris (Lab of Software Architectures and Information Systems) Masaryk University, Brno www.lasaris.cz 2/18 (: Off-topic first :) 3/18 Where is Software Engineering heading? ICSE 2022 submissions results from Andreas Zeller 4/18 Student Research Competition at SAC’22 I suggest interested students to have a look at the ACM SRC page → https://src.acm.org 5/18 (: Back to the topic :) 6/18 Where is Software Engineering heading? https://iansommerville.com/technology/research-impact 7/18 Where is Software Engineering heading? 8/18 The Traditional View of Software Reliability & Resilience ● Reliability is the probability that a system will work as designed ● Resiliency can be described as the ability to a system to self-heal after damage, failure, load, or attack ● Some assumptions in SE: – Faults in Software / Hardware might lead to failures – We can try to predict and take countermeasures based on the analysis of past history – All models are based on monothonic behaviour (i.e., the fact that there are no concept drifts) – We can adapt systems based on our models of failure detection / location The traditional view from I. Sommerville 9/18 The Traditional View of Software Reliability & Resilience nr. of failures over a period of time How many faults were detected in reviewed Product? X=A/B A=Absolute number of faults detected in review B=Number of estimated faults to be detected in review (using past history or reference model) ● We do not know or cannot search through the whole space of failures ● We build models and use proxies (as faults) to estimate the failures and adapt systems ex-post 10/18 Defects Prediction as proxies ● We can have prediction models telling us about the prediction of defects in code → Software Metrics for Software Defects Prediction, Master’s Thesis, Dominik Arne Rebro – supervisor: Bruno Rossi to be defended at FI MU 2022. F-measure distribution ● It is assumed that the more defects → the more the failures ● look into code and improve to avoid future failures (for e.g., to see which modules require more attention) ● We need to develop/refactor, redeploy, etc… This is an old view of how software systems are built 11/18 Software Reliability Growth Models (SRGM) ● We can try to fit the cumulative failure data curve to see which models could be better in giving us an estimate of our failures - clearly impossible to get one-fits-all models → Rossi, B., Russo, B., & Succi, G. (2010). Modelling failures occurrences of open source software with reliability growth. In IFIP International Conference on Open Source Systems (pp. 268-280). Springer, Berlin, Heidelberg → Chren, S., Micko, R., Buhnova, B., & Rossi, B. (2019). STRAIT: a tool for automated software reliability growth analysis. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE. → Radoslav Mičko, Software Reliability Growth Models for Open Source Software. Master Thesis FI MU, 2022. → Chren, S., Micko, R., Buhnova, B., & Rossi, B. Applicability of Software Reliability Growth Models to Open Source Software, to appear. Fitting all models for a project 12/18 Self-healing Systems ● Modern systems of systems embrace failures ● Have monitoring capabilities and can self-adapt to emerging situations ● Can take action to restart services / processes → e.g., the circuit breaker pattern Examples are Microservices → Štefanko, M., Chaloupka, O., Rossi, B., van Sinderen, M., & Maciaszek, L. (2019). The Saga pattern in a reactive microservices environment. In Proc. 14th Int. Conf. Softw. Technologies (ICSOFT 2019) (pp. 483-490). Prague, Czech Republic: SciTePress. → Zezulka, M., Chaloupka, O., & Rossi, B. (2021). Integrating Distributed Tracing into the Narayana Transaction Manager. In COMPLEXIS (pp. 55-62). https://www.javacodegeeks.com/2016/01/self-healing-systems.html 13/18 Self-healing Systems ● Modern systems of systems embrace failures ● Have monitoring capabilities and can self-adapt to emerging situations ● Can take action to restart services / processes → e.g., the circuit breaker pattern Examples are Microservices → Štefanko, M., Chaloupka, O., Rossi, B., van Sinderen, M., & Maciaszek, L. (2019). The Saga pattern in a reactive microservices environment. In Proc. 14th Int. Conf. Softw. Technologies (ICSOFT 2019) (pp. 483-490). Prague, Czech Republic: SciTePress. → Zezulka, M., Chaloupka, O., & Rossi, B. (2021). Integrating Distributed Tracing into the Narayana Transaction Manager. In COMPLEXIS (pp. 55-62). https://www.javacodegeeks.com/2016/01/self-healing-systems.html What we are missing is the capability of systems to learn when self-adapting to failure Learning from failures, take countermeasures, and self-adapt This can be part of Systemof-Systems modelling of “emerging behaviour” 14/18 Using Simulations to learn expected behaviour (1/2) ● In previous work we created a testing management platform for Smart Grids based on the Mosaik framework for co-simulations ● We extended Mosaik with the disconnect method to remove edges from the dataflow graph and the entity graph → A simple way to simulate node failures → Mihal, P., Schvarcbacher, M., Rossi, B., & Pitner, T. (2022). Smart grids co-simulations: Survey & research directions. Sustainable Computing: Informatics and Systems,. → Schvarcbacher, M., Hrabovská, K., Rossi, B., & Pitner, T. (2018). Smart grid testing management platform (sgtmp). Applied Sciences, 8(11), 2278. → Gryga, L., & Rossi, B. (2021). Co-simulation of Smart Grids: Dynamically Changing Topologies in Failure Scenarios. In COMPLEXIS (pp. 63-69). Smart Grids Testing Processes 15/18 Using Simulations to learn expected behaviour (2/2) ● What about comparing results from simulations and “real runs” to determine expected behaviours? ● Systems can learn from running the system and simulation → AI can help in determing what could be the best course of action ● Simulation → failure vs reality→ failure? → Cioroaica, E., Blanco, J., Rossi, B. Timing Model for Predictive Simulation, conference paper, under revision. → Hind, B., Buhnova, B., Rossi, B.. Shifting Towards Antifragile Critical Infrastructure Systems. In Proceedings of the 7th International Conference on Internet of Things, Big Data and Security (IoTBDS 2022). Needs definition of what is an anomaly as well Can be done at design time or at runtime (in real- time) TM = Temporal Model TDT = Temporal Digital Twin 16/18 Maybe we need to move forward from the concept of resilience… 17/18 Maybe we need to move forward from the concept of resilience… This is where the next talk starts 18/18 Maybe we need to move forward from the concept of resilience… This is where the next talk starts Hind Bangui will have all the answers :)