Tag=AccentColor Flavor=Light Target=FillAndLine Data Mining Statistical Methods C&RT trees and Neural networks Martin Sebera Faculty of Sports Studies, Masaryk University > Jedna zářící kostka mezi matně bílými kostkami Tag=AccentColor Flavor=Light Target=FillAndLine •Orcid: 0000-0003-3750-1549 •Scopus Author ID: 36165218300 •ResearcherID: M-9818-2018 •https://www.muni.cz/en/people/55084-martin-sebera Tag=AccentColor Flavor=Light Target=FillAndLine CONTENT •Standards statistical methods •Disadvantages, limitations L •Datamining methods – classification, regression •Benefits J •C&RT trees, MLP Neural networks •Examples Tag=AccentColor Flavor=Light Target=FillAndLine Short list of frequently used methods •Basic statistical characteristics (mean, median, standard deviation, …) •t-test and ANOVA (analysis of variance) •Correlation analysis and Factor analysis •Linear regression and Test chí2 •assumptions: normality of data, homogenity of variances (® parametric vs. nonparametric methods) nominal, ordinal, categorical variables cannot be combined in model Tag=AccentColor Flavor=Light Target=FillAndLine Classification and regression •Classification - the process of classifying patterns into given classes based on the features of the classified object •Regression - the process of interleaving function data Tag=AccentColor Flavor=Light Target=FillAndLine Classification •E.g. linear classifier, k-NN (k-nearest neighbors) classifier, trees, neural networks, ... KNN Tag=AccentColor Flavor=Light Target=FillAndLine • Regression (approximation) Tag=AccentColor Flavor=Light Target=FillAndLine Classification and regression tools •Tools that are used for regression can almost always be used for classification. •Not all classification tools are usable for regression. •Artificial neural networks (MLP, RBF,...) •Trees and their variants •Genetic and evolutionary techniques •Classical discriminant methods - k-nearest neighbors, Bayesian classifier, Support-vector machine (SVM), ... Tag=AccentColor Flavor=Light Target=FillAndLine Data preparation •Editing data to the standard dataset format tabulka Tag=AccentColor Flavor=Light Target=FillAndLine Data preprocessing •Filtering - most often noise (sliding averaging, median filter - impulse noise removal) •Completion of missing data - only in an emergency, it is supplemented on the basis of the average resp. frequency •Normalization - the goal is to unify the numerical ranges of values Tag=AccentColor Flavor=Light Target=FillAndLine Data preparation •Dividing data into training, validation and testing set •Training set - known output → classifier learning •Validation set - known output, but we will not provide it to the classifier (comparison of the classification result with the real output) → validation •Test set - known output, we measure the success of the classifier • •Most often the distribution of available data (patterns) in the ratio 2 : 1 : 1 Tag=AccentColor Flavor=Light Target=FillAndLine Classification and regression trees •Principle: Gradual hierarchical dividing of data space into subgroups so that in the leaves of the created tree there are (homogeneous) groups of data belonging (in case of classification) to one class. •Based on the gradual division of the symptom space (similar to searching in the botanical key). •Classify an object into the corresponding class based on flags •Simple, fast •Easy visualization → good interpretation of results •More resistant to outliers and missing values. Tag=AccentColor Flavor=Light Target=FillAndLine Classification accuracy •We can influence the accuracy of the classification and thus the structure of the tree: •misclassification cost matrix •a priori probabilities of representation of individual classes (priors) •proportional representation of patterns of individual classes in the data (case weights, count variable) •change of one parameter affects the others (different expressions of the same) Tag=AccentColor Flavor=Light Target=FillAndLine Classification accuracy •ROC curve (Receiver Operating Characteristics) - area under the plotted curve - quality of the classification •graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. Sensitivity = TP / (TP + FN), Specificity = TN / (FP + TN), Positive predictive value = TP / (TP + FP), Negative predictive value = TN / (FN + TN), Efficiency = (TP + TN) / (TP + TN + FP + FN) TP – true positive, TN – true negative, FP – false positive, FN – false negative Tag=AccentColor Flavor=Light Target=FillAndLine Classification trees (Nonparametric method) •Samples are classified linearly and hierarchically into a finite (small) predetermined number of classes. •Sequence of decisions. •The root contains the entire data file. •Two (binary tree) or more branches grow from each node. Each sheet represents one of the groups. •The creation: choose a variable that divides the data into the most homogeneous subgroups possible. Tag=AccentColor Flavor=Light Target=FillAndLine Classification trees (Nonparametric method) •Stopping the growth of the tree. There are ways to: •further subdivision is not statistically significant •the size of the error in the subnodes (growth stops when the percentage success of the incorrect classification exceeds the specified value) •number of samples in the end node •number of terminal nodes Tag=AccentColor Flavor=Light Target=FillAndLine C&RT •C&RT is a binary tree build by splitting node into two child nodes repeatedly. •Each root node represents a single input variable (x) and a split point on that variable (assuming the variable is numeric). •The leaf nodes of the tree contain an output variable (y) which is used to make a prediction. Tag=AccentColor Flavor=Light Target=FillAndLine Example 1 •Kratochvíl, J., Plch, L., Sebera, M., & Koritáková, E. (2020). Evaluation of untrustworthy journals: Transition from formal criteria to a complex view. Learned Publishing, 33(3), 308–322. https://doi-org.ezproxy.muni.cz/10.1002/leap.1299 Tag=AccentColor Flavor=Light Target=FillAndLine Tag=AccentColor Flavor=Light Target=FillAndLine Predatory journals: criteria •Unambiguous determination of article processing charges •Affiliation of editorial board members •Journal ISSN on its website •Review Time •Description of Peer-Review •Proclamation of indexing WoS/Scopus/ERIH/Medline •Accessibility of full texts •Publisher •An evaluation of 259 biomedical journals •using the list of 8 criteria •The most common reason for failure to comply was: •sufficient editorial information and •declaration of article processing charges. Tag=AccentColor Flavor=Light Target=FillAndLine Tag=AccentColor Flavor=Light Target=FillAndLine •Vít M., Reguli Z., Sebera M., Cihounková J., & Bugala M. (2016). Predictors of children´s successful defence against adult attacker. Archives of Budo. (12), p. 141-150. •The paper is based on the presumption that the probability of successful defence of a child against an adult attacker is influenced by diversity of variables with different predictive values. EXAMPLE 2 Tag=AccentColor Flavor=Light Target=FillAndLine 288 defense situations were evaluated by 6 self-defense experts in 6 criteria. Tag=AccentColor Flavor=Light Target=FillAndLine •The best predictors: Active defence, Escape and Technical means •Communication and Safe distance keeping varied in the fifth position. •Guard position was found the weakest predictor. Tag=AccentColor Flavor=Light Target=FillAndLine Neural network (NN) •A tool for nonlinear modeling. •Many inputs generate an output that is a nonlinear function of the weighted sum of these inputs. •The weights assigned to each of the inputs are obtained on the basis of a learning process, where the generated outputs are compared with the so-called target outputs. •The obtained deviations between the known values and the obtained outputs serve as feedback for the adjustment of the weights. Tag=AccentColor Flavor=Light Target=FillAndLine Neural network •NN is a method in artificial intelligence •NN teaches computers to process data in a way that is inspired by the human brain. •It is a type of machine learning process, called deep learning, that uses interconnected nodes or neurons in a layered structure that resembles the human brain. •In other words, it is a very complex regression, where I have one dependent and many independent Tag=AccentColor Flavor=Light Target=FillAndLine Neural network - MLP •Multilayer Perceptron (MLP): class of feed-forward neural networks •3 types of layers - the input layer, output layer and hidden layer •Activation Functions: defines how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network Tag=AccentColor Flavor=Light Target=FillAndLine EXAMPLE 3 - Bernaciková M., Kumstát M., Buresová I., Kapounková K.,Struhár I., Sebera M. & Paludo, A. C. (2022). Diagnosing and preventing chronic fatigue in Czech youth athletes: mobile application •Frontiers in Physiology, section Exercise Physiology. In press • A picture containing text, screenshot, businesscard Description automatically generated Tag=AccentColor Flavor=Light Target=FillAndLine • Tag=AccentColor Flavor=Light Target=FillAndLine Bernaciková M., Kumstát M., Buresová I., Kapounková K.,Struhár I., Sebera M. & Paludo, A. C. Diagnosing and preventing chronic fatigue in Czech youth athletes: mobile application Figure 1 – A Simple Neural Network MLP 30-11-1 • Tag=AccentColor Flavor=Light Target=FillAndLine Bernaciková M., Kumstát M., Buresová I., Kapounková K.,Struhár I., Sebera M. & Paludo, A. C. Diagnosing and preventing chronic fatigue in Czech youth athletes: mobile application Figure 1 – A Simple Neural Network •MLP 30-11-1 Tag=AccentColor Flavor=Light Target=FillAndLine Tag=AccentColor Flavor=Light Target=FillAndLine Software •STATISTICA qTIBCO Software Inc. (2020). Data Science Workbench, version 14. http://tibco.com. •SPSS 28 qIBM SPSS Statistics, 28.0.0.0 (190) •I have to learn „R“ qlanguage and environment for statistical calculations, https://www.r-project.org/ • • Tag=AccentColor Flavor=Light Target=FillAndLine •Kratochvíl, J., Plch, L., Sebera, M., & Koritáková, E. (2020). Evaluation of untrustworthy journals: Transition from formal criteria to a complex view. Learned Publishing, 33(3), 308–322. https://doi-org.ezproxy.muni.cz/10.1002/leap.1299 •Vít M., Reguli Z., Sebera M., Cihounková J., & Bugala M. (2016). Predictors of children´s successful defence against adult attacker. Archives of Budo. (12), p. 141-150. •Bernaciková M., Kumstát M., Buresová I., Kapounková K.,Struhár I., Sebera M. & Paludo, A. C. (2022). Diagnosing and preventing chronic fatigue in Czech youth athletes: mobile application. Frontiers in Physiology, section Exercise Physiology. In press. Sources Tag=AccentColor Flavor=Light Target=FillAndLine Everything here are statistical games J Remember the most important and difficult thing of all statistical calculations is: factual interpretation of the results ! Conclusion Tag=AccentColor Flavor=Light Target=FillAndLine •Hvala na pažnji •Thank you for your attention