Machine Learning Jan Rygl rygl@fi.muni.cz jan.rygl@phonexia.com Outline What is machine learning1. Common concepts2. Techniques overview3. Recommendations4. Machine learning Machine learning is the science of getting computers to act without being explicitly programmed. (Coursera) The basic premise of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output value within an acceptable range. (whatis.com) ML consists of: goal definition1. data acquisition2. data preprocessing3. feature extraction4. a) applying ML model5. b) output postprocessing (optional)6. analysing results7. deploying trained model8. Goal definition Problem type: one-class classification (one label per document), multi-class classification (zero to many labels per document), clustering (known number of clusters/unknown cluster count), regression (predict correct value), verification (compute similarity of two documents), on-line learning (train system with new inputs), reinforcement learning (update system with delays -- after inputs are classified), ... Goal definition Scoring: Which results are good/acceptable/bad? What is priority? Can I make mistakes? Can I not answer?  ✁✂✄☎✆☎✝✞ ✟ ✠✡☛✠ ✠✡☛☞✌☛✠ ✍✂✄✎✏✏ ✟ ✠✡☛✠ ✠✡☛☞✌✑✠ ✒✓✔✂✎✆✕✁✂ ✟ ✖ ✗  ✁✂✄☎✆☎✝✞✘✍✂✄✎✏✏  ✁✂✄☎✆☎✝✞✙✍✂✄✎✏✏ ✚✄✄✕✁✎✄✛ ✟ ✠✡☛☞✡✑✠ ✠✚✏✏ ✜✝✄✕✔✂✞✢✆✠ ✣✎✏✎✞✄✂✜ ✎✄✄✕✁✎✄✛ ✟ ✤✠✡☛✠ ✥✘✠✌✑☞✡☛✠ ✠✡✑✠ ✥✘✠✌☛☞✡✑✠ Examples: Bank searching for account hijacks We want high recall (find all possible frauds). We want high precision: we have 10 people for checking accounts manually, we can have maximum 500 false positive cases per day. Metric consists of two parts: precision > 98% if 2% corresponds to 500 checks per day recall highest possible possible score: Examples: Training language model We want high accuracy For English and German, we have many testing examples For smaller languages such as Czech, we have less examples We can train a model perfect for English and German and bad for other languages with accuracy We don't care about precision and recall, we care about correct results Balanced accuracy or F-measure: the most possible correct results for each language (Fmeasure is harder to explain to customer)  ✁✂✄☎ ✆ ✝✞✟✠✡✡ ☛ ☞✌✍✎✏✑ ✒✝✞✟✓✔✓✕✖ ✗ ✏✘✙✚✛ ✜ ✢✣✤✥✦✧★ Goal definition Computing power and time: Do we have GPU? Enough RAM? How many processor cores? Should system be parallel/serial? Should system be scalable? Is system on-line (response in miliseconds)? Marker search (state of the art analysis) is recommended before starting any experiment. Which algorithms need big machines? Which techniques support on-line training? Etc. Data acquisition Beware of GDPR http://www.eugdpr.org/ (http://www.eugdpr.org/) Buy (legal!) data Collect and annotate Crawler from web vs pay somebody for crawling web Ask your partners/customers for data For specific tasks (analysis of bank accounts), only one data source is enough (account logs from bank). For general tasks (entity detection, many data sources will be needed (known entity collections, text similar to analysed texts). Data preprocessing Deduplication Remove meta-data and other information connected to labels not present in real data e.g., authorship attribution of anonymous e-mails trained on signed data Cleaning (remove noise such as images, tables, quotes, not always aplicable, depends on task): topic recognition: ignore tables with numbers, convert images to keywords (title, label), keep quotes authorship recognition (replace images and tables by tag IMG, remove quotes) Possible language analysis (tokenization, morphology and syntactic analysis, semantic analysis) and connecting with other databases Start with something simple in the beginning. Use existing tools, don't invent a wheel  ✁ ✂✄☎✆ ✝✞✟✠ ✡☛☞✌✍☞✟✌✎✡✏✑✎ ✏✠✒✟✞☞ ✓✔✕✖✗✘✔✙✚✁✛✜✚ ✢✎✝ ✣✤✗✥✔✔✦✗✘✔✙✚✁✛✧✚✕★✘✚✩✘✪✆ ✝✟✞ ✥✫✬✕ ✏✡ ✭★✪ ✮✯✰✭✆ ✘✚✩✘ ✱ ✘✚✩✘✮✕✚✲✦✬✥✚★✥✫✬✕✯ ✭✳✭ ✴ ✥✫✬✕ ✴ ✭✳✭✪ ✞✎☞✵✞✡ ✂✘✔✙✚✁ ✝✟✞ ✘✔✙✚✁ ✏✡ ✘✚✩✘✮✧✲✦✛✘★✭✳✭✪ ✏✝ ✘✔✙✚✁ ✶✡✢ ✡✟☞ ✘✔ ✙✚✁✮✛✧✧✲✬✥✚★✪☎ ✘✚✩✘ ✱ ✷✸✕✁✚✧✘ ✹✔✤✥✚ ★✥✮ ✄✺✻✼✰✄✽✾✿✪ ✓✬✧ ✬ ❀✔✤✬✦ ❁✬❂✬✦ ✧✚✬✣✬✁✮✷ ✒✞✏✡☞★✘✚✩✘✮✧✲✦✛✘★✪✪ ✒✞✏✡☞★✓✔✕✖✗✘✔✙✚✁✛✜✚★✘✚✩✘✪✪ ✒✞✏✡☞★✣✤✗✥✔✔✦✗✘✔✙✚✁✛✧✚✕★✘✚✩✘✪✪ ✂✭✸✕✁✚✧✘✭✯ ✭✹✔✤✥✚✭✯ ✭★✥✮✭✯ ✭✄✺✻✼✰✄✽✾✿✪✭✯ ✭✓✬✧✭✯ ✭✬✭✯ ✭❀✔✤✬✦✭✯ ✭ ❁✬❂✬✦✭✯ ✭✧✚✬✣✬✁✮✭☎ ✂✭✸✕✁✚✧✘✭✯ ✭✹✔✤✥✚✭✯ ✭★✭✯ ✭✥✮✭✯ ✭✄✺✻✼✰✄✽✾✿✭✯ ✭✪✭✯ ✭✓✬✧✭✯ ✭✬✭✯ ✭❀ ✔✤✬✦✭✯ ✭❁✬❂✬✦✭✯ ✭✧✚✬✣✬✁✭✯ ✭✮✭☎ ✂✭✸✕✁✚✧✘✭✯ ✭✹✔✤✥✚✭✯ ✭★✭✯ ✭✥✭✯ ✭✮✭✯ ✭✄✺✻✼✭✯ ✭✰✭✯ ✭✄✽✾✿✭✯ ✭✪✭✯ ✭✓ ✬✧✭✯ ✭✬✭✯ ✭❀✔✤✬✦✭✯ ✭❁✬❂✬✦✭✯ ✭✧✚✬✣✬✁✭✯ ✭✮✭☎ Which text output is better? Which tokenizer is better? Feature extraction Iterative process Begin with simple ones, add complexity with more experiments if not successful Think about features -- many features help only for one data source Features shouldn't be able to describe every document in training set perfectly: e.g. bag of words with all words for long books can match 100% training data easily. Features should be able to find out some generalization of rule. Use feature selection methods (e.g. entropy based) if you have too many features (depends on computing power and number of instances) Feature extraction Good starters: bag of words stop words word n-grams character n-grams Always try to precompute features from data, don't use public lists of best words/stop words/character n-grams. They are topic and style dependent. Feature extraction requires tuning data set data don't have to be annotated they need to have the same format and style as classified data not part of train or test set If you extract features from training data, you can overtrain. -- Train data accuracy will be high, but test data won't be recognized well. If you extract features from test data, you are cheating. -- Evaluation doesn't have any sense. Preparing data for machine learning Check that you have enough data. Prepare: tune set (unlabeled, see feature extraction) train set test set Train set is usually bigger than test set, but test set must be representative. Use different source of data for testing if possible and add some data of the same type as training set documents to it. Not enough data If you don't have enough data from making big enough train and test sets, use one of following techniques: random sampling: repeat N times (e.g. N=10, n=50, ...): select randomly 90% of instances and use them for training rest of instances is used for testing evaluate trained model compile all evaluations (average, min, ...), system performance is compilation of performances of random data samplings n-fold cross-validation: divide data into N groups (folds) repeat N times (common values are 4 and 10): Nth fold is testing, all other folds are merged into training set evaluate trained model compile all evaluations, same as previous step Applying machine learning model Select correct model1. Use correct model correctly2. Select the "lightest" possible model which is able to process data => you will save computing and programming time. For Python programmers, I definitely recommend Sklearn: http://scikit-learn.org/ (http://scikit-learn.org/) Lazy programmer is a good programmer. Implementing a ML algorithm is a good practise for school seminars, but students' implementations are less efficient and usually introduce some bugs.  ✁ ✂✄☎✆ ✝✞✟✠ ✡☛☞✌✍✞✎ ✏✠✑✟✞✒ ✓✔✕✔✖✗✕✖ ✝✞✟✠ ✡☛☞✌✍✞✎ ✏✠✑✟✞✒ ✖✘✙ ✚✛✚✖ ✜ ✓✔✕✔✖✗✕✖✢✣✤✔✓✥✚✛✚✖✦✧ ✓✚★✚✕✖ ✜ ✓✔✕✔✖✗✕✖✢✣✤✔✓✥✓✚★✚✕✖✦✧ ✩✣✪ ✜ ✖✘✙✢✫✬✭✦★✔✙✙✔✜✮✢✮✮✯✰ ✭✜✯✮✮✢✧ ✩✣✪✢✪✚✕✦✓✚★✚✕✖✢✓✔✕✔✂✆✱✯☎✰ ✓✚★✚✕✖✢✕✔✛★✗✕✂✆✱✯☎✧ ML models Supervised Naive Bayes Linear classifiers Decision trees Random forests SVM Neural networks ... Unsupervised Clustering algorithms (K-means, hierarchical clustering) Neural networks Outlier detection ... ✲✳✕✂✄☎✆ ✫✬✭✦✭✜✯✮✮✢✮✰ ✩✔✩✴✗✥✖✚✵✗✜✄✮✮✰ ✩✣✔✖✖✥✶✗✚★✴✕✜✷✤✁✗✰ ✩✤✗✪✮✜✮✢✮✰ ✓✗✩✚✖✚✤✁✥✪✳✁✩✕✚✤✁✥✖✴✔✸✗✜✷✤✁✗✰ ✓✗★✛✗✗✜✹✰ ★✔✙✙✔✜✮✢✮✮✯✰ ✺✗✛✁✗✣✜✻ ✛✼✪✻✰ ✙✔✽✥✚✕✗✛✜✱✯✰ ✸✛✤✼✔✼✚✣✚✕✾✜✿✔✣✖✗✰ ✛✔✁✓✤✙✥✖✕✔✕✗✜✷✤✁✗✰ ✖✴✛✚✁✺✚✁★✜ ❀✛✳✗✰ ✕✤✣✜✮✢✮✮✯✰ ✘✗✛✼✤✖✗✜✿✔✣✖✗✧ Supervised ML Typical for smaller data Data annotation is possible We know what we want to find/predict Not applicable usually for Facebook and Google like companies, but great for smaller companies and problems Scenario 1 I don't have time to implement/wait I don't need to explain results to somebody else It should perform reasonably well Try Naive Bayes Naive Bayes (sklearn source) Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features. Given a class variable y and a dependent feature vector x_1 through x_n, Bayes’ theorem states the following relationship:  ✁✂ ✄ ☎✆☎ ✝ ✞✟✠ ✟✡  ✁✂✝ ✁ ☎✆ ✄ ✂✝✟✠ ✟✡  ✁ ☎✆☎ ✝✟✠ ✟✡  ✁ ✂✄☎✆ ✝✞✟✠✡☛ ☛✝✞☞ ✌✡✠✞ ✍✎✏☞✑✡✒ ✝✞✟✠✡☛ ✓✔✕✔✖✗✕✖ ✘✙✘✖ ✚ ✓✔✕✔✖✗✕✖✛✜✢✔✓✣✘✙✘✖✤✥ ✌✡✠✞ ✍✎✏☞✑✡✒✦✒✑✝✧☞★✩✑✪☞✍ ✝✞✟✠✡☛ ✫✔✬✖✖✘✔✁✭✮ ✯✁✰ ✚ ✫✔✬✖✖✘✔✁✭✮✤✥ ✖✕✔✙✕ ✚ ✕✘✱✗✛✕✘✱✗✤✥ ✲✣✳✙✗✓ ✚ ✯✁✰✛✴✘✕✤✘✙✘✖✛✓✔✕✔✵ ✘✙✘✖✛✕✔✙✯✗✕✥✛✳✙✗✓✘✶✕✤✘✙✘✖✛✓✔✕✔✥ ✟✡✝✒☛✤✷✸✬✙✔✕✘✢✁✆ ✹✺✦✻✌ ✖✗✶✢✁✓✖✷ ✼ ✤✕✘✱✗✛✕✘✱✗✤✥ ✽ ✖✕✔✙✕✥✥ ✟✡✝✒☛✤✾✭✬✱✰✗✙ ✢✴ ✱✘✖✜✔✰✗✜✗✓ ✳✢✘✁✕✖ ✢✬✕ ✢✴ ✔ ✕✢✕✔✜ ✹✿ ✳✢✘✁✕✖ ✆ ✹✿✾ ✼ ✤✘✙✘✖✛✓✔✕✔✛✖❀✔✳✗✂❁☎✵✤✘✙✘✖✛✕✔✙✯✗✕ ❂✚ ✲✣✳✙✗✓✥✛✖✬✱✤✥✥✥ Scenario 2 I have a lot of binary/multi value features (e.g. ❃❄❅❄❆ ❇ ❈❆❉❊❋ ●❉❅❅❄❍❋ ■❆❉❉❏❑) I want to explain decision Tuning should be intuitive Try Decision trees (sklearn source) Decision Trees are a supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. ✸✬✙✔✕✘✢✁✆ ❁✛❁❁▲ ✖✗✶✢✁✓✖ ✭✬✱✰✗✙ ✢✴ ✱✘✖✜✔✰✗✜✗✓ ✳✢✘✁✕✖ ✢✬✕ ✢✴ ✔ ✕✢✕✔✜ ▲▼❁ ✳✢✘✁✕✖ ✆ ◆ Scenario 3 If any of following aplies: I don't want black box, customer (court, security agency, clever CTO) wants to be explained decisions of system I can give time to output analysis I need quick performance linear classifiers are perfect match.  ✁ ✂✄✄☎✆ ✝✞✟✠ ✡☛☞✌✍✞✎✏☞✑✎✌✍✞✒✠✟✓✌☞ ✑✠✔✟✞✕ ✖✗✘✙✚✛✜✜✢✣✢✤✥ ✑✠✔✟✞✕ ✕✑✠✌ ✝✞✟✠ ✡☛☞✌✍✞✎ ✑✠✔✟✞✕ ✦✛✧✛✜✤✧✜ ✢✥✢✜ ★ ✦✛✧✛✜✤✧✜✩✚✪✛✦✫✢✥✢✜✬✭ ✜✮✦ ★ ✖✗✘✙✚✛✜✜✢✣✢✤✥✬✭ ✜✧✛✥✧ ★ ✧✢✯✤✩✧✢✯✤✬✭ ✰✫✱✥✤✦ ★ ✜✮✦✩✣✢✧✬✢✥✢✜✩✦✛✧✛✲ ✢✥✢✜✩✧✛✥✮✤✧✭✩✱✥✤✦✢✳✧✬✢✥✢✜✩✦✛✧✛✭ ✔✞✑✎✕✬✴✘✵✥✛✧✢✪✁✆ ✶✷✏✸✝ ✜✤✳✪✁✦✜✴ ✹ ✬✧✢✯✤✩✧✢✯✤✬✭ ✺ ✜✧✛✥✧✭✭ ✔✞✑✎✕✬✻✼✵✯✽✤✥ ✪✣ ✯✢✜✚✛✽✤✚✤✦ ✱✪✢✁✧✜ ✪✵✧ ✪✣ ✛ ✧✪✧✛✚ ✶✓ ✱✪✢✁✧✜ ✆ ✶✓✻ ✹ ✬✢✥✢✜✩✦✛✧✛✩✜✾✛✱✤✂✿☎✲ ✬✢✥✢✜✩✧✛✥✮✤✧ ❀★ ✰✫✱✥✤✦✭✩✜✵✯✬✭✭✭ Analysis of trained weights: positive value: higher the value, higher the chance of label negative value: higher the absolute value, lower the chance of label select label with highest scalar product of weights and features ✘✵✥✛✧✢✪✁✆ ✿✩✿✿✄ ✜✤✳✪✁✦✜ ✼✵✯✽✤✥ ✪✣ ✯✢✜✚✛✽✤✚✤✦ ✱✪✢✁✧✜ ✪✵✧ ✪✣ ✛ ✧✪✧✛✚ ❁❂✿ ✱✪✢✁✧✜ ✆ ❂✿  ✁ ✂✄☎✆✝ ✞✟✠ ✁✡☛ ☞✌✍✎☞ ✏✑ ✎✁✒✓✎✔✌✕✎✖✗✔✗✘✙✕✌✔✚✎✕✛✁✌✓✎✘✜✝ ✢✠✏✑✣✖✤✥✌✍✎☞ ✦✧✤ ★ ☞✌✍✎☞✜ ✢✠✏✑✣✖✤✩ ✤✙✪✡✗✁✖✂ ✤✫✎✗✚✬✕✂✦✧✆✭✦✮✯✰✞✤ ★ ✖☞✌✍✎☞☛ ✫✎✗✚✬✕✜ ✞✟✠ ☞✌✍✎☞☛ ✫✎✗✚✬✕ ✏✑ ✱✗✲✖✗✔✗✘✙✳✎✌✕✒✔✎✛✁✌✓✎✘☛ ✘✚✴✙✵✡✎✳✛✂✁✡✆ ✜✆✜✜ Scenario 4 Bigger data set, previous methods don't work. More powerfull techniques are needed: we want some readability: random forests black box is enough, but we don't have GPUs for NN: SVM Random forests There are tools for explaining forests decisions (the most common paths in trees) Results can be analysed, human readable With increasing tree count and more deep structure, power of forest is growing Better for multi-value features Could horrible fail for periodic functions, e.g. sinus ✥✌✍✎☞ ✘✎✕✡✘✌ ✫✎✗✚✬✕✂✘✎✲✌☞ ☞✎✁✚✕✬ ✖✵✓✜✆✭✶✙✷✸✩ ✫✎✗✚✬✕✂✘✎✲✌☞ ✫✗✴✕✬ ✖✵✓✜✆✭✷✸✙✹✶✩ ✫✎✗✚✬✕✂✲✎✕✌☞ ☞✎✁✚✕✬ ✖✵✓✜✆✭✺✻☎✙✶✼✩ ✫✎✗✚✬✕✂✲✎✕✌☞ ✫✗✴✕✬ ✖✵✓✜✆✭✺✷✽✙ ✹✶ ✥✌✍✎☞ ✾✎✔✘✗✵✡☞✡✔ ✫✎✗✚✬✕✂✘✎✲✌☞ ☞✎✁✚✕✬ ✖✵✓✜✆✭✻☎✙✼✄✩ ✫✎✗✚✬✕✂✘✎✲✌☞ ✫✗✴✕✬ ✖✵✓✜✆✭✺✹✽✄✙ ✷✄✩ ✫✎✗✚✬✕✂✲✎✕✌☞ ☞✎✁✚✕✬ ✖✵✓✜✆✭✄✼✙☎✿✩ ✫✎✗✚✬✕✂✲✎✕✌☞ ✫✗✴✕✬ ✖✵✓✜✆✭✺ ✸☎✙✸✹ ✥✌✍✎☞ ✾✗✔✚✗✁✗✵✌ ✫✎✗✚✬✕✂✘✎✲✌☞ ☞✎✁✚✕✬ ✖✵✓✜✆✭✺✹✼✸✙✷✹✩ ✫✎✗✚✬✕✂✘✎✲✌☞ ✫✗✴✕✬ ✖✵✓✜✆✭✺✹✹ ✼✙✄✽✩ ✫✎✗✚✬✕✂✲✎✕✌☞ ☞✎✁✚✕✬ ✖✵✓✜✆✭✷✄✻✙✻✷✩ ✫✎✗✚✬✕✂✲✎✕✌☞ ✫✗✴✕✬ ✖✵✓✜ ✆✭✹✶✿✙✽✷  ✁ ✂✄☎✆✝ ✞✟✠✡ ☛☞✌✍✎✟✏✑✍✏☛✍✡✒✌✍ ✓✡✔✠✟✕ ✖✗✁✘✙✚✛✙✜✢✣✤✥✦✗✣✣✧★✧✢✜ ✞✟✠✡ ☛☞✌✍✎✟✏✑✩✎✕✎☛✍✕☛ ✓✡✔✠✟✕ ✚✗✪✢✫✬✦✗✣✣✧★✧✬✗✤✧✙✁ ✓✡✔✠✟✕ ✕✓✡✍ ✞✟✠✡ ☛☞✌✍✎✟✏ ✓✡✔✠✟✕ ✘✗✤✗✣✢✤✣ ✧✜✧✣ ✭ ✘✗✤✗✣✢✤✣✮✦✙✗✘✫✧✜✧✣✯✰ ✬✦★ ✭ ✖✗✁✘✙✚✛✙✜✢✣✤✥✦✗✣✣✧★✧✢✜✯✚✗✱✫✘✢✲✤✳✭✴✵ ✜✗✁✘✙✚✫✣✤✗✤✢✭✶✰ ✣✤✗✜✤ ✭ ✤✧✚✢✮✤✧✚✢✯✰ ✬✦★✮★✧✤✯✧✜✧✣✮✘✗✤✗✵ ✧✜✧✣✮✤✗✜✷✢✤✰✮✲✜✢✘✧✬✤✯✧✜✧✣✮✘✗✤✗✰ ✔✟✓✏✕✯✸✛✢✗✤✹✜✢ ✺✢✧✷✳✤✣✝✸ ✻ ✬✦★✮★✢✗✤✹✜✢✫✧✚✲✙✜✤✗✁✬✢✣✫✰ ✔✟✓✏✕✯✸✼✹✜✗✤✧✙✁✝ ✽✾✑✿✞ ✣✢✬✙✁✘✣✸ ✻ ✯✤✧✚✢✮✤✧✚✢✯✰ ❀ ✣✤✗✜✤✰✰ ✔✟✓✏✕✯❁❂✹✚❃✢✜ ✙★ ✚✧✣✦✗❃✢✦✢✘ ✲✙✧✁✤✣ ✙✹✤ ✙★ ✗ ✤✙✤✗✦ ✽✩ ✲✙✧✁✤✣ ✝ ✽✩❁ ✻ ✯✧✜✧✣✮✘✗✤✗✮✣✳✗✲✢✂✶✆✵ ✯✧✜✧✣✮✤✗✜✷✢✤ ❄✭ ❅✫✲✜✢✘✰✮✣✹✚✯✰✰✰ Support vector machines For long time, the best ML for NLP problems. Currently popularity of SVM is decreasing because of the deep learning. Features of SVM: hard to select correct parameters (feature tuning is required) slower then previous methods except linear kernel works as black box more powerful, number of features can be higher than number of training samples ✛✢✗✤✹✜✢ ✺✢✧✷✳✤✣✝ ✼✹✜✗✤✧✙✁✝ ✶✮✶✴❆ ✣✢✬✙✁✘✣ ❂✹✚❃✢✜ ✙★ ✚✧✣✦✗❃✢✦✢✘ ✲✙✧✁✤✣ ✙✹✤ ✙★ ✗ ✤✙✤✗✦ ❇❈✶ ✲✙✧✁✤✣ ✝ ❉☎ citing (sklearn): The advantages of support vector machines are: Effective in high dimensional spaces. Still effective in cases where number of dimensions is greater than the number of samples. Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient. Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels. The disadvantages of support vector machines include: If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial. SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below). Scenario 5 We have complex, difficult problem. OR We have images or other problems with binary features. Use neural networks. Neural networks Steps: Think about features and how to extract them E.g., using bag of words containing 1000 words, input layer consists of 1000 inputs Each sentence is represented as a vector of zeros with several ones on positions corresponding to words in the sentence (one hot encoding) Each sentence is represented as a vector of float numbers between 0 and 1, each word is encoded as a 1000 float numbers, each word encoding is different to each other word encodings Think how to design neural network: how many layers size of each layer neural network type which components should be used Start with small and simple and add complexity. Being expert in NN is a big project Testing hypothesis is very slow and you need good hardware Playing with many configurations, components, parameters, tricks, ... The best solutions are usually find by luck -- the space of all neural networks is too big Play with input data -- neural networks needs a lot of data -- generate data automatically, add noise to existing data, or use NN on unsupervised problems Sklearn supports only the simplest neural networks, e.g.:  ✁ ✂✄☎✆✝ ✞✟✠✡ ☛☞✌✍✎✟✏✑✏✍✒✟✎✌✓✏✍✔✕✠✟☞ ✖✡✗✠✟✔ ✘✙✚✛✜✢✣✣✤✥✤✦✧ Multi-layer Perceptron classifier. This model optimizes the log-loss function using LBFGS or stochastic gradient descent. Neural networks To use neural networks in Python, you have two reasonable possibilities: Keras with Theano background: Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. It is more user friendly, but documentation is not great. Tensorflow from Google: https://www.tensorflow.org/ (https://www.tensorflow.org/) Ineffective on small number of GPUs The most effective on big projects, many GPUs available Harder to use and learn Unsupervised algorithms More creativity is needed. Manual result analysis is must. Data/result analysis Use some baseline algorithm to set up some expectations. E.g. guess labels randomly, guess the most common label. If you are not happy with achieved results, analyse: your system (search bugs, use logging for each component, check manually and with asserts all components inputs and outputs) if system learns weights, check which features are used and which are ignored outputs: which classes are wrong, which are ok inputs: what happens with less inputs, with sampling, randomizing the order, ... features: do they cover all cases? If results are better than you expected: again, check everything, chance of bug or using flawed dataset is big. Everything is perfect: if you have time, still check. Deploying trained model Speeding up: more effective feature extraction better data structures using GPUs over CPUs for some techniques Memory reduce: iterators instead of lists read structures from memory instead of holding it in RAM Writing result parsers: converting ML outputs to desired format Documentation and evaluation: achieved precision/recall/accuracy, strong and weak points of system recommended system configuration guess of experiment duration for sample data set lengths Interface: good CLI or GUI, notebook, ... In history, rewriting in C++, currently the speed is almost the same. Questions? Control checks You are analysing a mood of a person by first sentence in an opener message:  ✁✂✄☎ ✆✝✞ ✟✠✡ ☛☞✌☞✍✎✆✠✡ ✏✠✑✒✓✍✎✞☎ ✔✂ ✟✠ ✟✕☞✖✂✗✗✗ ✘✖✙✠✄ ✚✠✛✜ ✢✝ ✕✑ ✑✣✤✕✠✖✥✑✗✗✗ ✣☞✑ ✒✂✖✠✦✧★ Recommend some features. You are given 100 annotated e-mails and 100 e-mails without labels to solve a classification problem. How do you divide data into training and testing set and tuning set? You have small project on classification of salesmans performance (influence of number of calls, appointments, overtimes and other factors on sales of examined person): which ML technique will you use and why? Select one from NN, SVM, Random Forests, Decision Trees, Linear classifiers, Naive Bayes, ... Thank you  ✁ ✂✄☎✆✝ ✞ ✟✠✡☛✠ ☞✌✍✎✏☞✑✒✓ ✔✕✖✗✘✙✚✛ ✁✜✢✣✁✤✚✛✙ ✁✣✙✚✜✣✣✥✦✧✚★✣✩✪✗✘✁✜ ✫✫✙✣ ✬✭✪✧✚✬ ✫✫✗✣✬✙ ✬✚✛✤✚ ✂✮✜✯✣✁✤✚✛✙✰✗✗✆ ✯✣✁✤✚✛✙✪✁✱ ✁✣✙✚✜✣✣✥ ✁✣✙✚✜✣✣✥✦✧✚★✣✩✪✗✘✁✜ ✙✣ ✬✭✪✧✚ ✬ ✂✮✜✯✣✁✤✚✛✙✰✗✗✆ ✲✛✪✙✪✁✱ ✳✴✵☎✵✶ ✜✘✙✚✬ ✙✣ ✁✣✙✚✜✣✣✥✦✧✚★✣✩✬✭✪✧✚✬✩✷✙★ ✭ ✂✮✜✯✣✁✤✚✛✙✰✗✗✆ ✸✚✧✪✛✚✢✙✪✁✱ ✛✚✤✚✹✭✩✕✬ ✛✚✺✖✚✬✙✬ ✙✣ ✷✙✙✗✬✝✻✻✢✧✁✕✬✩ ✢✭✣✖✧✼✭✹✛✚✩✢✣★✻✹✕✹✽✻✭✪✜✬✻✛✚✤✚✹✭✩✕✬✻✾✩✄✩✵ ✿✚✛✤✪✁✱ ✘✣✖✛ ✬✭✪✧✚✬ ✹✙ ✷✙✙✗✝✻✻☎✴❀✩✵✩✵✩☎✝❁✵✵✵✻✁✣✙✚✜✣✣✥✦✧✚★✣✩✬✭✪✧ ✚✬✩✷✙★✭ ❂✬✚ ✯✣✁✙✛✣✭✫✯ ✙✣ ✬✙✣✗ ✙✷✪✬ ✬✚✛✤✚✛ ❃✯  ✁✙✚✛✛✖✗✙✚✧