Subjects clustering Jakub Vonšovský Brno 2015 Motivation ● I completed graphical design and typography – this area interests me so what next? ● Recommender system for IS offering subjects similar to those I completed in development (Hana Bydžovská) ● Each cluster contains subjects similar to each other Introduction ● 3-phase algorithm in which data preparation, distance measurement and making clusters takes place ● First two phases in Python language (own work), together about 45 KB ● Clustering algorithm is external application in Java language with author's persmission, from my side programming of common interface, cluster measurement metrics and understandable (for our purposes) output Phase I – Data grabbing ● First version used given csv files – small set of subjects (63) and also not current ● Currently when list of subjects is changed (now all data about all subjects at faculty of informatics are included) information about literature, anotation, … should be downloaded ● For time reasons downloading of grades was removed / not completed as required url is different from semester to semester, faculty to faculty Phase I – Parsing demo Phase II – Distance matrix ● For each attribute – teachers, fields, prerequisites, field constraints, supervisors, literature, objectives and syllabus (and grades) distance of two subjects is calculated ● These parameters are then weighted and sum in just one number ● As bigger number means the better in this case it has to be converted then to distance values where smaller means the better Phase II – Normalization ● For each row in matrix: ● First method uses similarity to itself as base number and all values are subtracted from it ● Second method tried to enhance first one by normalize these values into <0 + ε, 1> ● Third method uses logistic function to avoid getting close to 0 or 1 P(t)= 1 1+e −t Calculating distance – Jaccard Index ● ● Simple metric used for most of attributes (authors of literature, fields, …) ● For teachers values of set have also weights – lecturer, deputy have 1, seminar tutor 0.5 and assistants have no value as they probably don't have relationship with subject area itself and fluctuate ● At first also for text but then TF-IDF was added J (A ,B)= A∩B A∪B Calculating distance – TF-IDF ● TF-IDF of word in document is counted by number of its occurences divided by number computed by occurences of this word in all documents (in this context document is text from syllabus or objectives) ● This value is computed for every word in every document (even if it's zero) ● Then cosine similarity is launched on there vectors giving again only one similarity number Calculating distance – text generally ● How to store words in czech fusional language? – First simplistic approach uses only words longer than 3 letters and stores maximum of 6 letters – Second, "smarter" approach uses stemming application I wrote several years ago ● After few tests it showed that TF-IDF is better than Jaccard and storing 6 letters is better than stemming (am I bad / dumb programmer?) ● However TF-IDF has high complexity of O(s²*w) (subjects, words) running then 20 minutes instead of 1,5 making it bad bottleneck Calculating distance - prerequisites ● Problematic part of application – for example it sometimes produces higher similarity number for some other subject then to subject itself ● When one subject is prerequisite of some other subject, it's taking advantage by high weight into final sum ● Same for "banned" subjects – they have to be punished (by high weight) ● Subjects with no relationship have prerequisite weight set to 0 Normalization results ● After manual evaluation it seems that first method of "reverse values" is again better than other two "smarter" approaches ● On testing data the other two methods for example put "Artificial Intelligence I" next to "Essentials of General Logic", another example on full data is "Bioinformatics I" and "Computers and Ergonomy" ● Maybe sparsely distributed values are better than dense as it is done artificially Phase III - Clustering ● After two phases application launches clustering algorithm based on Murtagh Average Clustering Linkage where the only input parameter is maximum cluster size (threshold) ● Davies-Bouldin and Dunn index were implemented to measure "quality" but it showed that it is not very usable in our context – these two measures went against each other many times, "good" numbers were for not very usable clusters, … Results Three methods Future? ● Some improvement suggestions: – Downloading grades and implement proper metric – Jaccard Index can be matched with some more advanced and better statistic methods – More cluster algorithms? Thank you for your attention Discussion helper I ● 6 letters, PV080-PV017 and PV080-IA101 – TF-IDF 0,319536851569 0,0131464657433 – JI syllabus 0,148648648649 0,0112359550562 – JI objectives 0,149253731343 0,0173913043478 ● Basic stemming, same subjects – TF-IDF 0,227518352002 0,00310845584597 – JI syllabus 0,121495327103 0,0462962962963 – JI objectives 0,148936170213 0,030303030303 Discussion Helper II ● Threshold 3.0 3.5 4.0 4.5 5.0 – Biggest cluster 11 23 54 127 350 – Davies-Bouldin 0.01 0.031 0.078 0.201 0.416 – Dunn 0.015 0.014 0.014 0.013 0.224 ● Threshold 0.35 0.45 0.55 0.65 0.75 0.85 0.95 – Biggest cluster 11 20 52 106 133 150 418 – Davies-Bouldin 0.007 0.016 0.046 0.089 0.120 0.146 0.563 – Dunn 0.107 0.107 0.102 0.1 0.191 0.409 0.546