Text mining in reports from student stays in Czech enterprises Abstract— This research aims at finding differences between the organizations during innovation processes We analyze reports from student stay in the Czech enterprises with respect of innovation in the enterprise, or of potential of innovations. In this paper we analyze the student reports by means of natural language processing tools. We show which words are frequently used and how representative the students reports are. We show how text classification can be used for characterizing a company and what are the keywords - with respect to the company, with respect to the innovation. We compare the results with an observation of an expert. Company; Innovation process; Text mining I. Introduction The constant and rapid changes occurring in the markets force the companies look for the new ways how to survive and to be competitive. The customers make higher demands on the products they buy and on the other services. Products have to be introduced on the market faster and have to meet individual demands. Therefore it is necessary to bring new ideas and approaches to business and innovation is necessary in today conditions. A waste of text documents, like newspaper and magazine articles, reports etc., written by managers, newspapermen but also customer blogs or web pages – are available for most of companies. In this paper we explore some of them – reports written by students of Technical University of Liberec after their stay at a company. Organization of long term industrial trainee is realized during 17 year with cooperating companies. These companies include the automotive, ICT, services. At present, this cooperation is supported and financed by the European Social Fond Operational Program Human Resources 3.2. THEORY AND PRACTISE - support for university students to obtain internships for employers, transferring practical experience in teaching (CZ.1.07/2.2.00/07.0321). We use text mining methods for discovery of patterns that may be typical for company, or may characterize the company as innovative. The current business environment is characterized as highly turbulent, influenced by modern information and communication technologies, globalization, short innovation and production cycles and employees’ mobility. It is not easy to compete in such an environment; organizations have to utilize their corporate resources to the greatest possible extent. Such resources include finance, employees, tangible assets, technologies and also knowledge. As stated by P. Drucker (2001): “The most important, and indeed the truly unique, contribution of management in the 20th century was the fifty-fold increase in the productivity of the 'manual worker' in manufacturing. The most important contribution management needs to make in the 21st century is similarly to increase the productivity of 'knowledge work' and the 'knowledge worker.” In a similar manner, Nonaka (1995) states: “In an economy whose only certainty is uncertainty, knowledge is the only source to gain permanent competitive advantage.” Why do some entrepreneurs find successful business opportunities while others do not? One of the reasons is that they try to find new opportunities and try to innovate their business. Innovation research has progressed over more recent years. We can find a number of factors at three levels of analysis—the individual, work group, and the organization. (West, 2001, 2002; King & Anderson, 2002). Most innovations will be a mixture of emergent processes, adopted and adapted procedures which are in common usage elsewhere, and ideas which become sharpened over time by realistic limitations imposed by the organization (e.g., profitability, practicality of use, way of knowledge sharing, …), and so innovation researchers have almost exclusively focused upon cases and processes of relative novelty in organizations (West, 2002). In this paper we analyze the student reports by means of natural language processing tools. We show which words are frequently used and how representative the students reports are. We show how text classification can be used for characterizing a company and what are the keywords - with respect to the company, with respect to the innovation. We compare the results with an observation of an expert. In the following text we first describe the input data in Section 2. Section 3 concerns the most frequent words. In Section 4 we show how relevant the student reports are. In the following section we show which kind of reports are the most relevant and which the less relevant. Section 6 concerns prediction of innovative potencies of a company. We conclude with summary of the results and plans for future work. II. Data collection The reports document the realization during one year student’s education industrial trainee. This industrial trainee is part of studying bachelor course Computing and Business on Faculty of Economics Technical University of Liberec. The reports concern four companies. As a part of information may be private we will assigne letters A, B, C and D to them. A (iteg) is SME that…, B (autocont) is one of the biggest hardware sellers and software houses in Czech republic. C is a car producer, D (flores) is SME that … Reports include information about internal and external situations of company. It means how employees communicate between themselves, with customers and suppliers, how the management motivates employees, how the innovation processes are realized, how the company develop their internal and external processes, how the information and communication systems are used and some other detail information. We analyzed 138 texts that describe companies A, B, C and D from different point of view. The document collection contained 29074 tokens (words and diacritics) and 7323 different words. III. Descriptive characteristics For each company we first built an ordered list of words. Then we compared those list and looked for words that are frequent for a company and infrequent for the others. As each report describe a company from one of focuses (e.g. overview, technology, management, motivation, communication, collaboration, software use, education, brief evaluation of the stay etc.) we also analyzed the frequent words with respect to the focus. E.g. for B (a software house) it was the word solution that was the most frequent between substantives and adjectives. This word almost not appeared in text about other companies. Similarly for D, it was the word product when focused on Motivation. For C, the car producer, the word company was much more frequent then for A, B and D. It correspond with an expert observations. B is now more focusing into selling solutions instead of selling hardware and software. D (flores) is a new company that is developing a new product. In the C car factory, all processes are standardized and it takes usually longer time to change it. IV. results The observation described in the previous section brought only indirect evidence for relevancy of the reports. That is why we looked for direct evidence. We would try to learn classifier that would recognize which report concerned which company. The learning set contained 138 document. Each document was represented as a bag of words, i.e. each example was represented by a vector of length of 7323 (the number of unique words in the collection). Each item of the vector was equal to a numeric characteristics (importance) of the particular word. We tested three possibilities, from the simplest ot he most complex – Boolean (word appeared/not appeared in the document), a frequency of the word, and term-frequency/inverse-document-frequency (TF/IDF). We chose a word frequency, for its simplicity, because the overall accuracy was almost the same as for TF/IDF. We used decision trees, Bayes learning, SVM and instance-based learning and 10-fold cross validation. Overall accuracy was between in range of 67-88% with the highest accuracy for multinomial naïve Bayes classifier. When all words that contain the company ame and names of its proprietary products, accuracy decreased to 84% but that decrease is quite small. We also checked whether there is difference between men and women, however the difference was not strong enough. V. What kind of a report is relevant It was observed by the expert that some of document might be more important for company recognition then the other, and that it depends on the focus that the author used. In this experiment we always removed one kind of a focus. The biggest difference in accuracy was observed for two cases – for overview and for brief evaluation of a stay. After removing the text of brief evaluation the accuracy little increased, on about 3%. It may be explained by the fact that this text usually does not bring any information that concern the company itself. After removing an overview, what is actually an introduction of a company and brief description of the goal of the stay, the accuracy decreased on about 7%. VI. VII. Can we predict how inovative the company is? Innovations are principal for long-term growth of a company. Two companies, actually SMEs - flores and iteg - are very active. On the other side – two big companies rather concentrate to conservative solutions. In the last experiment we checked whether this fact – innovation - can be discovered automatically from the reports. We built two classes, the first containing two SMEs, the other containing the rest. We again used the same pre-processing methods as in the previous experiments and the same learning algorithms. For multinomial naïve bayes the overall accuracy reached 88% and we can conclude that the potential of innovations can be induced form the text. We also analyzed which words appeared to be most important for this discrimination: positive keywords, i.e. words that are frequent for SMEs concern projects, presentation. On the opposite side, words company, helpful have been typical for conservative companies. VIII.Conclusion In this paper we analyzed reports from student stays in four Czech enterprises. We showed what words are the most typical, how accyrate is the prediction of a company from the text and how accurately the innovative potencies may be predicted. As future work we plan to extend the document collecrtion with the information from the web (web presentations, news that concerns a company). We also intend to employ natural language processing tools – morphological disambiguation and shallow syntax analysis. Acknowledgment We thank This work has been partially supported by … Masaryk University . [DEL: :DEL] [DEL: :DEL] [DEL: :DEL] [DEL: :DEL] [DEL: :DEL] [DEL: :DEL] [DEL: :DEL] [DEL: :DEL] [DEL: :DEL] References Drucker, P.F. (2001), Management in 21st century, Management Press, Praha, ISBN 80-7261-021-X, p. 129 King, N., & Anderson, N. (2002). Managing innovation and change: A critical guide for organizations. London: Thompson Nonaka, I., Takeuchi, H. (1995), The Knowledge Creating Company, New York, Oxford Press, 1995, ISBN 0-19-509269-4 West, M. A. (2001). The human team: basic motivations and innovations. In N. Anderson, D. S. Ones, H. K. West, M. A. (2002). Sparkling fountains or stagnant ponds: an integrative model of creativity and innovation implementation within groups. Applied Psychology: An International Review, 51, 355–386 [1] Han J., Kamber M.: Data Mining: Concepts and Techniques. Elsevier 2006. [2] Mitchell T.: Machine Learning. McGraw Hill, New York, 1997. [3] … Proceedings of Znalosti 2010 Czech-Slovak AI conference, Jindřichův Hradec, 2010. [4] Witten I.H., Frank E.: Data Mining. Practical Machine Learning Tools and Techniques. Elsevier 2005 ________________________________