Unit 3 Statistics Task 1 Data - collocations Complete the list of collocations with the noun ´data´ with other examples. ADJ. E.g. We have collected the raw data and are about to begin analysing it. QUANT. E.g. One vital item of data was missing from the table. VERB + DATA E.g. They are not allowed to hold data on people's private finances. DATA + VERB E.g. This data reflects the magnitude of the problem. DATA + NOUN acquisition, capture, collection | entry, input | storage | access, retrieval | analysis, handling, management, manipulation, processing | exchange, interchange, transfer, transmission | protection, security | source | archive, bank (also databank), base (also database), file | system PREP. in the ~ We have found some very interesting things in the data. | ~ about Data about patients is only released with their permission. | ~ for We have no data for southern Spain. | ~ from My aim is to synthesize data from all the surveys. | ~ on data on the effects of pollution PHRASES the acquisition/handling/storage, etc. of data, a source of data (http://oxforddictionary.so8848.com/search?word=data) Task 2 “Lies, damned lies and statistics” A) Try to decide what is suspicious about the following statements: We hear a prediction that life expectancy will reach 150 years in the next century, based on simple extrapolation from increases over the past 100 years. B) Data Data are the raw material on which the discipline of statistics is built as well as the raw material from which individual statistics themselves are calculated. These data are typically numbers. In fact, however, data are more than merely numbers. To be useful, the numbers must be associated with some (1) ___________. For example, we need to know what the measurements are measurements of, and just what has been counted. To produce valid and accurate results when we (2) _________ our statistical analysis, we also need to know something about how the values have been (3) _________ . Did everyone we asked give answers to a questionnaire, or did only some people answer? If only some answered, are they properly (4) ____________ of the population of people we wish to make a statement about, or is the sample distorted in some way? We also need to know if a measuring instrument is (5)___________ and whether the data are up (6) _____ ________ . There is an infinite number of such questions which could be asked, and we need to be alert for any which could influence the conclusions we (7) ___________ . C) Measures of central tendency Measures of central tendency are numbers that describe what is average or typical of the distribution. These measures include the mean, median and mode and standard deviation. Match each definition with the type of average: a) It is the value such that half the numbers in the data set are larger and half are smaller b) It is the value taken most frequently in a sample c) It is the value found by adding all the numbers up and dividing by how many there are Which of the statistics would you choose when talking about average salaries? Which of the statistics would be suitable when presenting the number of children per family? D) Put the sentences into the correct order. Dispersion A) To illustrate, suppose that we have a set of a million and one numbers, taking the values 0,1,2,3,4,…, 1,000,000. Both the mean and the median of this set of values are 500,000. B) Averages, such as the mean and the median, provide single numerical summaries of collections of numerical values. They are useful because they give an indication of the general size of the values in the data. C) But, it is readily apparent that this is not a very ´representative´ value of the set. At the extremes, one value in the set is half a million larger and one value is half a million smaller than the mean and median. D) However, single summary values can be misleading. In particular, single values might deviate substantially from individual values in a set of numbers. E) Standard deviation Standard deviation is a number that indicates how much each of the values in the distribution deviates from the mean (or centre) of the distribution. If the data points are all close to the mean, then the standard deviation is close to ______. If many data points are far from the mean, then the standard deviation is ______ . If all the data values are equal, then the standard deviation is __________ . Here is how to calculate standard deviation. Based on the description, write the mathematical formula for standard deviation. - Calculate the mean (M) - Subtract the mean from each subject´s score (X) - Square the answer - Sum the squared scores (Σ) - Divide by the number of participants minus 1 - Take the square root of the answer F) Read the text about normal distribution and identify one piece of information that is false: A normal distribution of data means that most of the examples in a set of data are close to the ´average´, while a few examples are at one extreme or the other. Normal distribution graphs have these characteristics: - The curve has a single peak - It is bell-shaped - The mean (average) lies at the centre of the distribution and the distribution is symmetrical around the mean - The two ´tails´ of the distribution cross the x axis G) Skewness Measures of dispersion tell us how much the individual values deviate from each other. But they do not tell us in what way they deviate. In particular, they do not tell us if the larger deviations tend to be for the larger values or the smaller values in the data set. E.g. if there are five company employees, four of whom earned about $10,000 per year and one earned around ten times that, the measure of dispersion (the standard deviation, for example) would tell us that the values were quite widely spread out, but would not tell us that one of the values was much larger than the others. Indeed, the standard deviation for the five values $90,000, $89,999, $89,998, $889,997 and $1 is exactly the same as for the original five values. What is different is that the anomalous value (the $1value) is now very small instead of very large. To detect this difference, we need another statistic to summarize the data, one which picks up on and measures the asymmetry in the distribution of values. One kind of asymmetry in distribution of values is called skewness. Our original employee salary example, with one anomalously large value of $99,999, is right skewed because the distribution has a long ´tail´ stretching out to the single very large value. This distribution has many smaller values and very few larger values. In contrast, the distribution of values in which $1 is the anomaly, is left skewed, because the bulk of the values bunch together and there is a long tail stretching down to the single very small value. Draw a right and a left skewed curve. Task 3 Assignment Performance analyst Amy is a performance analyst for a professional football club. One of her key job roles is to analyse football matches to see how well the team and players are performing from a statistics perspective. ´Professional football is such big business now that football teams are looking for as much detail as they can about the performance of the team as a whole, and of the individual players, so that they can start to see who is performing well, and who isn´t. This has lots of influence over team selections as the statistics that I produce are often good indicators of whether or not a player is tired and needs a rest, or even if they look like they´re not putting in enough effort. I can also give players detailed breakdowns of what they have done during the game. For example, how many passes they have completed, how far they have run, how many shots they´ve had on target. I keep a record of this for them over the course of the season and we can use this to help develop the players. As well as being able to analyse players and our own team, I analyse lots of games of opposition teams if we have a big game coming up (for example, I might look at their star striker and be able to give the staff a breakdown of where they prefer to shoot from and how they score most of their goals) which the manager and the coaching staff find really useful – and sometimes they might even get me to compare the performance statistics of a couple of players that they´re interested in buying. One of the key problems I face is that sometimes the playing and coaching staff don´t understand some of the statistic that I present them with, so I do need to spend some time with them to explain what is going on.´ What type of statistics do you think would be useful to use in this type of sporting environment? How do you think you could make the statistics easier to understand? How much do you think players would appreciate this type of feedback?