Manipulating variables MANIPULATING VARIABLES - WHY? Often, when you arc working with data, you need to make changes to your variables. This may be for a number of reasons. For example, you may have an income variable that is measured in pounds, but you would prefer to have income grouped into a number of categories, or even percentiles, instead. You may be examining educational qualifications, but you are only really interested in studying whether people went to university or not, and therefore you would collapse a variable into two categories (i.e. went to university, didn't go to university). You may have a very detailed ethnicity 01 race variable that was collected as an open-ended question - that is, the survey respondents could write in whatever they pleased. Clearly a categorical variable with hundreds of possible responses would be very nearly impossible to work with and therefore you may want to collapse these into more manageable categories. There are many reasons you would want to change the way your variables are presented. One special instance is the case of missing values. When people answer surveys, they don't always answer all the questions, either because they would rather not or because not all the Questions may be relevant to them. Survey researchers use a variety of techniques to capture the instances where a respondent may refuse to answer a question or that question isn't appropriate for the particular respondent. Often these values are recorded in the data spreadsheet as negative numbers, such as -1 for 'refused' or -9 for 'inapplicable'. Sometimes, they are stored as high numbers. MISSING VALUES Manipulating variables such as 99 or 999, There is no rule about this - it depends on the conventions of the particular researcher or research institute who collected the data. You can recede the messing values to one or more of the 27 values Stata recognizes as meaning 'missing'. These 27 are a dot {.) and a dot followed by a letter (.a to .z). Stata actually uses the dot to recode the value to an extremely high number, which is why you have to be careful when using the greater than symbol (>j if the variable has missing values. In our example data the data collectors decided to use negative numbers (i.e. -7, -8! to indicate rion~rcspon.se for a variety of reasons. The table below shows the output ot the variable educ using the tabulate command (see Chapter 5y . sabul&ce c-dac highese f:clucur.0o:ia.l qua] i.iicfftion j r'req . Perc ai e C 1TÍ . juisfiing j la C .1.9 u . -9 3 5C> 3 4.5 3 . 01 higher degree j 1 22 1 19 4 . 80 ^irsr. dsgree j 598 5 S3 10 . 03 v •dcac - -v-: r:f 1 11 12 . 62 o L r i er higher qf j 1, 207 11 76 24 . 58 nursing qf 1 216 2 0') 26 . 68 gcR a 1řvels | 9B5 9 60 36 . 27 gee o levels or eou1v | 2 , Ü86 2 0 32 50 . 60 uorr-rmrr..- al qf, ;io o levels j 34.9 3 40 60 . 0 0 ■:j grade 2-b, flcuL gra.de 4-b | 411 4 00 64 . 00 a p p r fj r 11 i. c e s h i p 1 2 62 2 5 5 66 . 56 o:.her Cff j 84 Ü R? 67 37 CO OE | j , .345 3 2 63 100 . 06 Total | 10,264 100 01. As you can see from the frequency table there are 19 cases with 'missing' values and 352 ».ases where this question was asked of a 'proxy respondent', A proxy respondent is someone in the household who answers questions on the respondent's behalf if he or she is not able to participate. If we tabulate again, with the option nol (nolabcl) asking for the values rather than the labels of the categories, we get: Missing values 43 tabulate educ t nol educational | -Oil | Freq. Percent Cum. -9 | 19 0 .19 0 .19 -7 352 3 .43 3 .61 1 i 122 1 , 19 4 .80 ^ 1 59 8 5 , 83 10 .63 3 I 225 2 .19 12 .82 4 ! 1, 207 11 .76 24 . 58 5 1 215 2 ,09 26 . 68 6 i 985 9 . 60 36 .27 '7 ! ' 1 2,086 20 .32 56.60 8 ! 349 3 .40 60 .00 9 1 411 A .00 64 . 00 10 | 262 2 . 55 66 .55 11 84 0 ,82 67 .37 12 j 3, 349 32 . 63 100.00 10,264 100 .00 We see that —9 Is the value given to ''missing' cases and -7 to 'proxy respondent' cases, For this variable we would choose to recode both -9 and -7 to be missing values. There a re a number of ways to specify missing values. The first uses the recede command, where we recode both the negative numbers to a dot. recode educ -■9 =. -7=. or recode educ -9/-7-. The forward slash / is Stata notation for a range of values - in the above example, values ~9 through -7. Remember to put the lowest value first. There are no cases with a value of -8, but that doesn't matter. 44 Manipulating variables If you wanted to keep the reasons for the missing values separate then you could use the dot followed by a letter values such as recode uduc 9« .a —7«* .fa Also, you can recode more than one variable at a time if the variables in your list have the same values to change. For example, if our variables education [educ] and housing tenure (tenure) had the same range of missing values to recode, we could use recoda ectac tenure (—9/—1=») or recode educ tenure (—9=.a) (—7 » . bi Where there is more than one variable in the list, the recode value must be in parentheses. The recode command is also used to manipulate the non-missing values of variables, and this is coverei in the next section. You must be sure that the 'missing' values do not have a meaningful value. That is, in our example data, when the negative values you may think to recode to missing values can represent a real value, such as zero, for a variable. To illustrate this we shall look at the variables nags, which measures how many cigarettes a respondent smoked per day. First, let us get some descriptive information on the variable La ncigs number of." | cigarettes smoked J Freq. Percent Cum. missing or wild | 20 0 .19 0 .19 inapplicable | (5, 949 67 .70 67 .90 proxy respondent ! 3 52 3 .43 71 .33 refused 1 j 0 . 01 71 .34 don't know | 1 0 . 01 71 .35 less than 1 per day | 41 0 .40 71 .75 -i ! 39 0 .38 72 .13 2 1 55 0 .54 72 . 66 3 1 50 0 .49 73 .15 4 i 31 0 .30 7 3 . 45 5 1 162 1 .58 75 . 03 Missing values 45 6 I 8 ! 9 I 10 I 59 0 . 57 75 . 60 53 0 ,52 76 . . 12 62 0 . 60 76 . 72 14 0 . .14 76 . 86 501 4 CO 81 .74 9 0 .09 99 . 99 1 0 . . 01 100 . . 00 Total 1 10,264 100.00 We can sec that several respondents smoked less than one per day. We can find out the variable values as well. ta ncigs,nol number of cigarettes :ed J Freq. Percent Cum. -----+ ------------ ------- ----- --------- -9 1 20 0 . 19 0.19 -8 1 6, 949 67 . 70 67 . 90 -7 1 352 3 . 43 71.33 ^2 1 1 0 . 01 71.34 -1 1 1 0 . . 01 71.35 o ! 41 0 . .40 71.75 i 1 39 0 . 3 8 72 .13 2 1 5 5 0 . . 54 72 . 66 3 1 50 0 . 49 73 .15 4 ! 31 0 . .30 73 .45 5 1 162 1 . 58 75 . 03 6 1 59 0 . .57 75 .60 1 1 53 0. . 52 7 6 .12 8 I 6 2 0. . 60 76 .72 9 ! 14 0 . . 14 76 . 86 10 1 501 4 . . 88 81. 74 etc 60 I 9 0.09 99.99 80 !■ 1 0.01 100.00 Total I 10,264 100.00 46 Manipulating variables All respondents were asked the question 'Do you smoke?*. If they answered 'Yes', then they were asked the second question, 'How many cigarettes do you smoke a day?'. If they answered 'No' to the first question then they were not asked the second question. This can be seen by the number of respondents with a value of —8 in the variable ucigs. This number should correspond to the number of respondents who answered 'No' to the first question. As well, there is a value of 0, which is labelled as less than 1 per day. But these are occasional smokers, so are they ready zeros? If you wanted a variable that indicated the number of cigarettes smoked per day then you could recede the negative values of ucigs in this way: recods nci.gs 0~i --5«-. ~-8™0 —7«. -2« . 1= . Or, if you wanted to keep the reasons for missing values separate: roc ode ncigs 0=1 -9».a -8=0 - 7- .b 2- .<_■ -1» .) mv decode _all,mv{—8} etc. Also, the mvdtecode command can be used with single variables or lists of variables and does not have to be used with _all: mvdecode nuiStat:,mv£ 9/--1 ) mvdecodw tenure-finm,*w(-9/—7) The use of negative numbers to indicate values likely to be changed to missing is good practice. It has its limitations, as we have shown above. Another time when negative numbers are nor appropriate is when you have standardized scores, as these variables have negative numbers as legitimate values. As we mentioned previously, some data sets use high numbers such as 99 or 999 to indicate non-response (no answer). Additional care is 14 501 80 48 Manipulating variables needed when recodiag these to missing and especially using the global recode command mvdecode. For example, your data may have a variable for age that uses 999 as the indicator for non-response as well as a variable for five categories of marital status that uses 9 as the value for non-response. If you simply used the global mvdecode _all command for the range 9/999 then all respondents aged 9 and over would be given a missing value in the age variable. Whatever system is used to indicate non-response, you need to exercise care when declaring values to be missing. It is very important to note that there is no way to undo the mvdecode command. You are well advised to be absolutely sure you want to run this command. And y'ou should never overwrite your original data sets - always work with copies. As you may have gathered by now - there are usually half a dozen different ways to do what you want to do in Stata. As you become mote familiar with its possibilities yon will find the way that suits you best, in the numerous courses that we have taught on Stata, not one has passed without a student showing us something that we didn't know! CREATING NEW VARIABLES OUT OF EXISTING VARIABLES You may wain to make new variables out of one or more existing variables. For example, you might want a variable that denotes women who are self-employed. To make this variable, you would need information from two variables that already existed in your data set: gender (sax) and job status (jbstat). When you are creating variables, you often need to use mathematical expressions. The following are recognized by Stata: + addition subtraction * multiplication / division A raise to the power Parentheses are used to control the order m which calculations are done. Parentheses may be nested within other parentheses, many layers deep. Before moving on to more advanced variable manipulation, it is necessary to understand the logical operators used in Stata. Creating new variables out of existing variables not > greater than = = equal (two equals signs) >- greater than or equal to !- not equal < less than & and <= less than or equal to | or Stata also recognizes a myriad of mathematical functions. The ones you probably use most are: sqrt(x) square root log(x) or ln(x) natural logarithm cxp(x) exponent (e to power) int(x) integer Type help functions for the full range or functions supported by Stata. Box Shortening ce.mt'iands The shortening of commands can be used or. alirtosi all <3 as couple I 6'? 4 0 0 674 widowed 0 866 n 866 divorced. 0 434 0 i 434 separated 0 189 o I 189 j ' 1-l raarr 1 ea 0 0 2, 081 2,081 Total j 6, 683 1,489 2, 081 110,253 Generating a new variable with information from one existing variable is the simplest cvpe of variable creation. Returning to our Lxarnpie of a variable that indicates women who are self-vnij.l«>j ed, wc could do tins m a number of different ways; we first j f -i' p-by-step approaches so you can see how the different control ids vi ork, and then we use a single command to demonstrate Stata's use of logic in constructing new variables. The self-employed have the value 1 in the ibslai variable (10 categories) and tile missing values have been recoded to a dot (.), while women have the value 2 in the sex variable and there are no missing values on the sex variable. The step-by-step way is to generate a new variable (feml) and sei ai! values to missing (.), and then to replace the values as the; meei specific conditions. The command replace allows you io change the contents of a variable, so to speak. n©n feml™. r»>pl tee feml=l if jbstat==l & sex==2 replace feial=0 3f jbstat>=2 & jbstat<-12 ■wlacc- Ceml-0 if snx--l & -jbetat==l T hb » dl 11 u n \ due of 1 for all women who arc self-employed, i n aie ihen left io decide what to do with the rest of the values, n niable inns' Tike more than one value to vary. Often, a le-earcher will use dummy variables to denote 1 for the catcgorv Manipulating variables of interest and 0 for cases that do not meet this condition, so we replace the new variable with zeros where the cases are women who are not self-employed and all men. The third line replaces those who have an employment status but are not self employed with a 0. 'I he fourth line replaces men who are self-employed with a 0. The step-by-step way of creating this new variable is cumbersome, but it does help to understand what is actually happening in your data with each command. However, we hope that you are able to move quickly on to use more efficient commands. Wc can first reduce the four commands to three: gen fesa2=l If jbstafc==l & sex==2 replace feai2=0 if jbstat>=2 & jbstat<=12 replace teml-O if sex==l & jbstat=~l In the first command line the generate igem command creates a new variable femZ with cases equal to 1 for women who are self-employed, and then the rest of the cases are automatically set as missing (.). The last two command lines are the same as in the first way, replacing all cases that are not self-employed or are self-employed men with a zero. Note here that we do not use jbstat 1=1 (jbstat not equal to one) as this would code the cases who arc missing on jbstat to zero. We can now reduce the three commands to two: gen feia3=l if jbstat==l & h«x**2 replace £em3»fi if ',! jbstat •»2&-jbstat<»10} ) /// sex==l) & !i»issing(jbstat} In this way the last two lines arc combined to replace cases who are not self-employed or are men as zeros. It is necessary to include the last element of the command line - & !missingS jbstat) - in order not to include men who have missing values on the jbstat variable in the zeros. This element means 'not missing values on the jbstat variable'. The most efficient way to create this new variable is to use Stata's logic for constructing new variables which then means it can be done m one command hne: 0«n £em4=jbstat==l & sex==2 if /// ! missing' (jbstat, sex) Dummy variables In this form ot the gen command, cases who are self-employed (jbstat==l) and women (s«x=»2! arc automatically given the value I while all other cases are given zeros except for those who are missing on either jbstat or sex - if '.missing < jbstat, sex) - in which case they arc given a missing value (.). Note the use of the if in this way rather than the & in the last line of making the fem3 variable. Including the variable sex in the parentheses after imissing is not strictly necessary as we know in our data that there are no missing values on that variable. A check on the summary statistics (see Chapter 5) of the new variables shows that they are identical: . so. tern* Variatale | Obs Mean Std. Dev. Min Max fcml j 9912 . 0204802 .1416433 0 1 fein2 j 9912 . 0204802 .1416433 0 1 f em3 9912 . 0204802 .1416433 0 1 fem4 i 9912 . 0204802 . 14164.33 0 1 DUMMY VARIABLES One very common reason to manipulate pre-existing variables is to create dummy variables, or dichotomous variables, with the values of 0 and 1. Dummy variables are a special case of nominal variables. Remember, nominal variables are those which measure qualitative characteristics such as sex, religion, or marital status. There are two reasons why we use dummy variables, and they are based upon what you should already know about nominal variables; • i he only measure of central tendency that makes sense for a nominal variable is the mode. The mean and median for nominal variables are not meaningful. Although you can get Stata or any other software program to produce a mean and a median for nominal variables, that doesn't mean they have any meaningful value. • There are lots of nominal-type characteristics that are important predictors of the things that social scientists are interested m -- but because we can't determine a 'meaningful mean' or median, wc need to make a special adjustment so that we can use nominal variables in multivariate statistics. See Box 3.2 if you've heard of dummy variables but aren't exactly sure what they are or how to interpret them, 54 Manipulating variabi&s Sux w.2; OumiT-y variables Lf Jr o'a't nt-i a 'yoica, ^u'nn/ asv bex .b a T- mnal vanabte and a piediofoi etraauy dcpendsi; vanan'es it- when are :nroi i i 0 - male. When we tiansform our variable into a dummy variable tike (hid, « becomes possible to es'culale a irmn that we can interpret. If we add -jp the ya-'ues oi the va-iabio -sex above and dr/'ie by th» tumbw of aases, we get. Dummy variables 55 (1 + 1 + 0 <• 0 + 0+ 1 + 1 t 0 ¥ 1 J-0)/10 = 0.8 Because *e've transferred the variable into ts and 0s. what this n'.'fnbe; "epre°ents r.ow is the proportion o( the sample that takes the value 1 on the dummy variable; that is, the proportion of the sample list ,-s femaie. So typically, you would nee a dummy variable such as this presented -n a tatve of descriptive statistics like this: Mean j Standard Dev. Minimum | Maximum Sex (i = 08 j';iif4;':afe/'*.:-Sllt-i' • 1 . 0 i 1 rentals) J ; -___i_____I Sometimes the note (1 --fe-iiale}' isn't presented arid the authors jus: pu- 'Female' instead ot 'Se/ as the variable name, assuming Inai readers will understand that it =3 & /// Smissing{educ) (d,< Use the generate command in one line: gen degree4=educ<=2 if 1 missing(educ) It is very important in (c! and (d) to have the !missing (aduc) element at the end - note that one is a & and the other is an if. If you make another variable idegreeS) where you omit this part from (d), and then do a frequency distribution on both, you will see that they are different. gen. ciegree5-educ<=2 tabl degr«e4 degrees . tabl degree4 degree? -> tabulation of degree4 degree4 | Freq. Percent Cum. 0 | 9,173 92.72 92.72 1 I 720 7.28 100.00 Total | 9.893 100.00 -> tabulation of degree5 degree5 | Freq. Percent Cum. 0 | 9,544 92.99 92.99 1 j 720 7.01 100.00 Total | 10,264 100.00 Where have the missing values gone? Without the ! mi suing (educ) clement of the command, Stata has put all the missing values into the 0 category of the dummy variable degree 5. This is because Stata recognizes missing values (dot, .a, etc.) as a very high number and every tune we use < or > we must ensure that we tell Stata not to include missing values! BÖ Manipulating variables You may wonder why, m (d), Stata knows to make a variable coded 1 and 0, Nowhere in the code are 0 and 1 explicitly stated, like in the other examples. This has to do with how Stata under- .... stands double and single equal signs. In (d), wre are setting a con- fc dition and if it is met, Stata logically assigns 1 to the condition and 0 to the cases were the condition is not met. See Box 3.3 on double r~ ■ :'.! c."!i. :.-.a:j' -i:'.;r. f*x :--o;a ' .!. ravii ioa < ai tin- i :0"_ iv Now we have created five variables measuring wdtcther or not vVk","'( Ixx a .X yr. c Y\, . lo-. 'r :vr.i iC-h.'v Y\V ..an :_v. ': rid or Ir rhe \'>..i.!X.:."iv:r v.r 'ahks n'-.in^ih,. nniansi -x.i cp. sXfsip i'.,>e.:oi--'.i flt-rtt-cai rifief-tv-si? ?ftart><*f* ~ Ntv- jwrs .o Stata offer- cwt oonfused as to v.hen si-rcte ; : or ao,ji)i-: (--) ?quil5 signs si-ould ö= -.,3rd A simplo i'jIc ci tnjnb ; Is -Ji:d r. shcyc- cctja;s sgn is cssd :o ass-Cjn the v^iuc a: t^s ncht or ,< mn va-co tc The ut"t. K'or extiMipie .r> i.he coinrn^^o no-.v vnriao a. ciiire / nos been crcüicd which -s dis -.icrne a.s iiib vai'ablc ce'ec. We >-ave ;oid titaSc dial * is golnu ro he che :.-.«mff A Cglios"; cmyjai sign, on too en nor -.and. is used to chock -.'.'i-.ei-or rho vaU-v to the nciül is tho same ast!,e«alue ic in; :eit. It i = t-s-od icr coirifjcnsoe pu>poses, •n'il' e- llian f>t-s'nr " i Single- eo i» "rhe ciHnipnd misawiriA* if «6»»«»2 l ! V'cuie returv SLirrrr-ary st.-r!!st,o-: or 3:1 ■.■a-ia^U";; in th<> daisi s-Ot "j J- So;- you aro noise io set nr. valcios of cvarubc to somet'iir;. !*? .". slnrx equals sirjo. and -f yon aistCohr.q f> eq .lail'y. u;.:• the cr.nf ocerailor-. s^-. .t*v tri:- Labelling variables Often when we want to include nominal (categorical) variables in statistical estimations, it is necessary to create a series of dummy variables from the original variable. This is done easily by combining the tab and gen commands. For example, if we wanted to create a series of dummies for the job status variable: tab jbstat, gen(jobstat) If you look at the bottom of your variable list, you will see that a number of dummy variables have been created (jobstatl, jobstatl, jobsta/3, etc). You can do a frequency distribution of these variables like any other variable. You will see that they are all dichotomized into a 0/1 format. LABELLING VARIABLES From our previous example on page 50 wc have a new marital status variable (mastatJ). Having done the frequency distribution (tab) of the variable, we can see that there are no labels attached to the values of the variable. Now we want to label the variable and values so that we don't forget which each number stands for Labeling in Stata is a bit different for those used to labeling variables in other statistical software packages. There arc essentially three steps: 1. Create a label for the variable itself. 2. Define a label for values. 3. Attach that label to a specific variable. We first label our variable: label var mastat2 "simpler marital status" Now our variable mashitZ has a label 'simpler marital status', if you tab mastat2 you will see this new label and it will also be in your variable list. Now wc want to label the values of the categories of nuistatl. To do this use the command label define. First, we have to define a name for our labels. We arc calling these labels 'marital'. IcJj clef marital 1 "partnered" 2 wpr«v married" /// 3 "single" 62 Manipulating variables If you make an error when you are using this command and try to do it again, you will likely get a message that the labels have already been defined. Stata will not let you overwrite labels unless it is sure that you actually want to! You can assure Stata that you intend to overwrite labels by adding the option modi f y. I ry: lab def marital 1 "partnered" 2 "pr«v married" /// 3 "single", modify it is important to realize that at this point you have defined a label, but this label is not yet attached to any variable categories. The label exists separately from any variable. If you type label list you will see all the labels in your data set. The label for 'marital' will be there with all the category values defined as we have entered them above. Now we must attach these category labels to the variable using the procedure label value. lab val mastat2 marital This tells Stata to attach the values we assigned to the label 'marital' to the variable mastatZ. Once value labels have been defined they can be applied to other variables. If we now tabulate the new marital status variable we get: . ta mastat2 simpler | marl tail status j i'req. Percent C am. partnered | 6, 683 65. 18 65 .18 prov married [ 1,489 14. 52 7 9 .70 single 2, 081 2 0.30 100.00 Total 1 10,253 1U0.00 Tins may seem like a nonsense way of doing labels, but it does have one really great benefit. If you have numerous variables with the same codes (e.g. 'yes' and 'no'!, it is very easy to label them. Labelling variables 63 We have created some dummy variables earlier in the chapter: one for self-employed women and one for degrees. To label them all, we would just define a label: lab def yesno 1 "ye&" 0 "no" and then attach these values to our dummy variables: lab vai feml yesno lab val degree4 yesno .. . . etc If yon were using multiple dummy variables in your data set, you can easily see how quick it would be to label them all. For ways of changing a number of variable names at once, seethe commands in Box 3.4. Box 3.4: Renaming variables If a variable already has a name you might want to change it It could be that the original one isn't as informative'as it should be; some of the older data sals were very restricted in the ntimbereof characters they could use to name variables, so it's not uncommon to see variable names such as $dtOQ88 for the number of children in the house which it would more convenient to change to ' numchd For this example you would use; rsiiame sdt0063 amnclid The renaine command is suitable when you Only have a few variables to change as you can only change one variable per rename command. If you have more' than half a dozen variables to changolt is worth installing the rsav&rs command. Type: £irt of options with reiwars rjut mostly it allows you to rename mutfo'e vonabies n one command. Let's continue with our example above where a data set has variable names stc!0068, std0069 andstc/0070 and you want to change these to more meaningful names. Typing r«nv«xa stdOOSS stdOOSS std0070 \ numchd /// would change the variable names listed to the left of the backslash \\) to the new variable names to the right of the backslash in the respective order. If the original variables are consecutive in the data set then you can just put the first and last separated by a dash (as with many other Stata commands), for example: renvairs »td0068 -*td0070 \ msmchd chdag© chdsex Anothei useful command to be aware ot is reapf is. This command can either change variable name stubs, such as sdt in our example, or remove them completely. If you wanted to change. sdt stub to end for our three example variables, you would type renpfi.K sed chd You would now have three variables: ohdOOSB, chd0069 and chdOOVO, If you typed State would remove the si stub from all variables that start with that stub, in this ease leaving three variables named d0068, 00069 and dOOTO- All variables that start with the stub will be changed, so be careful with this command. Note that you cannot use variable names that start with a number A common use for r*np£ix is if you have panel data and dacn Wdvw of data has variables that start with the same letter. In the British Household Panel Survey all the variables in the first wave start with a, second wa/s start with b, third wave start with c. and so on. Whan you put data sets together in a pane! format the variables need to b£ named the same in each so reapf ix can be used to remove the starting letter (stub) from all variable names. More on generating new variables 65 MORE OM GENERATING NEW VARIABLES Stuta has a second, 'extended' generate command that allows you to create numerous types of variables from your existing data both for analysis and for data management. The command is eger. and as you get more familiar with Stata then you will probably find yourself using this command more and more as it is very powerful. There are a considerable number of uses for egen so we cannot cover all of them here. Instead we use a few examples to illustrate its utility. Remember that you can type help egen to find out all of the ways to use the command. The group function of egen creates a new variable by combining categories of two or more variables. Let's take a simple example of two variables both with only two categories. First we make a variable that indicates those who are married compared to all others; reoode mastat (1=1) (2/max=0), gen(married) We have included the (1=1) for clarity but it is not strictly necessary for this recodc. Now we want to create a new variable that has the following four categories: married men, married women, unmarried men, and unmarried wotiiui If we crosstabulared the variables married and sex we would see that: ta sex married !: a sex marri ed 1 RECODE of mastat j (marital status) sex o 1 I Total mad.e 1 1,886 2,947 4,833 f ema] e 1 2,369 3,062 j 5,431 Total 1 4,255 6,009 j 10,264 From this table we can see the numbers in each of the four categories we are interested in, but we don't have them in a single variable. The long way to do this would be to use the gen and replace commands with if statements such as: 66 Manipulating variables gaii cexicai 1«1 if sax==lsmarriea==0 replace if serK=*~l&:marriaeL-~l replace s*xmarl»3 if sex==2smarried==0 replace sexfflarls4 if sex--2&mari°i8d=-l . gen sexmarl = l if sex--l&max~rieci--Q (8378 missing values generated) . replace sexmarl = 2 if sex= = lfanarriecl==l (2 947 real changes made) . replace se^marl-S if sex- = 2£cmarriecl-"C (2359 real changes made) . replace sexrnarl = 4 if sex==2fatiarried= = l (3 062 real changes made) Then if we label the categories and tabulate the new variable: lab def sesciftar 1 "unmarried m@n" /// 2 "married m«n" 3 "ururcarried woner" /// 4 "married women" lab val sexmarl saxmar ta saxmarl , lab def sexmar 1 "'unmarried men" /// 2 "married men" 3 "unmarried women" /// 4 "married v/omsn" . lab val sexmarl sexmar . ta sexmari sexmari j Free. Percent Cuir. unmarried mean | 1,886 18.37 18.37 married men | 2,947 28.71 47.09 unmarried women | 2,369 23.08 70.17 married women j 3,062 2S.83 100.CO lotal | 10,264 100.00 If you compare the categories of this new variable with the crosstabuiation of sex and married above you can see the new categories represent each cell of the crosstabuiation. However, using the mgmn command considerably shortens this process: More on generating new variables 67 egen sexmar2=group{sex married) fca seminar 2 . ~a sexmar2 group(sex | married) Freq, Percent Cum. 1 1 1,886 18 , . 3 7 18.37 2 1 2 , 947 28 . . 71 4'?. 09 3 2, 359 23, 08 70 .17 3 , 062 29. , 83 100.00 Total 10 ,2 64 100 . . 00 We ordered our replace commands to re-create the process that the egexi group command uses to make its categories so that you can easily compare the results. This is that the first category of the first variable in the list is taken first and then groups are made with that and the categories of the second variable. So, in the sex variable 1 - men and 2 = women, and in llie married variable 0 -- unmarried and 1 = married. Therefore the categories of the new variable are 1 = men unmarried, 2 = men married, 3 = women unmarried. 4 = women married.. So we can use the label 'sexniab with this variable as well: lab val sexmar2 sexmar ta sexmar2 .. t.a sexmar'i g r oup i sex. ) married) j Froq. Percent Cum. IS .37 47 . 09 70 .17 100.00 Obviously, this gets more complicated if the variables have more than two categories and if you use more than two variables. It's urirn-ar riecl men | married men | .irimarriod women | married women | Total I 1,886 18.37 2,947 28.71 2,36 9 23.0 8 3,062 29.8.3 10,264 100.00 68 Manipulating variables important to understand how Stata orders categories of variables in a list; not only for this command but for many others as well such as the by and bysort commands that we cover in Chapter 4. The second function of egea we cover here is rownonmiss. This function tells Stata to look across a number of variables and, for each case (row), create a new variable that shows how many non-missing values there are. Non-missing means any value which has not being designated as missing f, or .a, ,b, etc). Remember that you need to have coded the values to missing prior to using this command. In this example, we take three General Health Questionnaire (GHQ) items, ghqa, ghqh and ghqd, to show how this command works by creating a new variable called obs. Then to show the frequencies of the new variable we need to tabulate it: egen obssrownoiuniss (ghqa ghcjb grhqd) ta obs . egen o.bs = rowrionmi ss (ghcja ghqb ghqd) . ta otis obs Freer. Percent Cum. 0 530 5.16 5.16 1 I 1. 0 . 01 5.17 2. J y/j 0 .14 5.31 3 1 9,719 94 . 69 100.00 ---------+ -- ------------ --------- Totaj j 10,264 100.00 This table shows us that 9719 people answered all three questions, 14 answered two (note that this doesn't show you which two), 1 answered only one question, and 530 did not answer any of the three questions. Another way of looking at this is that those given a zero on the variable obs are missing on all three questions. To show this, we tabulate the three variables for those who have a zero: tabl ghqa ghqb ghqd if obs==0 . tabl ghqa ghqb ghqd if obs-=0 -> tabulation of ghqa if obs-=0 no observations More on generating new variables 69 -> tabulation of gho;b if ofos--0 no observations -> tabulation of ghqd if obs==0 no observations This confirms that chose 530 cases have not answered any of the three questions. The values of obs can be used in a number of ways when creating and manipulating variables, especially when creating scales (see below) and deciding how many questions need to be answered to be included. It may also be used to select cases tor analysis or to manage data sets when some cases are kept ot chopped, We carry on using the three GHQ items to demonstrate another function of egen. This function, rowmeaai, creates a new variable with a value oi the mean of the variables specified. The default method for this function creates a mean for any cases that have at least one non-missing value. So, a mean might be created for someone who has only answered one of the questions. You need to decide if that is something you wish to do. So, we wdll use the obs variable created by the rownonmiss function to refine the variable creation. We start with the default method; ©gen meanl-rowmeanighqa ghqfo ghejd) , egen meanl-rowmean (ghqa cjrtqb ghqd) (530 missing values generated) Stata tells us that the new variable, meant, has been created with • 530 missing values. Refer back to the tabic of obs and see that .530 cases were missing on all three questions and these are the ones that are missing on this new variable. While you are looking at the ^ table of obs. you can see that one person answrerecl only one of the three questions. If we wanted to exclude that person from having a mean then we can use the values of obs to condition the ', rowan function by restricting it to only those who answered two or three questions: egen mean2ssrovmaari(ghqa gl*tjb ghqd) if obs> 1 . egen raoan2-rowmean(ghqe ghqb ghqd)if obs>l (b31 missing values generated) 70 Manipulating variables The output shows 531 missing values in the new variable, meanZ, which reflects the inclusion of the person who only answered one question. To follow this example through to its logical conclusion we now use the rowmean function and the values of the obs variable to calculate a mean for only those who answered all three questions: egen mean3»rowmeau(ghga ghqb gb.cr.c0 if ©bs~-3 . eger, mean3 = rowmean ighqa ghqb ghqd) if obs==3 (545 missing values generated! I Now for the new variable, mean?>, there are 545 missing values. 14 more tlvin in me i _ -6 C.6943 0 . •: 4 6 5 .1.72416 0.63 01 opirtf.'b | 9662 f 11.7466 . 16.1841 8 0.6131 ■■ ' ,w 1 90 0 O.OOt^ . 2050443 n.se-C'- opiamo 1 9646 0 .40 1 0 0.2786 .22111311 0.68*5 optarr. , iiMd U.44S1 0,:n39 .41 = 9041 0 05 0 6 c.,t«r= J 06S2 0 . 637 5 C . .':Lf88 1S0SV" 3 0 , 6409 t II ) 1 II ' - 6 . 2. 2 S 3 0 . 0726 .2533817 0.7' 32 ppla™i 1 9643 0.443 4 0.1:'Mi 227-361 0.6366 ..a., i.an i 0.50 97 C . 3 i 1 9 .204099 2 0 . 6786: .204414 0.6906 Of particular interest i: ; the last col until labelled 'alpha'. It : would be better labelled as 'al pha if item t 'emovedy bee a use that is what it is telling us. ff we look down to opfaing. we can see that the alpha of the scale would improve to 0.7152 if we took this item out. Alpha can only tell us this information for items one at a time - it won't tell us the effect on alpha of removing two or three items at once. So it seems that opfamg isn't such a great measure for our proposed scale. And if wc look at it, we can see that the item is somewhat different from the other items because it is asking about general child upbringing, rather than issues pertaining specifically to employment. It should be stressed that you should always check the items in your scale and make sure that they make theoretical sense. The mathematical techniques involved in scale construction do not do this for you! So what happens if we take out opfutngf . alpha opj.ama optai6o op.tair.r oplaa.ei opfaa.e opfamr optamll '' / / 76 Manipulating variables cpfarr.a [ 9623 + G.r.880 up-amb 9 6 62 f U.74 4 2 opfomc j 9640 - 2.5771 apfa-d I 9646 - 0.4777 7pfame I 9648 - 0.46S4 op:a:::i , 9657 + 2.6393 cpfaiiüi J 9647 - C.4754 apfj.mi 1 3£4C - 0.5131 T£sl r.rale We can see that the alpha has improved and that no further removal of individual items will increase our alpha. As we haven't created the scale variable yet, we could ask Statu to create it for us after it computed the reliability using the gen option. alpha opfama opfamb opfane opfamd opfame /// opfamf optamh opfa&ii, gen. (famscalel) 'i'he gen option tells Slata to create a new variable (we have named it famscalel). Unless we tell Stata otherwise, it will go ahead and reverse code items. If you are ever in a position where you don't want Stata to do this, typing the option asis will tell Stata not to reverse code any items. You can also use the option std to get Stata to create the new variable in standardized form (a mean of 0 and a standard deviation of 1). It is very important to note how Stata handles missing data in the alpha command. In other commands that we have talked about so tar in this book, cases are deleted from the analysis it they arc missing on at least one of the variables under consideration. So when we tab sax age, if a case is missing on the age variable, it will not be included in the tabulation. It is deleted 'casewise'. Similarly, we could create the scale just by simply adding up all the variables using the gen command (after reverse coding manually). Now, the cases where there was missing data on one or more of the scale items would nr t be nclt ' f Th default in the gen option, however, is to ere ir 1 > e i i er observation where there is a response to at let 11 <• i 1 other words, a scale value could be create! >i in me 1 Creating a scale 77 answered only one of the eight opfam variables. The score is calculated by dividing the summative score over the total number of items available for the specific case. if you find this objectionable (we do!), you may want to employ the option casewise so that cases with any missing values on the scale items are deleted. Let's see how this changes our results, alpha opfama opfamb opfamc opfamd opfame /// opfamf opfamh op£ami,gen(ffamscale2) casewise , alpha opfama opfarrb opfamc opfamcl opfame /'/'/ opfarrif opfamh opfami, gen ( farsscale2 ) casowi se Test .scale - mean (unstandardized items) Reversed items: opfamc opfamd opfame opfamh The results suggest a slightly better alpha. But we also know that cases are only included if respondents answered all the items in our scale. If you don't want to be so stringent, you can set an alternative minimum number of items that must be non-missing in order for the scale to be constructed. Let's say we decided that a person must have answered at least five of the eight items m order to be included in the scale, we would use the option min. alpha opfama opfarab opfamc opfamd opfame /// opfamf opfamh opfami, gen(famscale3) min(5) When we create a scale, we are just really adding up items and, as such, we would rightly expect that if we arc adding up 8 items, all of which have values of 1 to 5, our scale will have a minimum of 8 and a maximum of 40. People who score 8 would be very conservative, while those around the 40 mark would be rather libera!. opfam i. Average inr.er.item covariance : t-junibcr of items in the scale: Scale reliability coefficient .2540916 8 0.7166 recede opf nine opfasd apfame opfamh opf ami 111 (1=5) (2=4) (3=3) (4=2) (5=1) 78 Manipulating variables gen f2ur.scal©4~ o£>faroa+ opfamb* opfamc-f /// opfamd-t- op£ame+ opf an>f» opfamh+ opfami However, if we Bummariza oar new variables {jamscalel, jamscalel and fjm$cate3) and compare them with jamscaleA created by manually reverse coding and using gen to add the items (note the use of the wildcard * that saves typing all the variables). We get: so Laras-oaie" Variable J cbs Mean Scd. Dev. Men Max raiascalel j 97 IB far,isca.lc2 j 9515 ffmscalftl i 965'/ f.am£caie4 : 951b - . 5244^40 -.5529802 25 . 5.7909 .6009038 .5954735 . 39S8b'15 4 .763829. What is happening? In order to understand these values, you need to understand how Stata constructs the scale with the alpha command. When an item is 'reverse scored' it isn't recoded so that 1 becomes 5, 2 becomes 4, etc. What happens is that a negative sign is placed in front of all the original values so that the original values are changed to: Variable opfama optanib Djjfamc opjamd opfame opprmf Qpfamb op/ami Direction New values 12 3 4 .5 1 2 3 4 5 —5 —4 -3 —2 —1 -5 -4 -3 -2 -1 -5 -4 -3 -2 -1 12 3 4 5 -5 ...4 ■ -4 -3 -2 -1 Creating a scale 79 This logic produces an overall total, for those who answered all eight items, with a minimum of-22 and a maximum of 10. If we divide these scores by 8 (the total number of items), we get -2,75 and 1.25 which match the minimum and maximum values in the data shown for lamscaleZ which used the casewisse option so that only cases with data on all eight items were included. The theoretical minimum and maximum values for famscatel, created using the default settings for the gen option, are -5 and 5 as it is possible for respondents to answer just one item and be included in the scale. You can see from the Obs column that there are about 200 more cases in the famscalel variable than in the famscalel variable. Determining the theoretical minimum and maximum values of the scale is a little more complex lor famscaleS, which used the option min(5) to tell Stata to use only cases that have answers to five or more items. II we take the five iowest possible scores which are the five reverse coded items then we get -25/5 = -5 as a minimum. The five highest possible scores are the three unaltered items and two reverse coded items; 5+5-^5-1-1 so 1-3/5 - 2.6 is the maximum. You can see that the actual minimum and maximum values in the data fall short of these extremes. However, if we correlate all. tour scales we see that they are mathematicahy equivalent for the eases that were included on each pair oi scales. The decision rests with you, as the analyst, as to how many answers you need for someone to obtain an overall scale score. tamsca-2 lamKosv-3 fcim.soa-4 1.0000 9515 1.0000 1.0000 a 5 15 9 ;5 5 7 1 . 0000 3 . 0000 1 .0000 9515 9515 9515 For an account of using a scale in an applied research project, see Box 3.5. laiiffia lei I 1 . 0000 I 9'/18 fara-ceale/ I 1,0 0 00 9515 f cmiKCalej J 1 . 0000 I 0657 tamecclel | 1.0000 I 9515 80 Manipulating variables Box 3.5: An exampte of using a scale in a project in a project mnded by the Department of Health we conducted a series of analyses investigating the consequences of early child-bearing on outcomes later in fife. These were later published as papers examining housing,' partners and partnerships,2 and gender differences in outcomes,3 The data we used were from the British Cohort Study of 1970 which has followed approximately 15,000 children from birth in 1970 until the present day. Part of the analysis was to construct a measure of childhood behaviour prior to any childbearing. The cohort was interviewed at age 10 in 1980 At this time their teachers were also asked about their behaviour in school. We decided to use teacher reported behaviour as well as parent reported behaviour. We took alt the answers iron? the feacners and examined the correlations between the variables and, after some thought and analysis, set-tied on a scale of five items that measured the child's concentration on educational tasks, their popularity with peers, the number ol friends, their level of co-operation with peers, and the extent thai the teachers were able to negotiate with the child, Al! items were measured using a 'thermometer' type response ranging from T to 47. These i've items had correlations between 0.33 and 0,84, wnich suggested that they might make a reasonable scale without all measuring the same thing. We standardized alt five Hems to a mean of 0 and a standard devstion of 1 using the f gejscommand to make five new variables: íoí/u, rtwo, zthree, zfour, ztive Then we used the alpha command to determine that the resulting scate had a Cronbach's alpha of 0.82, which -was very satisfactory for a five-item scale. We used the item option on the alpha command to see it" omitting items would improve the alpha value, and the results indicated that no significant improvement could be gained. To create the final score for each child we used ,the egea ' Efmisch, I F ,md Pe.ahn, D J. (200^1 Ean> chrMUearing and Sousing choices >o}!iil8! of Housmq Etonomirs, 13 1 70- i 94 - uinw.li, l^. and P^alm, D J ÍV0H5) Early mntheftyjotf and l&i&t partnerships. JonnKdl c/ Population IcO'io/ViCi., 16. 469-48$, n Pobsore K and Pevalin, DJ. Í20C-7V Gondei cfilk-iences in flic piediulurs and njiccne' jt >otr,qp.rr-,rrhnx! Re^e^rcn ,n Coudlbvatlhcavcn sou A*o!jtlny 25 205-2'B. Demonstration exercise 81 Gonj.rEnd again to take a mean o* the five items whsre a low score indicated poor behaviour. cone ZtYO-' ".6633 a. Bia ■ 3 i:~5i * The final measure of childhood behaviour was a significant predictor of early parenthood, but thb regression models indicated an interaction effect with gender {see Chapters 8 and 9 for _• %egression and interaction, effects) where this behaviour predicted psariy motherhood but not early fatherhood. Then we looked ai when the cohort was 00 years old and we found that this 1 el-rtvuw t* ea&ury w.^ d^fnifioant^ associated wi'h a jange ot v« outcomes including their educational attainment, labour force It participation, pay, social class, house ownership, and receipt of benefits. 2 DEMONSTRATION EXERCISE In this demonstration exercise we use some oi the techniques * covered in this and each of the subsequent chaptets to conduct a series of data analyses exploring the question of social variations in mental health among working age adults. Our measure of men-l] tal health is the 12-item General Health Questionnaire, which is a scale that can be constructed in a number of ways. In this demon-y stration we start by using the GHQ m the form of a scale that m ranges from 0 to 56, with higher scores indicating poorer mental health. The factors we are interested in using are sex, age, marital status, employment status, number of own children in the household and region oi the country. 82 Manipulating variables We start by opening the example data tile (exampledata.dta) from our default directory. Before opening the data file wc increase the memon available ro Stata to 50 A4b using the set inem command: version 19 set mem 50m cd "C : \project. folder" use axampledata.dta Next we use the keep command to retain only the individual level variables we need for this analysis and then recodc ail the negative values to missing. We keep the individual identifier (pid) and the household identifier (hid) so that we can match on household level in formation in the next chapter. keep pid hid ghg* sex age mastat jbstat achild lira-decode all,mv(-9/-1> Stata returns the output below. You can see thai after the utrus ~ code command, Stata tells you how many missing values were assigned for each variable. For example, the variable jbstat had 352 missing values while the variable ghql had 580 missing values, from this you can deduce that the variables íjíVÍ. bid, sex, age, mastat and nchild do not have any missing values. . 'a/decodo _al 1,rov bstati 3 52 missing val ues generated ghqa; 53 6 mi ss \ rig values generated ghejb: 536 miss.; rig values generated qhqc : 545 missing values generated ghqd: 53 4 missing vai ues generated yhqe: 53 5 m i. a s i iig values generated ghqf : 54 6 missing values generaled ghqcf: 534 missing values genera ted ghqh -. 53 2 ..nrssing va 1 ues generated ghqi : 534 i'fi:i ssmg vaJ. ues generated ghqj : 57'/ miss i rig values generated ghq< : 587 mi s s i r;i g values generat ed ghql ; 580 missing values generated Wc now create the n he wildcard í") to save listing rati I d be to use the dash as the i. recede ghqa- ghql c We of the scale using the alpha the item option to give us (2=1) (1=0) ' n n [i n 1 ' 'i ill t - ii le to each of the GIIQ items 'b n tl e d-- i I "f i c c' ntciic! reliability check. The overall pli -dm ( i -i i! I is i i rr u the bottom right of the table; id die sen, lr positi e iud i 1 items have similar item-rest corrc-!14"iMti ^h*. n ht h "ii i ohm i dit ws that the (weral! alpha value ,oul< tut ne im eiM.it>' i opp ng any item. We should be rea-i I i i p^ v ith r ■ tut u tl ieh ibility of this scale. ide qhq* S4-3S leu i 11-0 i (ghqe: 9 h n ij 1 ?l +ghqf + ghqy-s-ghqh + gnqi. + ghqj +ghqk+ghql (651 missing values generated} . lab var ghqscale "ghej 0-36" . su ghqscale Variable I Obs Mean Stci. Dev. Min Max ghqscale | 9613 10,7712b 4.9143.8? 0 36 We now construct another variable based on the GilQ items. This one uses a coding of 0-0-1-1 for' each of the items and then adds up the items to a maximum of 12. Then a threshold of 4 or more is used to make a dichotommis indicator. First we recocle ail the 12 GHQ items and then sum them to create a new variable called d_ghq. recede giiq* (0/1=0) (2/3=1) gen el_sl»q=ghija+ghgb+gl«jc+ghqd+gliqe+gliijf /// +ghqg+ghqli+-18 & age<~65 su age The output shows how many cases (observations) are deleted from the data set after implementing the keep command, .brorn the descriptive statistics tor the variable age, we see that now we only have 8163 cases in our data. Remember that there are no missing cases in the age variable. . keep i r age>--lS « age<-63 (2101 observations deleted) . su age Variable | Obs Mean iitd. Dev. Min Max age | 8163 39.32733 13.08993 1£ 55 iNcxt we reeode the age variable into three categories and use the gen option m the recede command to create a new variable called agecat. We then label the new variable and its categories. We then use the tab command to produce a frequency table of the new variable to check if out reeode and labelling have come out correctly. reeode age (18/32=1) (33/50=2) /// (51/65=3),gen.(agecat) la.b var agecat ttage categories'* lab clef agalaib 1 M18-32 years" /// 2 "33-50 years" 3 "51-65 years" lab val agecat agelab tab agecat After the recede command, Stata tells us that there arc 8163 (all cases) differences between the original variable age and the newdy created variable agecat. As we numbered the categories of agecat i, 2, and 3 and there is no one in the data under j 8 years of age, it 3 Demonstration exercise 87 is not surprising that for ail eases the values of age and agecat are different. The frequency table produced from the tab command shows the number and percentage of the cases in each of the three aye categories. ■i . reci cle a.co ( 18/32=1) !2 3/50=2) /// (5.1 / 6 5 = 3 i ,g an iagecat) {8163 differences betweer age and agecat) *5 , .ab var agecat "age cat egorias" ; . 1 a.b def agel :ib 1 "18-32 years/ / / i 2 ' 3 3-50 years" 3 "31- 65 years" ■i . lab val agec at ace]ab . cab agecr,!. age | c: o Lee cries Freq. i e rcent Cum. -------- .......-----h- -------.......----- 18-32 years | 2 9 5 6 3 6.21 36.21 3 3-50 years I 3,33 6 40.8'/ 77.08 51-65 years 1, 871 22.92 100.00 Total | 8.163 100.0 0 Our next step i.' to recode the sex variable into a dummy vari- able thai indicates female cases. 1 w'e use the recode command wu.h the gen option again. We label die new variable and Its categories, then produce a frequency table to check our recode. tab sex tab sex:,no 1 recode sea* (1=0) (2=1),gen(female) lab var female "female indicator" lab de£ saxlafo 0 ^male" 1 "female" lab val female sexlab tab female Wc see the frequency table of the sex variable but we need to see what numbers He underneath the category labels of male and femuLe. We use the tab command with the nol option. SB Manipulating variables ta sex sex I Freq. Percent Cum. male \ 3,914 47.95 47,95 female | 4,249 52.05 100.00 Total I 8,163 100,00 ta sex,nol sex I Freq. Percent Cunt. 1 j 3,914 47.95 47.95 2 j 4,249 52.05 100.00 Total I 8,163 100.00 . recode sex (1 = 0) (2-=l), gen ( female) (8163 differences between sex and female) - lab var female "female indicator" . lab def sexlab 0 "male" 1 "female" . .lab veil female sexlab . ta female female indicator ( Freq. Percent Cum. male ] 3,914 47.95 47.95 female j 4,249 52.05 100.00 Total J 8,163 100.00 To reduce the number of marital statu;, categories, we recode the marital status variable [mastat) into a new variable called marstl and have four categories, where 1 = single, 2 =. married/cohabiting, 3 = separated/divorced and 4 = widowed. We need to see what the categories and the numbering arc in the original marital status variable (mastal). we do this by using the tab command with the aol option. We then recode, create the new-variable and label the new variable and its categories. Demonstration exercise 89 tab mastat tab masfcat.,nol recede raastat (6=1) (1/2=2) (4/5=3) /// (3=4),gen(marst2> lab var marst2 "marital status 4 categories" lab def marlab 1 "single" 2 "married" /// 3 "sep/div" 4 "widowed" lab val m&XBt2 marlab tab marst2 The output for these commands is similar to drat above. The exact process and commands you use to rccode variables may vary from this, but we strongly advise you to have a system that allows you to check your receding as you go along. . cab mastat marita, status '. Freq. Percent. Ceuti. •^-rrioci 62 . 87 62.87 1 ivirig as couple 654 8 . 01 70 . 88 widowed 189 2 . 32 73 .20 divorced j 397 4.86 78 . 06 separated | 172 2 .11 80 .. 17 never married | 1,619 19 . 83 100.00 Total j 8, 163 100.00 . tab mastat,nol marital status | r7, reel. Percent Cum. i ; 5 , 132 62 . 87 62 . 87 2 654 8 . 01 70 - 88 i 189 2 . 32 73 . 20 4 397 4 . 86 78 . 06 b j 172 2 .11 80 . 17 6 j 1, 619 19 . 83 100.00 Total j 8,163 100.00 Manipulating variables .. recode mastat (6-1) (1/2-2) (4/5-3) /// 1 3 - 4) , g en (ma r s 12 ; 17509 c.ifferences between mastat and rearst2j . lab var marst2 "marital status 4 categories . lab cef mar lab 1 "single" 2 "married" /•'/ 3 "sep/div" 4 "widowed" . lab val marst2 marlab . tab marsc2 marital | status 4 | categories | Freq.. Percent Cum. single | 1,619 19.83 19.83 married j 5,78 6 7 0.88 9 0.71 sep/div | 569 6.97 97.68 widowed | 189 2.32 100.00 Total | 8,163 100.00 We now create the employment status variable: ta jbstat ta jbstat,nol recode. jbstat (1/2 = 1) (3=2) (7=3) (6=4) /// (9=4) (5=5) (8=5) (4=6) (10=.), gen(enpstat) lab var empstat "employment status" lab def empiab 1 "employed" 2 "umwnployed" /// 3 "1 ongtemv sick" 4 "studying" /// 5 "family care" 6 "retired" lab val empstat empiab ta. empstat . ta jbstat current labour force | status j Freq. Percent Cum. self employed ( 731 9.26 9.20 in paid employ I 4,844 61,39 70.65 Demonstration exercise 91 "reemployed ' 505 6 40 77 05 retired j 403 11 82 16 family care j 9 0 0 11 41 93 5 6 ft student | 202 2 56 96 . 12 long tern; sick/disabled | 244 3 09 99 21 on matern leave < 13 0 16 99 , 38 govt trng scheme j 22 0 28 99 . 66 something else ] 27 0 34 100 . 00 Total ! 7, 8 91 100 00 ta jbstat,nul current j labour j force 1 status | Freq. Percent Cum. 1 [ 731 9 26 9 26 2 ; 4, 844 61 3 9 70 65 3 i 505 6 40 77 05 4 1 40 3 5 11 82 16 !> ! 900 11 41 93 56 6 [ 202 2 56 96 12 7 244 3 09 99 21 8 i 1 3 0 16 99 38 9 1 22 n 2 8 9S 6 6 0 27 0 34 100 0 0 Total ] 7,891 100,00 . recede jbstat (1/2=1) (3-2) (7=3) (6=4) /// > (9 = 4) (5 = 5) <8 = b) (-1-6) (10 = . ) , gen (empstat) (62 60 differences between jbstat and empstat) . lab var empstat "employment status" . lab def emplab J "employed" 2 "unemployed" /// > 3 "longterm sick" 4 "studying" /// > b '"family care" 6 "retired" lab val empstat emplab 92 Manipulating variables . ta empstat employment status j Freq, Percent Cum. employed | 5, 575 7 0 8S 70 . 89 unemployed I 505 6 42 7 7.31 longterm sick 244 3 10 80 .42 studying j 224 2 85 83 .27 family care j 913 11 61 94 . 88 retired 403 5 12 100.00 Total | 7, 864 100 00 Next we collapse the variable for number of children into rewer categories: su nchild recode nchild (0=1! (1/2=2) (3/9=3), /// gen (mime Jsd) lab var numchd "children 3 categories" lab def chdlab 1 "none" 2 "one or two" /// 3 "three or more" lab v»l numchd chdlab . so nchild Variable | Oos Mean Std.. Dev. Min Max nchild | 8163 .6659316 1.019895 0 9 . recode nchild (0-1) (l/2-=2) (3/9=3), /// gen(numchd) (6508 differences between nchild and numchd) . lab var numchd "children 3 categories" . lab def chdlab 1 "none" 2 "one or two" /// 3 "three or more" . lab val numchd chdlab ta numchd Demonstration exercise 93 children 3 j categories | Freq. Percent Cum. none : 5,182 63 . 48 63 .48 one or two | 2, 443 2S.93 93 . 41 three or more | 538 6.59 100.00 Total | 8 , 163 100 . CO When we have completed our recocting we produce descriptive statistics for all the variables that we will use in our future analyses. sb ghqscale d ghq female age agecat marst2 /// empstat numehd We can use the output of descriptive statistics to see if our variables have the right number of categories and cases (observations). The output below shows that some of the variables have fewer valid observations than the 8163 in our total sample. This is due to the GHQ items, employment status and marital status variables having a number of people who did not respond to the questions and so have been coded as missing values, . so ghqsc. ale d.._ghq female age agecat mar st2 /// empstat numcbd Variable [ Obs Mean Std. Dev. Min Max ghqscale | 7714 10.76407 4.987117 0 3 6 d_ghq | 7 714 .1870625 .3 899 87 0 1 female 8163 .5205194 . 4996094 0 1 age j 8163 39.32733 13.08993 18 65 agecat | 8163 1.867083 .7574497 1 3 marsr.2 8163 1.917 677 . 5949101 1 4 empstat 7864 1.93235 1. 64757 1 6 nurachd 1 81 63 1.431092 .6140945 1 3 Finally, we use the keep command again to retain only the variables we wish to use in further analyses. The order command lets us order the variables in the data set if this is something you prefer. The compress command stores the data set in the smallest 94 Manipulating variables amount of space, and then the save command saves our new data set to our default directory for future use. keep pid hid ghqscale d gh