Manipulating variables
MANIPULATING VARIABLES - WHY?
Often, when you arc working with data, you need to make changes to your variables. This may be for a number of reasons. For example, you may have an income variable that is measured in pounds, but you would prefer to have income grouped into a number of categories, or even percentiles, instead. You may be examining educational qualifications, but you are only really interested in studying whether people went to university or not, and therefore you would collapse a variable into two categories (i.e. went to university, didn't go to university). You may have a very detailed ethnicity 01 race variable that was collected as an open-ended question - that is, the survey respondents could write in whatever they pleased. Clearly a categorical variable with hundreds of possible responses would be very nearly impossible to work with and therefore you may want to collapse these into more manageable categories. There are many reasons you would want to change the way your variables are presented. One special instance is the case of missing values.
When people answer surveys, they don't always answer all the questions, either because they would rather not or because not all the Questions may be relevant to them. Survey researchers use a variety of techniques to capture the instances where a respondent may refuse to answer a question or that question isn't appropriate for the particular respondent. Often these values are recorded in the data spreadsheet as negative numbers, such as -1 for 'refused' or -9 for 'inapplicable'. Sometimes, they are stored as high numbers.
MISSING VALUES
Manipulating variables
such as 99 or 999, There is no rule about this - it depends on the conventions of the particular researcher or research institute who collected the data.
You can recede the messing values to one or more of the 27 values Stata recognizes as meaning 'missing'. These 27 are a dot {.) and a dot followed by a letter (.a to .z). Stata actually uses the dot to recode the value to an extremely high number, which is why you have to be careful when using the greater than symbol (>j if the variable has missing values.
In our example data the data collectors decided to use negative numbers (i.e. -7, -8! to indicate rion~rcspon.se for a variety of reasons. The table below shows the output ot the variable educ using the tabulate command (see Chapter 5y
.   sabul&ce c-dac
highese f:clucur.0o:ia.l
qua] i.iicfftion j	r'req .	Perc	ai e	C	1TÍ .
juisfiing j	la	C	.1.9	u	. -9
	3 5C>	3	4.5	3	. 01
higher degree j	1 22	1	19	4	. 80
^irsr. dsgree j	598	5	S3	10	. 03
v •dcac - -v-:   r:f 1			11	12	. 62
o L r i er higher qf j	1, 207	11	76	24	. 58
nursing qf 1	216	2	0')	26	. 68
gcR a 1řvels |	9B5	9	60	36	. 27
gee o levels or eou1v |	2 , Ü86	2 0	32	50	. 60
uorr-rmrr..- al qf,  ;io o levels j	34.9	3	40	60	. 0 0
■:j grade 2-b, flcuL gra.de 4-b |	411	4	00	64	. 00
a p p r fj r 11 i. c e s h i p 1	2 62	2	5 5	66	. 56
o:.her Cff j	84	Ü	R?	67	37
CO   OE |	j , .345	3 2	63	100	. 06
Total |	10,264	100	01.		
As you can see from the frequency table there are 19 cases with 'missing' values and 352 ».ases where this question was asked of a 'proxy respondent', A proxy respondent is someone in the household who answers questions on the respondent's behalf if he or she is not able to participate. If we tabulate again, with the option nol (nolabcl) asking for the values rather than the labels of the categories, we get:
Missing values 43
tabulate educ t nol educational |
-Oil |	Freq.	Percent		Cum.
-9 |	19	0	.19	0 .19
-7	352	3	.43	3 .61
1 i	122	1	, 19	4 .80
^ 1	59 8	5	, 83	10 .63
3 I	225	2	.19	12 .82
4 !	1, 207	11	.76	24 . 58
5 1	215	2	,09	26 . 68
6 i	985	9	. 60	36 .27
'7 ! ' 1	2,086	20	.32	56.60
8 !	349	3	.40	60 .00
9 1	411	A	.00	64 . 00
10 |	262	2	. 55	66 .55
11	84	0	,82	67 .37
12 j	3, 349	32	. 63	100.00
	10,264	100	.00	
We see that —9 Is the value given to ''missing' cases and -7 to 'proxy respondent' cases, For this variable we would choose to recode both -9 and -7 to be missing values.
There a re a number of ways to specify missing values. The first uses the recede command, where we recode both the negative numbers to a dot.
recode educ -■9 =.  -7=.
or
recode educ -9/-7-.
The forward slash / is Stata notation for a range of values - in the above example, values ~9 through -7. Remember to put the lowest value first. There are no cases with a value of -8, but that doesn't matter.
44    Manipulating variables
If you wanted to keep the reasons for the missing values separate then you could use the dot followed by a letter values such as
recode uduc    9« .a —7«* .fa
Also, you can recode more than one variable at a time if the variables in your list have the same values to change. For example, if our variables education [educ] and housing tenure (tenure) had the same range of missing values to recode, we could use
recoda ectac tenure  (—9/—1=»)
or
recode educ tenure (—9=.a)   (—7 » . bi
Where there is more than one variable in the list, the recode value must be in parentheses. The recode command is also used to manipulate the non-missing values of variables, and this is coverei in the next section.
You must be sure that the 'missing' values do not have a meaningful value. That is, in our example data, when the negative values you may think to recode to missing values can represent a real value, such as zero, for a variable. To illustrate this we shall look at the variables nags, which measures how many cigarettes a respondent smoked per day.
First, let us get some descriptive information on the variable
La ncigs
number of." | cigarettes smoked J	Freq.	Percent		Cum.	
missing or wild |	20	0	.19	0	.19
inapplicable |	(5, 949	67	.70	67	.90
proxy respondent !	3 52	3	.43	71	.33
refused 1	j	0	. 01	71	.34
don't know |	1	0	. 01	71	.35
less than 1 per day |	41	0	.40	71	.75
-i !	39	0	.38	72	.13
2 1	55	0	.54	72	. 66
3 1	50	0	.49	73	.15
4 i	31	0	.30	7 3	. 45
5 1	162	1	.58	75	. 03
Missing values 45
6 I
8 !
9 I
10 I
59	0	. 57	75	. 60
53	0	,52	76 .	. 12
62	0	. 60	76	. 72
14	0 .	.14	76	. 86
501	4	CO	81	.74
9	0	.09	99	. 99
1	0 .	. 01	100 .	. 00
Total   1     10,264 100.00
We can sec that several respondents smoked less than one per day. We can find out the variable values as well.
ta ncigs,nol
number of cigarettes
:ed J	Freq.	Percent		Cum.
-----+	------------	-------	-----	---------
-9 1	20	0	. 19	0.19
-8 1	6, 949	67	. 70	67 . 90
-7 1	352	3	. 43	71.33
^2 1	1	0	. 01	71.34
-1 1	1	0 .	. 01	71.35
o !	41	0 .	.40	71.75
i 1	39	0	. 3 8	72 .13
2 1	5 5	0 .	. 54	72 . 66
3 1	50	0	. 49	73 .15
4 !	31	0 .	.30	73 .45
5 1	162	1	. 58	75 . 03
6 1	59	0 .	.57	75 .60
1 1	53	0.	. 52	7 6 .12
8 I	6 2	0.	. 60	76 .72
9 !	14	0 .	. 14	76 . 86
10 1	501	4 .	. 88	81. 74
etc
60   I 9 0.09 99.99
80   !■ 1 0.01 100.00
Total   I     10,264 100.00
46    Manipulating variables
All respondents were asked the question 'Do you smoke?*. If they answered 'Yes', then they were asked the second question, 'How many cigarettes do you smoke a day?'. If they answered 'No' to the first question then they were not asked the second question. This can be seen by the number of respondents with a value of —8 in the variable ucigs. This number should correspond to the number of respondents who answered 'No' to the first question. As well, there is a value of 0, which is labelled as less than 1 per day. But these are occasional smokers, so are they ready zeros?
If you wanted a variable that indicated the number of cigarettes smoked per day then you could recede the negative values of ucigs in this way:
recods nci.gs 0~i --5«-.  ~-8™0 —7«.  -2« .    1= .
Or, if you wanted to keep the reasons for missing values separate:
roc ode ncigs 0=1 -9».a -8=0 - 7- .b   2- .<_■ -1» .<T
In the second way we have kept -9=. a and —7 = .b as they are for the educ variable, as in our data -9 and -7 denote the same reason for missing values in all of the variables. Both of these recodes will produce the frequency table of the ncigs variable beiow. 'I he differences in coding the missing Yalu.es are only shown when the missing option is used in the tabulate command which we cover in Chapter 5.
.   ta ncigs,nol
nureber of | ciqarett.es [
smoked	Freq.	Perct	:aic	O	am.
0 1	ft ,949	70	.26	7 0	.26
1 !	80	0	. 81	71	. 07
2 i	55	0	.56	'/I	. 63
-! i	50	0	. 5/1	72	. 13
4 1	31	0	,31	7 2	. 45
b 1	1G2	1.	. 64	74	. 08
6	59	0	. 60	74	. 66
7 i	5 3	G .	.54	75	. 22
Missing values 47
0.14 75,99
5.07 81.05
0.09 99.99
0.01 100.00
Total   j        9,890 100.00
There is a useful command that will convert all the negative values ro missing values in one step. But before you use this global recede, you must make sure that all the values that you are going to specify are indeed logically deemed to be system missing and you want to recede ail of the missing values to a dot (,) only rather than keep the reasons for missing values separate. The command is mvdecode (.missing value decode). The subcommand ts .all (tor all variables) and the further option hit speciiies which valueis) to treat as missing.
mvetecode ._al.I,are,{--9/~l}
The shortened range for the missing values (-9/--1) does not work on some Sfata configurations. If this is the case with your set-up, then you can enter the missing values one at a time:
mvdecode    a i 1, arv {—<>)
mv decode   _all,mv{—8} etc.
Also, the mvdtecode command can be used with single variables or lists of variables and does not have to be used with _all:
mvdecode nuiStat:,mv£  9/--1 ) mvdecodw tenure-finm,*w(-9/—7)
The use of negative numbers to indicate values likely to be changed to missing is good practice. It has its limitations, as we have shown above. Another time when negative numbers are nor appropriate is when you have standardized scores, as these variables have negative numbers as legitimate values. As we mentioned previously, some data sets use high numbers such as 99 or 999 to indicate non-response (no answer). Additional care is
14 501
80
48    Manipulating variables
needed when recodiag these to missing and especially using the global recode command mvdecode. For example, your data may have a variable for age that uses 999 as the indicator for non-response as well as a variable for five categories of marital status that uses 9 as the value for non-response. If you simply used the global mvdecode _all command for the range 9/999 then all respondents aged 9 and over would be given a missing value in the age variable. Whatever system is used to indicate non-response, you need to exercise care when declaring values to be missing.
It is very important to note that there is no way to undo the mvdecode command. You are well advised to be absolutely sure you want to run this command. And y'ou should never overwrite your original data sets - always work with copies.
As you may have gathered by now - there are usually half a dozen different ways to do what you want to do in Stata. As you become mote familiar with its possibilities yon will find the way that suits you best, in the numerous courses that we have taught on Stata, not one has passed without a student showing us something that we didn't know!
CREATING NEW VARIABLES OUT OF
EXISTING VARIABLES
You may wain to make new variables out of one or more existing variables. For example, you might want a variable that denotes women who are self-employed. To make this variable, you would need information from two variables that already existed in your data set: gender (sax) and job status (jbstat).
When you are creating variables, you often need to use mathematical expressions. The following are recognized by Stata:
+ addition
subtraction * multiplication / division A    raise to the power
Parentheses are used to control the order m which calculations are done. Parentheses may be nested within other parentheses, many layers deep.
Before moving on to more advanced variable manipulation, it is necessary to understand the logical operators used in Stata.
Creating new variables out of existing variables
not	>	greater than
= =   equal (two equals signs)	>-	greater than or equal to
!-    not equal	<	less than
& and	<=	less than or equal to
| or		
Stata also recognizes a myriad of mathematical functions. The ones you probably use most are:
sqrt(x)   square root	log(x) or ln(x)   natural logarithm
cxp(x) exponent (e to power)	int(x) integer
Type help functions for the full range or functions supported by Stata.
Box     Shortening ce.mt'iands
The shortening of commands can be used or. alirtosi all <so,n-' rrtarKte. The manuals or the help command will tall you wnai is the shortest accepted form of the command as shown hy the underlined .part of the word. For example, nab-Jlate ot-.r be shortened to ta. Same commands cannot be shortened
However, with help, you must spell the nan,.- of the command completely, solwslp tabulate will yield you r?s:"ts, whco tie*p fcafc wil! produce a message stating that the command could not be found. Help brings up the online help system •01 the command. Think of it as the computer version of the Stata manuals, as it often contains the same information! As well, th «-• v -.h. ■ . - d looks for the term-you indicate in help files. Stata Tecrmicaf! Pulletina, and Stata FAQ if your Stata is web active. ■
50    Manipulating variables
Probabiy the most common command used to create new variables irorn pre-existing variables is generate. The command generate can be shortened to gen, ge or even g. To use the generate command, you type generate followed by the name of the new variable and then the set of conditions ihat equal the value oi the new variable. For example, it we wanted to create a variable that measured yearly income, we could multiply our monthly income variable Ifimn) by J2:
gen yearlyincomesfism*12
Likewise, we could generate a variable that roughly equated to weekly income by dividing monthly income by, say, 4,2:
ge woeVlyincoiiie»t:imn/4 .2
We can use the recode command to make a new variable out of an existing variable. Suppose we wanted to make a simplified variable where the only categories were 'partnered', 'never married', and 'previously married'' twin the original marital status variable {mastat}. 'Partnered' would include married and cohabiting couples, while 'previously married' would include all divorced, separated, and widowed people. Wc could do this by;
recede mastat 1/2=1 3/5=2 6=3
But this would mean that we lose the original variable categories, so we advise that you create new variables when recoding. Let's call our new variable mastatZ, so now the command, will be:
srecode mastat 1/2 = 1 3/5 = 2 6-3, gen (mastatS)
The gen option to the recode command creates a new variable with the receded categories, so you will keep the original variable and its categories as well.
ta Tfiasl.at2
mastat?	F req,	Perc	tnt	C cull.
i !	C.683	65	. 11	65 .1.1
2 1	1, 489	14	. 51	79 . 62
3 I	2 , 092	2 0	. 38	100.00
Total	10,2 64	10 0	.CO	
Creating new variables out of existing variables §1
You should always check the variables you create. Cite way is to crosstabulate them, provided the number of categories is low me.-ible:
.   ta mastat mastat2
rr.astar.2
-irits:  status j		1		3	j Total
■   ■ --------------+ -		-------		-------	• +--------
married [	6 ,	0 09	0	0	1 6,009
ri -><3 as couple I		6'? 4	0	0	674
widowed		0	866	n	866
divorced.		0	434	0	i 434
separated		0	189	o	I 189
j '   1-l raarr 1 ea		0	0	2, 081	2,081
Total j	6,	683	1,489	2, 081	110,253
Generating a new variable with information from one existing variable is the simplest cvpe of variable creation. Returning to our Lxarnpie of a variable that indicates women who are self-vnij.l«>j ed, wc could do tins m a number of different ways; we first j f -i' p-by-step approaches so you can see how the different control ids vi ork, and then we use a single command to demonstrate Stata's use of logic in constructing new variables. The self-employed have the value 1 in the ibslai variable (10 categories) and tile missing values have been recoded to a dot (.), while women have the value 2 in the sex variable and there are no missing values on the sex variable.
The step-by-step way is to generate a new variable (feml) and sei ai! values to missing (.), and then to replace the values as the; meei specific conditions. The command replace allows you io change the contents of a variable, so to speak.
n©n feml™.
r»>pl tee feml=l if jbstat==l & sex==2
replace feial=0 3f jbstat>=2 & jbstat<-12 ■wlacc- Ceml-0 if snx--l & -jbetat==l
T hb » dl 11 u n \ due of 1 for all women who arc self-employed, i n aie ihen left io decide what to do with the rest of the values, n niable inns' Tike more than one value to vary. Often, a le-earcher will use dummy variables to denote 1 for the catcgorv
Manipulating variables
of interest and 0 for cases that do not meet this condition, so we replace the new variable with zeros where the cases are women who are not self-employed and all men. The third line replaces those who have an employment status but are not self employed with a 0. 'I he fourth line replaces men who are self-employed with a 0. The step-by-step way of creating this new variable is cumbersome, but it does help to understand what is actually happening in your data with each command. However, we hope that you are able to move quickly on to use more efficient commands. Wc can first reduce the four commands to three:
gen fesa2=l If jbstafc==l & sex==2 replace feai2=0 if jbstat>=2 & jbstat<=12 replace teml-O if sex==l & jbstat=~l
In the first command line the generate igem command creates a new variable femZ with cases equal to 1 for women who are self-employed, and then the rest of the cases are automatically set as missing (.). The last two command lines are the same as in the first way, replacing all cases that are not self-employed or are self-employed men with a zero. Note here that we do not use jbstat 1=1 (jbstat not equal to one) as this would code the cases who arc missing on jbstat to zero.
We can now reduce the three commands to two:
gen feia3=l if jbstat==l & h«x**2
replace £em3»fi if  ',! jbstat •»2&-jbstat<»10} ) ///
sex==l) & !i»issing(jbstat}
In this way the last two lines arc combined to replace cases who are not self-employed or are men as zeros. It is necessary to include the last element of the command line - & !missingS jbstat) - in order not to include men who have missing values on the jbstat variable in the zeros. This element means 'not missing values on the jbstat variable'.
The most efficient way to create this new variable is to use Stata's logic for constructing new variables which then means it can be done m one command hne:
0«n £em4=jbstat==l & sex==2 if /// ! missing' (jbstat, sex)
Dummy variables
In this form ot the gen command, cases who are self-employed (jbstat==l) and women (s«x=»2! arc automatically given the value I while all other cases are given zeros except for those who are missing on either jbstat or sex - if '.missing < jbstat, sex) - in which case they arc given a missing value (.). Note the use of the if in this way rather than the & in the last line of making the fem3 variable. Including the variable sex in the parentheses after imissing is not strictly necessary as we know in our data that there are no missing values on that variable.
A check on the summary statistics (see Chapter 5) of the new variables shows that they are identical:
.   so. tern* Variatale |	Obs	Mean	Std. Dev.	Min	Max
fcml j	9912	. 0204802	.1416433	0	1
fein2 j	9912	. 0204802	.1416433	0	1
f em3	9912	. 0204802	.1416433	0	1
fem4 i	9912	. 0204802	. 14164.33	0	1
DUMMY VARIABLES
One very common reason to manipulate pre-existing variables is to create dummy variables, or dichotomous variables, with the values of 0 and 1. Dummy variables are a special case of nominal variables. Remember, nominal variables are those which measure qualitative characteristics such as sex, religion, or marital status. There are two reasons why we use dummy variables, and they are based upon what you should already know about nominal variables;
• i he only measure of central tendency that makes sense for a nominal variable is the mode. The mean and median for nominal variables are not meaningful. Although you can get Stata or any other software program to produce a mean and a median for nominal variables, that doesn't mean they have any meaningful value.
• There are lots of nominal-type characteristics that are important predictors of the things that social scientists are interested m -- but because we can't determine a 'meaningful mean' or median, wc need to make a special adjustment so that we can use nominal variables in multivariate statistics.
See Box 3.2 if you've heard of dummy variables but aren't exactly sure what they are or how to interpret them,
54    Manipulating variabi&s
Sux w.2; OumiT-y variables
Lf Jr o'a't nt-i a 'yoica, ^u'nn/ asv bex .b a T- mnal
vanabte and a piediofoi etraauy dcpendsi; vanan'es it- when are :nroi<rSted- pa,. po«e<tv, f&rriiy strurru'e, etc, tL't sex ,j r.cr,-,nai, wmch limits what we Ob-, do H in isrms of statistics. Howeier, there ere ways of including nominal i/artabfcw in statist-al anaht/s ihough ibe uie cr 'bese clumpy vanfebles I' wf vJre the eja'nple ot sex n t'dnfsmmarin nto n dunirm' tanabie «ou!o Wan i-wt <e *ctlj pic1' d te („ t»Dciv ou ~„ and i jo ' t (c j female), reaving the other sex (in this case, male) to bo coded 0. Why would we do this?
Let'" look at it in a bpteadsheot.
In th - en, .no p, ID ,ust tal^ift tc the ocrsonai rjpntitwt of »ip u spo-o-',|i, iud c<a n tin, dun ii y vanaofc wnw^ i -to i"'p *> i i 0 - male. When we tiansform our variable into a dummy variable tike (hid, « becomes possible to es'culale a irmn that we can interpret. If we add -jp the ya-'ues oi the va-iabio -sex above and dr/'ie by th» tumbw of aases, we get.
Dummy variables 55
(1 + 1 + 0 <• 0 + 0+ 1 + 1 t 0 ¥ 1 J-0)/10 = 0.8
Because *e've transferred the variable into ts and 0s. what this n'.'fnbe; "epre°ents r.ow is the proportion o( the sample that takes the value 1 on the dummy variable; that is, the proportion of the sample list ,-s femaie. So typically, you would nee a dummy variable such as this presented -n a tatve of descriptive statistics like this:
	Mean	j Standard Dev.	Minimum | Maximum
Sex (i =	08	j';iif4;':afe/'*.:-Sllt-i' •	1 . 0           i 1
rentals)		J ;	-___i_____I
Sometimes the note (1 --fe-iiale}' isn't presented arid the authors jus: pu- 'Female' instead ot 'Se/ as the variable name, assuming Inai readers will understand that it <s dummy variable:
Variable	Mean	I Standard Dev.	Minimum	Maximum
Female	0.6		0	1
So, both of the** examples maan e-cscty'ihc same thing: 60% of the samplf* is female. There is also tfw additional hint that the variable Is a 'dummy' because* the minimum is reported as 0 and the max i-v^ta is reported as 1. The standard deviation for dummy variabfes -is not interprefable, although some authors occasionally put ft in.'
Often, dummy variables are created out of yes/no questionnaire items as well (wnich are also dichotomous}. Suppose you saw something like this:
Van-able
OapHcii 0,15 puntshiYiani should be I
Single parefv*  ; 0.45
56    Manipulating variables
You could understand it to mean that 15% of the sample' answered 'Yes' to a questionnaire item which asked if they think, caoital pun sbmerit shou'd be reinstated Sinrladj, 46% of this sample were singfe parents when the only possible categories in the variab<e were 1 - yes, 0 - no.
Nominal variables with more than two categories
Nominal variables often have more than two categories. Let us consider the example of a measure of where peopie live or their region of (esidence with a variable that has the following categories' 1 - North. 2 = South, 3 _ East, 4 = West, If you want to include region as an independent variable in your study you need to make some adjustments to if. You must make dummy variables for as many categories as theie are in the original variable, In other words, we need to make four dummy variables to measure region in this example. So what we do is:
• make a variable North where 1 - North 0 = all other responses, make a variable South where I - South 0 = all other responses: make a variable East where   1 = East   0 - all other responses;
* make a variable West where 1 = West 0   all other responses.
Let's took at our original variable in a spreadsheet:
ID	Region
i	3
	
3	4
4	1
5	1
6	1
7	2
8	3
9	2
10
Dummy variabl
So we have 3 cases in the North, 4 irr the South, 2 irt.the East, and 1 in the West. When'we convert region into dummy variables, the spreadsheet will Iook like this:
ID	Region	#||ffby|f	SauVr		SÄ.v'r,» iSSfltt
1 ■	3	0	0 '	1	
2	2	0	1	0	
3	4	0	Iftvf jiff	,0	
4	1	t .	0	0	SyS'llS
5	1	1 .	0	0	
6	1	1	0	0	
7	2	0	1	0	0 ''
a	3	0	Ö	t	0
9	2	0		0	0
,0	2 J	0	1 1 jo		
We have the original variable here and the four dummy variables. If we compute the mean of the dummy variables we will get:
North, (0 + 0 + 0+ 1 -M + 1 + 0-1 0 + Q + 0)/10- 0,3; South, (0 -r 1 + 0 + 0 + 0 + 0 + 1 + 0 + 1 +     0 = 0.4; East, (1 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 0}/1p = 0.2; West, (0 + 0 + 1 ,-0 + 0 + 0 + 0-^0 + 0 + 0+)/10=0.1,
These means represent the proportion of those who had the value 1 on each of the 'dummy variables. In a table, you would see this presented typically as:
Variable	Mean	Standard Dev		aaa/ifT-in;
Region				
North	0.3	-	0	1
Scu*h	0.4	- '	0	t
East	0.2	-	0	1
West	0.1	I 0		1
58    Manipulating var ables
p v\ e i'cv-e fa,.' d,anmy vatfah'p^ Titasu m.3 n-yic a. O cur «Q«,ple, 30% live m the North 403'j in ibe Soulh- 20% m i»e East snd 1 C'Si m the Wes~, Note triaf it you add up these percentages, you get 10O'/o. and likewise if you ado up the numbers in the 'mean' coiumn. you »vri end up with 1.0,
You may wonder why the variable sax can be converted into just one dummy variable and region requirfed four. Tne reason :s that when wa convert a dichotomouf. variable («htcr only ha' two categories -n tho original) we only need to present one vaiiable tor the category we choose to code as '1'. it is assumed that whatever the othei category for the dichofemuus variable was swill be represented as the zero.
Yoa will want to note that when you use. dummy variables in £ multivariate analysis-, one group ^ivays is omitted ay the reference category'. In the case ot sen, mate is omutecl and the oofffciente for 'female' would be compared to those for 'male'. In the example of region, one. category would have to be omitted (e.g.' North) and the other categories compared to the omitted oateyoiy when interpreting the results Th;r, will be discussed further when w£ am including dummy variables in multivariate estimations in Chapters S and 9,
As we've said before, and will say again, there is often more than one way to complete a task in Stata that gets you to the right answer. The creation ot variables is a case in point. Let us show you some different ways of creating dummy variables.
Suppose we wanted to create a dummy variable for those who have a university degree called degree from the educ variable where there are i2 categories of qualifications, with 1 and 2 being degrees and the missing values re-coded to a dot (.). We could do this in a number of more or less efficient ways.
(a.)   Create a new variable and recode it: generate degreel=ecto.c recede degreel 1/2-1 3/m&x-0
(b)   Recede educ and make the new variable at the same time: recode eeluc (1/2=1)   <3/m*x«0! , ///
Dummy variables Si
(cl   Use generate and replace commands: gos dagree3=l if educ<=2 replace dagre@3=G if educ>=3 & /// Smissing{educ)
(d,<   Use the generate command in one line:
gen degree4=educ<=2 if 1 missing(educ)
It is very important in (c! and (d) to have the !missing (aduc)
element at the end - note that one is a & and the other is an if. If you make another variable idegreeS) where you omit this part from (d), and then do a frequency distribution on both, you will see that they are different.
gen. ciegree5-educ<=2 tabl degr«e4 degrees
.   tabl degree4 degree?
-> tabulation of degree4
degree4   |        Freq.        Percent Cum.
0 | 9,173 92.72 92.72
1 I 720 7.28 100.00
Total   | 9.893 100.00
-> tabulation of degree5 degree5   |        Freq.        Percent Cum.
0 |        9,544 92.99 92.99
1 j 720 7.01 100.00
Total   |       10,264 100.00
Where have the missing values gone? Without the ! mi suing (educ) clement of the command, Stata has put all the
missing values into the 0 category of the dummy variable degree 5. This is because Stata recognizes missing values (dot, .a, etc.) as a very high number and every tune we use < or > we must ensure that we tell Stata not to include missing values!
BÖ    Manipulating variables
You may wonder why, m (d), Stata knows to make a variable coded 1 and 0, Nowhere in the code are 0 and 1 explicitly stated, like in the other examples. This has to do with how Stata under- .... stands double and single equal signs. In (d), wre are setting a con- fc dition and if it is met, Stata logically assigns 1 to the condition and 0 to the cases were the condition is not met. See Box 3.3 on double r~ ■ :'.!  c."!i. :.-.a:j' -i:'.;r. f*x :--o;a ' .!. ravii ioa < ai tin- i :0"_ iv
Now we have created five variables measuring wdtcther or not vVk","'( Ixx a .X yr. c  Y\, . lo-. 'r :vr.i iC-h.'v   Y\V ..an :_v. ': rid or Ir rhe \'>..i.!X.:."iv:r v.r 'ahks n'-.in^ih,.     nniansi -x.i cp.
sXfsip i'.,>e.:oi--'.i  flt-rtt-cai rifief-tv-si? ?ftart><*f* ~
Ntv- jwrs .o Stata offer- cwt oonfused as to v.hen si-rcte ; : or ao,ji)i-: (--) ?quil5 signs si-ould ö= -.,3rd A simplo i'jIc ci tnjnb ; Is -Ji:d r. shcyc- cctja;s sgn is cssd :o ass-Cjn the v^iuc a: t^s ncht or ,< mn va-co tc The ut"t. K'or extiMipie .r> i.he coinrn^^o
no-.v vnriao a. ciiire / nos been crcüicd which -s dis -.icrne a.s iiib vai'ablc ce'ec. We >-ave ;oid titaSc dial * is golnu ro he che :.-.«mff
A Cglios"; cmyjai sign, on too en nor -.and. is used to chock -.'.'i-.ei-or rho vaU-v to the nciül is tho same ast!,e«alue ic in; :eit. It i = t-s-od icr coirifjcnsoe pu>poses, •n'il' e- llian f>t-s'<jr:rr.tjrit' i''|fo >nr " i
Single- eo i»        "rhe ciHnipnd
misawiriA* if «6»»«»2 l !
V'cuie returv     SLirrrr-ary st.-r!!st,o-: or 3:1 ■.■a-ia^U";; in th<> daisi s-Ot "j J-
So;- you aro noise io set nr. valcios of cvarubc to somet'iir;. !*? .". slnrx equals sirjo. and -f yon aistCohr.q f> eq .lail'y. u<w 'R« 1. .t.
ooaoie ctiaals siayc. As '.vidi acyt'inc else.     r.,-,r-- you practice.
ms'jr-: c Cv^ :t*S -.POSH   dji's.- .  v.'iv.] b --1CO dOub-O Cq.-i,-, < 0. ".^ Äi, „.
av i ran   <>;.:• the cr.nf ocerailor-.
s^-. .t*v
tri:-
Labelling variables
Often when we want to include nominal (categorical) variables in statistical estimations, it is necessary to create a series of dummy variables from the original variable. This is done easily by combining the tab and gen commands. For example, if we wanted to create a series of dummies for the job status variable:
tab jbstat, gen(jobstat)
If you look at the bottom of your variable list, you will see that a number of dummy variables have been created (jobstatl, jobstatl, jobsta/3, etc). You can do a frequency distribution of these variables like any other variable. You will see that they are all dichotomized into a 0/1 format.
LABELLING VARIABLES
From our previous example on page 50 wc have a new marital status variable (mastatJ). Having done the frequency distribution (tab) of the variable, we can see that there are no labels attached to the values of the variable. Now we want to label the variable and values so that we don't forget which each number stands for Labeling in Stata is a bit different for those used to labeling variables in other statistical software packages. There arc essentially three steps:
1. Create a label for the variable itself.
2. Define a label for values.
3. Attach that label to a specific variable.
We first label our variable:
label var mastat2 "simpler marital status"
Now our variable mashitZ has a label 'simpler marital status', if you tab mastat2 you will see this new label and it will also be in your variable list.
Now wc want to label the values of the categories of nuistatl. To do this use the command label define. First, we have to define a name for our labels. We arc calling these labels 'marital'.
IcJj clef marital 1 "partnered" 2 wpr«v married" /// 3 "single"
62    Manipulating variables
If you make an error when you are using this command and try to do it again, you will likely get a message that the labels have already been defined. Stata will not let you overwrite labels unless it is sure that you actually want to! You can assure Stata that you intend to overwrite labels by adding the option modi f y. I ry:
lab def marital 1 "partnered" 2 "pr«v married" /// 3 "single", modify
it is important to realize that at this point you have defined a label, but this label is not yet attached to any variable categories. The label exists separately from any variable. If you type label list you will see all the labels in your data set. The label for 'marital' will be there with all the category values defined as we have entered them above.
Now we must attach these category labels to the variable using the procedure label value.
lab val mastat2 marital
This tells Stata to attach the values we assigned to the label 'marital' to the variable mastatZ. Once value labels have been defined they can be applied to other variables. If we now tabulate the new marital status variable we get:
.   ta mastat2
simpler | marl tail
status j	i'req.	Percent	C am.
partnered |	6, 683	65. 18	65 .18
prov married [	1,489	14. 52	7 9 .70
single	2, 081	2 0.30	100.00
Total 1	10,253	1U0.00	
Tins may seem like a nonsense way of doing labels, but it does have one really great benefit. If you have numerous variables with the same codes (e.g. 'yes' and 'no'!, it is very easy to label them.
Labelling variables 63
We have created some dummy variables earlier in the chapter: one for self-employed women and one for degrees. To label them all, we would just define a label:
lab def yesno 1 "ye&" 0 "no"
and then attach these values to our dummy variables:
lab vai feml yesno lab val degree4 yesno
.. . . etc
If yon were using multiple dummy variables in your data set, you can easily see how quick it would be to label them all.
For ways of changing a number of variable names at once, seethe commands in Box 3.4.
Box 3.4: Renaming variables
If a variable already has a name you might want to change it It could be that the original one isn't as informative'as it should be; some of the older data sals were very restricted in the ntimbereof characters they could use to name variables, so it's not uncommon to see variable names such as $dtOQ88 for the number of children in the house which it would more convenient to change to ' numchd For this example you would use;
rsiiame sdt0063 amnclid
The renaine command is suitable when you Only have a few variables to change as you can only change one variable per rename command. If you have more' than half a dozen variables to changolt is worth installing the rsav&rs command. Type:
£irt<lit; reavars
Ike,I !■/,..".'/     - -pa." :r.i .ir.-~ \c ■ it'-:« .-cn-rrard
►
64    Manipulating varaoles
► There are a numbs> of options with reiwars rjut mostly it
allows you to rename mutfo'e vonabies n one command. Let's continue with our example above where a data set has variable names stc!0068, std0069 andstc/0070 and you want to change these to more meaningful names. Typing
r«nv«xa stdOOSS stdOOSS std0070 \ numchd ///
would change the variable names listed to the left of the backslash \\) to the new variable names to the right of the backslash in the respective order. If the original variables are consecutive in the data set then you can just put the first and last separated by a dash (as with many other Stata commands), for example:
renvairs »td0068 -*td0070 \ msmchd chdag© chdsex
Anothei useful command to be aware ot is reapf is. This command can either change variable name stubs, such as sdt in our example, or remove them completely. If you wanted to change. sdt stub to end for our three example variables, you would type
renpfi.K sed chd
You would now have three variables: ohdOOSB, chd0069 and chdOOVO, If you typed
State would remove the si stub from all variables that start with that stub, in this ease leaving three variables named d0068, 00069 and dOOTO- All variables that start with the stub will be changed, so be careful with this command. Note that you cannot use variable names that start with a number
A common use for r*np£ix is if you have panel data and dacn Wdvw of data has variables that start with the same letter. In the British Household Panel Survey all the variables in the first wave start with a, second wa/s start with b, third wave start with c. and so on. Whan you put data sets together in a pane! format the variables need to b£ named the same in each so reapf ix can be used to remove the starting letter (stub) from all variable names.
More on generating new variables 65
MORE OM GENERATING NEW VARIABLES
Stuta has a second, 'extended' generate command that allows you to create numerous types of variables from your existing data both for analysis and for data management. The command is eger. and as you get more familiar with Stata then you will probably find yourself using this command more and more as it is very powerful. There are a considerable number of uses for egen so we cannot cover all of them here. Instead we use a few examples to illustrate its utility. Remember that you can type help egen to find out all of the ways to use the command.
The group function of egen creates a new variable by combining categories of two or more variables. Let's take a simple example of two variables both with only two categories. First we make a variable that indicates those who are married compared to all others;
reoode mastat  (1=1)   (2/max=0), gen(married)
We have included the (1=1) for clarity but it is not strictly necessary for this recodc. Now we want to create a new variable that has the following four categories: married men, married women, unmarried men, and unmarried wotiiui If we crosstabulared the variables married and sex we would see that:
ta sex married
!: a sex marri ed
	1  RECODE of	mastat	
	j (marital	status)	
sex	o	1 I	Total
mad.e	1 1,886	2,947	4,833
f ema] e	1 2,369	3,062 j	5,431
Total	1 4,255	6,009 j	10,264
From this table we can see the numbers in each of the four categories we are interested in, but we don't have them in a single variable. The long way to do this would be to use the gen and replace commands with if statements such as:
66    Manipulating variables
gaii cexicai 1«1 if sax==lsmarriea==0
replace if serK=*~l&:marriaeL-~l
replace s*xmarl»3 if sex==2smarried==0 replace sexfflarls4 if sex--2&mari°i8d=-l
.  gen sexmarl = l if sex--l&max~rieci--Q (8378 missing values generated)
.  replace sexmarl = 2  if sex= = lfanarriecl==l (2 947 real changes made)
.   replace se^marl-S  if sex- = 2£cmarriecl-"C (2359 real changes made)
.  replace sexrnarl = 4 if sex==2fatiarried= = l (3 062 real changes made)
Then if we label the categories and tabulate the new variable:
lab def sesciftar 1 "unmarried m@n" ///
2 "married m«n" 3 "ururcarried woner" /// 4 "married women"
lab val sexmarl saxmar
ta saxmarl
,   lab def sexmar 1  "'unmarried men" ///
2  "married men"  3  "unmarried women" /// 4  "married v/omsn"
.   lab val sexmarl sexmar
.  ta sexmari
sexmari   j        Free.       Percent Cuir.
unmarried mean | 1,886 18.37 18.37
married men | 2,947 28.71 47.09
unmarried women | 2,369 23.08 70.17
married women j 3,062 2S.83 100.CO
lotal   |       10,264 100.00
If you compare the categories of this new variable with the crosstabuiation of sex and married above you can see the new categories represent each cell of the crosstabuiation. However, using the mgmn command considerably shortens this process:
More on generating new variables 67
egen sexmar2=group{sex married)
fca seminar 2
.   ~a sexmar2 group(sex |
married)	Freq,	Percent		Cum.
1 1	1,886	18 ,	. 3 7	18.37
2 1	2 , 947	28 .	. 71	4'?. 09
3	2, 359	23,	08	70 .17
	3 , 062	29.	, 83	100.00
Total	10 ,2 64	100 .	. 00	
We ordered our replace commands to re-create the process that the egexi group command uses to make its categories so that you can easily compare the results. This is that the first category of the first variable in the list is taken first and then groups are made with that and the categories of the second variable. So, in the sex variable 1 - men and 2 = women, and in llie married variable 0 -- unmarried and 1 = married. Therefore the categories of the new variable are 1 = men unmarried, 2 = men married, 3 = women unmarried. 4 = women married.. So we can use the label 'sexniab with this variable as well:
lab val sexmar2 sexmar
ta sexmar2
..   t.a sexmar'i
g r oup i sex. ) married)   j        Froq.      Percent Cum.
IS .37 47 . 09 70 .17 100.00
Obviously, this gets more complicated if the variables have more than two categories and if you use more than two variables. It's
urirn-ar riecl men |
married men |
.irimarriod women |
married women |
Total I
1,886 18.37
2,947 28.71
2,36 9 23.0 8
3,062 29.8.3
10,264 100.00
68    Manipulating variables
important to understand how Stata orders categories of variables in a list; not only for this command but for many others as well such as the by and bysort commands that we cover in Chapter 4.
The second function of egea we cover here is rownonmiss. This function tells Stata to look across a number of variables and, for each case (row), create a new variable that shows how many non-missing values there are. Non-missing means any value which has not being designated as missing f, or .a, ,b, etc). Remember that you need to have coded the values to missing prior to using this command. In this example, we take three General Health Questionnaire (GHQ) items, ghqa, ghqh and ghqd, to show how this command works by creating a new variable called obs. Then to show the frequencies of the new variable we need to tabulate it:
egen obssrownoiuniss (ghqa ghcjb grhqd)
ta obs
. egen o.bs = rowrionmi ss (ghcja ghqb ghqd) .  ta otis
obs	Freer.	Percent	Cum.
0	530	5.16	5.16
1 I	1.	0 . 01	5.17
2. J	y/j	0 .14	5.31
3 1	9,719	94 . 69	100.00
---------+ --	------------		---------
Totaj j	10,264	100.00	
This table shows us that 9719 people answered all three questions, 14 answered two (note that this doesn't show you which two), 1 answered only one question, and 530 did not answer any of the three questions. Another way of looking at this is that those given a zero on the variable obs are missing on all three questions. To show this, we tabulate the three variables for those who have a zero:
tabl ghqa ghqb ghqd if obs==0
.   tabl ghqa ghqb ghqd if obs-=0
-> tabulation of ghqa if obs-=0 no observations
More on generating new variables 69
-> tabulation of gho;b if ofos--0 no observations
-> tabulation of ghqd if obs==0 no observations
This confirms that chose 530 cases have not answered any of the three questions.
The values of obs can be used in a number of ways when creating and manipulating variables, especially when creating scales (see below) and deciding how many questions need to be answered to be included. It may also be used to select cases tor analysis or to manage data sets when some cases are kept ot chopped,
We carry on using the three GHQ items to demonstrate another function of egen. This function, rowmeaai, creates a new variable with a value oi the mean of the variables specified. The default method for this function creates a mean for any cases that have at least one non-missing value. So, a mean might be created for someone who has only answered one of the questions. You need to decide if that is something you wish to do. So, we wdll use the obs variable created by the rownonmiss function to refine the variable creation. We start with the default method;
©gen meanl-rowmeanighqa ghqfo ghejd)
,   egen meanl-rowmean (ghqa cjrtqb ghqd) (530 missing values generated)
Stata tells us that the new variable, meant, has been created with •    530 missing values. Refer back to the tabic of obs and see that .530 cases were missing on all three questions and these are the ones that are missing on this new variable. While you are looking at the ^    table of obs. you can see that one person answrerecl only one of the three questions. If we wanted to exclude that person from having a mean then we can use the values of obs to condition the ',    rowan function by restricting it to only those who answered two or three questions:
egen mean2ssrovmaari(ghqa gl*tjb ghqd) if obs> 1
.   egen raoan2-rowmean(ghqe ghqb ghqd)if obs>l (b31 missing values generated)
70    Manipulating variables
The output shows 531 missing values in the new variable, meanZ, which reflects the inclusion of the person who only answered one question. To follow this example through to its logical conclusion we now use the rowmean function and the values of the obs variable to calculate a mean for only those who answered all three questions:
egen mean3»rowmeau(ghga ghqb gb.cr.c0  if ©bs~-3
.  eger, mean3 = rowmean ighqa ghqb ghqd)   if obs==3 (545 missing values generated!
I
Now for the new variable, mean?>, there are 545 missing values. 14 more tlvin in me</,v2, which shows that the 14 people who only p answered two questions are now given missing values. We now I show the descriptives of all three newly created mean value variables:
su mean*
. su mean* Variable j
meant | mean! | m.ean3 j
Mean
1. 976645 1.976643 1.976193
SLd. Dev.
.43110 0 8 .4311229 .4302754
K i !i Max
1 4
1 4
1 4
Ob«
9734 9733
9719
This output shows the different number of non-missmg observations in each of the mean score variables. The relatively small number of eases that did not answer all three questions means that their exclusion has very little effect on the mean scores for the sample. However, there are other reasons why you may want ro exclude these eases or you may be happy to include them.
CREATING A SCALE
It is often useful to create a scale out oi a number of variables in a data set to measure a more general concept. You may, for example, have 10 different variables that measure depression, but find that they can be grouped together to form a single overall measure of depression. There are, mathematically, many ways to create a scale, but one common way is to simply add up the variables to create a summed scale.
We have a number of GHQ (Goldberg and Williams 1988} measures m our sample data set. If we add these up and create a summed measure, we would do it like this:
gen gh<j» ghqa+ ghqb+ ghqc+ ghqd+ ghqe+ / / / ghqf + ghqg+ gliqh+ ghqi + ghtjj + gliqk+ gtiql
Thus command might look a little cumbersome, so let's use the egen command to do the same thing:
egen obs2«rownonaxias (ghga-ghql)
egen gbq2=rowtotal (ghqa-gliql)  if obs==12
Compare the two: su ghq ghq2
.   ,gu ghq glq2
Variable  j      obs Mean    Scd. Dev.    Min Max
ghq  |     961.3    22.77125      4.914182      12 48 gncji:   j     9613    22.77125      4.914182      12 48
Be careful how you use the rowtotal function in egen as it does different things with missing values in some circumstances between version 9 and 10, so it would be best to check using help egen. In this example it's not a problem as we are using the obs2 variable to determine which cases to add across the variables.
The variables used to make up this scale all have values from 1 to 4 with high values that arc associated with an undesirable characteristic, such as being depressed, being under strain, or having lessened ability to face problems. So the resulting scale has a minimum of 12 and a maximum of 48. You should note that individuals who are missing on even one of the composite items are given a "missing' value in the overall scale because of the condition obs2==12.
The GHQ is a well-established scale and we can be fairly confident of its reliability and validity. But what if we are working with data that are less familiar?
It is very important that you are familiar with the variables in your proposed scale and that they are all in the 'same direction'.
Manipulating variables
The difference between Stata and other statistical software programs is that if you were making a satisfaction scale in SPSS, tor example, it would be important that ail the variables used to create rhis scale ail have the same 'direction' of measurement. If some are coded in the opposite order (to measure dissatisfaction, for example), you would have to recode these items so that all the measures were m the same direction (i.e. measuring the extent of satisfaction). Stata, however, has the ability to examine variables and 'reverse score' items thar it thinks are reverse coded using the command alpha. The default setting in Stata is to empirically determine the relationship and reverse the scorings for any that enter (i.e. are correlated with other items) negatively. We will return to this topic shortly.
But how' do you know if it is a good scale? There are several ways of assessing the reliability of a scale. One of the most common is the Cronbach's alpha, which is what the scaling command alpha is based upon. For all proposed scale items, alpha computes the inter-item correlations (or covariances) for all pairs of variables in the variable list. The command will also return a Cronbach's alpha statistic for the new scale. Cronbach's alpha statistic ranges from 0 to 1, with values closer to one indicating a 'better', more internally consistent scale. It should be noted, however, that scales comprised of a high number of variables will have higher alpha values than scales with the same inter item correlations hut with fewer variables. So the number of items in a scale must be kept in mind when assessing alpha value. However, in general, a value of about 0.70 or higher is generally considered acceptable for a scale.
Let's try a different set of variables - those that assess opinions on women and their role in the home and workplace. To find out how good our measure is, we can ask Stata to repott a reliability coefficient for us:
alpha opfsma. opfamb opfamc opfamd opfam© / / / opfamf opfamg opfanth opf ami
or
alpha opfanta-opfami
Note that if you use the latter of these two alpha commands, tire variables must be in this order in your variable list.
Creating a scale 73
,   alpha optama-opLacai
Test scale = mean(unstandardizsd items) Reversed items:  opfamc opfair.d opfame opfamh opt ami
Average interiteir. covariance: .204454 Number of items in the scale: 9 Scale reliability coefficient: 0.695S
We arc given an alpha of 0.6958, which is a little on the low side for a nine-hem scale. We can also see that Srata has listed a number of reversed items. This means that Stata has decided that these five items are 'negatively' scored compared to the other items in tire scale, and as such, has 'reverse coded' them to fit in with the theme of the scale.
l.er's examine the items as they were asked to the survey respondents.
opfama:   A pre-school child is likely to suffer if lbs or Iter mother works.
opfamh   Ail in all, family life suffers when the woman has
a full-time job. opfamc:   A woman and her family would all be happier
if she goes out to work. opfamd:   Both the husband and wite should both contribute
to the household income. opfame:   Having a full-time job is the best way for a woman
to be au independent person, opfamf:    A husband's job is to earn money; a wife's job is to
look after the home and family. opfatng:   Children need a father to be as closely involved in
their upbringing as the mother. opfamh:   Employers should make special arrangements to
help mothers combine jobs and chilclcare. opfami:    A single parent can bring up children as well as
a couple.
The response categories for ail items were: (1) strongly agree, (2) agree, (3) neither agree nor disagree, (4! disagree and (5) strongly disagree.
As the items are, if a person strongly agreed that pre-school children suffered if a mother worked (opfama), he or she wouid
74    Manipulating variables
get a score of 1. However, such a person would be unlikely to strongly agree with tire statement that a woman and her family would be happier if she worked (opfamc). In this case, the respondent might strongly disagree with such a statement, giving a score of 5. However, both these opinions reflect a tendency to be 'conservative' in opinions about gender roles in a family. Stata has picked up that people who answered in certain ways on items like opfama and opfamb (where high scores reflect more 'liberal' opinions about gender roles} were likely to have 'reversed' scores on items upjamc, opfanui, opfame, opfamb, and apfami (where high scores reflect a more 'conservative' orientation). As the items that were 'reverse scored' were more conservative, this means that higher scores on our new scale are associated wirh more liberal opinions.
But does Stata always get it right" It is the case that the five items highlighted bv Stata as being reverse coded are 'opposite' in direction to opfama, opfamb, and opfamf. But you could argue that agreeing with opfamg may also indicate more liberal views. Stata, however, hasn't picked this up.
The problem isn't something with Stala. The algorithm with which Stata works to decide which items should be reverse coded relies on the item's correlations with the other scale items. If it is ncGativcly correlated, \t becomes reverse coded.
Let's create a correlation matrix of these items:
.1.4/0 c.:.ir.
You can see from rhis matrix that the reverse coded items [opfamc, opjamd, opfame, opfamb-, opfami) are negatively correlated with opfama, opfamb and opfamf (and positively with the other 'reverse coded' items). On the other hand, opfarng docs not follow this genera! pattern, which is why it wasn't reverse coded by Stata.
There arc a couple possible reasons for this. The most likely is that it is not a good item for your scale - that it somehow does not
Creating a scale 75
'fit' as well as the other items. You can check this by using the option item:
alpha opf aina-opf ami, item
.   alpha j;jrf jna-op/urr-.i . xz^i.
■j'i'-it. scale =  :\eai\ ;ui-:--r;a.iiäaj, dz:-.6  iterr.s )
avernat-
	icea.-teto	■-tril!!-rC:5: !■	rtex-ieoü.	
Obt;     Slqr. C:	-.rreiaoton C(	3rrelar ".on ::.	avarlance	alpha
I i -> i                _ -6	C.6943	0 . •: 4 6 5	.1.72416	0.63 01
opirtf.'b   |     9662 f	11.7466		. 16.1841 8	0.6131
■■ ' ,w   1      90 0	O.OOt^		. 2050443	n.se-C'-
opiamo   1 9646	0 .40 1 0	0.2786	.22111311	0.68*5
optarr.   , iiMd	U.44S1	0,:n39	.41 = 9041	0  05 0 6
c.,t«r=   J 06S2	0 . 637 5	C . .':Lf88	1S0SV" 3	0 , 6409
t       II )    1         II ' -	6 . 2. 2 S 3	0 . 0726	.2533817	0.7' 32
ppla™i   1 9643	0.443 4	0.1:'Mi	227-361	0.6366
..a., i.an i	0.50 97	C . 3 i 1 9	.204099 2	0 . 6786:
			.204414	0.6906
Of particular interest i:	; the last col	until labelled	'alpha'. It	: would
be better labelled as 'al	pha if item t	'emovedy bee	a use that	is what
it is telling us. ff we	look down	to opfaing.	we can see that	
the alpha of the scale would improve to 0.7152 if we took this item out. Alpha can only tell us this information for items one at a time - it won't tell us the effect on alpha of removing two or three items at once.
So it seems that opfamg isn't such a great measure for our proposed scale. And if wc look at it, we can see that the item is somewhat different from the other items because it is asking about general child upbringing, rather than issues pertaining specifically to employment.
It should be stressed that you should always check the items in your scale and make sure that they make theoretical sense. The mathematical techniques involved in scale construction do not do this for you!
So what happens if we take out opfutngf
.   alpha opj.ama optai6o op.tair.r oplaa.ei opfaa.e opfamr  optamll   '' / /
76    Manipulating variables
cpfarr.a [ 9623 + G.r.880
up-amb 9 6 62 f U.74 4 2
opfomc j 9640 - 2.5771
apfa-d I 9646 - 0.4777
7pfame  I    9648 - 0.46S4
op:a:::i   ,     9657 + 2.6393
cpfaiiüi   J     9647 - C.4754
apfj.mi   1     3£4C - 0.5131
T£sl r.rale
We can see that the alpha has improved and that no further removal of individual items will increase our alpha.
As we haven't created the scale variable yet, we could ask Statu to create it for us after it computed the reliability using the gen option.
alpha opfama opfamb opfane opfamd opfame /// opfamf optamh opfa&ii, gen. (famscalel)
'i'he gen option tells Slata to create a new variable (we have named it famscalel). Unless we tell Stata otherwise, it will go ahead and reverse code items. If you are ever in a position where you don't want Stata to do this, typing the option asis will tell Stata not to reverse code any items. You can also use the option std to get Stata to create the new variable in standardized form (a mean of 0 and a standard deviation of 1).
It is very important to note how Stata handles missing data in the alpha command. In other commands that we have talked about so tar in this book, cases are deleted from the analysis it they arc missing on at least one of the variables under consideration. So when we tab sax age, if a case is missing on the age variable, it will not be included in the tabulation. It is deleted 'casewise'. Similarly, we could create the scale just by simply adding up all the variables using the gen command (after reverse coding manually). Now, the cases where there was missing data on one or more of the scale items would nr t be nclt ' f Th default in the gen option, however, is to ere ir 1 > e i i er observation where there is a response to at let 11    <• i 1
other words, a scale value could be create!   >i     in me 1
Creating a scale 77
answered only one of the eight opfam variables. The score is calculated by dividing the summative score over the total number of items available for the specific case.
if you find this objectionable (we do!), you may want to employ the option casewise so that cases with any missing values on the scale items are deleted. Let's see how this changes our results,
alpha opfama opfamb opfamc opfamd opfame /// opfamf opfamh op£ami,gen(ffamscale2) casewise
,  alpha opfama opfarrb opfamc opfamcl opfame /'/'/ opfarrif opfamh opfami, gen ( farsscale2 )   casowi se
Test .scale - mean (unstandardized items) Reversed items:  opfamc opfamd opfame opfamh
The results suggest a slightly better alpha. But we also know that cases are only included if respondents answered all the items in our scale.
If you don't want to be so stringent, you can set an alternative minimum number of items that must be non-missing in order for the scale to be constructed. Let's say we decided that a person must have answered at least five of the eight items m order to be included in the scale, we would use the option min.
alpha opfama opfarab opfamc opfamd opfame /// opfamf opfamh opfami, gen(famscale3) min(5)
When we create a scale, we are just really adding up items and, as such, we would rightly expect that if we arc adding up 8 items, all of which have values of 1 to 5, our scale will have a minimum of 8 and a maximum of 40. People who score 8 would be very conservative, while those around the 40 mark would be rather libera!.
opfam i.
Average inr.er.item covariance : t-junibcr of items in the scale: Scale reliability coefficient
.2540916 8
0.7166
recede opf nine opfasd apfame opfamh opf ami 111
(1=5)   (2=4)   (3=3)   (4=2) (5=1)
78    Manipulating variables
gen f2ur.scal©4~ o£>faroa+ opfamb* opfamc-f /// opfamd-t- op£ame+ opf an>f» opfamh+ opfami
However, if we Bummariza oar new variables {jamscalel, jamscalel and fjm$cate3) and compare them with jamscaleA created by manually reverse coding and using gen to add the items (note the use of the wildcard * that saves typing all the variables). We get:
so Laras-oaie"
Variable  J      cbs Mean    Scd.  Dev.        Men Max
raiascalel j 97 IB
far,isca.lc2 j 9515
ffmscalftl i 965'/
f.am£caie4 : 951b
- . 5244^40
-.5529802 25 . 5.7909
.6009038 .5954735 . 39S8b'15 4 .763829.
What is happening?
In order to understand these values, you need to understand how Stata constructs the scale with the alpha command. When an item is 'reverse scored' it isn't recoded so that 1 becomes 5, 2 becomes 4, etc. What happens is that a negative sign is placed in front of all the original values so that the original values are changed to:
Variable opfama optanib Djjfamc
opjamd opfame opprmf Qpfamb op/ami
Direction
New values 12 3 4 .5
1 2 3 4 5
—5 —4 -3 —2 —1 -5 -4 -3 -2 -1 -5 -4 -3 -2 -1 12 3 4 5
-5 ...4 ■
-4 -3 -2 -1
Creating a scale 79
This logic produces an overall total, for those who answered all eight items, with a minimum of-22 and a maximum of 10. If we divide these scores by 8 (the total number of items), we get -2,75 and 1.25 which match the minimum and maximum values in the data shown for lamscaleZ which used the casewisse option so that only cases with data on all eight items were included.
The theoretical minimum and maximum values for famscatel, created using the default settings for the gen option, are -5 and 5 as it is possible for respondents to answer just one item and be included in the scale. You can see from the Obs column that there are about 200 more cases in the famscalel variable than in the famscalel variable.
Determining the theoretical minimum and maximum values of the scale is a little more complex lor famscaleS, which used the option min(5) to tell Stata to use only cases that have answers to five or more items. II we take the five iowest possible scores which are the five reverse coded items then we get -25/5 = -5 as a minimum. The five highest possible scores are the three unaltered items and two reverse coded items; 5+5-^5-1-1 so 1-3/5 - 2.6 is the maximum. You can see that the actual minimum and maximum values in the data fall short of these extremes.
However, if we correlate all. tour scales we see that they are mathematicahy equivalent for the eases that were included on each pair oi scales. The decision rests with you, as the analyst, as to how many answers you need for someone to obtain an overall scale score.
tamsca-2     lamKosv-3 fcim.soa-4
1.0000 9515
1.0000 1.0000
a 5 15 9 ;5 5 7
1 . 0000 3 . 0000 1 .0000
9515 9515 9515
For an account of using a scale in an applied research project, see Box 3.5.
laiiffia lei   I 1 . 0000
I 9'/18
fara-ceale/   I 1,0 0 00 9515
f cmiKCalej   J 1 . 0000
I 0657
tamecclel   | 1.0000
I 9515
80    Manipulating variables
Box 3.5: An exampte of using a scale in a project
in a project mnded by the Department of Health we conducted a series of analyses investigating the consequences of early child-bearing on outcomes later in fife. These were later published as papers examining housing,' partners and partnerships,2 and gender differences in outcomes,3 The data we used were from the British Cohort Study of 1970 which has followed approximately 15,000 children from birth in 1970 until the present day. Part of the analysis was to construct a measure of childhood behaviour prior to any childbearing. The cohort was interviewed at age 10 in 1980 At this time their teachers were also asked about their behaviour in school. We decided to use teacher reported behaviour as well as parent reported behaviour. We took alt the answers iron? the feacners and examined the correlations between the variables and, after some thought and analysis, set-tied on a scale of five items that measured the child's concentration on educational tasks, their popularity with peers, the number ol friends, their level of co-operation with peers, and the extent thai the teachers were able to negotiate with the child, Al! items were measured using a 'thermometer' type response ranging from T to 47.
These i've items had correlations between 0.33 and 0,84, wnich suggested that they might make a reasonable scale without all measuring the same thing. We standardized alt five Hems to a mean of 0 and a standard devstion of 1 using the f gejscommand to make five new variables: íoí/u, rtwo, zthree, zfour, ztive Then we used the alpha command to determine that the resulting scate had a Cronbach's alpha of 0.82, which -was very satisfactory for a five-item scale. We used the item option on the alpha command to see it" omitting items would improve the alpha value, and the results indicated that no significant improvement could be gained. To create the final score for each child we used ,the egea
' Efmisch, I F ,md Pe.ahn, D J. (200^1 Ean> chrMUearing and Sousing choices >o}!iil8! of Housmq Etonomirs, 13 1 70- i 94
- uinw.li, l^. and P^alm, D J ÍV0H5) Early mntheftyjotf and l&i&t partnerships. JonnKdl c/ Population IcO'io/ViCi., 16. 469-48$,
n Pobsore K and Pevalin, DJ. Í20C-7V Gondei cfilk-iences in flic piediulurs and njiccne' jt >otr,qp.rr-,rrhnx! Re^e^rcn ,n Coudlbvatlhcavcn sou A*o!jtlny 25 205-2'B.
Demonstration exercise 81
Gonj.rEnd again to take a mean o* the five items whsre a low score indicated poor behaviour.
cone
ZtYO-'
".6633
a. Bia ■
3 i:~5i
* The final measure of childhood behaviour was a significant
predictor of early parenthood, but thb regression models indicated an interaction effect with gender {see Chapters 8 and 9 for _• %egression and interaction, effects) where this behaviour predicted psariy motherhood but not early fatherhood. Then we looked ai when the cohort was 00 years old and we found that this 1 el-rtvuw t* ea&ury w.^ d^fnifioant^ associated wi'h a jange ot v« outcomes including their educational attainment, labour force
It participation, pay, social class, house ownership, and receipt of
benefits.
2   DEMONSTRATION EXERCISE
In this demonstration exercise we use some oi the techniques *    covered in this and each of the subsequent chaptets to conduct a series of data analyses exploring the question of social variations in mental health among working age adults. Our measure of men-l]    tal health is the 12-item General Health Questionnaire, which is a scale that can be constructed in a number of ways. In this demon-y    stration we start by using the GHQ m the form of a scale that m    ranges from 0 to 56, with higher scores indicating poorer mental health. The factors we are interested in using are sex, age, marital status, employment status, number of own children in the household and region oi the country.
82    Manipulating variables
We start by opening the example data tile (exampledata.dta) from our default directory. Before opening the data file wc increase the memon available ro Stata to 50 A4b using the set inem command:
version 19
set mem 50m
cd "C : \project. folder"
use axampledata.dta
Next we use the keep command to retain only the individual level variables we need for this analysis and then recodc ail the negative values to missing. We keep the individual identifier (pid) and the household identifier (hid) so that we can match on household level in formation in the next chapter.
keep pid hid ghg* sex age mastat jbstat achild lira-decode    all,mv(-9/-1>
Stata returns the output below. You can see thai after the utrus ~ code command, Stata tells you how many missing values were assigned for each variable. For example, the variable jbstat had 352 missing values while the variable ghql had 580 missing values, from this you can deduce that the variables íjíVÍ. bid, sex, age, mastat and nchild do not have any missing values.
.   'a/decodo _al 1,rov
bstati	3 52	missing	val ues	generated
ghqa;	53 6	mi ss \ rig	values	generated
ghejb:	536	miss.; rig	values	generated
qhqc :	545	missing	values	generated
ghqd:	53 4	missing	vai ues	generated
yhqe:	53 5	m i. a s i iig	values	generated
ghqf :	54 6	missing	values	generaled
ghqcf:	534	missing	values	genera ted
ghqh -.	53 2	..nrssing	va 1 ues	generated
ghqi :	534	i'fi:i ssmg	vaJ. ues	generated
ghqj :	57'/	miss i rig	values	generated
ghq< :	587	mi s s i r;i g	values	generat ed
ghql ;	580	missing	values	generated
Wc now create the <dHQ scale from the id items m the data set. The items are coded from 1 to 4 but wc want to make a scale
that goes from 0 to 36, so we need to recede all the 12 GHQ items
Demonstration exercise 83
go trom tv item.
items then
vay tc the d
tí
tv mo! t h . v. n r tout) tľ>n
he wildcard í") to save listing rati I d be to use the dash as the i. recede ghqa- ghql c We of the scale using the alpha the item option to give us
(2=1) (1=0)
' n n [i n 1 ' 'i ill t - ii le to each of the GIIQ items 'b n tl e d-- i I "f i c c' ntciic! reliability check. The overall pli -dm ( i -i i! I is i i rr u the bottom right of the table; id die sen, lr positi e iud i 1 items have similar item-rest corrc-!14"iMti ^h*. n ht h "ii i ohm i dit ws that the (weral! alpha value ,oul< tut ne im eiM.it>' i opp ng any item. We should be rea-i I i   i   p^ v ith r ■ tut u tl ieh ibility of this scale.
ide qhq* S4-3S leu i
11-0 i
	(ghqe: 9	h n	ij 1
?l	<g.):q-: 9	i n i -	rrn 1
■M	iqliqqi 9	l   ' '"-I"!....'	
	1 ,I|	32 c [ranges	made
	iqhq.i : 9	Síl i-hsnges	n V
	iqhqj: 9	87 changes	i I
	í oViq'K : 9	a r	made
	(qhel: 9	4 ii]	made
1	.  alpha -	Jliq"-. -tern	
gnqa g'hqe tiiiyf
ghqg ghqh giiqi ghq3 ;:hqk glial
ii ■■uerolation r.	relation	covariange		Ipha
+ U.5H9:	0.0119	.1517366	0	8 348
1              Ii . !)' 1 l	0 .S3 25	. 1408634	0	8541
+ 0.5092	0.4116	.153 9663	0	8 60 3
+ 0.4798	0.3983	. 1580361	0	86C7
+	0.bSSU	.13612 16	0	8^8 9
i r	3.5943	. 1403709	0	6489
i             it, .,: 11	0.0429	. 1487009	0	8 527
+ 0.5680	C.4970	.154532	0	8561
■i o.v"/:.ii	C.6390	. I.?095?tt	0	8/ 15
+ 0.7324	0.049	.135947	0	8,447
l.                  fl   04 H '	0 .5636	.1450927	0	8511
0.0220	0.0479	.1499576	0	8 52 3
		.1454405	0	3S31
84    Manipulating variables
The command creates the scale using the gen command. As we have shown in this chapter, you could use the alpha command with a gen option but we prefer to construct the scale manually m this example.
gen ghqscale=ghqa.+gb.qb+ghqc+ghqd*ghqe+ghgf ///
+ghqg+gh.qh+ghqi+ghq j 4-ghqk+ghql lab var ghqvcale Mghq su ghqscale
In this part of the output, Stata lets us know that in creating the scale 651 missing values have been generated in the new variable {ghqscale). This is because the gen command only creates a new variable tor those cases that have non-missing values on all 12 items. The next line labels the new variable and then we use the au command to display the descriptive statistics of the new variable, which shows that we have 9613 cases with a new scale score. There is further discussion of descriptive statistics commands in Chapter 5.
.  yen ghqscale - ghqa+ghqb-t ghqc+ghqd+ghqe /.// > +ghqf + ghqy-s-ghqh + gnqi. + ghqj +ghqk+ghql
(651 missing values generated}
.   lab var ghqscale "ghej 0-36"
.  su ghqscale
Variable  I      Obs Mean    Stci.  Dev.    Min Max
ghqscale  |     9613    10,7712b      4.9143.8?        0 36
We now construct another variable based on the GilQ items. This one uses a coding of 0-0-1-1 for' each of the items and then adds up the items to a maximum of 12. Then a threshold of 4 or more is used to make a dichotommis indicator. First we recocle ail the 12 GHQ items and then sum them to create a new variable called d_ghq.
recede giiq*  (0/1=0) (2/3=1)
gen el_sl»q=ghija+ghgb+gl«jc+ghqd+gliqe+gliijf ///
+ghqg+ghqli+<lhqi+sjhqj +ghqk+ghql ta d gbq
Demonstration exercise 85
ta d_ghq
G_giaq  |       Freq.      Percent Cum.
0 1	4,	933	51.	.32	51.	.32
1	1,	423	14	, 30	66 .	. 12
2 1		873	9 .	08	7 5	.20
3 i		600	6 .	.24	81	, 44
4 1		447	4 .	. 65	86 .	. 09
5 1		541	3 ,	. 55	89 .	. 64
6 i		2 60	2 ,	70	92 .	.34
7 1		210	2 .	. 18	94 .	53
8 !		162	1.	69	96 ,	.21
9 1		103	-L ,	. 07	97 .	.28
10 J		112	1.	17	98 ,	,45
11 1		94	0 ,	.98	99 .	.43
12 1		5 5	0 .	57	100 .	00
--------+ __	-----	---------	----------	-------	-------	
:al I	9,	613	100 ,	00		
The tabulation shows us that the recode and summing have been done correctly. Now we recode the d_ghq variable into a dichotom-oiis indicator where I equals those with a G11Q score of 4 or more:
recode d gh<j 0/3-0 4/12=1 ta d ghq
La cLghq
d^gdq  I      Freq,      Percent Cum.
0   J       7,829 81.44 81.44
1.   [       1,784 18.56 100.00
Total   |       9,613 100.00
Tiie tabulation of the dichotomous GHQ indicator shows that I #.56% of the current cases in the data set are over the threshold.
As we are interested in variations of mental health for those aged 18 to 65, we use the keep command to retain only those cases within that age range. Compare this use of the keep command - keeping eases - with the other use earlier in this
Manipulating variables
example when it was used to keep variables. Similarly, the drop command can be used to drop variables or cases depending on how the command is formatted. We then produce descriptive statistics of the age variable to see how many cases we have left in our data.
keep If ag@>-18 & age<~65 su age
The output shows how many cases (observations) are deleted from the data set after implementing the keep command, .brorn the descriptive statistics tor the variable age, we see that now we only have 8163 cases in our data. Remember that there are no missing cases in the age variable.
. keep i r age>--lS « age<-63 (2101 observations deleted)
.   su age
Variable  |      Obs Mean    iitd.  Dev.    Min Max
age   |    8163    39.32733      13.08993      1£ 55
iNcxt we reeode the age variable into three categories and use the gen option m the recede command to create a new variable called agecat. We then label the new variable and its categories. We then use the tab command to produce a frequency table of the new variable to check if out reeode and labelling have come out correctly.
reeode age (18/32=1)   (33/50=2) ///
(51/65=3),gen.(agecat) la.b var agecat ttage categories'* lab clef agalaib 1 M18-32 years" ///
2 "33-50 years" 3 "51-65 years" lab val agecat agelab tab agecat
After the recede command, Stata tells us that there arc 8163 (all cases) differences between the original variable age and the newdy created variable agecat. As we numbered the categories of agecat i, 2, and 3 and there is no one in the data under j 8 years of age, it
3
Demonstration exercise 87
is not surprising that for ail eases the values of age and agecat are different. The frequency table produced from the tab command shows the number and percentage of the cases in each of the three aye categories.
■i	. reci	cle a.co (	18/32=1) !2	3/50=2) ///
	(5.1	/ 6 5 = 3 i ,g	an iagecat)	
	{8163	differences betweer		age and agecat)
*5				
	, .ab	var agecat  "age cat		egorias"
;	.  1 a.b	def agel	:ib 1 "18-32	years/ / /
i	2 '	3 3-50 years" 3 "31-		65 years"
■i	. lab	val agec	at ace]ab	
	. cab	agecr,!.		
		age |		
	c: o Lee	cries	Freq. i	e rcent Cum.
	--------	.......-----h-	-------.......-----	
	18-32	years |	2  9 5 6	3 6.21 36.21
	3 3-50	years I	3,33 6	40.8'/ 77.08
	51-65	years	1, 871	22.92 100.00
		Total |	8.163	100.0 0
	Our	next step i.'	to recode the sex variable into a dummy vari-	
	able thai indicates female cases. 1			w'e use the recode command
wu.h the gen option again. We label die new variable and Its categories, then produce a frequency table to check our recode.
tab sex tab sex:,no 1
recode sea* (1=0)   (2=1),gen(female) lab var female "female indicator" lab de£ saxlafo 0 ^male" 1 "female" lab val female sexlab tab female
Wc see the frequency table of the sex variable but we need to see what numbers He underneath the category labels of male and femuLe. We use the tab command with the nol option.
SB    Manipulating variables
ta sex
sex  I      Freq.      Percent Cum.
male   \ 3,914 47.95 47,95
female  | 4,249 52.05 100.00
Total   I 8,163 100,00 ta sex,nol
sex  I Freq. Percent Cunt.
1 j       3,914 47.95 47.95
2 j       4,249 52.05 100.00
Total   I       8,163 100.00
.  recode sex  (1 = 0)   (2-=l), gen ( female) (8163 differences between sex and female)
-  lab var female "female indicator"
.   lab def sexlab 0  "male"  1 "female"
.  .lab veil female sexlab
.  ta female
female
indicator  (      Freq.      Percent Cum.
male  ]       3,914 47.95 47.95
female  j      4,249 52.05 100.00
Total   J       8,163 100.00
To reduce the number of marital statu;, categories, we recode the marital status variable [mastat) into a new variable called marstl and have four categories, where 1 = single, 2 =. married/cohabiting, 3 = separated/divorced and 4 = widowed. We need to see what the categories and the numbering arc in the original marital status variable (mastal). we do this by using the tab command with the aol option. We then recode, create the new-variable and label the new variable and its categories.
Demonstration exercise 89
tab mastat tab masfcat.,nol
recede raastat   (6=1)   (1/2=2)   (4/5=3) ///
(3=4),gen(marst2> lab var marst2 "marital status 4 categories" lab def marlab 1 "single" 2 "married" ///
3 "sep/div" 4 "widowed" lab val m&XBt2 marlab tab marst2
The output for these commands is similar to drat above. The exact process and commands you use to rccode variables may vary from this, but we strongly advise you to have a system that allows you to check your receding as you go along.
.   cab mastat
marita,   status '.	Freq.	Percent.	Ceuti.
•^-rrioci		62 . 87	62.87
1 ivirig as couple	654	8 . 01	70 . 88
widowed	189	2 . 32	73 .20
divorced j	397	4.86	78 . 06
separated |	172	2 .11	80 .. 17
never married |	1,619	19 . 83	100.00
Total j	8, 163	100.00	
.   tab mastat,nol			
marital			
status |	r7, reel.	Percent	Cum.
i ;	5 , 132	62 . 87	62 . 87
2	654	8 . 01	70 - 88
i	189	2 . 32	73 . 20
4	397	4 . 86	78 . 06
b j	172	2 .11	80 . 17
6 j	1, 619	19 . 83	100.00
Total   j      8,163 100.00
Manipulating variables
..   recode mastat   (6-1)   (1/2-2)   (4/5-3) ///
1 3 - 4) , g en (ma r s 12 ; 17509 c.ifferences between mastat and rearst2j
. lab var marst2  "marital status 4 categories
.  lab cef mar lab 1 "single" 2  "married" /•'/ 3   "sep/div"  4 "widowed"
.  lab val marst2 marlab
.   tab marsc2
marital | status 4 |
categories  |      Freq..      Percent Cum.
single  | 1,619 19.83 19.83
married  j 5,78 6 7 0.88 9 0.71
sep/div  | 569 6.97 97.68
widowed  | 189 2.32 100.00
Total   |      8,163 100.00
We now create the employment status variable:
ta jbstat
ta jbstat,nol
recode. jbstat  (1/2 = 1)   (3=2)   (7=3)   (6=4) ///
(9=4)   (5=5)   (8=5)   (4=6)   (10=.), gen(enpstat)
lab var empstat "employment status"
lab def empiab 1 "employed" 2 "umwnployed" /// 3 "1 ongtemv sick" 4 "studying" /// 5 "family care" 6 "retired"
lab val empstat empiab
ta. empstat
.  ta jbstat
current  labour force |
status   j   Freq.   Percent Cum.
self employed   (       731 9.26 9.20
in paid employ   I  4,844       61,39 70.65
Demonstration exercise 91
"reemployed '	505	6	40	77	05
retired j	403		11	82	16
family care j	9 0 0	11	41	93	5 6
ft student |	202	2	56	96 .	12
long tern; sick/disabled |	244	3	09	99	21
on matern leave <	13	0	16	99 ,	38
govt trng scheme j	22	0	28	99 .	66
something else ]	27	0	34	100 .	00
Total !	7, 8 91	100	00		
ta jbstat,nul
current j
labour j
force 1
status |       Freq.      Percent Cum.
1 [	731	9	26	9	26
2 ;	4, 844	61	3 9	70	65
3 i	505	6	40	77	05
4 1	40 3	5	11	82	16
!> !	900	11	41	93	56
6 [	202	2	56	96	12
7	244	3	09	99	21
8 i	1 3	0	16	99	38
9 1	22	n	2 8	9S	6 6
0	27	0	34	100	0 0
Total   ]       7,891 100,00
.  recede jbstat   (1/2=1)   (3-2)   (7=3)   (6=4) ///
> (9 = 4)   (5 = 5)   <8 = b)   (-1-6)   (10 = . ) , gen (empstat) (62 60 differences between jbstat and empstat)
.   lab var empstat  "employment status"
.  lab def emplab J   "employed" 2  "unemployed" ///
> 3  "longterm sick" 4  "studying" ///
> b  '"family care"  6 "retired"
lab val empstat emplab
92     Manipulating variables
.   ta empstat
employment
status j	Freq,	Percent		Cum.
employed |	5, 575	7 0	8S	70 . 89
unemployed I	505	6	42	7 7.31
longterm sick	244	3	10	80 .42
studying j	224	2	85	83 .27
family care j	913	11	61	94 . 88
retired	403	5	12	100.00
Total |	7, 864	100	00	
Next we collapse the variable for number of children into rewer categories:
su nchild
recode nchild (0=1!   (1/2=2)   (3/9=3), ///
gen (mime Jsd) lab var numchd "children 3 categories" lab def chdlab 1 "none" 2 "one or two" ///
3 "three or more" lab v»l numchd chdlab
.   so nchild
Variable   |      Oos Mean    Std..   Dev.    Min Max
nchild  |     8163     .6659316      1.019895        0 9
.   recode nchild  (0-1)   (l/2-=2)   (3/9=3), /// gen(numchd)
(6508 differences between nchild and numchd)
.   lab var numchd "children  3 categories"
.   lab def chdlab 1  "none"  2  "one or two" /// 3  "three or more"
.  lab val numchd chdlab
ta numchd
Demonstration exercise 93
children 3 j
categories |	Freq.	Percent	Cum.
none :	5,182	63 . 48	63 .48
one or two |	2, 443	2S.93	93 . 41
three or more |	538	6.59	100.00
Total |	8 , 163	100 . CO	
When we have completed our recocting we produce descriptive statistics for all the variables that we will use in our future analyses.
sb ghqscale d ghq female age agecat marst2 /// empstat numehd
We can use the output of descriptive statistics to see if our variables have the right number of categories and cases (observations). The output below shows that some of the variables have fewer valid observations than the 8163 in our total sample. This is due to the GHQ items, employment status and marital status variables having a number of people who did not respond to the questions and so have been coded as missing values,
.  so ghqsc.	ale d.._ghq female		age agecat	mar st2	///
empstat	numcbd				
Variable [	Obs	Mean	Std. Dev.	Min	Max
ghqscale |	7714	10.76407	4.987117	0	3 6
d_ghq |	7 714	.1870625	.3 899 87	0	1
female	8163	.5205194	. 4996094	0	1
age j	8163	39.32733	13.08993	18	65
agecat |	8163	1.867083	.7574497	1	3
marsr.2	8163	1.917 677	. 5949101	1	4
empstat	7864	1.93235	1. 64757	1	6
nurachd 1	81 63	1.431092	.6140945	1	3
Finally, we use the keep command again to retain only the variables we wish to use in further analyses. The order command lets us order the variables in the data set if this is something you prefer. The compress command stores the data set in the smallest
94    Manipulating variables
amount of space, and then the save command saves our new data set to our default directory for future use.
keep pid hid ghqscale d gh<j female age ///
agecat marst2 empstat numchd ordar pid hid ghqscale d ghq festal* age ///
agecat roarst2 empstat nunchd compress
save deniodatal .dta, replace
.  keep pid hid ghqscale d_ghq female age  /// agecat :narst2 empstat numchd
. order pid hid ghqscale d___ghq female age //'/ agecat marst2 empstat numchd
. compress
ghqscale was float now byte
.  save demodatal, dta, repla.ce
(note;:   file demodatal. ctta not found)
file oemodatal.dta saved