Interpreting dummy variables in multiple regression stata

The way you are interpreting the coefficients is not quite right. The general interpretation of the coefficient on a dummy variable in a multiple regression is "the expected (or average) difference in the dependent variable between those with $1$ and those with $0$ values of that dummy variable, holding other independent variables constant.

If you, for example, only have these two $A$ and $B$ variables as predictors, then the interpretation of the coefficient on variable $B$ is "the expected difference in the dependent variable between someone with the value of $1$ and someone with the value of $0$ of variable $B$, if they both had the same value of variable $A$." That same value of variable $A$ can be anything, not just $0$.

Variable A can be present (i.e., 1) only when Variable B is present (1). I am wondering how I can interpret the estimated coefficient for variable B, because the coefficient for B represent the presence of B while A is 0, which logically does not make sense.

I think in these two sentences you may have meant to say that you cannot interpret the effect of $A$ while $B$ is $0$, because $A$ should be non-existent when $B$ is $0$. But the regression model does not know that and the $\beta$'s (marginal effects) it provides apply regardless of what the value of the second variable is. So you cannot just take interpret those coefficients as effects of the first variable for a specific value of the second variable.

It is a pretty simple exercise to figure out what each coefficient represents in this kind of regression:

$$ Y_i = \beta_0 + \beta_1A_i + \beta_2B_i + \epsilon_i $$

where both $A$ and $B$ are binary dummy variables. You can just plug in the possible values of those variables and see what you get, like this:

If $A = 0$ and $B = 0$: The expected value of $Y$ will be $\beta_0$.

If $A = 0$ and $B = 1$: The expected value of $Y$ will be $\beta_0 + \beta_2$.

If $A = 1$ and $B = 0$: The expected value of $Y$ will be $\beta_0 + \beta_1$.

If $A = 1$ and $B = 1$: The expected value of $Y$ will be $\beta_0 + \beta_1 + \beta_2$.

So, we see that $\beta_2$ is the average difference in $Y$ between individuals with $B = 0$ and $B = 1$, regardless of whether $A = 0$ or $A = 1$.

If you actually want to incorporate the knowledge that $A$ can be present only when $B$ is present into your model, to allow yourself to interpret results the way you want, you can modify the regression to something like this:

$$ Y_i = \beta_0 + \beta_1A_i + \beta_2B_i + \beta_3A_iB_i + \epsilon_i $$

If you perform the same exercise as above with this equation, you'll see that different combinations of coefficients will represent the average difference in $Y$ between someone with the value of $1$ and someone with the value of $0$ of variable $B$ if they both had the value of $0$ for variable $A$ and the average difference in $Y$ between someone with the value of $1$ and someone with the value of $0$ of variable $B$ if they both had the value of $1$ for variable $A$.

How do I create dummy variables?

Title   Creating dummy variables
Author William Gould, StataCorp

A dummy variable is a variable that takes on the values 1 and 0; 1 means something is true (such as age < 25, sex is male, or in the category “very much”).

Dummy variables are also called indicator variables.

As we will see shortly, in most cases, if you use factor-variable notation, you do not need to create dummy variables.

In cases where factor variables are not the answer, you may use generate to create one dummy variable at a time and tabulate to create a set of dummies at once.

Using factor variables instead of generating dummy variables

I have a discrete variable, size, that takes on discrete values from 0 to 4

 . tabulate size

       size |      Freq.     Percent        Cum.
------------+-----------------------------------
  miniature |         19       19.00       19.00
      small |         37       37.00       56.00
     normal |         30       30.00       86.00
      large |         12       12.00       98.00
       huge |          2        2.00      100.00
------------+-----------------------------------
      Total |        100      100.00

If I want a dummy for all levels of size except for a comparison group or base level, I do not need to create 4 dummies. Using [U] factor variables, I may type

        . summarize i.size

or use an estimator

        . regress y x i.size

If I want to use a dummy that is 1 if size is large (size==3) and 0 otherwise, I type

        . regress y x 3.size

If I want to make the comparison group, or base level, of size be size==3 instead of the default size==0, I type

        . regress y x ib3.size

You can also use factor-variable notation to refer to categorical variables, their interactions, or interactions between categorical and continuous variables.

For example, I can specify the interaction of each level of size (except the base level) and the continuous variable x by typing

        . regress y x i.size#c.x

The c. instructs Stata that variable x is continuous.

In all the cases above, you did not need to create a variable.

Moreover, many of Stata's postestimation facilities, including in particular the margins command, are aware of factor variables and will handle them elegantly when making computations.

There are some instances where creating dummies might be worthwhile. We illustrate these below.

Using generate to create dummy variables

You could type

        . generate young = 0 
        . replace young = 1 if age<25

or

        . generate young = (age<25)

This statement does the same thing as the first two statements. age<25 is an expression, and Stata evaluates it; returning 1 if the statement is true and 0 if it is false.

If you have missing values in your data, it would be better if you type

        . generate young = 0 
        . replace young = 1 if age<25
        . replace young = . if missing(age)

or

        . generate young = (age<25) if !missing(age) 

Stata treats a missing value as positive infinity, so the expression age<25 evaluates to 0, not missing, when age is missing. (If the expression were age>25, the expression would evaluate to 1 when age is missing.)

You do not have to type the parentheses around the expression.

        . generate young = age<25 if !missing(age)

is good enough. Here are some more illustrations of generating dummy variables:

        . generate male = sex==1

        . generate top = answer=="very much"

        . generate eligible = sex=="male" & (age>55 | (age>40 & enrolled)) if !missing(age)

In the above line, enrolled is itself a dummy variable—a variable taking on values zero and one. We could have typed & enrolled==1, but typing & enrolled is good enough.

Just as Stata returns 1 for true and 0 for false, Stata assumes that 1 means true and that 0 means false.

Using tabulate to create dummy variables

tabulate with the generate() option will generate whole sets of dummy variables.

Say that variable group takes on the values 1, 2, and 3. If you type

        . tabulate group

you will see a frequency table of how many times group takes on each of those values. If you type

        . tabulate group, generate(g)

you will see the table, and tabulate will create variable names g1, g2, and g3 that take on values 1 and 0, g1 being 1 when group==1, g2 being 1 when group==2, and g3 being 1 when group==3. Watch:

 . list

      +-------+
      | group |
      |-------|
   1. |     1 |
   2. |     3 |
   3. |     2 |
   4. |     1 |
   5. |     2 |
      |-------|
   6. |     2 |
      +-------+

 . tabulate group, generate(g)
 
       group |      Freq.     Percent        Cum.
 ------------+-----------------------------------
           1 |          2       33.33       33.33
           2 |          3       50.00       83.33
           3 |          1       16.67      100.00
 ------------+-----------------------------------
       Total |          6      100.00

 . list

      +----------------------+
      | group   g1   g2   g3 |
      |----------------------|
   1. |     1    1    0    0 |
   2. |     3    0    0    1 |
   3. |     2    0    1    0 |
   4. |     1    1    0    0 |
   5. |     2    0    1    0 |
      |----------------------|
   6. |     2    0    1    0 |
      +----------------------+

What you name the variable is up to you. If we had typed

        . tabulate group, generate(res)

the new variables would have been named res1, res2, and res3.

It is also not necessary for the variable being tabulated to take sequential values or even be integers. Here is another example:

 . list

      +------+
      |    x |
      |------|
   1. |   -1 |
   2. | 3.14 |
   3. |    8 |
   4. |   -1 |
   5. |    8 |
      +------+

 . tabulate x, generate(xval)

           x |      Freq.     Percent        Cum.
 ------------+-----------------------------------
          -1 |          2       40.00       40.00
        3.14 |          1       20.00       60.00
           8 |          2       40.00      100.00
 ------------+-----------------------------------
       Total |          5      100.00

 . list

      +------------------------------+
      |    x   xval1   xval2   xval3 |
      |------------------------------|
   1. |   -1       1       0       0 |
   2. | 3.14       0       1       0 |
   3. |    8       0       0       1 |
   4. |   -1       1       0       0 |
   5. |    8       0       0       1 |
      +------------------------------+

You can find out what the values are from describe:

 . describe

 Contains data
   obs:             5                          
  vars:             4                          
  size:            55 
 ------------------------------------------------------------------------
               storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------
 x               float  %9.0g                  
 xval1           byte   %8.0g                  x==    -1.0000
 xval2           byte   %8.0g                  x==     3.1400
 xval3           byte   %8.0g                  x==     8.0000
 ------------------------------------------------------------------------
 Sorted by:  
      Note:  dataset has changed since last saved

Finally, tabulate can be used with string variables:

 . list

      +-----------+
      |    result |
      |-----------|
   1. |      good |
   2. |       bad |
   3. |      good |
   4. | excellent |
   5. |       bad |
      +-----------+

 . tabulate result, generate(res)
 
          result |      Freq.     Percent        Cum.
 ----------------+-----------------------------------
             bad |          2       40.00       40.00
       excellent |          1       20.00       60.00
            good |          2       40.00      100.00
 ----------------+-----------------------------------
           Total |          5      100.00

 . describe

 Contains data
   obs:             5                          
  vars:             4                          
  size:           110 (99.9% of memory free)
 ------------------------------------------------------------------------
               storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------
 result          str15  %15s                   
 res1            byte   %8.0g                  result==bad
 res2            byte   %8.0g                  result==excellent
 res3            byte   %8.0g                  result==good
 ------------------------------------------------------------------------
 Sorted by:  
      Note:  dataset has changed since last saved
 
 . list

      +--------------------------------+
      |    result   res1   res2   res3 |
      |--------------------------------|
   1. |      good      0      0      1 |
   2. |       bad      1      0      0 |
   3. |      good      0      0      1 |
   4. | excellent      0      1      0 |
   5. |       bad      1      0      0 |
      +--------------------------------+

How do you interpret dummy variables in multiple regression?

The general interpretation of the coefficient on a dummy variable in a multiple regression is "the expected (or average) difference in the dependent variable between those with 1 and those with 0 values of that dummy variable, holding other independent variables constant.

How do you interpret the dummy variable coefficient in regression?

In analysis, each dummy variable is compared with the reference group. In this example, a positive regression coefficient means that income is higher for the dummy variable political affiliation than for the reference group; a negative regression coefficient means that income is lower.

Does Stata recognize dummy variables?

Dummy (logical) variables in Stata take values of 0, 1 and missing. The most common use of dummy variables is in modelling, for instance using regression (we will use this as a general example below).

How do you interpret multiple regression values?

is the predicted or expected value of the dependent variable, X1 through Xp are p distinct independent or predictor variables, b0 is the value of Y when all of the independent variables (X1 through Xp) are equal to zero, and b1 through bp are the estimated regression coefficients.