• Home   /  
  • Archive by category "1"

Nlsy79 Bibliography Example

This section describes these three primary components of the NLSY79 codebook system and discusses the important types of information found within each. An additional codebook supplement exists for the Geocode data file.


The codebook is the principal element of the NLSY79 documentation system and contains information intended to be complete and self-explanatory for each variable in a data file. The software accompanying the NLSY79 data sets allows easy access to each variable's codebook information and permits the user to print a codebook extract for preselected variables.

Every variable is presented within the NLSY79 documentation as a block of information called a "codeblock." Each codeblock entry depicts the following important information: 

  • reference number
  • variable title
  • coding information
  • frequency distribution
  • location within the data file
  • reference to the questionnaire item or source of the variable
  • information on the derivation of created variables

Users will find that NLSY79 CAPI codeblocks present greater detail on each variable, including universe totals, universe skip patterns, and range of acceptable values information. Each of these terms is described more completely below. Codeblocks for many variables include special notes containing additional information designed to assist in the accurate use of data from that variable.

Codebooks are arranged in reference number order. As a general rule, raw questionnaire items appear first for a given survey year, followed by items from such instruments as the Information Sheet and Employer Supplement. Variables from the main body of the questionnaire are followed by created or constructed variables drawn from an external data source, such as the County & City Data Book.

Beginning with the 1993 CAPI surveys, questions relating to each job/employer, which were formerly located within the unique Employer Supplements, are merged with the main questionnaire items. A comparison of the reference number assignments used for the 1988 PAPI and 1993 CAPI variables appear in Table 1 and provide users with a sample set of reference numbers. Users should note that not all survey year assignments will be ordered in precisely this manner.

Table 1. NLSY79 1988 and 1993 Reference Number Assignment

1988 PAPI1993 CAPI
R25000.-R28927.All Raw, Edited and Created VariablesR41001.-R44308.All Raw, Edited and Created Variables
R25000.-R27467.Questionnaire ItemsR41001.-R43988.Questionnaire Items including the Employer Supplement series
R27469.-R27501.Information Sheet ItemsR43989.-R44036.Information Sheet Items
R27506.-R27609.Household RecordR44037.-R44126.Household Record
R27610.-R28254.Employer Supplement (ES) 1
R28255.-R28371.Children's Record FormR44127.-R44162.Children's Record Form
R28372.-R28690.Childhood Residence Calendar2
R28704.-R28729.Created VariablesR44163.-R44205.Created Variables
R28735.-R28811.Supplemental Fertility File Variables
R28825.-R28927.Geocode VariablesR44206.-R44308.Geocode Variables
Note:  PAPI refers to paper-and-pencil interviews which were conducted with the NLSY79 during 1979-92. CAPI or computer-assisted personal interviews began for the full NLSY79 cohort in 1993.
1 Beginning in 1993, variables from the employer supplement series are included within the raw questionnaire items.
2 The childhood residence retrospective was unique to 1988.

The following figures give users an example of codebook pages before (Figure 1) and after (Figure 2) CAPI implementation.

Figure 1. NLSY79 Sample PAPI Codeblock

Figure 2. NLSY79 Sample CAPI Codeblock

Coding Information: Each codeblock entry presents the set of legitimate codes that a variable may assume along with a text entry describing the codes. 

Dichotomous (or variables answered yes/no) are uniformly coded "Yes" = 1, "No" = 0. Other dichotomous variables have frequently been reformulated so this convention may be followed.

Discrete (Categorical), as in the case of the NLSY79 example, the variable 'Activity Most of Survey Week CPS Item'.

  7. OTHER

Continuous (Quantitative), as in the case of hourly rate of pay in the example above. These variables have continuous data but are presented in the codebook using a convenient frequency distribution. NLSY79 users will note that most valid data are positive numbers. Special cases are flagged by negative numbers in the NLSY79. See Appendix 13 in the NLSY79 Codebook Supplement for more detail on the handling of negative numbers in the data files. The following conventions have been used throughout the data:

Valid Skip-4
Invalid Skip-3
Don't Know-2

Coding information for a given variable in the NLSY79 codeblock is:

  1. not necessarily consistent with the codes found within the questionnaire and
  2. not necessarily consistent for the same variable across years. Use only the codebook coding information for analysis.

Frequency Distribution: In the case of discrete (categorical) variables, frequency counts are normally shown in the first column to the left of the code categories.  In the case of continuous (quantitative) variables, a distribution of the variable is presented using a convenient class interval. The format of these distributions varies.

Derivations: The decision rules employed in the creation of main file constructed variables have been included, whenever possible, in the codebook under the title "DERIVATIONS." This information enables researchers to determine whether available constructs are appropriate to their needs. In the case of the example NLSY79 variable in Figure 1, no derivation is shown because these variables are picked up directly from the interview schedule. Certain variables will contain a reference to an appendix for the decision rules that were used in creating the variable.

Questionnaire Item: "Questionnaire item" is a generic term identifying the printed source of data for a given variable. A questionnaire item may be a question, a check item, or an interviewer's reference item appearing within one of the survey instruments.

The questionnaire location for NLSY79 entries appears either in parentheses or brackets directly after the reference number, for example R04434. (SO6D1314).  The five questionnaire item numbering conventions used in the codebook are described in the Survey Instruments section (see especially Table 2).

Before the adoption of CAPI if an NLSY79 variable was not taken directly from one of the survey instruments, the questionnaire location contained an asterisk (*) in the codebook. The following categories of variables had no questionnaire numbers: 

  1. assigned identification numbers for the respondent, child, or family unit;
  2. all derived or constructed variables;
  3. variables from the following special surveys: Profiles (ASVAB), the School Survey, and the Transcript Survey;
  4. variables found on constructed data files such as the Supplemental Fertility File (area of interest "Fertility and Relationship History/Created"); and
  5. variables drawn from an external data source such as those found on the Geocode files.   

In CAPI years, survey staff assign a question name that is not used in the questionnaire. This name remains the same in subsequent rounds, so similar created variables can be easily located.

Section, deck, and question numbers have been somewhat arbitrarily assigned to the information and questions found in special survey instruments such as the Household Screener, Information Sheet, Children's Record Forms, Household Interview Forms, and the Employer Supplements. The section and deck numbers for these special survey items were numbered sequentially after the main survey items and their specific order varies each year. The exception to this is the assignment of the deck numbers for the Employer Supplements. Question numbering is discussed earlier in the Survey Instruments section (see especially Table 3).

Universe Information: Universe information was attached to select 1979-92 variables. Beginning with the 1993 CAPI interviews, the amount of universe information was expanded to include:

  1. Universe Totals: Two totals are presented: 
    • the sum of the frequency counts for each coding category is presented below the individual codes; and
    • the sum of the valid responses plus missing response counts of "refusals," "don't knows," and "invalid skips" can be found in the TOTAL==========> field.  The number of respondents who legitimately did not respond to a question, that is, "valid skips (-4)" and "noninterviews (-5)," are also depicted.
  2. Universe Skip Patterns: The following detailed universe information will enable researchers to easily trace the flow of respondents both backward and forward through various parts of the CAPI questionnaire items included in the codebook:

    "Go to Reference # XXXXX.," appended to certain coding categories, indicates that respondents selecting that answer category were routed to the next question specified.

    "Lead In(s) Reference # XXXXX." identifies the question or questions immediately preceding the codeblock question through which the universe of respondents was routed. Each lead-in reference number is followed by the relevant response value indicators, (Default), (ALL), [1:1], [1:6], and so forth.  For example:

R41000. (All)This means that all cases where R41000. is asked will branch to the current question. This does not imply all respondents are asked question R41000.
R41000. (Default)This means that the default path of control from question R41000. is to branch to the current question, but there may be conditions under which a different path would be taken.
R41000. [1:6]This means that whenever the response category for question R41000. takes on the values one to six inclusive, the next question is the current question record.

"Default Next Question" specifies the next question that all respondents of the current codeblock will be asked unless some other skip condition indicates otherwise.

Valid Values Range:  Depicted below the frequency distribution is information relating to the range of valid values for that particular distribution. "MINIMUM" indicates the smallest recorded value exclusive of "NA" and "DK." "MAXIMUM" indicates the largest recorded value. The computer-assisted interview contains internal range checks that limit responses to those between predesignated values, alert interviewers to verify unusual values, and bolster the information provided by the traditional minimum and maximum fields (see, for example, Figure 2 above).

Maximum and Minimum Fields: The MIN and MAX fields define the range, that is, the lower limit and the upper limit, of data values for a given question. A MAX of $156,359 on an income question, for example, means that this value was the highest value recorded.

Hardmax and Hardmin Fields: Hard Maximum and Hard Minimum fields denote the highest and lowest values that were accepted by the CAPI program. A Hardmax of 500,000 and a Hardmin of 0 on an income question indicate that no values above $500,000 or values lower than zero (no income) can be accepted. Dates, such as month/day/year of the respondent's last interview [lintdate] and current interview [curdate], are used as Hardmin and Hardmax values in order to restrict responses to certain questions to values within that range. Responses outside this range must be entered by the interviewer in the comment field.

Softmax and Softmin Fields: Softmax and Softmin fields cover ranges where an answer may exceed reasonable limits yet remain within the absolute limits and are acceptable after verification. A Softmax set to $80,000 on an income question will cause the machine to "beep" and a warning to appear on the screen. Interviewers are thus alerted that the value is unusual and the respondent's answer should be verified.

Restricted Income Values: Confidentiality issues restrict release of all income values. To insure respondent confidentiality, the values of income variables exceeding particular limits are truncated and the upper limits converted to a set maximum value.

  1. From 1979 through 1984, the upper limit on income variables was $75,000, and any amounts exceeding $75,000 were converted to $75,001
  2. Beginning in 1985, the upper limit on income amounts was increased to $100,000 due to inflation and the advancing age of the cohort, and amounts exceeding $100,000 were converted to $100,001
  3. Beginning in 1996, the top two percent of respondents with valid values were averaged and that average value replaced all values in the top range

Users should be aware of these changes in the income ceiling if they are carrying out longitudinal analyses with these data. Upward trends in mean income statistics may reflect this change in the ceiling value. More information about truncation is available in the "Income" section of this guide.

Restricted Asset Values: Confidentiality issues also restrict release of all asset values. To insure respondent confidentiality, the values of asset variables exceeding particular limits are truncated and the upper limits converted to a set maximum value. The asset amounts have different upper limits, and the types of variables and limits for those variables are as follows:

  1. Starting in 1985 all mortgage, market value of residential property, debt on residential property, miscellaneous debt and total market value of assets worth more than $150,000 were converted to $150,001; the market value and debt on a farm or business and savings that was worth more than $500,000 was converted to $500,001; the market value and debt on vehicles that was more than $30,000 was converted to $30,001
  2. Beginning in 1989, the amounts exceeding the upper limits mentioned above were assigned the average value of all values exceeding the limits, in an effort to more accurately reflect the true range of income and asset values
  3. Beginning in 1996, the top two percent of respondents with valid values were averaged and that average value replaced all values in the top range

Users should be aware of these changes in the asset ceiling if they are carrying out longitudinal analyses with these data. Upward trends in mean asset statistics may reflect this change in the ceiling value. More information about truncation is available in the "Assets" section of this guide.

Verbatim: Generally during the PAPI years, when a NLSY79 variable was taken directly from the questionnaire, the verbatim of the question appeared beneath the variable title. If a question is the source for more than one variable, the first variable contains the verbatim while subsequent variables prompt the user to refer back to the variable containing the verbatim. The following verbatim responses appear for reference numbers R03194. and R03195. and demonstrate this convention.

R03194. 'In Which Months of 1979 Did You (or Your Husband/Wife) Receive Supplemental Security Income? January 80 INT'
R03195. 'See R (3194.) February'

Codebook Supplements and Other Technical Documentation

The Other Documentation section of the website includes several items that provide additional information about the NLSY79 survey. There are two NLSY79 codebook supplements. The first supplement, the NLSY79 Codebook Supplement, contains a series of attachments and appendices, variable creation procedures, supplementary coding categories, and derivations for selected variables on the main NLSY79 data files. Information provided within this document is not available in the NLSY79 codebooks, nor will it be found on the documentation files on the NLSY79 data sets. The other supplement contains comparable information specific to the NLSY79 Geocode data files. The Technical Sampling Report describes the selection of the NLSY79 sample and provides additional statistical information. Finally, the School & Transcript Surveys Documentation provides technical information about those special data collections.

Error Updates

Prior to working with an NLSY79 data file, users should make every effort to acquire information on current data or documentation errors. A variety of methods are used to notify users of errors in the data files or documentation and to provide those persons who acquired an NLSY79 data set directly from the Center for Human Resource Research with corrected information. 

Errata can be accessed by following links for the cohort of interest.

This appendix provides two sets of details about asset and debt variables within the NLSY79:

Revision Process for Asset and Debt Variables

In the spring of 2008, a revised set of asset and debt variables was released to the public. These revised asset and debt variables fixed a number of problems with the NLSY79 data by eliminating some implausible outliers, generating uniform topcodes for all rounds, and constructing a total net worth variable. This section provides details on the revision process.

What Users See: Prior to the spring 2008 release users saw a single asset or debt question for each item in the wealth section of the questionnaire. For example, in 1987 the questionnaire asked each respondent who owned a home or apartment the market value of their residential property. The questionnaire asked respondents "About how much do you think this property would sell for on today's market?" Until the spring of 2008, the respondent answers were found in a single 1987 variable that had the following R and Q numbers:

R23627.00    [Q1947] (TRUNC)

After the revision was done, two more asset variables were added to the data set based on the same underlying property responses.  The two new variables are

R23627.01    [*Created] (TRUNC) (REVISED)
R23627.02    [*Created] (TRUNC) (IMPUTED)

What is the difference between R23627.00, R23627.01 and R23627.02? The variable that ends in (.00) R23627.00 is the original variable in the dataset and is left so that researchers can reproduce previous results. The variables that ends in (.01) R23627.01 is a new variable which uses a revised topcoding algorithm (see Step 6 below). By revising the variable, researchers are now provided with extra information previously unavailable. The variable that ends in (.02) R23627.02 is a new variable which imputes missing and unknown responses if possible as well as using the revised topcoding algorithm.

There are two new variables because some users will not want to use imputed data. The (.01) variables are cleaned and re-topcoded but do not have any imputed values. The (.02) variables have as many missing or unknown values imputed as possible. In general, the survey staff recommends that users without a strong preference should use the (.02) asset or debt value that ends in the label "(TRUNC) (IMPUTED)."

Table 1 gives an example using the 1987 property value question of how seven different types of cases were handled by the revision and imputation process. Please note the "$" symbols are not in the NLSY79 data but are added to make it easier to read the table.

Table 1. Examples of How NLSY79 Asset/Debt Data Were Modified

Public IDOriginal R23627.00Revised R23627.01Imputed R23627.02Explanation
200$150,001$276,984$276,984Originally above the topcode and the value is still above the topcode but the topcode is now higher, revealing more information.
40$150,001$153,000$153,000Originally above topcode and now below topcode. Value is no longer topcoded.
9083-1-1$93,333Originally a 'refused' response. Now contains an imputed value.
205-2-2$276,984Originally a 'don't know' response. Now contains an imputed but since the imputed value is above the topcode the topcode is used as the value.
526-3-3$100,000Originally an invalid skip. Now contains an imputed value.
2-4-4$0Originally a valid skip. Since valid skip means does not have the asset the item is changed to zero.
1336-2-2-2Originally a 'don't know' response. Since it was not possible to impute value, the value was left as a 'don't know' response.

Not every asset or debt variable has a new revised or imputed offspring. Instead, to keep the project manageable, only 15 asset/debt categories were created in each year. These 15 categories match up exactly with the categories found in the NLSY79 wealth module that was used in the 1990s. The categories are: Home Value, Mortgage Value, Property Debt Value, Cash Saving, Stocks/Bonds, Trusts, Business Assets, Business Debts, Car Value, Car Debt, Possession Value, Other Debt Value, IRA, 401K, Certificate of Deposit Value. However, starting in 2000 and then in 2004, the wealth module became more complex. For these later rounds each asset/debt category corresponds to multiple individual asset/debt variables.

For example, in 2004 respondents were asked to report the values of two homes. Their values are combined to form the "home value" category. Similarly, in 2004 the "stocks/bonds" category represents the individually-reported values of government bonds, mutual funds, life insurance surrender values, stocks, corporate bonds, and money owed to the respondent.

Details of the Revision and Imputation

In addition to creating these combined variables, the NLS asset and debt revision project did six other steps. This six-step process started off with cleaning the raw data and culminated in a new net worth variable and new top coding for most respondents. The details of the six steps are as follows:

Step 1 -- Cleaning Raw Data

The original raw data has a number of out-of-range codes. These out-of-range codes were originally given the topcode value when released to the public. Examination of the out-of-range cases suggests most of these out-of-range flags were data entry mistakes and not actually out of range. Most of these out-of-range codes occurred in the 1988 and 1989 surveys, but this issue arises in other PAPI years. All out-of-range codes were changed to an "invalid skip" (-3) in the revised (.01) variables. If possible, these variables were then imputed in the (.02) variable. Researchers are able to determine which items were incorrectly marked as out of range by looking for items that were top coded originally and then changed to a -3 value in the revised (.01) variables.

Step 2 -- Unfolding brackets

Unfolding brackets were used for four asset/debt categories in 2000 and for all categories in 2004. These unfolding brackets were not used prior to 2000. Unfolding brackets are used if a respondent fails to report a particular asset's or liability's value. For example, suppose a respondent refuses or does not know the value of his certificate of deposit (CD). The respondent is first asked if his CD is worth more than an entry amount, which is $10,000 for some respondents and $20,000 for others. If the value is not above the entry amount, the respondent is asked if the value of his CD is $5,000 or more. If the value is above the entry amount, he is asked if the value would amount to $30,000 or more. These three questions result in four potential reported ranges: below $5,000; between $5,000 and the entry amount; between the entry amount and $30,000; and above $30,000.

Whenever an unfolding bracket is used, we replaced the reported range with the median value among respondents whose reported value falls in the given range. For example, respondents who revealed via unfolding brackets that their CDs are valued below $5,000 were assigned the median CD value among all responses who report directly (not via unfolding brackets) a value between $0 and $4,999. The 2004 median values used for each bracket for each asset/debt category are shown in Table 2 and Table 3.

Table 2. 2004 Median Values Used to Impute Unfolding Brackets

Asset/debt itemLowMiddle 1Middle 2High
Items worth more than $1k$2,000$7,000$15,000$50,000
Credit card debt$1,500$7,000$15,000$40,000
Student loans for R/SP$1,500$7,000$15,000$40,000
Student loans for children$2,300$8,000$15,000$35,000
Debt to business$700$7,000$16,000$50,000
Other debts$2,000$7,000$16,500$50,000
Business value$1,500$6,500$20,000$200,000
Business debt$3,000$8,000$17,000$140,000
Government bonds$1,000$6,000$15,000$50,000
Mutual funds$2,000$9,000$20,000$60,000
Life Insurance$2,000$7,425$20,000$100,000
Corporate bonds$500$6,500$14,000$100,000
Money owed to R$1,000$6,000$16,000$50,000
Home value$2,000$10,000$20,000$160,000
Property debt$2,000$6,500$20,000$44,500
2nd Home value$3,000$10,000$20,000$140,000
2nd Mortgage$2,500$7,750$20,000$100,000
2nd Property debt$1,500$10,000$20,000$90,000
Value of cars, trucks$2,200$9,000$20,000$40,000
Debt on cars, trucks$2,500$8,000$17,000$32,000
Value of other vehicles$3,000$8,000$16,000$45,000
Debt of other vehicles$3,000$8,000$16,500$50,350
R retirement plan$2,000$8,000$20,000$61,500
Spouse retirement plan$2,000$8,000$20,000$70,000

Table 3. 2004 Median Values Used to Impute Unfolding Brackets for Retirement Items

Asset itemLowHigh
Roth IRA$5,000$26,500
Coverdell IRA$4,400$20,000
Keogh plan$5,000$25,000
Variable annuities$5,000$32,000
529 plans$3,500$34,500
Other tax advantaged plans$6,000$50,000

Step 3 - Bracketing Interpolation of Items

The next step we did was to impute missing item values using a simple algorithm that takes advantage of the longitudinal aspect of the NLSY79 data. We linearly interpolated any missing value that had a set of bracketing values available, by which we mean known values from any "before" interview and any "after" interview. A "missing value" refers to any situation where the respondent reports holding a particular asset/debt, but does not report its value (directly or via unfolding brackets).

There are two bracketing cases. The first is when bracketing values are available for cash-related asset/debt categories (cash savings, stocks/bonds, trusts, other debt, IRAs, 401ks, CDs). In this case we considered as a valid bracketing value any instance when the respondent reports holding this asset/debt and gives a value, or any instance where the respondent reports not having this asset/debt, in which case we assign a value of zero.

The second bracketing case is for property-related asset/debt categories (home value, mortgage, property debt, business assets, business debt, car value, car debt, possessions). Unlike in the first case, we only used as a valid bracketing value any instance when the respondent reports holding this asset/debt and gives a value. If the respondent reports not having this asset/debt, we did not assign a value of zero to the bounding observation, as that would be an improbably low valuation to use for interpolation. For example, suppose the respondent reports owning a house in years t+4 and t+8 but not in year t (recall asset questions are asked every four years). The respondent gives a house value in t+8 but refuses to provide this information in t+4. Because he/she didn't own a house in t, the value of the house owned is zero in t. To estimate house value in t+4 by averaging the respondent's house values in t and t+8 would surely generate an underestimate. Hence, we do not use the zero house value in t as a bounding observation.  

If the missing value was not centered between two known values, the imputed value is linearly interpolated between the two. This algorithm mirrors the procedure used in the Netherlands Socio-Economic Panel for their asset and debt data.

Step 4 - Linear Extrapolation of Items

The primary drawback to the above bracketing interpolation is that it provides no method of estimating an item's value if the item is either the first or last in a series. For example, if a respondent provides information on his car's value in 1985, 1986, and 1987, states he does not know its value in 1988 and then drops out of the survey, there is no bracketing observation for 1988.

To estimate these missing starting and ending points we fit the known data using ordinary least squares (OLS) and then extrapolated to determine the missing value. The (respondent-specific) regression we estimated was:

Item Valueit = ai + biYearit + uit,

using non-missing values for this asset/debt item for respondent i. As an example, assume a respondent stated he owned a vehicle in 1985 but did not know its value. Then assume this respondent in 1986, 1987 and 1988 said his vehicle was worth $14,000, $10,000 and $8,000 respectively. The OLS imputation regression would be run with the following values:


The resulting computation for the missing year (1985) is:

Item Value = 5,971,666.5 - (3,000 * 1985) = $16,666.50, 

so we used $16,666.50 as the imputed value for 1985. Because the NLSY79 data does not contain any fractional data, all cents values were rounded.

We imposed two types of restrictions on the data used for each respondent-specific regression. First, we require that two or more non-missing values were available. A non-missing value for cash-related asset/debt categories (cash savings, stocks/bonds, trusts, other debt, IRAs, 401ks, CDs) is any value reported by the respondent or a zero value if the respondent states they do not have this asset/debt. For property-related asset/debt categories (home value, mortgage, property debt, business assets, business debt, car value, car debt, possessions) only values reported by the respondent are used. Users should note that this mimics the two types of bracketing strategies described in step 3. Any values imputed from steps 2-3 are treated as non-missing values and used in the regression.

Second, to run the regression the respondent must also have reported an item value in the next closest wealth interview to ensure our estimates are relatively precise. For example, if the respondent did not know the value of his vehicle in 2004, we must have a known (reported or imputed) vehicle value in 2000 (the closest year, given that asset data were not collected in 2002) for the imputation to occur.

We also imposed additional restrictions on the predicted values that arise from these respondent-specific regressions. If the predicted value arising from the regression is negative, we did not use it as an imputation because respondents cannot report a negative asset or debt. In addition, if the predicted value is more than twice as large as the item value reported in the nearest year, it was not be used as an imputation and no value is created. These rules are designed to ensure our extrapolated values are not too extreme relative to the other observations.

User Note: It is important to understand that the revised variables will potentially change with the addition of each subsequent round of wealth data. Revisions can occur because additional data will sometimes allow us to impute a missing value via step 3 (bracketing interpolation) rather than step 4 (linear extrapolation). This situation is similar to the NLSY79 work history data, which sometimes change when new information becomes available.

Step 5 - Creating Total Net Worth Variables

The new data also include a "created net worth" variable for each survey year in which an asset module was fielded. This series was computed simply by combining the revised asset and debt series using the following equation for each respondent:


If any of the revised items are missing because they could not be imputed, the computed net worth variable is set to missing. Note that each respondent is asked about 15 types of assets and debts in each round. There might be some types of assets/debts that the respondent reports not holding, some where he gives a value, some where we impute a value, and some where we are unable to impute a value. If any asset/debt falls into the latter category, we do not compute the total net worth variable for that particular respondent-year case. While we do not compute a total net worth in these cases, the revised series are designed to let researchers do it easily, in part because all respondents who do not own an asset or debt have a zero in the revised series, instead of a -4.

Step 6 -- Revision of Top Codes

The last step calculated new and consistent top codes for the wealth data.

The NLSY79 has used three basic types of top coding algorithms for financial data. In the early years of the survey (up to 1988), every answer to NLSY79 questions that resulted in a response above a specified cutoff value, such as $100,000 for some variables, is recoded to the truncation value plus one dollar, such as $100,001. Unfortunately this algorithm results in a sharp downward bias in the sample mean because the right tail of the distribution is truncated. In the middle years (1989 to 1994) a new algorithm was implemented, replacing all values above the hard cutoff with the average of all outlying values. Starting with the 1996 data, a third approach was used. In this approach the hard cutoff was eliminated and the cutoff became the value which would shield the top two percent of respondents. All values in the top two percent were averaged and that averaged value replaced values above the top code.

Because the NLSY79 has used a variety of different methods, because a number of researchers have complained about the lack of information above a hard cut off and because the data cleaning steps dramatically changed a number of the highest values, we re-topcoded home and vehicles values because homes and vehicles are clearly identifiable objects which can re-identify respondents. Other asset or debt categories are no longer topcoded because it is difficult to use them to identify a particular respondent.

If the variable was previously topcoded, we re-topcoded it using the top 2% described above. When calculating the top 2%, we did not include individuals whose values were set to zero because they did not own the item or have the debt.

Details about Computed Net Worth Variables

The following information describes the intermediate sets of variables that were used to create the high-level net worth figures and are currently the only NLSY79 variables that are imputed, which fills in missing information. The intermediate sets of variables are useful for researchers who want to probe a particular aspect of a respondent’s financial life, such as their debts or ownership of vehicles.


The assets and debt section variables can be conceptually thought of being in a pyramid that comprises either three or four levels. The base of the pyramid contains a large number of raw variables, which ask respondents if they have a particular asset or liability. The middle layers of the pyramid contain summary categories of asset and debt variables. The top layer of the pyramid contains the net worth of the respondent and their family. This document describes the middle layer’s summary categories.

When the first Net Worth calculations were done during the late 1990s, the NLSY79 asset and debt questions were relatively simple. There were 15 mid-level groups of questions that asked each respondent to report if they owned an asset or debt and then asked the particular value. The 15 mid-level categories were as follows:

            1) Home value
            2) Mortgages
            3) Other residential debt
            4) Value of farm/business/real estate
            5) Debts of farm/business/real estate
            6) Market value of vehicles
            7) Debt of vehicles
            8) Value of stocks/bonds/mutual funds
            9) Value of CDs
            10) Value of trusts
            11) Value of IRAs
            12) Value of 401ks and 403bs
            13) Value of cash savings
            14) Value of other assets like jewelry/collections
            15) Value of all other debts like credit cards/student loans

Each of the 15 mid-level categories corresponded directly to questions asked in the mid-1990s. Questions asked in the 1980s followed the same format but asked about fewer than 15 categories. Table 4 shows the reference numbers, question names and titles from the 1996 survey for each of the fifteen groups:

Table 4. Underlying Data in Round 17 in 1996 Used to Create 15 Mid-Level Asset and Debt Variables

Mid-Level Category

R Num






Mkt Val Res Property R-Sp Own 96




Amount R-Sp Owe On Res Property 96




Amt Oth Debt R-Sp Owes On Res Prop 96




Amount In Savings Accounts 96




Amount In Iras-Keough 96

401Ks / FA_6_IMPUTED



Amount In Tax-Defrd Plans 96




Amount In CDs, Loans, Mortg 96




Mkt Val Of Stocks, Bonds R-Sp Have 96




Total Val Of Estate, Invest Trust 96




Ttl Mkt Val Farm, Bsns, Oth Prop? 96




Ttl Amt Debts, Liablty Farm, Bsns 96




Amt R-Sp Owe On Vehicles 96




Mkt Val Of Vehicles R-Sp Own 96




Ttl Mkt Val Items Over $500 96




Total Amt R-Sp Owe To Creditors 96

The year 2000 was a transition year for the wealth module. In this year extra questions were asked to respondents who stated they did not know particular values. These extra questions asked respondents to provide rough brackets if possible. For example, if a respondent stated they had stocks, bonds or mutual funds but did not know the exact value, they were asked if the amount fell in the following ranges; less than $1,000, between $1,000 and $4,000, between $4,000 and $15,000, and above $15,000. In addition, respondents who owned a farm, business or investment real estate were asked the percentage of these types of businesses that they owned. The percentage question allowed for individuals to be co-owners and partial owners. Previously, the asset and debt questions assumed the individual always owned the entire business.

The Qnames for the year 2000 followed the same pattern as the data shown in Table 4. For example, the year 2000 stock/bond/mutual fund question is qname Q13-125. Bracketing questions associated with a qname had the letters A, B, C appended. So, in this example, the stock, bond and mutual fund bracketing questions are found in Q13-125A, Q13-125B, and Q13-125C.

Then, starting with the year 2004 survey, the number of questions greatly expanded and the wealth module became much more complex. For these later rounds, each asset/debt category corresponds to multiple sets of individual variables. To keep the project manageable, the same mid-level asset/debt categories were created in each year even though the number of questions was expanded.

In addition, questions about trusts were dropped, since few people appeared to be trust fund recipients. Because the NLS survey staff had discussed adding back in the trust questions, a high-level trust variable was included so there was a continuous sequence of wealth variables. However, to date (2016), the underlying trust variable questions have not been brought back, so these high-level trust questions either only have a zero or -5 value from 2004 onward.

For example, in 2004 respondents were asked to report the values of up to two primary homes, instead of just one home. The two values were then combined to form the “home value” input to the total net worth calculation. Similarly, the “stocks/bonds” category in the total net worth formula in the 2004 survey represents the individually-reported values of government bonds, mutual funds, life insurance surrender values, stocks, corporate bonds, and money owed to the respondent. In addition, most asset/debt categories contain questions that include unfolding brackets and other methods to improve respondent response to difficult questions when a respondent stated they did not know a precise answer.

As an example of the complexity, Table 5 shows the nine NLSY79 variables used in 2004 to create the home value variable, NFA_1A_Imputed “Market Value of Residential Property R/Spar Own (Trunc) (Imputed).”

Table 5: Underlying Data Used in 2004 Survey to Create Home Value Variable

R Number





Mkt Val Res Property R/Sp Own 2004



Est Market Value Of Residential Property R/Spar Own



Est Market Value Of Residential Property R/Spar Own 



Market Value Of Residence In 2003 More Than Entry Amount



Market Value Of Residence In 2003 More Than $30k 



Market Value of (2nd) Residential Property R/Spouse Own 



Est Market Value Of (2nd) Residential Property R/Spar Own



Market Value Of (2nd) Residence In 2003 More Than Entry Amount 



Market Value Of (2nd) Residence In 2003 More Than $30k 

Table 6 shows all of the key variables used to generate the 14 mid-level categories starting with Round 19 in 2000. To keep Table 6 understandable, only the most important Qnames for each variable are shown. Additional questions, shown in Table 5, which asked respondents ranges and bracketing amounts, are not shown in Table 6. The range and bracketing variables have Qnames with the same root as--but different suffixes from--the items shown in Table 6.

Table 6: Data Underlying Mid-Level Asset and Debt Categories Used Beginning in Round 20

Mid-Level Category




 Value of 1st Home


 Value of 2nd Home/Time Share



 Mortgage on 1st Home


 Mortgage on 2nd Home/Time Share



 Other Property Debt on 1st Home


 Other Property Debt on 2nd Home/Time Share



Total Amount In Checking, Savings, And Money Market Funds


Total Money If R-Spar Cashed In US Government Savings bondsFA_3A
Total Money If R-Spar Sold Mutual FundsFA_4A
Total Money If Insurance Policies CashedFA_5A
Money R-Spar Have If Sold/Paid Amt Owe On StockFA_9A
Amt Of $ If Cash/Pay Off Securities/ BondsFA_10A
R-Spouse/Partner Owed Money From Personal Or Mortgage LoansFA_11A

These variables contain only 0s and -5s since the
underlying questions about trust values were not asked.
Because there was a chance the underlying questions would be
brought back, the variables were included to
ensure a consistent sequence.


Market Value Of Farm in 2003,
Excluding Crops Held Under Commodity Credit Loans

Percentage Of Farm Owned By R Or SpouseQ13-FJT-12B
Market Value Of Business Professional PracticeQ13-BPPJT-11_TRUNC 
Percentage Of Professional Practice That R Owns Q13-BPPJT-12B 
Market Value Of R Share Of Professional Practice Q13-BPPJT-12E 
Market Value Of Additional Real Estate Q13-REJT-11_TRUNC 
Percentage Of Real Estate R-Spouse OwnQ13-REJT-12B
Market Value Of R Share Of Real EstateQ13-REJT-12E

Total Market Value Of
Farm/Business/Other Property R/Spouse Own

Total Amount Of Debts Owed On FarmQ13-FJT-12

Total Amount Of Debts Owed
On Professional Practice


Total Amount Of Debts
On Farm/Business/Other Property R/Spouse Owe

Market Value Of VehicleNFA_4C_TRUNC
Current Value Of VehicleSC_12A.01
Market Value Of Other Personal Use VehiclesNFA_5C

Total Money If R-Spouse Cashed In
Certificate of Deposits/CDs

R Or Spouse Owe Money On VehicleNFA_4E

Total Amount Owed By R-Spouse
On Vehicle After Last Car Payment


Total Amount Owed By R-Spar
On All Other Personal Use Vehicles


Balance Owed On Vehicle
After Last Payment


Market Value Of Collections
Worth $1000 Or More


Market Value Of Individual R-Spouse Items
Worth $1000 Or More


Total Balance Owed
On All Credit Card Accounts Together


Total Amount R-Spouse Owes
On Student Loans


Total Amount Owed
On Student Loans For Children


Total Amount R-Spouse Owes To Other Businesses
After Most Recent Payment

Total Money If Tax Advantage Account CashedFA_8D_TRUNC
401Ks / FA_6_IMPUTED
Total Value Of Emp-Sponsored Retiremt PlansFA_6E
Tot Balance Of Spar-Emp Sponsored Retiremt PlansFA_7C

Researchers who do not want to use the imputed values or who want to impute the values themselves can find all the original non-imputed respondent data in the NLSY79 dataset in the NLS Investigator (www.nlsinfo.org/investigator). Original variables are typically located very close to the questions which end in “IMPUTED” and often have the original qname without any suffix. One method for finding these questions is to determine the R number of the imputed value and then use the NLS Investigator to show the closest page of reference numbers.

One thought on “Nlsy79 Bibliography Example

Leave a comment

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *