This section describes these three primary components of the NLSY79 codebook system and discusses the important types of information found within each. An additional codebook supplement exists for the Geocode data file.
The codebook is the principal element of the NLSY79 documentation system and contains information intended to be complete and self-explanatory for each variable in a data file. The software accompanying the NLSY79 data sets allows easy access to each variable's codebook information and permits the user to print a codebook extract for preselected variables.
Every variable is presented within the NLSY79 documentation as a block of information called a "codeblock." Each codeblock entry depicts the following important information:
- reference number
- variable title
- coding information
- frequency distribution
- location within the data file
- reference to the questionnaire item or source of the variable
- information on the derivation of created variables
Users will find that NLSY79 CAPI codeblocks present greater detail on each variable, including universe totals, universe skip patterns, and range of acceptable values information. Each of these terms is described more completely below. Codeblocks for many variables include special notes containing additional information designed to assist in the accurate use of data from that variable.
Codebooks are arranged in reference number order. As a general rule, raw questionnaire items appear first for a given survey year, followed by items from such instruments as the Information Sheet and Employer Supplement. Variables from the main body of the questionnaire are followed by created or constructed variables drawn from an external data source, such as the County & City Data Book.
Beginning with the 1993 CAPI surveys, questions relating to each job/employer, which were formerly located within the unique Employer Supplements, are merged with the main questionnaire items. A comparison of the reference number assignments used for the 1988 PAPI and 1993 CAPI variables appear in Table 1 and provide users with a sample set of reference numbers. Users should note that not all survey year assignments will be ordered in precisely this manner.
Table 1. NLSY79 1988 and 1993 Reference Number Assignment
|1988 PAPI||1993 CAPI|
|R25000.-R28927.||All Raw, Edited and Created Variables||R41001.-R44308.||All Raw, Edited and Created Variables|
|R25000.-R27467.||Questionnaire Items||R41001.-R43988.||Questionnaire Items including the Employer Supplement series|
|R27469.-R27501.||Information Sheet Items||R43989.-R44036.||Information Sheet Items|
|R27506.-R27609.||Household Record||R44037.-R44126.||Household Record|
|R27610.-R28254.||Employer Supplement (ES) 1|
|R28255.-R28371.||Children's Record Form||R44127.-R44162.||Children's Record Form|
|R28372.-R28690.||Childhood Residence Calendar2|
|R28704.-R28729.||Created Variables||R44163.-R44205.||Created Variables|
|R28735.-R28811.||Supplemental Fertility File Variables|
|R28825.-R28927.||Geocode Variables||R44206.-R44308.||Geocode Variables|
|Note: PAPI refers to paper-and-pencil interviews which were conducted with the NLSY79 during 1979-92. CAPI or computer-assisted personal interviews began for the full NLSY79 cohort in 1993.|
|1 Beginning in 1993, variables from the employer supplement series are included within the raw questionnaire items.|
|2 The childhood residence retrospective was unique to 1988.|
The following figures give users an example of codebook pages before (Figure 1) and after (Figure 2) CAPI implementation.
Figure 1. NLSY79 Sample PAPI Codeblock
Figure 2. NLSY79 Sample CAPI Codeblock
Coding Information: Each codeblock entry presents the set of legitimate codes that a variable may assume along with a text entry describing the codes.
Dichotomous (or variables answered yes/no) are uniformly coded "Yes" = 1, "No" = 0. Other dichotomous variables have frequently been reformulated so this convention may be followed.
Discrete (Categorical), as in the case of the NLSY79 example, the variable 'Activity Most of Survey Week CPS Item'.
- WITH A JOB, NOT AT WORK
- LOOKING FOR WORK
- KEEPING HOUSE
- GOING TO SCHOOL
- UNABLE TO WORK
Continuous (Quantitative), as in the case of hourly rate of pay in the example above. These variables have continuous data but are presented in the codebook using a convenient frequency distribution. NLSY79 users will note that most valid data are positive numbers. Special cases are flagged by negative numbers in the NLSY79. See Appendix 13 in the NLSY79 Codebook Supplement for more detail on the handling of negative numbers in the data files. The following conventions have been used throughout the data:
Noninterview -5 Valid Skip -4 Invalid Skip -3 Don't Know -2 Refusal -1
Coding information for a given variable in the NLSY79 codeblock is:
- not necessarily consistent with the codes found within the questionnaire and
- not necessarily consistent for the same variable across years. Use only the codebook coding information for analysis.
Frequency Distribution: In the case of discrete (categorical) variables, frequency counts are normally shown in the first column to the left of the code categories. In the case of continuous (quantitative) variables, a distribution of the variable is presented using a convenient class interval. The format of these distributions varies.
Derivations: The decision rules employed in the creation of main file constructed variables have been included, whenever possible, in the codebook under the title "DERIVATIONS." This information enables researchers to determine whether available constructs are appropriate to their needs. In the case of the example NLSY79 variable in Figure 1, no derivation is shown because these variables are picked up directly from the interview schedule. Certain variables will contain a reference to an appendix for the decision rules that were used in creating the variable.
Questionnaire Item: "Questionnaire item" is a generic term identifying the printed source of data for a given variable. A questionnaire item may be a question, a check item, or an interviewer's reference item appearing within one of the survey instruments.
The questionnaire location for NLSY79 entries appears either in parentheses or brackets directly after the reference number, for example R04434. (SO6D1314). The five questionnaire item numbering conventions used in the codebook are described in the Survey Instruments section (see especially Table 2).
Before the adoption of CAPI if an NLSY79 variable was not taken directly from one of the survey instruments, the questionnaire location contained an asterisk (*) in the codebook. The following categories of variables had no questionnaire numbers:
- assigned identification numbers for the respondent, child, or family unit;
- all derived or constructed variables;
- variables from the following special surveys: Profiles (ASVAB), the School Survey, and the Transcript Survey;
- variables found on constructed data files such as the Supplemental Fertility File (area of interest "Fertility and Relationship History/Created"); and
- variables drawn from an external data source such as those found on the Geocode files.
In CAPI years, survey staff assign a question name that is not used in the questionnaire. This name remains the same in subsequent rounds, so similar created variables can be easily located.
Section, deck, and question numbers have been somewhat arbitrarily assigned to the information and questions found in special survey instruments such as the Household Screener, Information Sheet, Children's Record Forms, Household Interview Forms, and the Employer Supplements. The section and deck numbers for these special survey items were numbered sequentially after the main survey items and their specific order varies each year. The exception to this is the assignment of the deck numbers for the Employer Supplements. Question numbering is discussed earlier in the Survey Instruments section (see especially Table 3).
Universe Information: Universe information was attached to select 1979-92 variables. Beginning with the 1993 CAPI interviews, the amount of universe information was expanded to include:
- Universe Totals: Two totals are presented:
- the sum of the frequency counts for each coding category is presented below the individual codes; and
- the sum of the valid responses plus missing response counts of "refusals," "don't knows," and "invalid skips" can be found in the TOTAL==========> field. The number of respondents who legitimately did not respond to a question, that is, "valid skips (-4)" and "noninterviews (-5)," are also depicted.
- Universe Skip Patterns: The following detailed universe information will enable researchers to easily trace the flow of respondents both backward and forward through various parts of the CAPI questionnaire items included in the codebook:
"Go to Reference # XXXXX.," appended to certain coding categories, indicates that respondents selecting that answer category were routed to the next question specified.
"Lead In(s) Reference # XXXXX." identifies the question or questions immediately preceding the codeblock question through which the universe of respondents was routed. Each lead-in reference number is followed by the relevant response value indicators, (Default), (ALL), [1:1], [1:6], and so forth. For example:
|R41000. (All)||This means that all cases where R41000. is asked will branch to the current question. This does not imply all respondents are asked question R41000.|
|R41000. (Default)||This means that the default path of control from question R41000. is to branch to the current question, but there may be conditions under which a different path would be taken.|
|R41000. [1:6]||This means that whenever the response category for question R41000. takes on the values one to six inclusive, the next question is the current question record.|
"Default Next Question" specifies the next question that all respondents of the current codeblock will be asked unless some other skip condition indicates otherwise.
Valid Values Range: Depicted below the frequency distribution is information relating to the range of valid values for that particular distribution. "MINIMUM" indicates the smallest recorded value exclusive of "NA" and "DK." "MAXIMUM" indicates the largest recorded value. The computer-assisted interview contains internal range checks that limit responses to those between predesignated values, alert interviewers to verify unusual values, and bolster the information provided by the traditional minimum and maximum fields (see, for example, Figure 2 above).
Maximum and Minimum Fields: The MIN and MAX fields define the range, that is, the lower limit and the upper limit, of data values for a given question. A MAX of $156,359 on an income question, for example, means that this value was the highest value recorded.
Hardmax and Hardmin Fields: Hard Maximum and Hard Minimum fields denote the highest and lowest values that were accepted by the CAPI program. A Hardmax of 500,000 and a Hardmin of 0 on an income question indicate that no values above $500,000 or values lower than zero (no income) can be accepted. Dates, such as month/day/year of the respondent's last interview [lintdate] and current interview [curdate], are used as Hardmin and Hardmax values in order to restrict responses to certain questions to values within that range. Responses outside this range must be entered by the interviewer in the comment field.
Softmax and Softmin Fields: Softmax and Softmin fields cover ranges where an answer may exceed reasonable limits yet remain within the absolute limits and are acceptable after verification. A Softmax set to $80,000 on an income question will cause the machine to "beep" and a warning to appear on the screen. Interviewers are thus alerted that the value is unusual and the respondent's answer should be verified.
Restricted Income Values: Confidentiality issues restrict release of all income values. To insure respondent confidentiality, the values of income variables exceeding particular limits are truncated and the upper limits converted to a set maximum value.
- From 1979 through 1984, the upper limit on income variables was $75,000, and any amounts exceeding $75,000 were converted to $75,001
- Beginning in 1985, the upper limit on income amounts was increased to $100,000 due to inflation and the advancing age of the cohort, and amounts exceeding $100,000 were converted to $100,001
- Beginning in 1996, the top two percent of respondents with valid values were averaged and that average value replaced all values in the top range
Users should be aware of these changes in the income ceiling if they are carrying out longitudinal analyses with these data. Upward trends in mean income statistics may reflect this change in the ceiling value. More information about truncation is available in the "Income" section of this guide.
Restricted Asset Values: Confidentiality issues also restrict release of all asset values. To insure respondent confidentiality, the values of asset variables exceeding particular limits are truncated and the upper limits converted to a set maximum value. The asset amounts have different upper limits, and the types of variables and limits for those variables are as follows:
- Starting in 1985 all mortgage, market value of residential property, debt on residential property, miscellaneous debt and total market value of assets worth more than $150,000 were converted to $150,001; the market value and debt on a farm or business and savings that was worth more than $500,000 was converted to $500,001; the market value and debt on vehicles that was more than $30,000 was converted to $30,001
- Beginning in 1989, the amounts exceeding the upper limits mentioned above were assigned the average value of all values exceeding the limits, in an effort to more accurately reflect the true range of income and asset values
- Beginning in 1996, the top two percent of respondents with valid values were averaged and that average value replaced all values in the top range
Users should be aware of these changes in the asset ceiling if they are carrying out longitudinal analyses with these data. Upward trends in mean asset statistics may reflect this change in the ceiling value. More information about truncation is available in the "Assets" section of this guide.
Verbatim: Generally during the PAPI years, when a NLSY79 variable was taken directly from the questionnaire, the verbatim of the question appeared beneath the variable title. If a question is the source for more than one variable, the first variable contains the verbatim while subsequent variables prompt the user to refer back to the variable containing the verbatim. The following verbatim responses appear for reference numbers R03194. and R03195. and demonstrate this convention.
R03194. 'In Which Months of 1979 Did You (or Your Husband/Wife) Receive Supplemental Security Income? January 80 INT'
R03195. 'See R (3194.) February'
Codebook Supplements and Other Technical Documentation
The Other Documentation section of the website includes several items that provide additional information about the NLSY79 survey. There are two NLSY79 codebook supplements. The first supplement, the NLSY79 Codebook Supplement, contains a series of attachments and appendices, variable creation procedures, supplementary coding categories, and derivations for selected variables on the main NLSY79 data files. Information provided within this document is not available in the NLSY79 codebooks, nor will it be found on the documentation files on the NLSY79 data sets. The other supplement contains comparable information specific to the NLSY79 Geocode data files. The Technical Sampling Report describes the selection of the NLSY79 sample and provides additional statistical information. Finally, the School & Transcript Surveys Documentation provides technical information about those special data collections.
Prior to working with an NLSY79 data file, users should make every effort to acquire information on current data or documentation errors. A variety of methods are used to notify users of errors in the data files or documentation and to provide those persons who acquired an NLSY79 data set directly from the Center for Human Resource Research with corrected information.
Errata can be accessed by following links for the cohort of interest.
This appendix provides two sets of details about asset and debt variables within the NLSY79:
Revision Process for Asset and Debt Variables
In the spring of 2008, a revised set of asset and debt variables was released to the public. These revised asset and debt variables fixed a number of problems with the NLSY79 data by eliminating some implausible outliers, generating uniform topcodes for all rounds, and constructing a total net worth variable. This section provides details on the revision process.
What Users See: Prior to the spring 2008 release users saw a single asset or debt question for each item in the wealth section of the questionnaire. For example, in 1987 the questionnaire asked each respondent who owned a home or apartment the market value of their residential property. The questionnaire asked respondents "About how much do you think this property would sell for on today's market?" Until the spring of 2008, the respondent answers were found in a single 1987 variable that had the following R and Q numbers:
R23627.00 [Q1947] (TRUNC)
After the revision was done, two more asset variables were added to the data set based on the same underlying property responses. The two new variables are
R23627.01 [*Created] (TRUNC) (REVISED)
R23627.02 [*Created] (TRUNC) (IMPUTED)
What is the difference between R23627.00, R23627.01 and R23627.02? The variable that ends in (.00) R23627.00 is the original variable in the dataset and is left so that researchers can reproduce previous results. The variables that ends in (.01) R23627.01 is a new variable which uses a revised topcoding algorithm (see Step 6 below). By revising the variable, researchers are now provided with extra information previously unavailable. The variable that ends in (.02) R23627.02 is a new variable which imputes missing and unknown responses if possible as well as using the revised topcoding algorithm.
There are two new variables because some users will not want to use imputed data. The (.01) variables are cleaned and re-topcoded but do not have any imputed values. The (.02) variables have as many missing or unknown values imputed as possible. In general, the survey staff recommends that users without a strong preference should use the (.02) asset or debt value that ends in the label "(TRUNC) (IMPUTED)."
Table 1 gives an example using the 1987 property value question of how seven different types of cases were handled by the revision and imputation process. Please note the "$" symbols are not in the NLSY79 data but are added to make it easier to read the table.
Table 1. Examples of How NLSY79 Asset/Debt Data Were Modified
|Public ID||Original R23627.00||Revised R23627.01||Imputed R23627.02||Explanation|
|200||$150,001||$276,984||$276,984||Originally above the topcode and the value is still above the topcode but the topcode is now higher, revealing more information.|
|40||$150,001||$153,000||$153,000||Originally above topcode and now below topcode. Value is no longer topcoded.|
|9083||-1||-1||$93,333||Originally a 'refused' response. Now contains an imputed value.|
|205||-2||-2||$276,984||Originally a 'don't know' response. Now contains an imputed but since the imputed value is above the topcode the topcode is used as the value.|
|526||-3||-3||$100,000||Originally an invalid skip. Now contains an imputed value.|
|2||-4||-4||$0||Originally a valid skip. Since valid skip means does not have the asset the item is changed to zero.|
|1336||-2||-2||-2||Originally a 'don't know' response. Since it was not possible to impute value, the value was left as a 'don't know' response.|
Not every asset or debt variable has a new revised or imputed offspring. Instead, to keep the project manageable, only 15 asset/debt categories were created in each year. These 15 categories match up exactly with the categories found in the NLSY79 wealth module that was used in the 1990s. The categories are: Home Value, Mortgage Value, Property Debt Value, Cash Saving, Stocks/Bonds, Trusts, Business Assets, Business Debts, Car Value, Car Debt, Possession Value, Other Debt Value, IRA, 401K, Certificate of Deposit Value. However, starting in 2000 and then in 2004, the wealth module became more complex. For these later rounds each asset/debt category corresponds to multiple individual asset/debt variables.
For example, in 2004 respondents were asked to report the values of two homes. Their values are combined to form the "home value" category. Similarly, in 2004 the "stocks/bonds" category represents the individually-reported values of government bonds, mutual funds, life insurance surrender values, stocks, corporate bonds, and money owed to the respondent.
Details of the Revision and Imputation
In addition to creating these combined variables, the NLS asset and debt revision project did six other steps. This six-step process started off with cleaning the raw data and culminated in a new net worth variable and new top coding for most respondents. The details of the six steps are as follows:
Step 1 -- Cleaning Raw Data
The original raw data has a number of out-of-range codes. These out-of-range codes were originally given the topcode value when released to the public. Examination of the out-of-range cases suggests most of these out-of-range flags were data entry mistakes and not actually out of range. Most of these out-of-range codes occurred in the 1988 and 1989 surveys, but this issue arises in other PAPI years. All out-of-range codes were changed to an "invalid skip" (-3) in the revised (.01) variables. If possible, these variables were then imputed in the (.02) variable. Researchers are able to determine which items were incorrectly marked as out of range by looking for items that were top coded originally and then changed to a -3 value in the revised (.01) variables.
Step 2 -- Unfolding brackets
Unfolding brackets were used for four asset/debt categories in 2000 and for all categories in 2004. These unfolding brackets were not used prior to 2000. Unfolding brackets are used if a respondent fails to report a particular asset's or liability's value. For example, suppose a respondent refuses or does not know the value of his certificate of deposit (CD). The respondent is first asked if his CD is worth more than an entry amount, which is $10,000 for some respondents and $20,000 for others. If the value is not above the entry amount, the respondent is asked if the value of his CD is $5,000 or more. If the value is above the entry amount, he is asked if the value would amount to $30,000 or more. These three questions result in four potential reported ranges: below $5,000; between $5,000 and the entry amount; between the entry amount and $30,000; and above $30,000.
Whenever an unfolding bracket is used, we replaced the reported range with the median value among respondents whose reported value falls in the given range. For example, respondents who revealed via unfolding brackets that their CDs are valued below $5,000 were assigned the median CD value among all responses who report directly (not via unfolding brackets) a value between $0 and $4,999. The 2004 median values used for each bracket for each asset/debt category are shown in Table 2 and Table 3.
Table 2. 2004 Median Values Used to Impute Unfolding Brackets
|Asset/debt item||Low||Middle 1||Middle 2||High|
|Items worth more than $1k||$2,000||$7,000||$15,000||$50,000|
|Credit card debt||$1,500||$7,000||$15,000||$40,000|
|Student loans for R/SP||$1,500||$7,000||$15,000||$40,000|
|Student loans for children||$2,300||$8,000||$15,000||$35,000|
|Debt to business||$700||$7,000||$16,000||$50,000|
|Money owed to R||$1,000||$6,000||$16,000||$50,000|
|2nd Home value||$3,000||$10,000||$20,000||$140,000|
|2nd Property debt||$1,500||$10,000||$20,000||$90,000|
|Value of cars, trucks||$2,200||$9,000||$20,000||$40,000|
|Debt on cars, trucks||$2,500||$8,000||$17,000||$32,000|
|Value of other vehicles||$3,000||$8,000||$16,000||$45,000|
|Debt of other vehicles||$3,000||$8,000||$16,500||$50,350|
|R retirement plan||$2,000||$8,000||$20,000||$61,500|
|Spouse retirement plan||$2,000||$8,000||$20,000||$70,000|
Table 3. 2004 Median Values Used to Impute Unfolding Brackets for Retirement Items
|Other tax advantaged plans||$6,000||$50,000|
Step 3 - Bracketing Interpolation of Items
The next step we did was to impute missing item values using a simple algorithm that takes advantage of the longitudinal aspect of the NLSY79 data. We linearly interpolated any missing value that had a set of bracketing values available, by which we mean known values from any "before" interview and any "after" interview. A "missing value" refers to any situation where the respondent reports holding a particular asset/debt, but does not report its value (directly or via unfolding brackets).
There are two bracketing cases. The first is when bracketing values are available for cash-related asset/debt categories (cash savings, stocks/bonds, trusts, other debt, IRAs, 401ks, CDs). In this case we considered as a valid bracketing value any instance when the respondent reports holding this asset/debt and gives a value, or any instance where the respondent reports not having this asset/debt, in which case we assign a value of zero.
The second bracketing case is for property-related asset/debt categories (home value, mortgage, property debt, business assets, business debt, car value, car debt, possessions). Unlike in the first case, we only used as a valid bracketing value any instance when the respondent reports holding this asset/debt and gives a value. If the respondent reports not having this asset/debt, we did not assign a value of zero to the bounding observation, as that would be an improbably low valuation to use for interpolation. For example, suppose the respondent reports owning a house in years t+4 and t+8 but not in year t (recall asset questions are asked every four years). The respondent gives a house value in t+8 but refuses to provide this information in t+4. Because he/she didn't own a house in t, the value of the house owned is zero in t. To estimate house value in t+4 by averaging the respondent's house values in t and t+8 would surely generate an underestimate. Hence, we do not use the zero house value in t as a bounding observation.
If the missing value was not centered between two known values, the imputed value is linearly interpolated between the two. This algorithm mirrors the procedure used in the Netherlands Socio-Economic Panel for their asset and debt data.
Step 4 - Linear Extrapolation of Items
The primary drawback to the above bracketing interpolation is that it provides no method of estimating an item's value if the item is either the first or last in a series. For example, if a respondent provides information on his car's value in 1985, 1986, and 1987, states he does not know its value in 1988 and then drops out of the survey, there is no bracketing observation for 1988.
To estimate these missing starting and ending points we fit the known data using ordinary least squares (OLS) and then extrapolated to determine the missing value. The (respondent-specific) regression we estimated was:
Item Valueit = ai + biYearit + uit,
using non-missing values for this asset/debt item for respondent i. As an example, assume a respondent stated he owned a vehicle in 1985 but did not know its value. Then assume this respondent in 1986, 1987 and 1988 said his vehicle was worth $14,000, $10,000 and $8,000 respectively. The OLS imputation regression would be run with the following values:
The resulting computation for the missing year (1985) is:
Item Value = 5,971,666.5 - (3,000 * 1985) = $16,666.50,
so we used $16,666.50 as the imputed value for 1985. Because the NLSY79 data does not contain any fractional data, all cents values were rounded.
We imposed two types of restrictions on the data used for each respondent-specific regression. First, we require that two or more non-missing values were available. A non-missing value for cash-related asset/debt categories (cash savings, stocks/bonds, trusts, other debt, IRAs, 401ks, CDs) is any value reported by the respondent or a zero value if the respondent states they do not have this asset/debt. For property-related asset/debt categories (home value, mortgage, property debt, business assets, business debt, car value, car debt, possessions) only values reported by the respondent are used. Users should note that this mimics the two types of bracketing strategies described in step 3. Any values imputed from steps 2-3 are treated as non-missing values and used in the regression.
Second, to run the regression the respondent must also have reported an item value in the next closest wealth interview to ensure our estimates are relatively precise. For example, if the respondent did not know the value of his vehicle in 2004, we must have a known (reported or imputed) vehicle value in 2000 (the closest year, given that asset data were not collected in 2002) for the imputation to occur.
We also imposed additional restrictions on the predicted values that arise from these respondent-specific regressions. If the predicted value arising from the regression is negative, we did not use it as an imputation because respondents cannot report a negative asset or debt. In addition, if the predicted value is more than twice as large as the item value reported in the nearest year, it was not be used as an imputation and no value is created. These rules are designed to ensure our extrapolated values are not too extreme relative to the other observations.
User Note: It is important to understand that the revised variables will potentially change with the addition of each subsequent round of wealth data. Revisions can occur because additional data will sometimes allow us to impute a missing value via step 3 (bracketing interpolation) rather than step 4 (linear extrapolation). This situation is similar to the NLSY79 work history data, which sometimes change when new information becomes available.
Step 5 - Creating Total Net Worth Variables
The new data also include a "created net worth" variable for each survey year in which an asset module was fielded. This series was computed simply by combining the revised asset and debt series using the following equation for each respondent:
NET WORTH = HOME VALUE - MORTGAGE - PROPERTY DEBT + CASH SAVING + STOCKS/BONDS + TRUSTS + BUSINESS ASSETS - BUSINESS DEBT + CAR VALUE - CAR DEBT + POSSESSIONS - OTHER DEBT + IRAs + 401Ks + CDs.
If any of the revised items are missing because they could not be imputed, the computed net worth variable is set to missing. Note that each respondent is asked about 15 types of assets and debts in each round. There might be some types of assets/debts that the respondent reports not holding, some where he gives a value, some where we impute a value, and some where we are unable to impute a value. If any asset/debt falls into the latter category, we do not compute the total net worth variable for that particular respondent-year case. While we do not compute a total net worth in these cases, the revised series are designed to let researchers do it easily, in part because all respondents who do not own an asset or debt have a zero in the revised series, instead of a -4.
Step 6 -- Revision of Top Codes
The last step calculated new and consistent top codes for the wealth data.
The NLSY79 has used three basic types of top coding algorithms for financial data. In the early years of the survey (up to 1988), every answer to NLSY79 questions that resulted in a response above a specified cutoff value, such as $100,000 for some variables, is recoded to the truncation value plus one dollar, such as $100,001. Unfortunately this algorithm results in a sharp downward bias in the sample mean because the right tail of the distribution is truncated. In the middle years (1989 to 1994) a new algorithm was implemented, replacing all values above the hard cutoff with the average of all outlying values. Starting with the 1996 data, a third approach was used. In this approach the hard cutoff was eliminated and the cutoff became the value which would shield the top two percent of respondents. All values in the top two percent were averaged and that averaged value replaced values above the top code.
Because the NLSY79 has used a variety of different methods, because a number of researchers have complained about the lack of information above a hard cut off and because the data cleaning steps dramatically changed a number of the highest values, we re-topcoded home and vehicles values because homes and vehicles are clearly identifiable objects which can re-identify respondents. Other asset or debt categories are no longer topcoded because it is difficult to use them to identify a particular respondent.
If the variable was previously topcoded, we re-topcoded it using the top 2% described above. When calculating the top 2%, we did not include individuals whose values were set to zero because they did not own the item or have the debt.
Details about Computed Net Worth Variables
The following information describes the intermediate sets of variables that were used to create the high-level net worth figures and are currently the only NLSY79 variables that are imputed, which fills in missing information. The intermediate sets of variables are useful for researchers who want to probe a particular aspect of a respondent’s financial life, such as their debts or ownership of vehicles.
The assets and debt section variables can be conceptually thought of being in a pyramid that comprises either three or four levels. The base of the pyramid contains a large number of raw variables, which ask respondents if they have a particular asset or liability. The middle layers of the pyramid contain summary categories of asset and debt variables. The top layer of the pyramid contains the net worth of the respondent and their family. This document describes the middle layer’s summary categories.
When the first Net Worth calculations were done during the late 1990s, the NLSY79 asset and debt questions were relatively simple. There were 15 mid-level groups of questions that asked each respondent to report if they owned an asset or debt and then asked the particular value. The 15 mid-level categories were as follows:
1) Home value
3) Other residential debt
4) Value of farm/business/real estate
5) Debts of farm/business/real estate
6) Market value of vehicles
7) Debt of vehicles
8) Value of stocks/bonds/mutual funds
9) Value of CDs
10) Value of trusts
11) Value of IRAs
12) Value of 401ks and 403bs
13) Value of cash savings
14) Value of other assets like jewelry/collections
15) Value of all other debts like credit cards/student loans
Each of the 15 mid-level categories corresponded directly to questions asked in the mid-1990s. Questions asked in the 1980s followed the same format but asked about fewer than 15 categories. Table 4 shows the reference numbers, question names and titles from the 1996 survey for each of the fifteen groups:
Table 4. Underlying Data in Round 17 in 1996 Used to Create 15 Mid-Level Asset and Debt Variables
HOME VALUE / NFA_1A_IMPUTED
Mkt Val Res Property R-Sp Own 96
MORTGAGE / NFA_1B_IMPUTED
Amount R-Sp Owe On Res Property 96
PROPERTY DEBT / NFA_1C_IMPUTED
Amt Oth Debt R-Sp Owes On Res Prop 96
CASH SAVING / FA_1A_IMPUTED
Amount In Savings Accounts 96
IRAs / FA_8_IMPUTED
Amount In Iras-Keough 96
401Ks / FA_6_IMPUTED
Amount In Tax-Defrd Plans 96
CDs / FA_2A_IMPUTED
Amount In CDs, Loans, Mortg 96
STOCKS/BONDS / FA_9A_IMPUTED
Mkt Val Of Stocks, Bonds R-Sp Have 96
TRUSTS / TRUST IMPUTED
Total Val Of Estate, Invest Trust 96
BUSINESS ASSETS / Q13-131_IMPUTED
Ttl Mkt Val Farm, Bsns, Oth Prop? 96
BUSINESS DEBT / Q13-132_IMPUTED
Ttl Amt Debts, Liablty Farm, Bsns 96
CAR DEBT / NFA_4F_IMPUTED
Amt R-Sp Owe On Vehicles 96
CAR VALUE / NFA_4C_IMPUTED
Mkt Val Of Vehicles R-Sp Own 96
POSSESSIONS / NFA_6E_IMPUTED
Ttl Mkt Val Items Over $500 96
OTHER DEBT / DEBT_1A_IMPUTED
Total Amt R-Sp Owe To Creditors 96
The year 2000 was a transition year for the wealth module. In this year extra questions were asked to respondents who stated they did not know particular values. These extra questions asked respondents to provide rough brackets if possible. For example, if a respondent stated they had stocks, bonds or mutual funds but did not know the exact value, they were asked if the amount fell in the following ranges; less than $1,000, between $1,000 and $4,000, between $4,000 and $15,000, and above $15,000. In addition, respondents who owned a farm, business or investment real estate were asked the percentage of these types of businesses that they owned. The percentage question allowed for individuals to be co-owners and partial owners. Previously, the asset and debt questions assumed the individual always owned the entire business.
The Qnames for the year 2000 followed the same pattern as the data shown in Table 4. For example, the year 2000 stock/bond/mutual fund question is qname Q13-125. Bracketing questions associated with a qname had the letters A, B, C appended. So, in this example, the stock, bond and mutual fund bracketing questions are found in Q13-125A, Q13-125B, and Q13-125C.
Then, starting with the year 2004 survey, the number of questions greatly expanded and the wealth module became much more complex. For these later rounds, each asset/debt category corresponds to multiple sets of individual variables. To keep the project manageable, the same mid-level asset/debt categories were created in each year even though the number of questions was expanded.
In addition, questions about trusts were dropped, since few people appeared to be trust fund recipients. Because the NLS survey staff had discussed adding back in the trust questions, a high-level trust variable was included so there was a continuous sequence of wealth variables. However, to date (2016), the underlying trust variable questions have not been brought back, so these high-level trust questions either only have a zero or -5 value from 2004 onward.
For example, in 2004 respondents were asked to report the values of up to two primary homes, instead of just one home. The two values were then combined to form the “home value” input to the total net worth calculation. Similarly, the “stocks/bonds” category in the total net worth formula in the 2004 survey represents the individually-reported values of government bonds, mutual funds, life insurance surrender values, stocks, corporate bonds, and money owed to the respondent. In addition, most asset/debt categories contain questions that include unfolding brackets and other methods to improve respondent response to difficult questions when a respondent stated they did not know a precise answer.
As an example of the complexity, Table 5 shows the nine NLSY79 variables used in 2004 to create the home value variable, NFA_1A_Imputed “Market Value of Residential Property R/Spar Own (Trunc) (Imputed).”
Table 5: Underlying Data Used in 2004 Survey to Create Home Value Variable
NFA_1A_IMPUTED “MARKET VALUE OF RESIDENTIAL PROPERTY R/SPAR OWN (TRUNC) (IMPUTED)”
Mkt Val Res Property R/Sp Own 2004
Est Market Value Of Residential Property R/Spar Own
Est Market Value Of Residential Property R/Spar Own
Market Value Of Residence In 2003 More Than Entry Amount
Market Value Of Residence In 2003 More Than $30k
Market Value of (2nd) Residential Property R/Spouse Own
Est Market Value Of (2nd) Residential Property R/Spar Own
Market Value Of (2nd) Residence In 2003 More Than Entry Amount
Market Value Of (2nd) Residence In 2003 More Than $30k
Table 6 shows all of the key variables used to generate the 14 mid-level categories starting with Round 19 in 2000. To keep Table 6 understandable, only the most important Qnames for each variable are shown. Additional questions, shown in Table 5, which asked respondents ranges and bracketing amounts, are not shown in Table 6. The range and bracketing variables have Qnames with the same root as--but different suffixes from--the items shown in Table 6.
Table 6: Data Underlying Mid-Level Asset and Debt Categories Used Beginning in Round 20
HOME VALUE / NFA_1A_IMPUTED
Value of 1st Home
Value of 2nd Home/Time Share
MORTGAGE / NFA_1B_IMPUTED
Mortgage on 1st Home
Mortgage on 2nd Home/Time Share
PROPERTY DEBT / NFA_1C_IMPUTED
Other Property Debt on 1st Home
Other Property Debt on 2nd Home/Time Share
CASH SAVING / FA_1A_IMPUTED
Total Amount In Checking, Savings, And Money Market Funds
|STOCKS/BONDS / FA_9A_IMPUTED|
|Total Money If R-Spar Cashed In US Government Savings bonds||FA_3A|
|Total Money If R-Spar Sold Mutual Funds||FA_4A|
|Total Money If Insurance Policies Cashed||FA_5A|
|Money R-Spar Have If Sold/Paid Amt Owe On Stock||FA_9A|
|Amt Of $ If Cash/Pay Off Securities/ Bonds||FA_10A|
|R-Spouse/Partner Owed Money From Personal Or Mortgage Loans||FA_11A|
|TRUSTS / TRUST IMPUTED|
These variables contain only 0s and -5s since the
|BUSINESS ASSETS / Q13-131_IMPUTED|
Market Value Of Farm in 2003,
|Percentage Of Farm Owned By R Or Spouse||Q13-FJT-12B|
|Market Value Of Business Professional Practice||Q13-BPPJT-11_TRUNC|
|Percentage Of Professional Practice That R Owns||Q13-BPPJT-12B|
|Market Value Of R Share Of Professional Practice||Q13-BPPJT-12E|
|Market Value Of Additional Real Estate||Q13-REJT-11_TRUNC|
|Percentage Of Real Estate R-Spouse Own||Q13-REJT-12B|
|Market Value Of R Share Of Real Estate||Q13-REJT-12E|
Total Market Value Of
|BUSINESS DEBT / Q13-132_IMPUTED|
|Total Amount Of Debts Owed On Farm||Q13-FJT-12|
Total Amount Of Debts Owed
Total Amount Of Debts
|CAR VALUE / NFA_4C_IMPUTED|
|Market Value Of Vehicle||NFA_4C_TRUNC|
|Current Value Of Vehicle||SC_12A.01|
|Market Value Of Other Personal Use Vehicles||NFA_5C|
|CDs / FA_2A_IMPUTED|
Total Money If R-Spouse Cashed In
|CAR DEBT / NFA_4F_IMPUTED|
|R Or Spouse Owe Money On Vehicle||NFA_4E|
Total Amount Owed By R-Spouse
Total Amount Owed By R-Spar
Balance Owed On Vehicle
|POSSESSIONS / NFA_6E_IMPUTED|
Market Value Of Collections
Market Value Of Individual R-Spouse Items
|OTHER DEBT / DEBT_1A_IMPUTED|
Total Balance Owed
Total Amount R-Spouse Owes
Total Amount Owed
Total Amount R-Spouse Owes To Other Businesses
|IRAs / FA_8_IMPUTED|
|Total Money If Tax Advantage Account Cashed||FA_8D_TRUNC|
|401Ks / FA_6_IMPUTED|
|Total Value Of Emp-Sponsored Retiremt Plans||FA_6E|
|Tot Balance Of Spar-Emp Sponsored Retiremt Plans||FA_7C|
Researchers who do not want to use the imputed values or who want to impute the values themselves can find all the original non-imputed respondent data in the NLSY79 dataset in the NLS Investigator (www.nlsinfo.org/investigator). Original variables are typically located very close to the questions which end in “IMPUTED” and often have the original qname without any suffix. One method for finding these questions is to determine the R number of the imputed value and then use the NLS Investigator to show the closest page of reference numbers.