To enhance my statistical and programming know-how, I set out to find data sets that would interest and/or be relevant to psychological researchers and utilize new (to me) programming languages.
One great resource I found, Open Psychology Data, includes ~35 data sets related to numerous personality measures, most with demographic and some with cognitive variables. All data are anonymous, and you're free to take the online questionnaire yourself if you'd like to familiarize yourself with the assessment. Data may be downloaded as a .csv file along with a .txt variable dictionary.
The Authoritarian Personality
I chose to work with the Right-Wing Authoritarian Scale (RWAS) dataset (given our current political climate), which includes additional personality and demographic information provided by ~10,000 respondents.
Authoritarianism is a personality trait characterized as having a strong belief in absolute obedience and/or submission to external authority figures, typically extending to the administration of oppression and harsh punishment to those who do not conform to that authority (Adorno et al., 1950). A more recent conceptualization includes Altemeyer's (1998) right-wing authoritarianism (RWA), which includes politically conservative followers who have the following personality traits: 1) a high degree of submission to the established authority in their society, 2) high levels of aggression in the name of their authorities, and 3) a high degree of conventionalism. In order to assess for the degree of these traits, Altemeyer developed the Right-Wing Authoritarian Scale (RWAS), a 22-item self-report measure with item scores ranging from 1 to 9, with higher scores indicating higher degrees of RWA. (The full questionnaire and scoring procedure may be found here.)
Taking a look at the variable dictionary, you can see that the entire RWAS is available as well as the accompanying response times for each of the RWAS items. Additionally, a number of demographic variables as well as validity questions are exported. Finally, questions assessing basic personality dimensions are also accessible.
The Five-Factor Model of Personality
While a number of theoretical frameworks exist for understanding and categorizing personality types, the most common and robust way of considering personality includes the five-factor model of personality (McCrae & Costa, 1987). This model proposes that the entirety of personality may be best explained using 5 trait dimensions, wherein individuals may score at any point among each dimension. The five factors include: 1) Openness to new experiences, 2) Conscientiousness, 3) Extraversion, 4) Agreeableness and 5) Neuroticism (or emotional stability).
A number of self-report measures have been developed to assess these 5 factors, typically with many questions in order to more accurately assess the 5 dimensions. Research has focused on developing shorter assessments, for the sake of brevity and usability. Our current dataset uses the Ten Item Personality Inventory (TIPI), which utilizes just 10 Likert-scale questions to assess how much you identify with each of the personality domains (Gosling et al., 2003).
Importing and Preparing our Data
Now that we've covered a bit of theory, we'll begin by importing our libraries and data set (named 'data'):
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#import the dataset
data = pd.read_csv('/Users/lospina/Desktop/ML Tutorials/data.csv')
These libraries allow us to manipulate data (pandas), perform mathematical computations (numpy), and visualize our data (matplotlib).
However, we see a warning message when we import our dataset:
Some internet sleuthing determined that Python isn't sure what type of variable column 73 is. Every time you conduct a computational analysis, Python will spend time figuring out what it is, potentially slowing down the time to complete the analysis. To avoid any computational lag, we can assign a type to this variable so Python will no longer have to spend time guessing.
#What is column 73?
data.iloc[:,73]
#Convert 'IP_country' into a string variable
data['IP_country'].astype(str)
#Let's look at our data
data
We see that column 73 is the country of respondent origin. We'll convert it to a string variable, until we decide what to do with this variable next. Looking at our dataset, we can see there are 90 variables and 9,881 observations (practically unheard of for psychology researchers)!
Missing Values
We can now start cleaning our data. Let's begin by counting how many missing (or null) values are in our variables. In Python, missing values are referred to as "NaN". If your variable is missing 20% or more of its values, it is probably not an informative variable and so you can remove them from your data. Here, our respondent's country of origin (IP_country) and major are missing 8,670 and 3,423 values, respectively; let's remove them from our data:
#Count number of missing values
data.isnull().sum()
#Remove columns with missing data
data = data.drop(['IP_country', 'major'], axis=1)
#Descriptive analysis
data.describe()
We can run a basic descriptive analysis on our data to see their minimum and maximum scores (and determine potential outliers). And, as you can see, we've encountered a problem:
The minimum values for our authoritarian variables of interest (RWAS variables denoted as 'Q') are zero; however, based on the measure's scoring rubric the lowest possible value for these questions is '1'. If you look at the time spent answering the RWAS questions (denoted as 'E' variables), you can see that these values are also zero. What likely happened is that respondents either unintentionally (or intentionally?) skipped these questions. We will have to determine who skipped through all the questions and remove those observations, and also determine how many '0' values there are for each 'Q' question.
#Remove rows that contain zero for all 'Q' questions
data.loc[data['Q1'] == 0]
#Drop these rows by index:
data = data.drop(data.index[[599,6031, 6242]])
#Let's count how many rows we have where at least one zero occurs for each of
#the Q variables:
data[['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11', 'Q12',
'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
'Q22']].eq(0).sum()
#How many total rows have at least one zero value in the Q variables:
data.loc[(data['Q1']==0) | (data['Q2']==0) | (data['Q3']==0) | (data['Q4']==0) |
(data['Q5']==0) | (data['Q6']==0) | (data['Q7']==0) | (data['Q8']==0) |
(data['Q9']==0) | (data['Q10']==0) | (data['Q11']==0) | (data['Q12']==0) |
(data['Q13']==0) | (data['Q14']==0) | (data['Q15']==0)| (data['Q16']==0) |
(data['Q17']==0) | (data['Q18']==0) | (data['Q19']==0) | (data['Q20']==0) |
(data['Q21']==0) | (data['Q22']==0)]
When we call for rows that contains zero values under Q1, we see three rows that contain zeroes for all Q questions. These individuals skipped all questions; we can remove them from our data set by dropping them using their position (i.e., index) in the dataset.
When we count how many zero values occur for each Q question, we can see that the highest value of zero values is 29, well below our 20% cut-off for missing values. However, just to make sure, counting total number of rows with zero values when accounting for all Q questions simultaneously reveals 198 rows with at least one zero value. This is still below our 20% cut-off rule of thumb. Rather than removing these people from the dataset altogether, we can replace these zeros with another measure of central tendency. For Likert-scale surveys, one can replace missing values with the variable's mean (Downey & King, 1998), which we will do here.
#Replace all instances of '0' in the Q variables to NaN
data[['Q1','Q2','Q3','Q4','Q5','Q6','Q7','Q8','Q9','Q10','Q11','Q12','Q13',
'Q14','Q15','Q16','Q17','Q18','Q19','Q20','Q21','Q22']] = data[['Q1','Q2',
'Q3','Q4','Q5','Q6','Q7','Q8','Q9','Q10','Q11','Q12','Q13','Q14','Q15',
'Q16','Q17','Q18','Q19','Q20','Q21','Q22']].replace(0, np.NaN)
#Fill missing values with the mean of each column
data.fillna(data.mean(), inplace=True)
#Double-check: View descriptives of your numerical variables
pd.set_option('display.max_columns',90)
data.describe(include='all')
You will find that our personality measure (the TIPI) has the same issue: descriptive analyses demonstrate a minimum score of '0', yet answers should range from 1 to 7. We should run the same procedure here: count the zero values, and if they number less than 20% (which they do), conduct a mean imputation so that we don't lose these observations.
#How many 0's are there in each of the TIPI variables?
data[['TIPI1', 'TIPI2', 'TIPI3', 'TIPI4', 'TIPI5', 'TIPI6', 'TIPI7', 'TIPI8',
'TIPI9', 'TIPI10']].eq(0).sum()
#How many total rows have at least one zero value in the TIPI variables?
data.loc[(data['TIPI1']==0) | (data['TIPI2']==0) | (data['TIPI3']==0) |
(data['TIPI4']==0) | (data['TIPI5']==0) | (data['TIPI6']==0) |
(data['TIPI7']==0) | (data['TIPI8']==0) | (data['TIPI9']==0) |
(data['TIPI10']==0)]
#Replace all instances of '0' in the Q variables to NaN
data[['TIPI1','TIPI2','TIPI3','TIPI4','TIPI5','TIPI6','TIPI7','TIPI8','TIPI9',
'TIPI10']] = data[['TIPI1','TIPI2', 'TIPI3','TIPI4','TIPI5','TIPI6','TIPI7',
'TIPI8','TIPI9','TIPI10']].replace(0, np.NaN)
#Fill missing values with the mean of each column
data.fillna(data.mean(), inplace=True)
#Double-check:View descriptives of your numerical variables
pd.set_option('display.max_columns',90)
data.describe(include='all')
Additional demographic variables were collected, and should also be reviewed. Most of these variables are categorical, yet are numerically coded in the dataset. This is actually a good thing since we will need them to be numerical once we start building our prediction models. However, again we find the same issue wherein zeros are present, which represent missing data. If you wish to conduct a data imputation, remember that the mean is not appropriate; for categorical variables, impute using the variable's mode. For example:
#Frequency count of 'education'
data['education'].value_counts()
#Replace all instances of '0' in 'education' to NaN
data[['education']] = data[['education']].replace(0, np.NaN)
#Fill missing values with the mode of 'education'
data[['education']] = data['education'].fillna(data['education'].mode()[0])
#Conduct this procedure for all demographic variables...
#Frequency count of 'familysize'
data['familysize'].value_counts()
#Replace all instances of '0' in 'familysize' to NaN
data[['familysize']] = data[['familysize']].replace(0, np.NaN)
#Fill missing values with the mean of each column
data.fillna(data.mean(), inplace=True)
data = data[data['familysize'] < 60]
#Frequency count of 'age'
data['age'].value_counts()
#Some people report they're 265 years old!
# Remove people who are older than a certain age; let's arbitrarily use 95
data = data[data['age'] < 95]
One note about the 'familysize' variable...
The variable 'familysize' specifically asks: "Including you, how many children did your mother have?"
First, we see that 172 respondents reported '0' children. This is incorrect, since the question requires the respondent to include themselves in the count (i.e., the minimum value must be '1'). We should conduct a mean imputation to replace these zeroes.
Curiously, we see that one respondent reported their mother having 69 children, while two respondents' mothers birthed 100 (!) children. The Guinness Book of World Records reports the most prolific mother, Mrs. Vassilyev, as having birthed 69 children. Since Mrs. Vassilyev passed away in 1782, we can assume that this respondent (and those who reported '100') incorrectly reported their family size. Most importantly, these answers are so ludicrous that I seriously suspect the validity of the rest of their answers. Therefore, I have removed these three individuals from the dataset altogether.
If you repeat the same procedure for age, you'll see a similar issue: one individual claims they are 265 years old! Therefore, I removed individuals from the data set who claimed to be older than 95 years old (i.e., an arbitrary cut-off).
If you repeat the same procedure for age, you'll see a similar issue: one individual claims they are 265 years old! Therefore, I removed individuals from the data set who claimed to be older than 95 years old (i.e., an arbitrary cut-off).
Computing Total Scores
Now that we've cleaned our data, we can compute total scores for the RWAS and TIPI dimensions:
#Create your Authoritarian total score:
data['rwasTot'] = (data['Q3'] + data['Q4'] + data['Q5'] + data['Q6'] +
data['Q7'] + data['Q8'] + data['Q9'] + data['Q10'] + data['Q11'] + data['Q12']
+ data['Q13'] + data['Q14'] + data['Q15'] + data['Q16'] + data['Q17'] +
data['Q18'] + data['Q19'] + data['Q20'] + data['Q21'] + data['Q22'])
#Double-check: correct range for the RWAS total score is 20 - 180
data['rwasTot'].describe()
#Certain TIPI variables need to be reverse-scored; use a function
def reverseScoring(data, high, cols):
'''Reverse scores on given columns
data = data frame,
high = highest score available +1
cols = the columns you want reversed in list form'''
data[cols] = high - data[cols]
return data
#Columns to be reversed
cols = ['TIPI2', 'TIPI4', 'TIPI6', 'TIPI8', 'TIPI10']
#Create new dataset with reverse scores:
data = reverseScoring(data, 8, cols)
#Compute your new TIPI total variables:
data['extraversion'] = data['TIPI1'] + data['TIPI6']
data['agreeableness'] = data['TIPI2'] + data['TIPI7']
data['conscientiousness'] = data['TIPI3'] + data['TIPI8']
data['emotionalstability'] = data['TIPI4'] + data['TIPI9']
data['opennesstoexperience'] = data['TIPI5'] + data['TIPI10']
#Double-check: correct range for the TIPI total scores is 2 - 14
data.describe()
To compute the RWAS total score, we simply add all the Q variables together. For the TIPI domain score, we first need to reverse score some of the questions. Using Python, we can create a function, detailing the procedure: subtract each score in the specified columns from the highest possible value +1. (If the highest score in your column is 7, you need to call on 8 as your highest value.) After reverse scoring, you can then create your TIPI domain variables. Run descriptive analyses again to double-check that your ranges are appropriate for your assessments.
Validity Check
Finally, the dataset includes a series of questions ('VCL') to assess an individual's reading level, which includes three nonsense words. You can utilize these questions in a conservative way, and remove observations for people whose reading level does not meet a pre-determined cut-off. I wish to maximize my sample (in order to increase my power and build more complex models), so I will remove respondents (n = 210) who reported knowing the definition of three nonsense words. Perhaps their reading ability and/or understanding of the instructions is suboptimal, or they checked all the questions without actually reading the words.
#Sum the appropriate VCL questions
data['vcheck'] = data['VCL6'] + data['VCL9'] + data['VCL12']
#How many respondents 'knew' the definitions of 3 nonsense words?
data['vcheck'].value_counts()
#Remove individuals who reported knowing fake words
data = data.query('vcheck != 3')
#Double-check to see that these respondents were removed:
data['vcheck'].describe()
#Export to CSV
data.to_csv('dataclean.csv', index=False)
Our dataset originally contained 9,881 rows; after cleaning, 9,661 observations remain. We can export our clean dataset as a new .csv file, which we can continue to use for all subsequent analyses.
This was a lengthy post, I admit; however, you'll find that manipulating and preparing your dataset is the most time-intensive aspect of data science. Now the fun can actually begin!
To view and/or download my Python Jupyter notebook, visit my Github page.
References
Adorno, T. W.; Frenkel-Brunswik, E.; Levinson, D.J.; Sanford, R. N. (1950). The Authoritarian Personality. Harper & Brothers.
Altemeyer, B. (1998). The other "authoritarian personality". Advances in Experimental Social Psychology, 30, 47-91.
Downey, R. G., & King, C. V. (1998). Missing data in Likert ratings: A comparison of replacement methods. The Journal of general psychology, 125(2), 175-191.
Gosling, S. D., Rentfrow, P. J., & Swann Jr, W. B. (2003). A very brief measure of the Big-Five personality domains. Journal of Research in personality, 37(6), 504-528.
McCrae, R. R., & Costa, P. T. (1987). Validation of the five-factor model of personality across instruments and observers. Journal of personality and social psychology, 52(1), 81.