Sanaa Igbokwe

Logo

Sanaa Igbokwe is a data analyst who illuminates insights and drives decision-making by unraveling complex datasets with precision and creativity. They know that data analysis is not just about numbers; it's about understanding human behavior and environmental justice.

View My LinkedIn Profile

UK Gender Pay Gap Analysis Final Report

Bea Igbokwe, Anusha Ramprasad, Nivedita Ravi
IST 462/652
December 7, 2022

%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Data Cleaning


To start the analysis of the UK gender pay gap, we read the csv files for the years 2019 through 2021, omitting columns that would not be utilized within our analysis. We determined the columns pertaining to income and employer size for each company would be our focus for this analysis.

These columns were EmployerName, DiffMeanHourlyPercent, DiffMedianHourlyPercent, DiffMeanBonusPercent, DiffMedianBonusPercent, MaleBonusPercent, FemaleBonusPercent, MaleLowerQuartile, FemaleLowerQuartile, MaleLowerMiddleQuartile, FemaleLowerMiddleQuartile, MaleUpperMiddleQuartile, FemaleUpperMiddleQuartile, MaleTopQuartile, FemaleTopQuartile, and EmployerSize– leaving us with 16 columns remaining from our original 27. Columns 2 through 4 focus on the difference of pay between men and women, measuring the mean/median of hourly/bonus pay, with a negative value in the dataset implying there are higher percentage of women with the higher pay. Columns 5 through 15 divide the hourly pay of employees into quartiles, then further divides them into male and female– these two columns for each division adding up to 100. For the last column, each row had a value out of a list to provide data on the employer size: “Less than 250”, “250 to 499”, “500 to 999”, “1000 to 4999”,”5000 to 19,999”, “20,000 or more”, “Not Provided”.


By observing the info for each dataset, we could conclude the size of our datasets and which columns contained null values. For the years 2020 and 2021, there were over 10,000 rows with 2019 having significantly less rows at a bit over 6,900 rows.

gpg_2021 = pd.read_csv("https://gender-pay-gap.service.gov.uk/viewing/download-data/2021")
gpg_21 = gpg_2021.drop(["Address", "PostCode", "CompanyNumber", "SicCodes", "ResponsiblePerson", "CompanyLinkToGPGInfo", "CurrentName", "DueDate","EmployerId", "DateSubmitted", "SubmittedAfterTheDeadline" ], axis = 1)
gpg_21.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10492 entries, 0 to 10491
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   EmployerName               10492 non-null  object 
 1   DiffMeanHourlyPercent      10492 non-null  float64
 2   DiffMedianHourlyPercent    10492 non-null  float64
 3   DiffMeanBonusPercent       7685 non-null   float64
 4   DiffMedianBonusPercent     7685 non-null   float64
 5   MaleBonusPercent           10492 non-null  float64
 6   FemaleBonusPercent         10492 non-null  float64
 7   MaleLowerQuartile          10295 non-null  float64
 8   FemaleLowerQuartile        10295 non-null  float64
 9   MaleLowerMiddleQuartile    10295 non-null  float64
 10  FemaleLowerMiddleQuartile  10295 non-null  float64
 11  MaleUpperMiddleQuartile    10295 non-null  float64
 12  FemaleUpperMiddleQuartile  10295 non-null  float64
 13  MaleTopQuartile            10295 non-null  float64
 14  FemaleTopQuartile          10295 non-null  float64
 15  EmployerSize               10492 non-null  object 
dtypes: float64(14), object(2)
memory usage: 1.3+ MB
gpg_20 = pd.read_csv("https://gender-pay-gap.service.gov.uk/viewing/download-data/2020", 
                     usecols=['DiffMeanBonusPercent',
 'DiffMeanHourlyPercent',
 'DiffMedianBonusPercent',
 'DiffMedianHourlyPercent',
 'EmployerName',
 'EmployerSize',
 'FemaleBonusPercent',
 'FemaleLowerMiddleQuartile',
 'FemaleLowerQuartile',
 'FemaleTopQuartile',
 'FemaleUpperMiddleQuartile',
 'MaleBonusPercent',
 'MaleLowerMiddleQuartile',
 'MaleLowerQuartile',
 'MaleTopQuartile',
 'MaleUpperMiddleQuartile'])
gpg_20.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10532 entries, 0 to 10531
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   EmployerName               10532 non-null  object 
 1   DiffMeanHourlyPercent      10532 non-null  float64
 2   DiffMedianHourlyPercent    10532 non-null  float64
 3   DiffMeanBonusPercent       7894 non-null   float64
 4   DiffMedianBonusPercent     7894 non-null   float64
 5   MaleBonusPercent           10532 non-null  float64
 6   FemaleBonusPercent         10532 non-null  float64
 7   MaleLowerQuartile          10332 non-null  float64
 8   FemaleLowerQuartile        10332 non-null  float64
 9   MaleLowerMiddleQuartile    10332 non-null  float64
 10  FemaleLowerMiddleQuartile  10332 non-null  float64
 11  MaleUpperMiddleQuartile    10332 non-null  float64
 12  FemaleUpperMiddleQuartile  10332 non-null  float64
 13  MaleTopQuartile            10332 non-null  float64
 14  FemaleTopQuartile          10332 non-null  float64
 15  EmployerSize               10532 non-null  object 
dtypes: float64(14), object(2)
memory usage: 1.3+ MB
gpg_19 = pd.read_csv("https://gender-pay-gap.service.gov.uk/viewing/download-data/2019", 
                     usecols=['DiffMeanBonusPercent',
 'DiffMeanHourlyPercent',
 'DiffMedianBonusPercent',
 'DiffMedianHourlyPercent',
 'EmployerName',
 'EmployerSize',
 'FemaleBonusPercent',
 'FemaleLowerMiddleQuartile',
 'FemaleLowerQuartile',
 'FemaleTopQuartile',
 'FemaleUpperMiddleQuartile',
 'MaleBonusPercent',
 'MaleLowerMiddleQuartile',
 'MaleLowerQuartile',
 'MaleTopQuartile',
 'MaleUpperMiddleQuartile'])
gpg_19.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6921 entries, 0 to 6920
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   EmployerName               6921 non-null   object 
 1   DiffMeanHourlyPercent      6921 non-null   float64
 2   DiffMedianHourlyPercent    6921 non-null   float64
 3   DiffMeanBonusPercent       5205 non-null   float64
 4   DiffMedianBonusPercent     5203 non-null   float64
 5   MaleBonusPercent           6921 non-null   float64
 6   FemaleBonusPercent         6921 non-null   float64
 7   MaleLowerQuartile          6921 non-null   float64
 8   FemaleLowerQuartile        6921 non-null   float64
 9   MaleLowerMiddleQuartile    6921 non-null   float64
 10  FemaleLowerMiddleQuartile  6921 non-null   float64
 11  MaleUpperMiddleQuartile    6921 non-null   float64
 12  FemaleUpperMiddleQuartile  6921 non-null   float64
 13  MaleTopQuartile            6921 non-null   float64
 14  FemaleTopQuartile          6921 non-null   float64
 15  EmployerSize               6921 non-null   object 
dtypes: float64(14), object(2)
memory usage: 865.2+ KB

Data Exploration of Employer Size

Next, we wanted to observe pay disparity based on the employer size. Focusing on the datasets for the years 2019 and 2021, we created visualizations to analyze how employer size changes the average hourly pay for men and women. To create comprehensive visualizations, we decided to use boxen plots that shows the distribution of each employer size value and order it from smallest to largest, with the last value being “Not Provided”. To use this variable to analyze pay disparity, we decided to focus on the difference in the average hourly pay, which provides a percentage for each employer indicating a numeric value that is negative or positive.

gpg_21['EmployerSize'].unique()
array(['1000 to 4999', '250 to 499', 'Less than 250', '5000 to 19,999',
       '500 to 999', 'Not Provided', '20,000 or more'], dtype=object)
sns.set(rc={'figure.figsize':(16.75,5.75)}, font_scale = 1.5)
sns.boxenplot(data=gpg_21, 
            x = "EmployerSize", 
            y = "DiffMeanHourlyPercent",
            order = ["Less than 250", "250 to 499", "500 to 999", "1000 to 4999",
                    "5000 to 19,999", "20,000 or more", "Not Provided"],
            palette = "gist_heat"
             ).set(title='2021 Mean Hourly Pay Based on Employer Size')
plt.savefig('EmplySize21.pdf')

From looking at the visualization, it appears there are extreme outliers for women, as the negative values represent bias for women. These outliers appear to be more apparent for employer sizes less than 5000. Despite the outliers representing women, the boxen plots show the average mean hourly pay favors men as the median line is above 0.

sns.set(rc={'figure.figsize':(16.75,5.75)}, font_scale = 1.5)
sns.boxenplot(data=gpg_19, 
            x = "EmployerSize", 
            y = "DiffMeanHourlyPercent",
            order = ["Less than 250", "250 to 499", "500 to 999", "1000 to 4999",
                    "5000 to 19,999", "20,000 or more", "Not Provided"],
            palette = "gist_heat"
             ).set(title='2019 Mean Hourly Pay Based on Employer Size')
plt.savefig('EmplySize19.pdf')

Similarly to the boxen plot for 2021, there are extreme outliers for the women who have an hourly pay largerly greater than the average. For this year, however, there is less of a disparity with the gap appearing to be smaller between men and women. Some factors for this could be related to the amount of data received for this year compared to 2021 and prehaps an increase in diversity of job roles in 2021.

Data Exploration of Employer

To explore gender bias amongst employers, we decided to focus on the extreme ends of our datasets. Looking at median hourly pay, we extracted the 5 employers who had the highest median hourly pay for men and the 5 employers who had the highest median hourly pay for women, then we created two small dataframes that can be used for graphs. For this data exploration we looked at the year 2021 to gain some insight before doing further analysis.

top5_21 = gpg_21.nlargest(5, 'DiffMedianHourlyPercent')
top5_21
EmployerName DiffMeanHourlyPercent DiffMedianHourlyPercent DiffMeanBonusPercent DiffMedianBonusPercent MaleBonusPercent FemaleBonusPercent MaleLowerQuartile FemaleLowerQuartile MaleLowerMiddleQuartile FemaleLowerMiddleQuartile MaleUpperMiddleQuartile FemaleUpperMiddleQuartile MaleTopQuartile FemaleTopQuartile EmployerSize
689 ATFC LIMITED 0.0 100.0 NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN 250 to 499
4343 HPI UK HOLDING LTD. 100.0 100.0 2.0 59.0 11.0 4.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 250 to 499
5517 M. ANDERSON CONSTRUCTION LIMITED 100.0 100.0 100.0 100.0 14.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 250 to 499
7197 PSJ FABRICATIONS LTD 100.0 100.0 100.0 100.0 3.7 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 Less than 250
4369 HULL COLLABORATIVE ACADEMY TRUST 45.0 93.0 NaN NaN 0.0 0.0 3.0 97.0 3.0 97.0 17.0 83.0 17.0 83.0 Not Provided
sns.barplot(data=top5_21,
            y = "EmployerName",
            x = 'DiffMedianHourlyPercent', 
            orient = "h",
           palette = "gist_heat"
           ).set(title="Top 5 Employers for Men")
[Text(0.5, 1.0, 'Top 5 Employers for Men')]

By observing this bar plot, it is concluded there is a gender bias amongst some employers that is most likely affected by the industry. From this vizualization, we can see several construction companies that have 100% median hourly pay for men, implying there really isn’t any women in this company to begin with.

last5_21 = gpg_21.nsmallest(5, 'DiffMedianHourlyPercent')
last5_21
EmployerName DiffMeanHourlyPercent DiffMedianHourlyPercent DiffMeanBonusPercent DiffMedianBonusPercent MaleBonusPercent FemaleBonusPercent MaleLowerQuartile FemaleLowerQuartile MaleLowerMiddleQuartile FemaleLowerMiddleQuartile MaleUpperMiddleQuartile FemaleUpperMiddleQuartile MaleTopQuartile FemaleTopQuartile EmployerSize
472 ANKH CONCEPTS HOSPITALITY MANAGEMENT LIMITED -379.6 -499.5 NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN 250 to 499
7256 QUEST PAY SOLUTIONS NE LIMITED -90.0 -131.0 NaN NaN 0.0 0.0 95.0 5.0 92.0 8.0 95.0 5.0 29.0 71.0 500 to 999
3449 FORTEL SERVICES LIMITED -184.2 -128.8 63.5 -6.7 12.9 30.4 93.4 6.6 99.0 1.0 96.1 3.9 99.5 0.5 1000 to 4999
7522 RLC (UK) LIMITED -40.9 -121.5 30.9 0.0 15.5 6.1 NaN NaN NaN NaN NaN NaN NaN NaN 250 to 499
6180 NCR UK GROUP LIMITED -53.0 -104.0 -105.0 -326.0 97.0 95.0 97.0 3.0 97.0 3.0 87.0 13.0 83.0 17.0 500 to 999
sns.barplot(data=last5_21,
            y = "EmployerName",
            x = 'DiffMedianHourlyPercent', 
            orient = "h",
           palette = "gist_heat"
           ).set(title='Top 5 Employers for Women')
[Text(0.5, 1.0, 'Top 5 Employers for Women')]

On the flipside, there are employers that appear to largely favor women. However, researching these companies in the visualization it is evident that the gender disparity is heavily influenced by the industry of the employer, as women mostly occupy financing and hospitality jobs.

Data Transformation for Visual Data Exploration

To create a comprehensive visualization, we decided to combine the results from the aforementioned analysis into one graph by merging the two small dataframes we created into one.

high_low_21 = pd.concat([top5_21, last5_21], axis = 0)
sns.set(rc={'figure.figsize':(12,5.75)}, font_scale = 1.5)
sns.barplot(data=high_low_21,
            y = "EmployerName",
            x = 'DiffMedianHourlyPercent', 
            orient = "h",
           palette = "gist_heat"
           ).set(title="Gender Pay Disparity by Employer")

plt.savefig('DiffMedianBar1.pdf')


Data Transformation for Final Analysis

For the final analysis we wanted to observe all of our data in one dataframe. To be able to create visualizations with all three of the datasets we created a column for each containing the year for that dataset. We used this new column to merge each dataframe together into one large dataframe. Now with one large dataframe we went ahead and filled in NAN values with the mean of each column to have data representing each row even if it wasn’t provided.

#ADDING A YEAR COLUMN  
gpg_19["Year"] = 2019
gpg_20["Year"] = 2020
gpg_21["Year"] = 2021
#MERGING DATAFRAMES ON THE YEAR COLUMN

merged = pd.concat([gpg_19, gpg_20], axis = 0)
final = pd.concat([merged,gpg_21], axis = 0)
final
EmployerName DiffMeanHourlyPercent DiffMedianHourlyPercent DiffMeanBonusPercent DiffMedianBonusPercent MaleBonusPercent FemaleBonusPercent MaleLowerQuartile FemaleLowerQuartile MaleLowerMiddleQuartile FemaleLowerMiddleQuartile MaleUpperMiddleQuartile FemaleUpperMiddleQuartile MaleTopQuartile FemaleTopQuartile EmployerSize Year
0 'PRIFYSGOL ABERYSTWYTH' AND 'ABERYSTWYTH UNIVE... 11.5 10.3 NaN NaN 0.0 0.0 53.0 47.0 41.0 59.0 40.0 60.0 62.0 38.0 1000 to 4999 2019
1 10 TRINITY SQUARE HOTEL LIMITED 8.7 10.3 29.6 54.5 90.5 90.5 47.9 52.1 56.3 43.7 78.9 21.1 66.7 33.3 250 to 499 2019
2 1LIFE MANAGEMENT SOLUTIONS LIMITED 11.0 -0.5 81.5 94.2 10.0 11.4 49.0 51.0 35.3 64.7 42.3 57.7 44.2 55.8 250 to 499 2019
3 1ST CHOICE STAFF RECRUITMENT LIMITED -2.3 0.0 -114.8 -249.3 1.1 0.4 50.8 49.2 67.7 32.3 62.9 37.1 50.0 50.0 250 to 499 2019
4 1ST HOME CARE LTD. -2.0 0.5 NaN NaN 0.0 0.0 10.0 90.0 8.0 92.0 9.0 91.0 9.0 91.0 250 to 499 2019
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10487 ZPG LIMITED 22.4 22.4 72.7 15.9 16.0 15.0 45.7 54.3 52.6 47.4 57.7 42.3 70.9 29.1 500 to 999 2021
10488 ZURICH EMPLOYMENT SERVICES LIMITED 26.5 24.4 55.4 36.0 96.9 95.5 35.9 64.1 43.0 57.0 53.1 46.9 67.8 32.2 1000 to 4999 2021
10489 ZURICH UK GENERAL SERVICES LIMITED 16.7 19.7 56.8 45.1 97.3 96.7 41.5 58.5 66.8 33.2 66.0 34.0 69.3 30.7 1000 to 4999 2021
10490 ZUTO LIMITED 16.0 6.0 10.0 35.0 66.0 75.0 64.0 36.0 58.0 42.0 71.0 29.0 70.0 30.0 250 to 499 2021
10491 ZWANENBERG FOOD GROUP UK LIMITED 18.7 2.1 76.9 0.0 57.7 42.3 48.7 51.3 59.4 40.6 62.3 37.7 65.0 35.0 500 to 999 2021

27945 rows × 17 columns

#REPLACING NULL VALUES WITH MEANS OF EACH PERCENTILE

meanBonus = final['DiffMeanBonusPercent'].mean()
medianBonus = final['DiffMedianBonusPercent'].mean()
mean_MaleLQ = final['MaleLowerQuartile'].mean()
mean_FemLQ = final['FemaleLowerQuartile'].mean()
mean_MaleLMQ = final['MaleLowerMiddleQuartile'].mean()
mean_FemLMQ = final['FemaleLowerMiddleQuartile'].mean()
mean_MaleUMQ = final['MaleUpperMiddleQuartile'].mean()
mean_FemUMQ = final['FemaleUpperMiddleQuartile'].mean()
mean_MaleTopQ = final['MaleTopQuartile'].mean()
mean_FemTopQ = final['FemaleTopQuartile'].mean()
final['DiffMeanBonusPercent'].fillna(value=meanBonus, inplace=True)
final['DiffMedianBonusPercent'].fillna(value=medianBonus, inplace=True)
final['MaleLowerQuartile'].fillna(value=mean_MaleLQ, inplace=True)
final['FemaleLowerQuartile'].fillna(value=mean_FemLQ, inplace=True)
final['MaleLowerMiddleQuartile'].fillna(value=mean_MaleLMQ, inplace=True)
final['FemaleLowerMiddleQuartile'].fillna(value=mean_FemLMQ, inplace=True)
final['MaleUpperMiddleQuartile'].fillna(value=mean_MaleUMQ, inplace=True)
final['FemaleUpperMiddleQuartile'].fillna(value=mean_FemUMQ, inplace=True)
final['MaleTopQuartile'].fillna(value=mean_MaleTopQ, inplace=True)
final['FemaleTopQuartile'].fillna(value=mean_FemTopQ, inplace=True)
final
EmployerName DiffMeanHourlyPercent DiffMedianHourlyPercent DiffMeanBonusPercent DiffMedianBonusPercent MaleBonusPercent FemaleBonusPercent MaleLowerQuartile FemaleLowerQuartile MaleLowerMiddleQuartile FemaleLowerMiddleQuartile MaleUpperMiddleQuartile FemaleUpperMiddleQuartile MaleTopQuartile FemaleTopQuartile EmployerSize Year
0 'PRIFYSGOL ABERYSTWYTH' AND 'ABERYSTWYTH UNIVE... 11.5 10.3 21.101073 3.523732 0.0 0.0 53.0 47.0 41.0 59.0 40.0 60.0 62.0 38.0 1000 to 4999 2019
1 10 TRINITY SQUARE HOTEL LIMITED 8.7 10.3 29.600000 54.500000 90.5 90.5 47.9 52.1 56.3 43.7 78.9 21.1 66.7 33.3 250 to 499 2019
2 1LIFE MANAGEMENT SOLUTIONS LIMITED 11.0 -0.5 81.500000 94.200000 10.0 11.4 49.0 51.0 35.3 64.7 42.3 57.7 44.2 55.8 250 to 499 2019
3 1ST CHOICE STAFF RECRUITMENT LIMITED -2.3 0.0 -114.800000 -249.300000 1.1 0.4 50.8 49.2 67.7 32.3 62.9 37.1 50.0 50.0 250 to 499 2019
4 1ST HOME CARE LTD. -2.0 0.5 21.101073 3.523732 0.0 0.0 10.0 90.0 8.0 92.0 9.0 91.0 9.0 91.0 250 to 499 2019
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10487 ZPG LIMITED 22.4 22.4 72.700000 15.900000 16.0 15.0 45.7 54.3 52.6 47.4 57.7 42.3 70.9 29.1 500 to 999 2021
10488 ZURICH EMPLOYMENT SERVICES LIMITED 26.5 24.4 55.400000 36.000000 96.9 95.5 35.9 64.1 43.0 57.0 53.1 46.9 67.8 32.2 1000 to 4999 2021
10489 ZURICH UK GENERAL SERVICES LIMITED 16.7 19.7 56.800000 45.100000 97.3 96.7 41.5 58.5 66.8 33.2 66.0 34.0 69.3 30.7 1000 to 4999 2021
10490 ZUTO LIMITED 16.0 6.0 10.000000 35.000000 66.0 75.0 64.0 36.0 58.0 42.0 71.0 29.0 70.0 30.0 250 to 499 2021
10491 ZWANENBERG FOOD GROUP UK LIMITED 18.7 2.1 76.900000 0.000000 57.7 42.3 48.7 51.3 59.4 40.6 62.3 37.7 65.0 35.0 500 to 999 2021

27945 rows × 17 columns

Data Exploration for Final Analysis

Finally, we wanted to analyze everything we explored so far at a smaller scale now on our final dataframe we created by merging the dataframes for the years 2019, 2020, and 2021. By merging these dataframes on the year column we were able to create visualizations that disply changes over the years, as well as analyze the data using the previously mentioned variables. First we looked at the hourly pay and bonus pay difference over the years, the male to female ratio of hourly pay based on the provided quartile columns, then analyzed the difference in hourly pay based on employer size.

Observing the Gap in Hourly Pay

To start the finalization of our analysis of pay disparity, we created some graphs looking at the difference in median and mean hourly pay, using the boxen plot as used at the start of analysis. With combined dataframes we were able to analyze this variable over the years 2019, 2020, and 2021 without having to measure it against another factor. Next, we created a histogram for the median hourly pay to gain insight of the pay disparity over all three years. We also observed the bonus pay variable to analyze for disparity, focusing on the median difference as the mean difference did not provide significant information.

sns.boxenplot(data=final, 
            y = "DiffMedianHourlyPercent", 
            x = "Year",
            palette = "gist_heat"
             ).set(title="Median Hourly Pay for 2019-2021")

plt.savefig('DiffMedianYear.pdf')

From the above boxen plot we observe that the values above 0 represent men and the values below 0 represent women. From the boxen plot above we can see that between the years 2019 to 2021, the year 2020 has extreme outliers for women when compared to the years 2019 and 2021.

sns.boxenplot(data=final, 
            y = "DiffMeanHourlyPercent", 
            x = "Year"
             ).set(title="Mean Hourly Pay for 2019-2021")
[Text(0.5, 1.0, 'Mean Hourly Pay for 2019-2021')]

The boxen plot for mean hourly pay shows similar trends like the previous boxen plot where men are favored more than women, with outliers representing some bias towards women. From the boxen plot we can clearly see that in the year 2019, there are more outliers than the other years indicating that in the year 2019 the mean hourly pay was a little more for women compared to 2020 and 2021.

final['DiffMedianHourlyPercent'].plot(kind='hist', 
                                      bins=75, 
                                      figsize=[12,6], 
                                      alpha=.6, 
                                      legend=True, 
                                      color = 'maroon'
                                     ).set(title="Median Hourly Pay")
[Text(0.5, 1.0, 'Median Hourly Pay')]

From the above histogram, we observe that the distribution of the difference in median hourly pay is severely-skewed to the left. Thus, we can infer that in most cases, men have higher hourly pay compared to women.

Bonus Pay

sns.barplot(data=final, 
            y = "DiffMedianBonusPercent", 
            x = "Year",
            palette = "gist_heat"
           ).set(title="Median Bonus Pay of Men & Women for Years 2019-2021")
plt.savefig('YearDiffMedBonus.pdf')

For 2019, we can observe that the percentage of median bonus pay is skewed towards men and that the ones for women aren’t included in the plot. In 2020, it can be seen that the median of women receiving bonus pay is 3% comparatively lower than men. Following that, in 2021, the median of women receiving bonus pay was 2% bonus. Thus, based on this output, we can infer that there is a slight improvement in the probability that women will receive bonus pay.

Male to Female Ratio of Hourly Pay

Next, we wanted to analyze the overall male to female hourly pay ratio of all three years together. To do this we calculated the average percentage of men and women in each quartile, assigning each value to a variable. We then used each variable to create a table to show two columns labeled male and female with each row showing the labeled quartile. By observing the table below, it is concluded that overall more men are in the top hourly pay quartile at 59.7% and more women are in the lower hourly pay quartile at 54.6%.

#FINDING THE MEANS FOR EACH PERCENTILE

MLQ_per = final['MaleLowerQuartile'].mean()
FLQ_per = final['FemaleLowerQuartile'].mean()
MaleLMQ_per = final['MaleLowerMiddleQuartile'].mean()
FemLMQ_per = final['FemaleLowerMiddleQuartile'].mean()
MaleUMQ_per = final['MaleUpperMiddleQuartile'].mean()
FemUMQ_per = final['FemaleUpperMiddleQuartile'].mean()
MaleTop_per = final['MaleTopQuartile'].mean()
FemTop_per = final['FemaleTopQuartile'].mean()
data = [{'Male': MLQ_per, 'Female': FLQ_per}, 
        {'Male': MaleLMQ_per , 'Female': FemLMQ_per}, 
        {'Male': MaleUMQ_per , 'Female':FemUMQ_per}, 
        {'Male': MaleTop_per, 'Female': FemTop_per}]
percent= pd.DataFrame(data, index=["Lower_Quartile",
                                   "LowerMiddle_Quartile",
                                   "UpperMiddle_Quartile",
                                   "Top_Quartile"])
percent
Male Female
Lower_Quartile 45.353374 54.646626
LowerMiddle_Quartile 49.841387 50.158613
UpperMiddle_Quartile 54.243593 45.756407
Top_Quartile 59.742682 40.257318

Gender Disparity by Employer

To continue our analysis we looked at a sample of employers, again at each extreme, with 5 employers who had the highest median hourly pay for men and the 5 employers who had the highest median hourly pay for women with all three years combined. We created one comprehensive bar plot using the same strategy used earlier in analysis.

final.small = final.nsmallest(5, 'DiffMedianHourlyPercent')
final.large = final.nlargest(5, 'DiffMedianHourlyPercent')
final.highlow = pd.concat([final.small,final.large], axis = 0)
/tmp/ipykernel_55/3621132997.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  final.small = final.nsmallest(5, 'DiffMedianHourlyPercent')
/tmp/ipykernel_55/3621132997.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  final.large = final.nlargest(5, 'DiffMedianHourlyPercent')
/tmp/ipykernel_55/3621132997.py:3: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  final.highlow = pd.concat([final.small,final.large], axis = 0)
final.highlow
EmployerName DiffMeanHourlyPercent DiffMedianHourlyPercent DiffMeanBonusPercent DiffMedianBonusPercent MaleBonusPercent FemaleBonusPercent MaleLowerQuartile FemaleLowerQuartile MaleLowerMiddleQuartile FemaleLowerMiddleQuartile MaleUpperMiddleQuartile FemaleUpperMiddleQuartile MaleTopQuartile FemaleTopQuartile EmployerSize Year
472 ANKH CONCEPTS HOSPITALITY MANAGEMENT LIMITED -379.6 -499.5 21.101073 3.523732 0.0 0.0 45.353374 54.646626 49.841387 50.158613 54.243593 45.756407 59.742682 40.257318 250 to 499 2021
6518 NSS CLEANING LIMITED -181.3 -487.2 -9087.300000 -14967.100000 12.3 5.3 98.200000 1.800000 100.000000 0.000000 100.000000 0.000000 67.300000 32.700000 250 to 499 2020
726 AUTO-SLEEPERS GROUP LIMITED -42.5 -220.3 10.300000 -5.900000 52.2 29.5 100.000000 0.000000 50.000000 50.000000 0.000000 100.000000 100.000000 0.000000 250 to 499 2020
727 AUTO-SLEEPERS INVESTMENTS LIMITED -42.5 -220.3 10.300000 -5.900000 52.2 29.5 100.000000 0.000000 50.000000 50.000000 0.000000 100.000000 100.000000 0.000000 250 to 499 2020
1747 DONALDSON TIMBER ENGINEERING LIMITED -54.2 -134.0 -39.500000 -393.500000 72.0 75.0 97.900000 2.100000 98.900000 1.100000 78.900000 21.100000 76.600000 23.400000 250 to 499 2019
989 BEERE ELECTRICAL SERVICES LIMITED 100.0 100.0 100.000000 100.000000 57.1 0.0 100.000000 0.000000 100.000000 0.000000 100.000000 0.000000 100.000000 0.000000 Less than 250 2020
4057 HARVEY NICHOLS (OWN BRAND) STORES LIMITED 100.0 100.0 56.600000 46.700000 22.2 56.9 100.000000 0.000000 100.000000 0.000000 100.000000 0.000000 100.000000 0.000000 Not Provided 2020
4061 HARVEY NICHOLS RESTAURANTS LIMITED 100.0 100.0 21.101073 3.523732 0.0 3.2 100.000000 0.000000 100.000000 0.000000 100.000000 0.000000 100.000000 0.000000 Not Provided 2020
4702 J.C.B.EARTHMOVERS LIMITED 100.0 100.0 28.300000 0.000000 96.1 90.9 45.353374 54.646626 49.841387 50.158613 54.243593 45.756407 59.742682 40.257318 250 to 499 2020
4715 J5C MANAGEMENT LIMITED 100.0 100.0 41.300000 23.000000 6.4 7.4 100.000000 0.000000 100.000000 0.000000 100.000000 0.000000 100.000000 0.000000 500 to 999 2020
sns.barplot(data=final.highlow,
            y = "EmployerName",
            x = 'DiffMedianHourlyPercent', 
            orient = "h",
           palette = "gist_heat")
plt.savefig('EmpDiff.pdf')

The bar plot represents the the 5 largest and 5 smallest Hourly median pay across various companies between the years 2019 and 2021. We can see that the companies to the left of 0 represent the companies which favor women over men as they have large negative values. The companies to the right of 0 represent companies with positive Median values which represent men getting paid more . From the above bar plot we can clearly see the companies that women would prefer to work in.

Gender Pay Disparity by Employer Size

Lastly, we analyzed the overall pay disparity based on employer size, focusing on the difference in the average hourly pay of men and women.

sns.set(rc={'figure.figsize':(16.75,8)}, font_scale = 1.5)
sns.boxenplot(data=final, 
              x = "EmployerSize", 
              y = "DiffMeanHourlyPercent",
              order = ["Less than 250", "250 to 499", "500 to 999", "1000 to 4999",
                        "5000 to 19,999", "20,000 or more", "Not Provided"],
              palette = "gist_heat"
             ).set(title="Mean Hourly Pay Based on Employer Size")
[Text(0.5, 1.0, 'Mean Hourly Pay Based on Employer Size')]

From the boxen plot above we can observe that the average hourly pay for women is higher in companies with less number of employees. Companies where the number of employees are larger tend to favor men more than women according to the boxen plot

Conclusion

After our analysis we came to a series of conclusions based on our results. By looking at the disparity in hourly pay, it is possible the increase in gender diversity also displays an increase in pay disparity with the outliers, women receiving a higher hourly pay, becoming more frequent. This possibility would actually be a positive outcome as it implies increased equity that cannot be as easily observed through the data we were using. Employers with less employees may be benefical for women receiving a higher hourly pay. Although the analysis still showed men received on average a higher hourly pay, the possibility of women recieving equitable income is more likely. Overall men make more hourly pay compared to women, with women more likely to make much less than men. The industry of the employer affects the gender diversity amongst companies, and as a result this showed extreme results with gender pay disparity. Men are more likely to receive bonus pay over women, however, the dominance of men receiving bonuses over women has decreased, likely because of the pandemic. The bonus pay women receive present as outliers compared to the range at which men receive bonuses, affecting the probability range of men or women receiving a bonus by a sigificant amount.