Overview

The purpose of this website is to assist users who are struggling with making the decision for college major by answer following two questions:

  • Do people majored in certain fields in undergrad go on to get a Master’s or PhD?

  • Which industries do they ultimately go into?

Our team randomly sampled
Our team developed an zoomable icicle diagram to allow users to explore around all majors while comparing their ‘popularities’ at the same time. To visualize the all potential pathways from the chosen major, our team implemented a Sankey diagram which can also help identifying the most ‘popular’ path/paths.



Where does the data come from?

The dataset comes from Linkedin API as well as scraping public data from the Linkedin website. For the purpose of our study, we mainly focused on five main attributes: current location, major, degree, date of graduation, and current field of industry. The content of this dataset is publicly available to all Linkedin users.



Our data mining process

In the data mining process, we used Python to filter each field. First, we scraped data from linkedIn API and tried to filter the json files. We used program to omit the rows with missing values. In order to find all the field of study of each person we used the regex method and used base field from internet as guideline. Following next, we used the results from previous step to obtain each person’s degree, from bachelor to master to phd to industry. A method called hashmap was used to calculate the number of each field and each degree to another degree.



Data Quality Analysis

Completeness

Majors: Since attribute was collected through manual entries, 81.36% of entries were missing. Because our main focus is to analyze the relationship between majors and industries, it is essential for us to ensure the completeness of the columns for majors and industries. Therefore, our team decided to proceed without the empty entries.

Degrees:Similar situation happened to degree column where the majority (81.25%) of entries are missing due to manual entering. However, there is a significant overlapping between the empty entries for both major column and degree column. After omitting the empty records(rows) from major columns, there is only 8.51% of empty values left in degree column.

Date of Graduation:Since date of graduation is not our main concern, we decided to keep the 9.04% empty rows for graduation date column.

Fields of Industry:When dealing with fields of industry, since these data were chosen by Linkedin users from a list of industries, no regularization of expression is needed for these entries. However, there is still 2.98% miss values existing in this column. Our team chose to omit these empty entries as completeness is essential for our study.

Coherence

This dataset is mostly coherence because the most popular majors in the icicle diagram is consistent with the the ranking from CNN.

  • Computer Science

  • Accounting

  • General Management

  • Finance

  • Marketing

  • General Engineering

  • Economics

  • Law & Public Policy

  • General Engineering

  • General Management

Correctness

As shown in following diagram, on the one hand, majors like accounting and computer science allow more possibilities in terms of career paths. This is consistent with reality since skill sets like these are demanded from a wide range of industries.

accounting sankey

On the other hand, as shown in following sankey diagram, background like chemistry is only required by limited range of industries.

chemistry sankey

Accountability

Since the data has been scraped from Linkedin and publicly available for all Linkedin users, the data set satisfies the accountability factor.