Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
robertopucp
GitHub Repository: robertopucp/1eco35_2022_2
Path: blob/main/Trabajo_grupal/WG1/qs_world_fs.ipynb
2714 views
Kernel: Python 3
Visualizing QS World University Rankings from 2017 to 2022
QS World University Rankings is an annual publication of global university rankings by Quacquarelli Symonds. The QS ranking receives approval from the International Ranking Expert Group (IREG), and is viewed as one of the three most-widely read university rankings in the world, along with Academic Ranking of World Universities and Times Higher Education World University Rankings. Quacquarelli Symonds (QS) is a UK company specialising in the analysis of higher education institutions around the world. In December 2003, Richard Lambert's review of university-industry collaboration in BritainΒ forΒ HM Treasury, the finance ministry of the United Kingdom recommended the need for world university rankings which Lambert said would help the UK to gauge the global standing of its universities. So, the first issue of QS World Rankings was released in 2004 in partnership with Times Higher Education (THE) as Times Higher Education - QS World University Rankings. In 2009, THE split with QS and went ahead to publish its own version of rankings. QS has been publishing its university rankings in partnership with Elsevier.

Methodology

QS designed its rankings to assess performance according to what it believes to be key aspects of a university's mission: teaching, research, nurturing employability, and internationalisation. The methodological framework it follows assess universities based on six metrics,
  • Academic Reputation (40%)
  • Employer Reputation (10%)
  • Faculty/Student Ratio (20%)
  • Citations per faculty (20%)
  • International Faculty Ratio (5%)
  • International Student Ratio (5%)
More information about the methodology can be found here.
About Data πŸ“ŠπŸ”Ž
The dataset was obtained by scraping the QS World University Rankings website with Python and Selenium.

Feature Description

The dataset has a total of 15 columns.
  • university - name of the university
  • year - year of ranking
  • rank_display - rank given to the university
  • score - score of the university based on the six key metrics mentioned above
  • link - link to the university profile page on QS website
  • country - country in which the university is located
  • city - city in which the university is located
  • region - continent in which the university is located
  • logo - link to the logo of the university
  • type - type of university (public or private)
  • research_output - quality of research at the university
  • student_faculty_ratio - number of students assigned to per faculty
  • international_students - number of international students enrolled at the university
  • size - size of the university in terms of area
  • faculty_count - number of faculty or academic staff at the university
pip install geopandas
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting geopandas Downloading geopandas-0.10.2-py2.py3-none-any.whl (1.0 MB) |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.0 MB 13.5 MB/s Collecting pyproj>=2.2.0 Downloading pyproj-3.2.1-cp37-cp37m-manylinux2010_x86_64.whl (6.3 MB) |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6.3 MB 37.6 MB/s Collecting fiona>=1.8 Downloading Fiona-1.8.21-cp37-cp37m-manylinux2014_x86_64.whl (16.7 MB) |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16.7 MB 63.2 MB/s Requirement already satisfied: pandas>=0.25.0 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.3.5) Requirement already satisfied: shapely>=1.6 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.8.2) Requirement already satisfied: six>=1.7 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (1.15.0) Collecting munch Downloading munch-2.5.0-py2.py3-none-any.whl (10 kB) Collecting click-plugins>=1.0 Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB) Requirement already satisfied: attrs>=17 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (22.1.0) Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (57.4.0) Requirement already satisfied: click>=4.0 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (7.1.2) Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (2022.6.15) Collecting cligj>=0.5 Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (2022.1) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (2.8.2) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (1.21.6) Installing collected packages: munch, cligj, click-plugins, pyproj, fiona, geopandas Successfully installed click-plugins-1.1.1 cligj-0.7.2 fiona-1.8.21 geopandas-0.10.2 munch-2.5.0 pyproj-3.2.1
Import Libraries πŸ“š
# import necessary libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap plt.rcParams['axes.edgecolor']='#FA6E4F' plt.rcParams['font.family'] = 'monospace' import seaborn as sns import geopandas as gpd import missingno as msno import re import warnings warnings.filterwarnings("ignore") import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename))
Custom Color Palette 🎨
long_palette = ["#FA6E4F", "#F2CF59", "#FB8E7E", "#C5D7C0", "#8EC9BB", "#F8CA9D", '#F69EAF', '#8F8CBC', '#7C5396', '#EA6382', '#6BEAF3', '#5A9DE2', '#DDAD64', '#EA876B', '#B98174', '#357866', '#625586', '#647B99'] custom_palette1 = sns.color_palette(long_palette) short_palette = ["#FA6E4F", "#F2CF59", "#FB8E7E", "#C5D7C0", "#8EC9BB", "#F8CA9D"] custom_palette2 = sns.color_palette(short_palette) watermelon_colors = ['#84e3c8', '#a8e6cf', '#dcedc1', '#ffd3b6', '#ffaaa5', '#ff8b94', '#ff7480'] custom_palette3 = sns.color_palette(watermelon_colors) research_palette = ['#FA6E4F','#8EC9BB'] student_faculty_palette = ['#003f5c','#ff6361'] international_palette = ['#ffcf6a','#628d82']
sns.palplot(sns.color_palette(long_palette)) sns.palplot(sns.color_palette(short_palette)) sns.palplot(sns.color_palette(watermelon_colors))
Image in a Jupyter notebookImage in a Jupyter notebookImage in a Jupyter notebook
Load and Explore data πŸ•΅πŸ»β€β™€οΈ
university_df = pd.read_excel("/content/drive/MyDrive/DAE-PUCP/Docentes - policy paper/Base de datos/data2017_2022.xlsx")
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-6-446ab81f94af> in <module> ----> 1 university_df = pd.read_excel("/content/drive/MyDrive/DAE-PUCP/Docentes - policy paper/Base de datos/data2017_2022.xlsx") /usr/local/lib/python3.7/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 309 stacklevel=stacklevel, 310 ) --> 311 return func(*args, **kwargs) 312 313 return wrapper /usr/local/lib/python3.7/dist-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options) 362 if not isinstance(io, ExcelFile): 363 should_close = True --> 364 io = ExcelFile(io, storage_options=storage_options, engine=engine) 365 elif engine and engine != io.engine: 366 raise ValueError( /usr/local/lib/python3.7/dist-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options) 1190 else: 1191 ext = inspect_excel_format( -> 1192 content_or_path=path_or_buffer, storage_options=storage_options 1193 ) 1194 if ext is None: /usr/local/lib/python3.7/dist-packages/pandas/io/excel/_base.py in inspect_excel_format(content_or_path, storage_options) 1069 1070 with get_handle( -> 1071 content_or_path, "rb", storage_options=storage_options, is_text=False 1072 ) as handle: 1073 stream = handle.handle /usr/local/lib/python3.7/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 709 else: 710 # Binary mode --> 711 handle = open(handle, ioargs.mode) 712 handles.append(handle) 713 FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/DAE-PUCP/Docentes - policy paper/Base de datos/data2017_2022.xlsx'
university_df.head()
university_df.shape
university_df.info()
Data Cleaning and Preprocessing πŸ§ΉπŸ”¨
We can see from the dataset info() method that there are many null values across multiple columns. Let's take a look at the number of null values.
pd.DataFrame(university_df.isnull().sum(), columns=['No. of Missing values'])
missing_percent = round(university_df.isna().mean() * 100, 1) pd.DataFrame(missing_percent[missing_percent > 0], columns=['% of Missing Values'])
Before handling the null values, let's see if there is any correlation between the missing values. I have used the missingno package. It's a simple python package that can be used for missing data visualization. Visualizing correlation between missing values can give better insights about the missingness of data. Learn more about missingness types here.
cmap = ListedColormap(custom_palette3, name='cmap1') msno.heatmap(university_df, cmap=cmap, figsize=(13, 6), fontsize=14);
The correlation heatmap includes only the columns with missing values. The higher the correlation, the higher the missing values in one column are dependent on the missing values with another column. We can see 'faculty_count' has significant correlation with 'student_faculty_ratio' and 'international_students'. Other columns have little or no significant correlation.
Since multiple columns have missing values, let's drop rows that have more than 4 missing values because we can't work with a university that's missing a lot of its attributes.
print(len(university_df[university_df.isnull().sum(axis=1) > 4])) drop_index = university_df[university_df.isnull().sum(axis=1) > 4].index.tolist() university_df.drop(drop_index, inplace=True) print('Rows which have more than 4 null values have been dropped!')
Let's drop 'link' and 'logo' column as they are hyperlinks. Although 'score' column can be very useful for analysis, its missing nearly 56% values. When I looked for these values on the QS website, I could see they have given a score only for the top 500 universities although 1000+ universities have been ranked. So, I'm ignoring this column as well.
university_df.drop(['link', 'logo', 'score'], axis=1, inplace=True)
Converting the 'international_students', 'faculty_count' and 'rank_display' column to numerical by removing all the special characters in them.
university_df['research_output'] = university_df['research_output'].replace('Very high', 'Very High') university_df['international_students'] = university_df['international_students'].apply(lambda x: float(str(x).replace(',',''))) university_df['faculty_count'] = university_df['faculty_count'].apply(lambda x: float(str(x).replace(',',''))) university_df['rank_display'] = university_df['rank_display'].apply(lambda x: float(re.sub(r'\W+', '', str(x))))
Visualizing universities by year and type
year_df = university_df['year'].value_counts().sort_values() fig, ax = plt.subplots(figsize=(10,4), dpi=90) ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) ax.spines['left'].set_visible(False) ax.tick_params(bottom=False) ax.get_yaxis().set_visible(False) sns.countplot(data=university_df, x='year', palette=custom_palette1); # add values on top of each bar ax.bar_label(ax.containers[0]) ax.set_xlabel('Year', fontsize=13, color = '#ff4800'); fig.suptitle('Number of universities ranked over the years', fontsize=15, color = '#ff4800');
With each year, more and more universities are considered for the rankings and 2022 has the highest number of universities.
type_df = university_df['type'].value_counts() fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,5)) pie_bar_colors = ['#FB8E7E','#8EC9BB'] explode = [0,0.1] ax1.pie(university_df['type'].value_counts().values, labels = university_df['type'].value_counts().index, explode=explode, colors=pie_bar_colors, autopct='%1.1f%%') ax1.axis('equal') ax2.bar(university_df['type'].value_counts().index, university_df['type'].value_counts().values, color=pie_bar_colors) ax2.spines['top'].set_visible(False) ax2.spines['right'].set_visible(False) ax2.spines['left'].set_visible(False) ax2.tick_params(axis='both', which='both', labelsize=10, left=False, bottom=False) ax2.get_yaxis().set_visible(False) plt.title("University Types", fontsize=15, color = '#ff4800'); ax2.bar_label(ax2.containers[0]) fig.tight_layout() fig.subplots_adjust(wspace=0.7)
If you do a simple google search, you can find many websites claiming that private universities are better than public universities because they tend to have better rankings. Well, that's not the case here. More than 80% of the universities ranked are public.
Distribution of universities across the world 🌏
Now, let's take a look at the geography of the universities.
Universities by Continents πŸ—Ί
university_df['region'] = university_df['region'].apply(lambda x: x.replace('Latin America', 'South America')) region_sum = pd.DataFrame(university_df['region'].value_counts().reset_index()) # define colors colors = ['#f8e3ca','#f8d1b4','#f7bf9e','#f6ad88','#f69b72','#f5895c'] cmap = ListedColormap(colors) fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 10), dpi=150, gridspec_kw={'height_ratios': [1, 2]}) # create barplot ax1.bar(region_sum['index'], region_sum['region'], color=colors[::-1]) ax1.spines['top'].set_visible(False) ax1.spines['right'].set_visible(False) ax1.spines['left'].set_visible(False) ax1.tick_params(bottom=False) ax1.get_yaxis().set_visible(False) ax1.bar_label(ax1.containers[0]) fig.suptitle('Distribution of universities across continents', fontsize=13, color = '#ff4800'); # create worldmap world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) worldmap_df = world.set_index('continent').join(region_sum.set_index('index')).reset_index() to_be_mapped = 'region' legend_labels = region_sum['index'].tolist()[::-1] worldmap_df.plot(column=to_be_mapped, cmap=cmap, linewidth=0.8, ax=ax2, edgecolors='0.8', legend=True, categorical=True, ) leg = ax2.get_legend() for text, label in zip(leg.get_texts(), legend_labels): text.set_text(label) leg.set_bbox_to_anchor((1.15,0.5)) # leg.edgecolors('#ff4800') frame = leg.get_frame() frame.set_edgecolor('#ff4800') ax2.set_axis_off() plt.subplots_adjust(wspace=0, hspace=0)
Europe tends to be the continent with more number of universities though we have to consider the fact that they have included Russia in Europe although it belongs to both Europe and Asia. It is followed by Asia and North America.
Universities by Countries 🏫
print('Number of countries with ranked universities: ',university_df['country'].nunique())
Out of the 195 countries in the world, only 97 countries have universities that are ranked.
uni_df = university_df['university'].value_counts() fig, ax = plt.subplots(figsize=(10,20), dpi=150) sns.countplot(data=university_df, y='country', order=university_df.country.value_counts().index, palette=custom_palette1); plt.xlabel('Number of universities', fontsize=12, color = '#ff4800') plt.ylabel('Country', fontsize=12, color = '#ff4800') plt.title("Distribution of universities across countries", fontsize=14, color = '#ff4800'); # plt.savefig('countrywise.png')
United States consists of more number of universities that have been ranked over the years followed by United Kingdom and Germany.
Universities by Cities πŸŒƒ
sorted_df = university_df.sort_values(by='rank_display').drop_duplicates('university') sorted_df = pd.DataFrame(sorted_df['city'].value_counts()[:20])
fig, ax = plt.subplots(figsize=(14,5), dpi=100) ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) ax.spines['left'].set_visible(False) ax.tick_params(bottom=False) ax.get_yaxis().set_visible(False) sns.barplot(data=sorted_df, y='city', x=sorted_df.index, palette=custom_palette2) plt.xticks(rotation=90); ax.bar_label(ax.containers[0]) ax.set_xlabel('City Name', fontsize=13, color = '#ff4800'); fig.suptitle('Distribution of universities across cities', fontsize=15, color = '#ff4800'); # plt.savefig('countrywise.png')
The above graph considers the top 20 cities with high number of unique universities. London is an academic hotspot with a whooping 19 universities that are ranked globally!
Ranking of Top 10 Universities πŸ”
# university_df.sort_values('rank_display')[:60] top_unis = ['Massachusetts Institute of Technology (MIT) ', 'Stanford University', 'University of Oxford', 'Harvard University', 'University of Cambridge', 'California Institute of Technology (Caltech)', 'ETH Zurich - Swiss Federal Institute of Technology', 'Imperial College London', 'UCL', 'University of Chicago'] topunis_df = university_df[university_df['university'].isin(top_unis)][['year','university','rank_display']].reset_index(drop=True)
fig = plt.figure(figsize=(15,15), dpi=100) for uni, i in zip(top_unis, range(1, 11)): new_df = topunis_df[topunis_df['university'] == uni] ax = fig.add_subplot(5, 2, i) ax.plot(new_df['year'], new_df['rank_display'], color='#003f5c', linewidth=1.5) plt.gca().invert_yaxis() ax.set_title(uni, color='#ff4800') fig.subplots_adjust(wspace=0.2, hspace=0.6, top=0.92) fig.suptitle('Ranking of top 10 universities from 2017 to 2022', fontsize=15, color = '#ff4800');
By taking a quick look at the dataframe, I have made a list of the top 10 universities ranked over the years. These 10 universities have a tendency to occupy the top 10 positions consistently.
  • MIT tends to be undisputed king in terms of QS Rankings, ranked number 1 always.
  • Stanford and Harvard have dropped down this year for the first time since 2017.
  • University of Oxford, the oldest university in the English-speaking world, has jumped from Rank 6 to Rank 2.
  • On an overall scale, universities from UK have spiked up on their rankings compared to the US universities most of which have dropped down this year (2022).
  • Out of the top 10, Only one university, ETH Zurich (Switzerland), is from a country other than US or UK.
QS World Rankings - Contributing Factors βš–οΈ
Let's explore the metrics used to gauge the universities. Out of the 6 metrics that have been used, only 3 are present in this dataset.
  • Research output - 20%
  • Student Faculty ratio - 20%
  • International students - 5%
So, our analysis will account only for 45% of the survey methodology.
Research Output πŸ”¬
Next to teaching, Academic research is viewed as a very important factor. Understanding research output can give us insights about how the top universities prioritize them.
fig, ax = plt.subplots(figsize=(8,4), dpi=90) ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) ax.spines['left'].set_visible(False) ax.tick_params(bottom=False) ax.get_yaxis().set_visible(False) sns.countplot(data=university_df, x='research_output', hue='type', palette=research_palette); for container in ax.containers: ax.bar_label(container) plt.legend(edgecolor='#ff4800'); ax.set_xlabel('Research Output', fontsize=13, color = '#ff4800'); fig.suptitle('Research output of universities', fontsize=15, color = '#ff4800');
Clearly, most number of universities under consideration have "Very High" research output. Public universities outperform private universities in terms of research.
fig, ax = plt.subplots(figsize=(8,4), dpi=90) sns.barplot(data=university_df, x='research_output', y='faculty_count', hue='type', ci=None, palette=research_palette) ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) ax.tick_params(bottom=False) ax.set_xlabel('Research Output', fontsize=13, color = '#ff4800') ax.set_ylabel('Faculty Count', fontsize=13, color = '#ff4800') ax.legend(edgecolor='#ff4800',bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) fig.suptitle('Research output Vs Faculty Count', fontsize=15, color = '#ff4800');
As far as the number of faculty are concerned,
  • Universities with "Very High" research output have more staffs.
  • So, does this mean universities with higher number of faculty do better research? Not necessarily. Universities with "Very High" research output may attract more accomplished academics and researchers because of their reputation along with many other factors.
  • We can see public universities with research output as "Very High" and "Low" have nearly equal number of staffs.
  • Also, private universities with "Low" research output have more staffs than "High" output.
  • This can mean that, not every university puts an emphasis on research although they have more number of academic staff.
research_size = pd.DataFrame(university_df.groupby(['research_output']).apply(lambda df: df['size'].value_counts()))
research_size
research_size = research_size.reset_index().rename(columns={'level_1': 'size', 'size': 'count'})
πŸ’‘ A quick intro on how to interpret pointplot.

A pointplot shows an estimate of mean value for a numeric variable by using scatter plot points. This can be particularly useful for comparing different levels of a categorical variable. The lines joining the pointplot can be used to judge the differences between slopes easily. For more info, refer here

sns.catplot(x="research_output", y="count", kind="point", data=research_size, hue='size', palette=custom_palette1); plt.xlabel('Research Output', fontsize=13, color = '#ff4800') plt.ylabel('Count', fontsize=13, color = '#ff4800') plt.title('Research output Vs Size of university', fontsize=15, color = '#ff4800');
The relationship between the size of the university and research output is pretty candid. Universities with "Very High" and "High" research output are larger in size comapare to "Medium" and "Low".
Student Faculty Ratio πŸ‘©πŸ»β€πŸŽ“β†’πŸ§‘πŸ»β€πŸ«
Student Faculty Ratio is an interesting measure. According to QS, "It is usually cited by students as a metric of highest importance to them". Lesser the ratio, higher the performance. A faculty with less number of students assigned to them can dedicate more focus and attention on each individual.
university_df['student_faculty_ratio'].describe()
  • On average, universities tend to have 13 students per faculty.
  • There are universities that have as low as 1 student per faculty.
  • While there are universities that have 67 students per faculty.
plt.figure(figsize=(10,3), dpi=100) sns.histplot(data=university_df, x='student_faculty_ratio', bins=60, color=student_faculty_palette[1]); plt.xlabel('Student Faculty Ratio', color = '#ff4800') plt.ylabel('Count', color = '#ff4800') plt.title('Distribution of Student Faculty Ratio', fontsize=15, color = '#ff4800');
We have a right skewed distribution. The outliers doesn't seem to affect the mean much. Most of the universities have somewhere between 5 to 20 students per faculty.
plt.figure(figsize=(8,4), dpi=100) sns.boxplot(data=university_df, y='student_faculty_ratio', x='research_output', hue='type', palette=student_faculty_palette); plt.xlabel('Research Output', fontsize=11, color = '#ff4800') plt.ylabel('Student Faculty Ratio',color = '#ff4800') plt.title('Student Faculty Ratio vs Research Output', color = '#ff4800');
Obviously, universities with "Very High" research output have very less "student faculty ratio" compared to the rest of them.
sns.catplot(x="size", y="student_faculty_ratio", kind="point", data=university_df, hue='type', palette=student_faculty_palette); plt.xlabel('Size', color = '#ff4800') plt.ylabel('Student Faculty Ratio', color = '#ff4800') plt.title('Student Faculty Ratio Vs Size of university', fontsize=15, color = '#ff4800');
Private universities have very less "student faculty ratio" compared to the public universities when it comes to the size. Another interesting observation is that the average "student faculty ratio" seems to increase with increase in the "size" of the university.
International Students 🌐
A university that attracts students from across the world demonstrates a global outlook and possess a multicultural diversity in its campus.
university_df['international_students'].describe()
  • On average, universities tend to have 1900+ international students.
  • There is a university with its international students intake as high as 31,000+. Let's take a look at it.
university_df.iloc[university_df['international_students'].idxmax()]
plt.figure(figsize=(10,3), dpi=100) sns.histplot(data=university_df, x='international_students', bins=50, color=international_palette[0]); plt.xlabel('International Students', color = '#ff4800') plt.ylabel('Count', color = '#ff4800') plt.title('Distribution of International Students', fontsize=15, color = '#ff4800');
We have a right skewed distribution here as well. There are very few outliers. Most of the universities have an intake between 0 to 5000.
sns.catplot(kind='point', data=university_df, x='research_output', y='international_students', order=university_df['research_output'].value_counts().index, ci=None, hue='type', palette=international_palette); plt.xlabel('Research Output', color = '#ff4800') plt.ylabel('International Students', color = '#ff4800') plt.title('International Students Vs Research Output', fontsize=15, color = '#ff4800');
International students tend to prefer public universities with "Very High" research output. Due to lesser tuition fees compared to private ones? πŸ€” Maybe. πŸ€·πŸ»β€β™€οΈ
Most popular country of choice for International Students βœ…
And for the last part, which country is most popular among international students? Can you guess before going down?
intstu_country = pd.DataFrame(university_df.groupby(['country'], sort=False)['international_students'].sum().sort_values(ascending=False)[:10])
fig, ax = plt.subplots(figsize=(10,4), dpi=100) ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) ax.spines['bottom'].set_visible(False) ax.tick_params(left=False) ax.get_xaxis().set_visible(False) sns.barplot(data=intstu_country, x='international_students', y=intstu_country.index, palette=custom_palette1); ax.bar_label(ax.containers[0], fmt = '%d') ax.set_ylabel('Country', fontsize=13, color = '#ff4800'); fig.suptitle('Country of choice for International Students from 2017 - 2022', fontsize=15, color = '#ff4800');
It's USA πŸ‡ΊπŸ‡Έ closely followed by UK πŸ‡¬πŸ‡§!
If you've come down this far, "THANK YOU!". Let me know in the comments if you have any feedback, criticisms or concerns.
Credits and Acknowledgement

Did I say those words? If you like my work, do hit the upvote button! No, I didn't. 😁