CoCalc -- qs_world

GitHub Repository: robertopucp/1eco35_2022_2
Path: blob/main/Trabajo_grupal/WG1/qs_world_fs.ipynb
⁴⁶⁷⁹ views

Kernel: Python 3

Visualizing QS World University Rankings from 2017 to 2022

QS World University Rankings is an annual publication of global university rankings by Quacquarelli Symonds. The QS ranking receives approval from the International Ranking Expert Group (IREG), and is viewed as one of the three most-widely read university rankings in the world, along with Academic Ranking of World Universities and Times Higher Education World University Rankings. Quacquarelli Symonds (QS) is a UK company specialising in the analysis of higher education institutions around the world. In December 2003, Richard Lambert's review of university-industry collaboration in Britain for HM Treasury, the finance ministry of the United Kingdom recommended the need for world university rankings which Lambert said would help the UK to gauge the global standing of its universities. So, the first issue of QS World Rankings was released in 2004 in partnership with Times Higher Education (THE) as Times Higher Education - QS World University Rankings. In 2009, THE split with QS and went ahead to publish its own version of rankings. QS has been publishing its university rankings in partnership with Elsevier.

Methodology

QS designed its rankings to assess performance according to what it believes to be key aspects of a university's mission: teaching, research, nurturing employability, and internationalisation. The methodological framework it follows assess universities based on six metrics,

Academic Reputation (40%)
Employer Reputation (10%)
Faculty/Student Ratio (20%)
Citations per faculty (20%)
International Faculty Ratio (5%)
International Student Ratio (5%)

More information about the methodology can be found here.

About Data 📊🔎

The dataset was obtained by scraping the QS World University Rankings website with Python and Selenium.

Feature Description

The dataset has a total of 15 columns.

university - name of the university
year - year of ranking
rank_display - rank given to the university
score - score of the university based on the six key metrics mentioned above
link - link to the university profile page on QS website
country - country in which the university is located
city - city in which the university is located
region - continent in which the university is located
logo - link to the logo of the university
type - type of university (public or private)
research_output - quality of research at the university
student_faculty_ratio - number of students assigned to per faculty
international_students - number of international students enrolled at the university
size - size of the university in terms of area
faculty_count - number of faculty or academic staff at the university

In [ ]:

pip install geopandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting geopandas
  Downloading geopandas-0.10.2-py2.py3-none-any.whl (1.0 MB)
     |████████████████████████████████| 1.0 MB 13.5 MB/s 
Collecting pyproj>=2.2.0
  Downloading pyproj-3.2.1-cp37-cp37m-manylinux2010_x86_64.whl (6.3 MB)
     |████████████████████████████████| 6.3 MB 37.6 MB/s 
Collecting fiona>=1.8
  Downloading Fiona-1.8.21-cp37-cp37m-manylinux2014_x86_64.whl (16.7 MB)
     |████████████████████████████████| 16.7 MB 63.2 MB/s 
Requirement already satisfied: pandas>=0.25.0 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.3.5)
Requirement already satisfied: shapely>=1.6 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.8.2)
Requirement already satisfied: six>=1.7 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (1.15.0)
Collecting munch
  Downloading munch-2.5.0-py2.py3-none-any.whl (10 kB)
Collecting click-plugins>=1.0
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Requirement already satisfied: attrs>=17 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (22.1.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (57.4.0)
Requirement already satisfied: click>=4.0 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (7.1.2)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (2022.6.15)
Collecting cligj>=0.5
  Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (2022.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (2.8.2)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (1.21.6)
Installing collected packages: munch, cligj, click-plugins, pyproj, fiona, geopandas
Successfully installed click-plugins-1.1.1 cligj-0.7.2 fiona-1.8.21 geopandas-0.10.2 munch-2.5.0 pyproj-3.2.1

Import Libraries 📚

In [ ]:

# import necessary libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
plt.rcParams['axes.edgecolor']='#FA6E4F'
plt.rcParams['font.family'] = 'monospace'
import seaborn as sns
import geopandas as gpd
import missingno as msno
import re

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Custom Color Palette 🎨

In [ ]:

long_palette = ["#FA6E4F", "#F2CF59", "#FB8E7E", "#C5D7C0", "#8EC9BB", "#F8CA9D", '#F69EAF', '#8F8CBC', '#7C5396', '#EA6382', '#6BEAF3', '#5A9DE2', '#DDAD64', '#EA876B', '#B98174', '#357866', '#625586', '#647B99']
custom_palette1 = sns.color_palette(long_palette)

short_palette = ["#FA6E4F", "#F2CF59", "#FB8E7E", "#C5D7C0", "#8EC9BB", "#F8CA9D"]
custom_palette2 = sns.color_palette(short_palette)

watermelon_colors = ['#84e3c8', '#a8e6cf', '#dcedc1', '#ffd3b6', '#ffaaa5', '#ff8b94', '#ff7480']
custom_palette3 = sns.color_palette(watermelon_colors)

research_palette = ['#FA6E4F','#8EC9BB']

student_faculty_palette = ['#003f5c','#ff6361']

international_palette = ['#ffcf6a','#628d82']

In [ ]:

sns.palplot(sns.color_palette(long_palette))
sns.palplot(sns.color_palette(short_palette))
sns.palplot(sns.color_palette(watermelon_colors))

Load and Explore data 🕵🏻‍♀️

In [ ]:

university_df = pd.read_excel("/content/drive/MyDrive/DAE-PUCP/Docentes - policy paper/Base de datos/data2017_2022.xlsx")

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-6-446ab81f94af> in <module>
----> 1 university_df = pd.read_excel("/content/drive/MyDrive/DAE-PUCP/Docentes - policy paper/Base de datos/data2017_2022.xlsx")

/usr/local/lib/python3.7/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper
/usr/local/lib/python3.7/dist-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    362     if not isinstance(io, ExcelFile):
    363         should_close = True
--> 364         io = ExcelFile(io, storage_options=storage_options, engine=engine)
    365     elif engine and engine != io.engine:
    366         raise ValueError(
/usr/local/lib/python3.7/dist-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
   1190             else:
   1191                 ext = inspect_excel_format(
-> 1192                     content_or_path=path_or_buffer, storage_options=storage_options
   1193                 )
   1194                 if ext is None:
/usr/local/lib/python3.7/dist-packages/pandas/io/excel/_base.py in inspect_excel_format(content_or_path, storage_options)
   1069 
   1070     with get_handle(
-> 1071         content_or_path, "rb", storage_options=storage_options, is_text=False
   1072     ) as handle:
   1073         stream = handle.handle
/usr/local/lib/python3.7/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    709         else:
    710             # Binary mode
--> 711             handle = open(handle, ioargs.mode)
    712         handles.append(handle)
    713 
FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/DAE-PUCP/Docentes - policy paper/Base de datos/data2017_2022.xlsx'

In [ ]:

university_df.head()

In [ ]:

university_df.shape

In [ ]:

university_df.info()

Data Cleaning and Preprocessing 🧹🔨

We can see from the dataset info() method that there are many null values across multiple columns. Let's take a look at the number of null values.

In [ ]:

pd.DataFrame(university_df.isnull().sum(), columns=['No. of Missing values'])

In [ ]:

missing_percent = round(university_df.isna().mean() * 100, 1)
pd.DataFrame(missing_percent[missing_percent > 0], columns=['% of Missing Values'])

Before handling the null values, let's see if there is any correlation between the missing values. I have used the missingno package. It's a simple python package that can be used for missing data visualization. Visualizing correlation between missing values can give better insights about the missingness of data. Learn more about missingness types here.

In [ ]:

cmap = ListedColormap(custom_palette3, name='cmap1')
msno.heatmap(university_df, cmap=cmap, figsize=(13, 6), fontsize=14);

The correlation heatmap includes only the columns with missing values. The higher the correlation, the higher the missing values in one column are dependent on the missing values with another column. We can see 'faculty_count' has significant correlation with 'student_faculty_ratio' and 'international_students'. Other columns have little or no significant correlation.

Since multiple columns have missing values, let's drop rows that have more than 4 missing values because we can't work with a university that's missing a lot of its attributes.

In [ ]:

print(len(university_df[university_df.isnull().sum(axis=1) > 4]))
drop_index = university_df[university_df.isnull().sum(axis=1) > 4].index.tolist()
university_df.drop(drop_index, inplace=True)
print('Rows which have more than 4 null values have been dropped!')

Let's drop 'link' and 'logo' column as they are hyperlinks. Although 'score' column can be very useful for analysis, its missing nearly 56% values. When I looked for these values on the QS website, I could see they have given a score only for the top 500 universities although 1000+ universities have been ranked. So, I'm ignoring this column as well.

In [ ]:

university_df.drop(['link', 'logo', 'score'], axis=1, inplace=True)

Converting the 'international_students', 'faculty_count' and 'rank_display' column to numerical by removing all the special characters in them.

In [ ]:

university_df['research_output'] = university_df['research_output'].replace('Very high', 'Very High')
university_df['international_students'] = university_df['international_students'].apply(lambda x: float(str(x).replace(',','')))
university_df['faculty_count'] = university_df['faculty_count'].apply(lambda x: float(str(x).replace(',','')))
university_df['rank_display'] = university_df['rank_display'].apply(lambda x: float(re.sub(r'\W+', '', str(x))))

Visualizing universities by year and type

In [ ]:

year_df = university_df['year'].value_counts().sort_values()
fig, ax = plt.subplots(figsize=(10,4), dpi=90)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.tick_params(bottom=False)
ax.get_yaxis().set_visible(False)

sns.countplot(data=university_df, x='year', palette=custom_palette1);

# add values on top of each bar
ax.bar_label(ax.containers[0])

ax.set_xlabel('Year', fontsize=13, color = '#ff4800');
fig.suptitle('Number of universities ranked over the years', fontsize=15, color = '#ff4800');

With each year, more and more universities are considered for the rankings and 2022 has the highest number of universities.

In [ ]:

type_df = university_df['type'].value_counts()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,5))

pie_bar_colors = ['#FB8E7E','#8EC9BB']
explode = [0,0.1]
ax1.pie(university_df['type'].value_counts().values, labels = university_df['type'].value_counts().index, explode=explode, colors=pie_bar_colors, autopct='%1.1f%%') 
ax1.axis('equal')

ax2.bar(university_df['type'].value_counts().index, university_df['type'].value_counts().values, color=pie_bar_colors) 
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.spines['left'].set_visible(False)
ax2.tick_params(axis='both', which='both', labelsize=10, left=False, bottom=False)
ax2.get_yaxis().set_visible(False)
plt.title("University Types", fontsize=15, color = '#ff4800');

ax2.bar_label(ax2.containers[0])

fig.tight_layout()
fig.subplots_adjust(wspace=0.7)

If you do a simple google search, you can find many websites claiming that private universities are better than public universities because they tend to have better rankings. Well, that's not the case here. More than 80% of the universities ranked are public.

Distribution of universities across the world 🌏

Now, let's take a look at the geography of the universities.

Universities by Continents 🗺

In [ ]:

university_df['region'] = university_df['region'].apply(lambda x: x.replace('Latin America', 'South America'))
region_sum = pd.DataFrame(university_df['region'].value_counts().reset_index())

# define colors
colors = ['#f8e3ca','#f8d1b4','#f7bf9e','#f6ad88','#f69b72','#f5895c']
cmap = ListedColormap(colors)

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 10), dpi=150, gridspec_kw={'height_ratios': [1, 2]})

# create barplot
ax1.bar(region_sum['index'], region_sum['region'], color=colors[::-1])
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
ax1.spines['left'].set_visible(False)
ax1.tick_params(bottom=False)
ax1.get_yaxis().set_visible(False)
ax1.bar_label(ax1.containers[0])

fig.suptitle('Distribution of universities across continents', fontsize=13, color = '#ff4800');

# create worldmap
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
worldmap_df = world.set_index('continent').join(region_sum.set_index('index')).reset_index()
to_be_mapped = 'region'
legend_labels = region_sum['index'].tolist()[::-1]

worldmap_df.plot(column=to_be_mapped, 
            cmap=cmap, 
            linewidth=0.8, 
            ax=ax2, 
            edgecolors='0.8', 
            legend=True, 
            categorical=True,
           )

leg = ax2.get_legend()
for text, label in zip(leg.get_texts(), legend_labels):
    text.set_text(label)

leg.set_bbox_to_anchor((1.15,0.5))
# leg.edgecolors('#ff4800')
frame = leg.get_frame()
frame.set_edgecolor('#ff4800')
ax2.set_axis_off()
plt.subplots_adjust(wspace=0, hspace=0)

Europe tends to be the continent with more number of universities though we have to consider the fact that they have included Russia in Europe although it belongs to both Europe and Asia. It is followed by Asia and North America.

Universities by Countries 🏫

In [ ]:

print('Number of countries with ranked universities: ',university_df['country'].nunique())

Out of the 195 countries in the world, only 97 countries have universities that are ranked.

In [ ]:

uni_df = university_df['university'].value_counts()

fig, ax = plt.subplots(figsize=(10,20), dpi=150)

sns.countplot(data=university_df, y='country', order=university_df.country.value_counts().index, palette=custom_palette1);
plt.xlabel('Number of universities', fontsize=12, color = '#ff4800')
plt.ylabel('Country', fontsize=12, color = '#ff4800')
plt.title("Distribution of universities across countries", fontsize=14, color = '#ff4800');

# plt.savefig('countrywise.png')

United States consists of more number of universities that have been ranked over the years followed by United Kingdom and Germany.

Universities by Cities 🌃

In [ ]:

sorted_df = university_df.sort_values(by='rank_display').drop_duplicates('university')
sorted_df = pd.DataFrame(sorted_df['city'].value_counts()[:20])

In [ ]:

fig, ax = plt.subplots(figsize=(14,5), dpi=100)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.tick_params(bottom=False)
ax.get_yaxis().set_visible(False)

sns.barplot(data=sorted_df, y='city', x=sorted_df.index, palette=custom_palette2)
plt.xticks(rotation=90);

ax.bar_label(ax.containers[0])

ax.set_xlabel('City Name', fontsize=13, color = '#ff4800');
fig.suptitle('Distribution of universities across cities', fontsize=15, color = '#ff4800');

# plt.savefig('countrywise.png')

The above graph considers the top 20 cities with high number of unique universities. London is an academic hotspot with a whooping 19 universities that are ranked globally!

Ranking of Top 10 Universities 🔝

In [ ]:

# university_df.sort_values('rank_display')[:60]
top_unis = ['Massachusetts Institute of Technology (MIT) ', 'Stanford University', 'University of Oxford', 'Harvard University', 'University of Cambridge', 'California Institute of Technology (Caltech)', 'ETH Zurich - Swiss Federal Institute of Technology', 'Imperial College London', 'UCL', 'University of Chicago']
topunis_df = university_df[university_df['university'].isin(top_unis)][['year','university','rank_display']].reset_index(drop=True)

In [ ]:

fig = plt.figure(figsize=(15,15), dpi=100)

for uni, i in zip(top_unis, range(1, 11)):
    new_df = topunis_df[topunis_df['university'] == uni]
    ax = fig.add_subplot(5, 2, i)
    ax.plot(new_df['year'], new_df['rank_display'], color='#003f5c', linewidth=1.5)
    plt.gca().invert_yaxis()
    ax.set_title(uni, color='#ff4800')
    
fig.subplots_adjust(wspace=0.2, hspace=0.6, top=0.92)
fig.suptitle('Ranking of top 10 universities from 2017 to 2022', fontsize=15, color = '#ff4800');

By taking a quick look at the dataframe, I have made a list of the top 10 universities ranked over the years. These 10 universities have a tendency to occupy the top 10 positions consistently.

MIT tends to be undisputed king in terms of QS Rankings, ranked number 1 always.
Stanford and Harvard have dropped down this year for the first time since 2017.
University of Oxford, the oldest university in the English-speaking world, has jumped from Rank 6 to Rank 2.
On an overall scale, universities from UK have spiked up on their rankings compared to the US universities most of which have dropped down this year (2022).
Out of the top 10, Only one university, ETH Zurich (Switzerland), is from a country other than US or UK.

QS World Rankings - Contributing Factors ⚖️

Let's explore the metrics used to gauge the universities. Out of the 6 metrics that have been used, only 3 are present in this dataset.

Research output - 20%
Student Faculty ratio - 20%
International students - 5%

So, our analysis will account only for 45% of the survey methodology.

Research Output 🔬

Next to teaching, Academic research is viewed as a very important factor. Understanding research output can give us insights about how the top universities prioritize them.

In [ ]:

fig, ax = plt.subplots(figsize=(8,4), dpi=90)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)

ax.tick_params(bottom=False)
ax.get_yaxis().set_visible(False)

sns.countplot(data=university_df, x='research_output', hue='type', palette=research_palette);

for container in ax.containers:
    ax.bar_label(container)

plt.legend(edgecolor='#ff4800');
ax.set_xlabel('Research Output', fontsize=13, color = '#ff4800');
fig.suptitle('Research output of universities', fontsize=15, color = '#ff4800');

Clearly, most number of universities under consideration have "Very High" research output. Public universities outperform private universities in terms of research.

In [ ]:

fig, ax = plt.subplots(figsize=(8,4), dpi=90)

sns.barplot(data=university_df, x='research_output', y='faculty_count', hue='type', ci=None, palette=research_palette)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(bottom=False)
ax.set_xlabel('Research Output', fontsize=13, color = '#ff4800')
ax.set_ylabel('Faculty Count', fontsize=13, color = '#ff4800')

ax.legend(edgecolor='#ff4800',bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
fig.suptitle('Research output Vs Faculty Count', fontsize=15, color = '#ff4800');

As far as the number of faculty are concerned,

Universities with "Very High" research output have more staffs.

So, does this mean universities with higher number of faculty do better research? Not necessarily. Universities with "Very High" research output may attract more accomplished academics and researchers because of their reputation along with many other factors.

We can see public universities with research output as "Very High" and "Low" have nearly equal number of staffs.
Also, private universities with "Low" research output have more staffs than "High" output.

This can mean that, not every university puts an emphasis on research although they have more number of academic staff.

In [ ]:

research_size = pd.DataFrame(university_df.groupby(['research_output']).apply(lambda df: df['size'].value_counts()))

In [ ]:

research_size

In [ ]:

research_size = research_size.reset_index().rename(columns={'level_1': 'size', 'size': 'count'})

💡 A quick intro on how to interpret pointplot.

A pointplot shows an estimate of mean value for a numeric variable by using scatter plot points. This can be particularly useful for comparing different levels of a categorical variable. The lines joining the pointplot can be used to judge the differences between slopes easily. For more info, refer here

In [ ]:

sns.catplot(x="research_output", y="count", kind="point", data=research_size, hue='size', palette=custom_palette1);
plt.xlabel('Research Output', fontsize=13, color = '#ff4800')
plt.ylabel('Count', fontsize=13, color = '#ff4800')
plt.title('Research output Vs Size of university', fontsize=15, color = '#ff4800');

The relationship between the size of the university and research output is pretty candid. Universities with "Very High" and "High" research output are larger in size comapare to "Medium" and "Low".

Student Faculty Ratio 👩🏻‍🎓→🧑🏻‍🏫

Student Faculty Ratio is an interesting measure. According to QS, "It is usually cited by students as a metric of highest importance to them". Lesser the ratio, higher the performance. A faculty with less number of students assigned to them can dedicate more focus and attention on each individual.

In [ ]:

university_df['student_faculty_ratio'].describe()

On average, universities tend to have 13 students per faculty.
There are universities that have as low as 1 student per faculty.
While there are universities that have 67 students per faculty.

In [ ]:

plt.figure(figsize=(10,3), dpi=100)
sns.histplot(data=university_df, x='student_faculty_ratio', bins=60, color=student_faculty_palette[1]);
plt.xlabel('Student Faculty Ratio', color = '#ff4800')
plt.ylabel('Count', color = '#ff4800')
plt.title('Distribution of Student Faculty Ratio', fontsize=15, color = '#ff4800');

We have a right skewed distribution. The outliers doesn't seem to affect the mean much. Most of the universities have somewhere between 5 to 20 students per faculty.

In [ ]:

plt.figure(figsize=(8,4), dpi=100)
sns.boxplot(data=university_df, y='student_faculty_ratio', x='research_output', hue='type', palette=student_faculty_palette);
plt.xlabel('Research Output', fontsize=11, color = '#ff4800')
plt.ylabel('Student Faculty Ratio',color = '#ff4800')
plt.title('Student Faculty Ratio vs Research Output', color = '#ff4800');

Obviously, universities with "Very High" research output have very less "student faculty ratio" compared to the rest of them.

In [ ]:

sns.catplot(x="size", y="student_faculty_ratio", kind="point", data=university_df, hue='type', palette=student_faculty_palette);
plt.xlabel('Size', color = '#ff4800')
plt.ylabel('Student Faculty Ratio', color = '#ff4800')
plt.title('Student Faculty Ratio Vs Size of university', fontsize=15, color = '#ff4800');

Private universities have very less "student faculty ratio" compared to the public universities when it comes to the size. Another interesting observation is that the average "student faculty ratio" seems to increase with increase in the "size" of the university.

International Students 🌐

A university that attracts students from across the world demonstrates a global outlook and possess a multicultural diversity in its campus.

In [ ]:

university_df['international_students'].describe()

On average, universities tend to have 1900+ international students.
There is a university with its international students intake as high as 31,000+. Let's take a look at it.

In [ ]:

university_df.iloc[university_df['international_students'].idxmax()]

In [ ]:

plt.figure(figsize=(10,3), dpi=100)
sns.histplot(data=university_df, x='international_students', bins=50, color=international_palette[0]);
plt.xlabel('International Students', color = '#ff4800')
plt.ylabel('Count', color = '#ff4800')
plt.title('Distribution of International Students', fontsize=15, color = '#ff4800');

We have a right skewed distribution here as well. There are very few outliers. Most of the universities have an intake between 0 to 5000.

In [ ]:

sns.catplot(kind='point', data=university_df, x='research_output', y='international_students', order=university_df['research_output'].value_counts().index, ci=None, hue='type', palette=international_palette);
plt.xlabel('Research Output', color = '#ff4800')
plt.ylabel('International Students', color = '#ff4800')
plt.title('International Students Vs Research Output', fontsize=15, color = '#ff4800');

International students tend to prefer public universities with "Very High" research output. Due to lesser tuition fees compared to private ones? 🤔 Maybe. 🤷🏻‍♀️

Most popular country of choice for International Students ✅

And for the last part, which country is most popular among international students? Can you guess before going down?

In [ ]:

intstu_country = pd.DataFrame(university_df.groupby(['country'], sort=False)['international_students'].sum().sort_values(ascending=False)[:10])

In [ ]:

fig, ax = plt.subplots(figsize=(10,4), dpi=100)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.tick_params(left=False)
ax.get_xaxis().set_visible(False)

sns.barplot(data=intstu_country, x='international_students', y=intstu_country.index, palette=custom_palette1);

ax.bar_label(ax.containers[0], fmt = '%d')

ax.set_ylabel('Country', fontsize=13, color = '#ff4800');
fig.suptitle('Country of choice for International Students from 2017 - 2022', fontsize=15, color = '#ff4800');

It's USA 🇺🇸 closely followed by UK 🇬🇧!

If you've come down this far, "THANK YOU!". Let me know in the comments if you have any feedback, criticisms or concerns.

Credits and Acknowledgement

Did I say those words? If you like my work, do hit the upvote button! No, I didn't. 😁

Methodology

Feature Description

Product

Resources

Company