CoCalc -- LAB WORK Pandas Data Preparation & EDA.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/ML Classification using Python/LAB WORK Pandas Data Preparation & EDA.ipynb
⁴⁷³³ views

Kernel: Python 3 (ipykernel)

In [2]:

import pandas as pd
import numpy as np

np.random.seed(42)

data = {
    'EmployeeID': range(101, 121),
    'Age': [25, 28, 32, 45, 41, 38, np.nan, 29, 34, 36, 40, 22, 27, 30, np.nan, 33, 44, 39, 26, 31],
    'Department': ['IT', 'HR', 'Finance', 'IT', 'Marketing', 'Finance', 'IT', 'HR', 
                   'Finance', 'IT', 'Marketing', 'Finance', np.nan, 'HR', 'IT', 
                   'Finance', 'IT', 'Marketing', 'HR', np.nan],
    'ExperienceYears': [1, 3, 5, 10, np.nan, 7, 9, 2, 4, np.nan, 12, 1, 6, 3, 5, np.nan, 8, 11, 2, 4],
    'TrainingHours': [15, 20, np.nan, 40, 25, 35, 18, np.nan, 22, 28, 30, 12, 17, 24, 26, 33, np.nan, 29, 21, 23],
    'SkillScore': [70, 76, 80, 90, np.nan, 85, 88, 72, 78, 92, 95, 68, np.nan, 74, 82, 87, 91, np.nan, 77, 79],
    'PerformanceRating': [3, 4, 4, 5, np.nan, 4, 5, 3, np.nan, 5, 5, 2, 3, 4, np.nan, 4, 5, 4, 3, 4]
}

df_lab = pd.DataFrame(data)
df_lab

Out[2]:

LAB QUESTIONS (12 Tasks)

Participants must use Pandas to solve these questions. They will use the dataset df_lab.

Section A: Basic Exploration**

1️⃣ Display the first 7 rows and the last 5 rows.

Hint: Use .head() and .tail().

2️⃣ Print the summary of all column datatypes and missing values.

Hint: .info(), .isnull().sum().

3️⃣ Show descriptive statistics for numerical columns.

Hint: .describe().

Section B: Cleaning & Missing Value Handling**

4️⃣ Identify which 3 columns have the highest number of missing values.

5️⃣ Fill missing values in:

Age → with mean
Department → with mode
TrainingHours → with median

(Participants decide correct Pandas method.)

6️⃣ Remove rows where both `SkillScore` AND `PerformanceRating` are missing.

Section C: Column Operations**

7️⃣ Create a new column: `TrainingEfficiency` = SkillScore / TrainingHours.

8️⃣ Create another column: `Category`

Rules:

If PerformanceRating ≥ 4 → “High”
Else → “Low”

(Use np.where() or apply().)

Section D: Filtering & Sorting**

9️⃣ Filter employees who:

Work in IT
Have SkillScore > 85
Have ExperienceYears ≥ 5

(Multiple conditions required.)

🔟 Sort the dataset by `ExperienceYears` (descending) and `SkillScore` (ascending).

Section E: Grouping & Aggregation**

1️⃣1️⃣ Compute the average:

SkillScore by Department
PerformanceRating by Department

(Use .groupby().)

Section F: Final Step**

Save the cleaned dataset as `Employee_Cleaned.csv`.**

In [ ]:

LAB QUESTIONS (12 Tasks)

Section A: Basic Exploration**

1️⃣ Display the first 7 rows and the last 5 rows.

2️⃣ Print the summary of all column datatypes and missing values.

3️⃣ Show descriptive statistics for numerical columns.

Section B: Cleaning & Missing Value Handling**

4️⃣ Identify which 3 columns have the highest number of missing values.

5️⃣ Fill missing values in:

6️⃣ Remove rows where both `SkillScore` AND `PerformanceRating` are missing.

Section C: Column Operations**

7️⃣ Create a new column: `TrainingEfficiency` = SkillScore / TrainingHours.

8️⃣ Create another column: `Category`

Section D: Filtering & Sorting**

9️⃣ Filter employees who:

🔟 Sort the dataset by `ExperienceYears` (descending) and `SkillScore` (ascending).

Section E: Grouping & Aggregation**

1️⃣1️⃣ Compute the average:

Section F: Final Step**

Save the cleaned dataset as `Employee_Cleaned.csv`.**

Product

Resources

Company

LAB QUESTIONS (12 Tasks)

Section A: Basic Exploration**

1️⃣ Display the first 7 rows and the last 5 rows.

2️⃣ Print the summary of all column datatypes and missing values.

3️⃣ Show descriptive statistics for numerical columns.

Section B: Cleaning & Missing Value Handling**

4️⃣ Identify which 3 columns have the highest number of missing values.

5️⃣ Fill missing values in:

6️⃣ Remove rows where both SkillScore AND PerformanceRating are missing.

Section C: Column Operations**

7️⃣ Create a new column: TrainingEfficiency = SkillScore / TrainingHours.

8️⃣ Create another column: Category

Section D: Filtering & Sorting**

9️⃣ Filter employees who:

🔟 Sort the dataset by ExperienceYears (descending) and SkillScore (ascending).

Section E: Grouping & Aggregation**

1️⃣1️⃣ Compute the average:

Section F: Final Step**

Save the cleaned dataset as Employee_Cleaned.csv.**

6️⃣ Remove rows where both `SkillScore` AND `PerformanceRating` are missing.

7️⃣ Create a new column: `TrainingEfficiency` = SkillScore / TrainingHours.

8️⃣ Create another column: `Category`

🔟 Sort the dataset by `ExperienceYears` (descending) and `SkillScore` (ascending).

Save the cleaned dataset as `Employee_Cleaned.csv`.**