Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/ML Classification using Python/LAB WORK Pandas Data Preparation & EDA.ipynb
4733 views
Kernel: Python 3 (ipykernel)
import pandas as pd import numpy as np np.random.seed(42) data = { 'EmployeeID': range(101, 121), 'Age': [25, 28, 32, 45, 41, 38, np.nan, 29, 34, 36, 40, 22, 27, 30, np.nan, 33, 44, 39, 26, 31], 'Department': ['IT', 'HR', 'Finance', 'IT', 'Marketing', 'Finance', 'IT', 'HR', 'Finance', 'IT', 'Marketing', 'Finance', np.nan, 'HR', 'IT', 'Finance', 'IT', 'Marketing', 'HR', np.nan], 'ExperienceYears': [1, 3, 5, 10, np.nan, 7, 9, 2, 4, np.nan, 12, 1, 6, 3, 5, np.nan, 8, 11, 2, 4], 'TrainingHours': [15, 20, np.nan, 40, 25, 35, 18, np.nan, 22, 28, 30, 12, 17, 24, 26, 33, np.nan, 29, 21, 23], 'SkillScore': [70, 76, 80, 90, np.nan, 85, 88, 72, 78, 92, 95, 68, np.nan, 74, 82, 87, 91, np.nan, 77, 79], 'PerformanceRating': [3, 4, 4, 5, np.nan, 4, 5, 3, np.nan, 5, 5, 2, 3, 4, np.nan, 4, 5, 4, 3, 4] } df_lab = pd.DataFrame(data) df_lab

LAB QUESTIONS (12 Tasks)

Participants must use Pandas to solve these questions. They will use the dataset df_lab.


Section A: Basic Exploration**

1️⃣ Display the first 7 rows and the last 5 rows.

Hint: Use .head() and .tail().

2️⃣ Print the summary of all column datatypes and missing values.

Hint: .info(), .isnull().sum().

3️⃣ Show descriptive statistics for numerical columns.

Hint: .describe().


Section B: Cleaning & Missing Value Handling**

4️⃣ Identify which 3 columns have the highest number of missing values.

5️⃣ Fill missing values in:

  • Age → with mean

  • Department → with mode

  • TrainingHours → with median

(Participants decide correct Pandas method.)

6️⃣ Remove rows where both SkillScore AND PerformanceRating are missing.


Section C: Column Operations**

7️⃣ Create a new column: TrainingEfficiency = SkillScore / TrainingHours.

8️⃣ Create another column: Category

Rules:

  • If PerformanceRating ≥ 4 → “High”

  • Else → “Low”

(Use np.where() or apply().)


Section D: Filtering & Sorting**

9️⃣ Filter employees who:

  • Work in IT

  • Have SkillScore > 85

  • Have ExperienceYears ≥ 5

(Multiple conditions required.)

🔟 Sort the dataset by ExperienceYears (descending) and SkillScore (ascending).


Section E: Grouping & Aggregation**

1️⃣1️⃣ Compute the average:

  • SkillScore by Department

  • PerformanceRating by Department

(Use .groupby().)


Section F: Final Step**

Save the cleaned dataset as Employee_Cleaned.csv.**