CoCalc -- Day 3 Pandas Lab.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Data Science using Python/Day 3 Pandas Lab.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Generate Emp Data with 1000 rows and 7 colums.

columns=['Name', 'Gender', 'Salary', 'Work Location', 'Age', 'Rating', 'Job Role']
Define lists for job roles, locations, and ratings

job_roles = ['Software Engineer', 'Data Analyst', 'Project Manager', 'Marketing Specialist', 'HR Manager', 'Financial Analyst', 'Sales Executive', 'Customer Support', 'Graphic Designer', 'Product Manager'] locations = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'] ratings = [1, 2, 3, 4, 5]

Work on following Pandas functions on above generated data

Load a CSV file into a pandas DataFrame and display the first 5 rows.
Get the shape of the DataFrame (number of rows and columns).
Check for missing values in the DataFrame and handle them appropriately.
Filter the DataFrame to only include rows where a specific column meets a certain condition (e.g., age > 30).
Sort the DataFrame based on a specific column in ascending order.
Add a new column to the DataFrame based on a calculation from existing columns.
Group the DataFrame by a categorical variable and calculate summary statistics for each group (e.g., mean, median, count).
Merge two DataFrames based on a common key column.
Remove duplicate rows from the DataFrame.
Rename columns in the DataFrame to make them more descriptive.
Select specific columns from the DataFrame and create a new DataFrame with only those columns.
Reset the index of the DataFrame to default integer index.

In [11]:

from faker import Faker
import random
import pandas as pd

# Initialize Faker
fake = Faker()

# Define lists for job roles, locations, and ratings
job_roles = ['Software Engineer', 'Data Analyst', 'Project Manager', 'Marketing Specialist', 'HR Manager', 'Financial Analyst', 'Sales Executive', 'Customer Support', 'Graphic Designer', 'Product Manager']
locations = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose']
ratings = [1, 2, 3, 4, 5]

# Generate sample employee data
employee_data = []
for _ in range(1200):
    name = fake.name()
    gender = fake.random_element(elements=('Male', 'Female'))
    salary = random.randint(30000, 200000)
    work_location = random.choice(locations)
    age = random.randint(20, 65)
    rating = random.choice(ratings)
    job_role = random.choice(job_roles)
    employee_data.append((name, gender, salary, work_location, age, rating, job_role))

# Create DataFrame
df = pd.DataFrame(employee_data, columns=['Name', 'Gender', 'Salary', 'Work Location', 'Age', 'Rating', 'Job Role'])

# Write DataFrame to CSV file
df.to_csv('employee_details.csv', index=False)

In [12]:

import pandas as pd

emp= pd.read_csv('employee_details.csv')

emp.head(5)

Out[12]:

print(df.shape)
print(df.isnull().sum())  # Check for missing values
df.dropna(inplace=True)   # Drop rows with missing values
filtered_df = df[df['age'] > 30]
filtered_df = df[df['age'] > 30]
sorted_df = df.sort_values(by='column_name')
df['new_column'] = df['column1'] + df['column2']
grouped_df = df.groupby('category_column').agg({'numeric_column': ['mean', 'median', 'count']})
merged_df = pd.merge(df1, df2, on='common_column')
df.drop_duplicates(inplace=True)
df.rename(columns={'old_name': 'new_name'}, inplace=True)
selected_columns_df = df[['column1', 'column2']]
df.reset_index(drop=True, inplace=True)

Generate Emp Data with 1000 rows and 7 colums.

Define lists for job roles, locations, and ratings

Work on following Pandas functions on above generated data

Product

Resources

Company