Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Data Science using Python/Day 3 Pandas Lab.ipynb
3074 views
Kernel: Python 3 (ipykernel)

Generate Emp Data with 1000 rows and 7 colums.

  • columns=['Name', 'Gender', 'Salary', 'Work Location', 'Age', 'Rating', 'Job Role']

  • Define lists for job roles, locations, and ratings

job_roles = ['Software Engineer', 'Data Analyst', 'Project Manager', 'Marketing Specialist', 'HR Manager', 'Financial Analyst', 'Sales Executive', 'Customer Support', 'Graphic Designer', 'Product Manager'] locations = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'] ratings = [1, 2, 3, 4, 5]

Work on following Pandas functions on above generated data

  • Load a CSV file into a pandas DataFrame and display the first 5 rows.

  • Get the shape of the DataFrame (number of rows and columns).

  • Check for missing values in the DataFrame and handle them appropriately.

  • Filter the DataFrame to only include rows where a specific column meets a certain condition (e.g., age > 30).

  • Sort the DataFrame based on a specific column in ascending order.

  • Add a new column to the DataFrame based on a calculation from existing columns.

  • Group the DataFrame by a categorical variable and calculate summary statistics for each group (e.g., mean, median, count).

  • Merge two DataFrames based on a common key column.

  • Remove duplicate rows from the DataFrame.

  • Rename columns in the DataFrame to make them more descriptive.

  • Select specific columns from the DataFrame and create a new DataFrame with only those columns.

  • Reset the index of the DataFrame to default integer index.

from faker import Faker import random import pandas as pd # Initialize Faker fake = Faker() # Define lists for job roles, locations, and ratings job_roles = ['Software Engineer', 'Data Analyst', 'Project Manager', 'Marketing Specialist', 'HR Manager', 'Financial Analyst', 'Sales Executive', 'Customer Support', 'Graphic Designer', 'Product Manager'] locations = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'] ratings = [1, 2, 3, 4, 5] # Generate sample employee data employee_data = [] for _ in range(1200): name = fake.name() gender = fake.random_element(elements=('Male', 'Female')) salary = random.randint(30000, 200000) work_location = random.choice(locations) age = random.randint(20, 65) rating = random.choice(ratings) job_role = random.choice(job_roles) employee_data.append((name, gender, salary, work_location, age, rating, job_role)) # Create DataFrame df = pd.DataFrame(employee_data, columns=['Name', 'Gender', 'Salary', 'Work Location', 'Age', 'Rating', 'Job Role']) # Write DataFrame to CSV file df.to_csv('employee_details.csv', index=False)
import pandas as pd emp= pd.read_csv('employee_details.csv') emp.head(5)
print(df.shape) print(df.isnull().sum()) # Check for missing values df.dropna(inplace=True) # Drop rows with missing values filtered_df = df[df['age'] > 30] filtered_df = df[df['age'] > 30] sorted_df = df.sort_values(by='column_name') df['new_column'] = df['column1'] + df['column2'] grouped_df = df.groupby('category_column').agg({'numeric_column': ['mean', 'median', 'count']}) merged_df = pd.merge(df1, df2, on='common_column') df.drop_duplicates(inplace=True) df.rename(columns={'old_name': 'new_name'}, inplace=True) selected_columns_df = df[['column1', 'column2']] df.reset_index(drop=True, inplace=True)