Thursday, June 27, 2024

Generating Synthetic Sales Data

To generate a synthetic dataset using Faker library for the previous 101 visualization examples, we'll create a Python script that generates random data for the specified columns. Since Faker generates random data, keep in mind that this dataset will be artificial and not representative of any real-world data.

First, make sure you have installed the Faker library. You can install it using pip:

bash code

pip install Faker

Let's generate the dataset with the required columns:

python code

import pandas as pd

import random

from faker import Faker

from datetime import datetime, timedelta

# Set random seed for reproducibility

random.seed(42)

# Initialize Faker and other necessary variables

fake = Faker()

start_date = datetime(2020, 1, 1)

end_date = datetime(2022, 1, 1)

# Create empty lists to store the generated data

order_ids = [] 

customer_ids = []

product_ids = []

purchase_dates = []

product_categories = []

quantities = []

total_sales = []

genders = []

marital_statuses = []

price_per_unit = []

customer_types = []

ages = [] # New list to store ages

# Number of rows (data points) to generate

num_rows = 10000

# Generate the dataset

for _ in range(num_rows):

order_ids.append(fake.uuid4())

customer_ids.append(fake.uuid4())

product_ids.append(fake.uuid4())

purchase_date = start_date + timedelta(days=random.randint(0,

(end_date - start_date).days))

purchase_dates.append(purchase_date)

product_categories.append(fake.random_element(elements=('Electronics',

'Clothing', 'Books', 'Home', 'Beauty')))

quantities.append(random.randint(1, 10))

total_sales.append(random.uniform(10, 500))

genders.append(fake.random_element(elements=('Male', 'Female')))

# Only 'Male' and 'Female' will be added

marital_statuses.append(fake.random_element(elements=('Single',

'Married', 'Divorced', 'Widowed')))

price_per_unit.append(random.uniform(5, 50))

customer_types.append(fake.random_element(elements=('New

Customer', 'Returning Customer')))

ages.append(random.randint(18, 80)) # Generate random ages

between 18 and 80

# Create a DataFrame from the generated lists

df = pd.DataFrame({

'Order_ID': order_ids,

'Customer_ID': customer_ids,

'Product_ID': product_ids,

'Purchase_Date': purchase_dates,

'Product_Category': product_categories,

'Quantity': quantities,

'Total_Sales': total_sales,

'Gender': genders,

'Marital_Status': marital_statuses,

'Price_Per_Unit': price_per_unit,

'Customer_Type': customer_types,

'Age': ages # Add the 'Age' column to the DataFrame

})

# Save the DataFrame to a CSV file

df.to_csv('ecommerce_sales.csv', index=False)

# Display the first few rows of the generated dataset

print(df.head())

This code will generate a DataFrame with the specified columns 'Order_ID', 'Customer_ID', 'Product_ID', 'Purchase_Date', 'Product_Category', 'Quantity', and 'Total_Sales', etc. You can now use this generated dataset for data visualization and analysis and apply the previous 101 visualization examples on it. Remember that this dataset is synthetic and should only be used for learning or testing purposes. For real-world analysis, it's essential to use genuine and representative data.

Share:

0 comments:

Post a Comment