Exploratory Data Analysis Pipeline¶

This notebook demonstrates a complete EDA workflow in Maxima: load a dataset, inspect its structure, compute summary statistics, and visualize distributions and relationships — all using the dataframes and ax-plots packages.

In [1]:

load("numerics")$
load("dataframes")$
load("dataframes-duckdb")$
load("ax-plots")$

Loading Data¶

In [2]:

T : df_read_csv("../../data/sales.csv")$
print("Shape:", df_table_shape(T))$
print("Columns:", df_table_names(T))$

Shape: [100,6]
Columns: ["date","region","product","units","price","revenue"]

In [3]:

df_table_head(T)

Out[3]:

$\begin{array}{llllll} \textbf{date} & \textbf{region} & \textbf{product} & \textbf{units} & \textbf{price} & \textbf{revenue} \\\\ \textit{str} & \textit{str} & \textit{str} & \textit{f64} & \textit{f64} & \textit{f64} \\\\ \hline \text{2024-01-01T00:00:00.000000Z} & \text{East} & \text{Widget} & 78.00 & 14.49 & 1130. \\\\ \text{2024-01-08T00:00:00.000000Z} & \text{North} & \text{Gadget} & 52.00 & 27.99 & 1455. \\\\ \text{2024-01-08T00:00:00.000000Z} & \text{South} & \text{Gizmo} & 31.00 & 39.99 & 1240. \\\\ \text{2024-01-15T00:00:00.000000Z} & \text{East} & \text{Gadget} & 95.00 & 22.49 & 2137. \\\\ \text{2024-01-15T00:00:00.000000Z} & \text{West} & \text{Widget} & 38.00 & 12.99 & 493.6 \\\\ \hline \text{5 rows} \times \text{6 cols} \end{array}$

Summary Statistics¶

df_describe computes count, mean, standard deviation, min, quartiles (25%, 50%, 75%), and max for every numeric column.

In [4]:

df_describe(T)

Out[4]:

$\begin{array}{llll} \textbf{stat} & \textbf{units} & \textbf{price} & \textbf{revenue} \\\\ \textit{str} & \textit{f64} & \textit{f64} & \textit{f64} \\\\ \hline \text{count} & 100.0 & 100.0 & 100.0 \\\\ \text{mean} & 86.72 & 25.99 & 2329. \\\\ \text{std} & 44.07 & 11.17 & 1722. \\\\ \text{min} & 19.00 & 9.990 & 189.8 \\\\ \text{25\%} & 43.75 & 15.99 & 1085. \\\\ \text{50\%} & 86.50 & 24.74 & 1875. \\\\ \text{75\%} & 113.3 & 34.62 & 3134. \\\\ \text{max} & 190.0 & 49.99 & 8598. \\\\ \hline \text{8 rows} \times \text{4 cols} \end{array}$

Distribution of Revenue¶

A histogram shows how revenue values are spread across the dataset.

In [5]:

ax_draw2d(
  ax_histogram(df_table_column(T, "revenue")),
  title="Revenue Distribution",
  xlabel="Revenue", ylabel="Count",
  grid=true, nbins=15
)$

No description has been provided for this image

Sales by Region¶

Group the data by region and compute total revenue and average units sold per region.

In [6]:

by_region : df_group_by(T, "region")$
region_totals : df_summarize(by_region,
  "total_revenue", lambda([revenue], np_sum(revenue)),
  "avg_units", lambda([units], np_mean(units))
)$
region_totals;

Out[6]:

$\begin{array}{lll} \textbf{region} & \textbf{total\_revenue} & \textbf{avg\_units} \\\\ \textit{str} & \textit{f64} & \textit{f64} \\\\ \hline \text{East} & 1.1290e+5 & 122.6 \\\\ \text{North} & 8.9261e+4 & 97.31 \\\\ \text{South} & 1.2042e+4 & 31.13 \\\\ \text{West} & 1.8711e+4 & 45.29 \\\\ \hline \text{4 rows} \times \text{3 cols} \end{array}$

In [7]:

ax_draw2d(
  ax_bar(
    df_to_string_list(df_table_column(region_totals, "region")),
    np_to_list(df_table_column(region_totals, "total_revenue"))
  ),
  title="Total Revenue by Region",
  ylabel="Revenue ($)", grid=true
)$

Product Analysis¶

How many sales does each product have, and what are the average price and total units?

In [8]:

df_value_counts(df_table_column(T, "product"))

Out[8]:

$\begin{array}{ll} \textbf{value} & \textbf{count} \\\\ \textit{str} & \textit{f64} \\\\ \hline \text{Widget} & 42.00 \\\\ \text{Gadget} & 32.00 \\\\ \text{Gizmo} & 26.00 \\\\ \hline \text{3 rows} \times \text{2 cols} \end{array}$

In [9]:

by_product : df_group_by(T, "product")$
product_stats : df_summarize(by_product,
  "avg_price", lambda([price], np_mean(price)),
  "total_units", lambda([units], np_sum(units))
)$
product_stats;

Out[9]:

$\begin{array}{lll} \textbf{product} & \textbf{avg\_price} & \textbf{total\_units} \\\\ \textit{str} & \textit{f64} & \textit{f64} \\\\ \hline \text{Widget} & 15.16 & 3512. \\\\ \text{Gadget} & 27.97 & 2888. \\\\ \text{Gizmo} & 41.07 & 2272. \\\\ \hline \text{3 rows} \times \text{3 cols} \end{array}$

In [10]:

ax_draw2d(
  ax_bar(
    df_to_string_list(df_table_column(product_stats, "product")),
    np_to_list(df_table_column(product_stats, "avg_price"))
  ),
  title="Average Price by Product",
  ylabel="Price ($)", grid=true
)$

Scatter: Units vs Revenue¶

Do more units sold correspond to higher revenue? A scatter plot reveals the relationship.

In [11]:

ax_draw2d(
  color=blue, marker_size=5,
  points(np_to_list(df_table_column(T, "units")),
         np_to_list(df_table_column(T, "revenue"))),
  title="Units Sold vs Revenue",
  xlabel="Units", ylabel="Revenue",
  grid=true
)$

Filtering — High-Value Sales¶

Filter the dataset to keep only rows where revenue exceeds 3000.

In [12]:

high_value : df_filter(T, lambda([revenue], is(revenue > 3000)))$
print("High-value sales:", df_table_shape(high_value))$
df_table_head(high_value);

High-value sales: [26,6]

Out[12]:

$\begin{array}{llllll} \textbf{date} & \textbf{region} & \textbf{product} & \textbf{units} & \textbf{price} & \textbf{revenue} \\\\ \textit{str} & \textit{str} & \textit{str} & \textit{f64} & \textit{f64} & \textit{f64} \\\\ \hline \text{2024-01-22T00:00:00.000000Z} & \text{North} & \text{Gizmo} & 68.00 & 44.99 & 3059. \\\\ \text{2024-02-05T00:00:00.000000Z} & \text{East} & \text{Gadget} & 112.0 & 29.99 & 3359. \\\\ \text{2024-02-26T00:00:00.000000Z} & \text{East} & \text{Gizmo} & 88.00 & 42.99 & 3783. \\\\ \text{2024-03-11T00:00:00.000000Z} & \text{East} & \text{Gizmo} & 96.00 & 37.99 & 3647. \\\\ \text{2024-04-01T01:00:00.000000+01:00} & \text{North} & \text{Gizmo} & 74.00 & 46.99 & 3477. \\\\ \hline \text{5 rows} \times \text{6 cols} \end{array}$

Summary¶

With dataframes for tabular manipulation and ax-plots for visualization, Maxima provides a complete exploratory data analysis workflow: load data, inspect structure, compute statistics, visualize distributions and relationships, and filter for subsets of interest.