SQL Analytics in Maxima¶

This notebook demonstrates how to use DuckDB's SQL engine from within Maxima for data analysis. The dataframes-duckdb package lets you register tables and run full SQL — including window functions, CTEs, and joins — then combine the results with Maxima's symbolic computer algebra system.

In [1]:

load("numerics")$
load("dataframes")$
load("dataframes-duckdb")$
load("ax-plots")$

Loading and Registering Data¶

Load the CSV into a dataframe table, then register it so DuckDB can query it by name.

In [2]:

sales : df_read_csv("../../data/sales.csv")$
df_register(sales, "sales")$
print("Registered 'sales' table:", df_table_shape(sales))$

Registered 'sales' table: [100,6]

Basic SQL Queries¶

Aggregate and filter data using familiar SQL syntax.

In [3]:

df_sql("SELECT region, COUNT(*) as n, SUM(revenue) as total_rev FROM sales GROUP BY region ORDER BY total_rev DESC")

Out[3]:

$\begin{array}{lll} \textbf{region} & \textbf{n} & \textbf{total\_rev} \\\\ \textit{str} & \textit{f64} & \textit{f64} \\\\ \hline \text{East} & 35.00 & 1.1290e+5 \\\\ \text{North} & 32.00 & 8.9261e+4 \\\\ \text{West} & 17.00 & 1.8711e+4 \\\\ \text{South} & 16.00 & 1.2042e+4 \\\\ \hline \text{4 rows} \times \text{3 cols} \end{array}$

In [4]:

df_sql("SELECT product, AVG(price) as avg_price, AVG(units) as avg_units FROM sales WHERE region = 'North' GROUP BY product")

Out[4]:

$\begin{array}{lll} \textbf{product} & \textbf{avg\_price} & \textbf{avg\_units} \\\\ \textit{str} & \textit{f64} & \textit{f64} \\\\ \hline \text{Gizmo} & 44.49 & 101.8 \\\\ \text{Widget} & 16.99 & 97.83 \\\\ \text{Gadget} & 29.49 & 93.83 \\\\ \hline \text{3 rows} \times \text{3 cols} \end{array}$

Window Functions¶

Window functions compute running totals and rankings without collapsing rows — a powerful SQL feature for analytical queries.

In [5]:

df_sql("SELECT region, product, revenue, SUM(revenue) OVER (PARTITION BY region ORDER BY revenue) as running_total FROM sales ORDER BY region, revenue LIMIT 15")

Out[5]:

$\begin{array}{llll} \textbf{region} & \textbf{product} & \textbf{revenue} & \textbf{running\_total} \\\\ \textit{str} & \textit{str} & \textit{f64} & \textit{f64} \\\\ \hline \text{East} & \text{Widget} & 1130. & 1130. \\\\ \text{East} & \text{Widget} & 1228. & 2358. \\\\ \text{East} & \text{Widget} & 1237. & 3594. \\\\ \text{East} & \text{Widget} & 1511. & 5105. \\\\ \text{East} & \text{Widget} & 1518. & 6623. \\\\ \text{East} & \text{Widget} & 1533. & 8156. \\\\ \text{East} & \text{Widget} & 1574. & 9730. \\\\ \text{East} & \text{Widget} & 1619. & 1.1349e+4 \\\\ \text{East} & \text{Widget} & 1733. & 1.3082e+4 \\\\ \text{East} & \text{Widget} & 1871. & 1.4953e+4 \\\\ \cdots & \cdots & \cdots & \cdots \\\\ \text{East} & \text{Widget} & 2241. & 2.1229e+4 \\\\ \text{East} & \text{Widget} & 2374. & 2.3603e+4 \\\\ \text{East} & \text{Gadget} & 2419. & 2.6022e+4 \\\\ \hline \text{15 rows} \times \text{4 cols} \end{array}$

In [6]:

df_sql("SELECT region, product, revenue, RANK() OVER (PARTITION BY region ORDER BY revenue DESC) as rank FROM sales ORDER BY region, rank LIMIT 20")

Out[6]:

$\begin{array}{llll} \textbf{region} & \textbf{product} & \textbf{revenue} & \textbf{rank} \\\\ \textit{str} & \textit{str} & \textit{f64} & \textit{f64} \\\\ \hline \text{East} & \text{Gizmo} & 8598. & 1.000 \\\\ \text{East} & \text{Gizmo} & 8138. & 2.000 \\\\ \text{East} & \text{Gadget} & 6368. & 3.000 \\\\ \text{East} & \text{Gadget} & 5374. & 4.000 \\\\ \text{East} & \text{Gadget} & 4768. & 5.000 \\\\ \text{East} & \text{Gizmo} & 4591. & 6.000 \\\\ \text{East} & \text{Gizmo} & 4493. & 7.000 \\\\ \text{East} & \text{Gizmo} & 4218. & 8.000 \\\\ \text{East} & \text{Gizmo} & 3943. & 9.000 \\\\ \text{East} & \text{Gizmo} & 3905. & 10.00 \\\\ \cdots & \cdots & \cdots & \cdots \\\\ \text{East} & \text{Gadget} & 2669. & 18.00 \\\\ \text{East} & \text{Gadget} & 2521. & 19.00 \\\\ \text{East} & \text{Widget} & 2473. & 20.00 \\\\ \hline \text{20 rows} \times \text{4 cols} \end{array}$

Common Table Expressions (CTEs)¶

CTEs let you build multi-step analyses in a single query. Here we compute per-region averages and compare each to the grand average.

In [7]:

result : df_sql("
  WITH region_stats AS (
    SELECT region, 
           AVG(revenue) as avg_rev,
           COUNT(*) as n
    FROM sales 
    GROUP BY region
  ),
  overall AS (
    SELECT AVG(revenue) as grand_avg FROM sales
  )
  SELECT rs.region, rs.avg_rev, rs.n,
         rs.avg_rev - o.grand_avg as diff_from_avg
  FROM region_stats rs, overall o
  ORDER BY rs.avg_rev DESC
")$
result;

Out[7]:

$\begin{array}{llll} \textbf{region} & \textbf{avg\_rev} & \textbf{n} & \textbf{diff\_from\_avg} \\\\ \textit{str} & \textit{f64} & \textit{f64} & \textit{f64} \\\\ \hline \text{East} & 3226. & 35.00 & 896.5 \\\\ \text{North} & 2789. & 32.00 & 460.3 \\\\ \text{West} & 1101. & 17.00 & -1228. \\\\ \text{South} & 752.6 & 16.00 & -1577. \\\\ \hline \text{4 rows} \times \text{4 cols} \end{array}$

Joins¶

Create a lookup table with region metadata and join it to the sales data for enriched analysis.

In [8]:

region_meta : df_sql("SELECT * FROM (VALUES 
  ('North', 'Urban', 1.2), 
  ('South', 'Suburban', 0.9),
  ('East', 'Urban', 1.1), 
  ('West', 'Rural', 0.8)
) AS t(region, type, market_factor)")$
df_register(region_meta, "regions")$

In [9]:

df_sql("SELECT s.region, r.type, r.market_factor,
        SUM(s.revenue) as total_rev,
        SUM(s.revenue) * r.market_factor as adjusted_rev
 FROM sales s 
 JOIN regions r ON s.region = r.region
 GROUP BY s.region, r.type, r.market_factor
 ORDER BY adjusted_rev DESC")

Out[9]:

$\begin{array}{lllll} \textbf{region} & \textbf{type} & \textbf{market\_factor} & \textbf{total\_rev} & \textbf{adjusted\_rev} \\\\ \textit{str} & \textit{str} & \textit{f64} & \textit{f64} & \textit{f64} \\\\ \hline \text{East} & \text{Urban} & 1.100 & 1.1290e+5 & 1.2419e+5 \\\\ \text{North} & \text{Urban} & 1.200 & 8.9261e+4 & 1.0711e+5 \\\\ \text{West} & \text{Rural} & 0.8000 & 1.8711e+4 & 1.4969e+4 \\\\ \text{South} & \text{Suburban} & 0.9000 & 1.2042e+4 & 1.0837e+4 \\\\ \hline \text{4 rows} \times \text{5 cols} \end{array}$

Visualizing SQL Results¶

Query results are regular dataframe tables, so we can extract columns and pass them directly to ax-plots.

In [10]:

prod_rev : df_sql("SELECT product, SUM(revenue) as total FROM sales GROUP BY product ORDER BY total DESC")$
ax_draw2d(
  ax_bar(
    df_to_string_list(df_table_column(prod_rev, "product")),
    np_to_list(df_table_column(prod_rev, "total"))
  ),
  title="Total Revenue by Product",
  ylabel="Revenue ($)", grid=true
)$

No description has been provided for this image

In [11]:

cross : df_sql("SELECT region, product, SUM(revenue) as rev FROM sales GROUP BY region, product ORDER BY region, product")$
cross;

Out[11]:

$\begin{array}{lll} \textbf{region} & \textbf{product} & \textbf{rev} \\\\ \textit{str} & \textit{str} & \textit{f64} \\\\ \hline \text{East} & \text{Gadget} & 4.0793e+4 \\\\ \text{East} & \text{Gizmo} & 4.5317e+4 \\\\ \text{East} & \text{Widget} & 2.6786e+4 \\\\ \text{North} & \text{Gadget} & 3.3048e+4 \\\\ \text{North} & \text{Gizmo} & 3.6185e+4 \\\\ \text{North} & \text{Widget} & 2.0028e+4 \\\\ \text{South} & \text{Gadget} & 3505. \\\\ \text{South} & \text{Gizmo} & 5487. \\\\ \text{South} & \text{Widget} & 3050. \\\\ \text{West} & \text{Gadget} & 5264. \\\\ \text{West} & \text{Gizmo} & 8902. \\\\ \text{West} & \text{Widget} & 4545. \\\\ \hline \text{12 rows} \times \text{3 cols} \end{array}$

Combining SQL with Symbolic Math¶

One of Maxima's unique strengths: use the CAS to verify or extend results from SQL queries. Here we check whether revenue ~ units * price holds on average for each region.

In [12]:

/* Fit a symbolic model to the per-region data */
stats : df_sql("SELECT region, AVG(units) as avg_u, AVG(price) as avg_p, AVG(revenue) as avg_r FROM sales GROUP BY region")$
/* Revenue should approximately equal units * price */
print("Revenue ~ units * price check:")$
for i : 0 thru 3 do block([u, p, r],
  u : np_ref(df_table_column(stats, "avg_u"), i),
  p : np_ref(df_table_column(stats, "avg_p"), i),
  r : np_ref(df_table_column(stats, "avg_r"), i),
  print("  ", df_string_column_ref(df_table_column(stats, "region"), i+1),
        ": u*p =", float(u*p), "actual =", r)
)$

Revenue ~ units * price check:
   East : u*p = 3185.6314285714293 actual = 3225.6028571428574
   West : u*p = 1089.2702422145333 actual = 1100.664705882353
   North : u*p = 2778.5151562500005 actual = 2789.4175
   South : u*p = 714.5910937500001 actual = 752.5949999999999

Summary¶

The dataframes-duckdb package brings full SQL analytics into the Maxima environment. CTEs, window functions, and joins provide powerful data transformation, while Maxima's symbolic engine lets you verify relationships and build mathematical models on top of query results.