Tutorials - Simon Willison Data analysis with SQLite and Python

title: "Tutorials - Simon Willison: Data analysis with SQLite and Python"
Youtube_Thumbnail: https://img.youtube.com/vi/5TdIxxBPUSI/hqdefault.jpg
sources:
  - https://www.youtube.com/watch?v=5TdIxxBPUSI
media_link: https://www.youtube.com/watch?v=5TdIxxBPUSI
Authors:
  - "[[PyCon US]]"
  - "[[Simon Willison]]"
contentPublished: 2023-06-02
noteCreated: 2026-01-05
description: SQLite is the world's most widely used database and has been a part of the Python standard library since 2006. It continues to evolve and offer more capabili...
tags:
  - clippings
  - video
takeaways:
subjects:
  - "[[sqlite-utils]]"
  - "[[SQLite]]"
  - "[[datasette]]"
  - "[[Python]]"
  - "[[Data Analysis]]"
  - "[[CSV]]"
  - "[[JSON]]"
  - "[[GitHub Codespaces]]"
  - "[[Pandas]]"
  - "[[Vercel]]"
  - "[[PostgreSQL]]"
Status: 👀 Reading

Notes

00:02 Introduction to PyCon Tutorial

Largest conference attended in 3.5 years
Tutorial on data analysis with SQLite and Python
Handout available at: sqlite-tutorial-pycon-2023.readthedocs.io
Volunteers Felix and Nick available for help

01:14 Tutorial Overview

Topics covered:
- SQLite and Python from standard library
- Datasette - tool for data exploration, analysis, and publishing
- sqlite-utils library for productive SQLite work
- Publishing databases to the internet
- Datasette running in WebAssembly
- Advanced SQL topics
02:01: SQL approach - learning like a board game (pick up rules as you go)

02:30 Requirements & Setup

First half: Only Python 3 needed
Second half: Ability to pip install packages
GitHub Codespaces as backup development environment
03:07: Codespaces benefit - throwaway environment if things break

05:37 What is SQLite?

Fascinating piece of technology that doesn't get enough respect
Named after stalactites and stalagmites (not because it's lightweight)
06:12: Key characteristics:
- Around for at least 20 years
- Embedded database (not server-based like Postgres/MySQL)
- C library with function calls to file on disk
- Screamingly fast - no network overhead
06:50: Built into Python since 2006
07:12: Library of Congress considers it archival quality
07:31: Database is a single file (data.db) that works on any platform
08:22: Maximum size is 2.8 terabytes (doubled from 1.4TB about a year ago)
08:45: Many small queries are efficient - 200 SQL queries = 200 function calls (not network round trips)

09:32 Getting Started - Downloading Example Database

Example: content.db from datasette.io
Contains news, plugins, and other data from official Datasette website
11:06: File is 17 megabytes

11: 11 Basic Python SQLite Usage

Import sqlite3 from standard library
Connect to database: sqlite3.connect("content.db")
11:35: View tables: SELECT name FROM sqlite_master WHERE type='table'
12:10: Check SQLite version with sqlite_version() function
12:17: Running version 3.31.1 (within last 12 months)

12:23 Working with Cursors

db.execute() returns cursor object
cursor.fetchall() fetches all rows
Rows returned as Python tuples
12:53: Can iterate through cursor directly
13:17: Using db.row_factory = sqlite3.Row for dictionary-like objects
13:54: Convert rows to dictionaries with dict(row)

14:05 Creating Tables

SQLite inspired by Postgres
14:23: Restricted set of column types (much simpler):
- TEXT
- INTEGER
- REAL (floating point)
- BLOB (binary data)
14:36: No need to decide varchar(100) vs varchar(255)
14:59: Example: Creating a peps table with id, title, author, status, type, created, body

16:02 Inserting Data

GPT-4 used to write parser function (saved 1.5 minutes)
16:21: Example: Parsing PEP 20 (The Zen of Python by Tim Peters, 2004)
17:50: Using Parameterized Queries
- Use placeholders (?) to avoid SQL injection
- Pass values as tuple/list separately
- 18:26: Always parameterize queries when working with external data
19:17: Alternative: Named parameters with colon syntax (e.g., :pep, :title)

20: 23 Core SQL Operations

INSERT: Adding data to tables
UPDATE: Modifying existing rows
- 20:40: Example: UPDATE peps SET author = :author WHERE id = :id
DELETE: Removing rows
- 22:44: Example: DELETE FROM peps WHERE id = ?
21:31: Transactions: Use with db: context manager for automatic transaction handling
23:02: 90% of data manipulation language covered
23:21: Most complexity is in SELECT statements

25:29 Introducing Datasette

Tool built over past 5 years for data exploration
Installation: pip install datasette
26:00: Check version: datasette version (v0.64.2)
26:07: Start web server: datasette content.db (runs on port 8001)
26:26: GitHub Codespaces automatically detects and opens in new tab

26:51 Datasette Web Interface

Provides interface to explore any SQLite database
27:04: Shows all tables in database
28:34: Repos table: All GitHub repositories in the Datasette project (158 total)
29:36: GitHub-to-sqlite tool - imports GitHub data into SQLite
30:01: Pattern: Everything in life can be in SQLite for querying/joining
30:26: genome-to-sqlite - import genome data, query eye color with SQL

30:39 Congressional Legislators Database

Database of Congress members, senators, presidents, vice presidents
Data from 1793 onwards
31:50: Open source project gathering Congressional information
5.2 megabytes database file
33:03: Tables for legislators, executives, names, birthdays

33:45 Exploring Data in Datasette

Sort tables by columns (e.g., start date)
34:03: Earliest legislative term: 1789 (Anti-Administration party)
34:16: Pro-Administration party also existed

34:24 Foreign Keys & Relations

Richard Bassett name is a hyperlink
34:35: Foreign keys defined in table creation
Datasette automatically converts foreign keys to hyperlinks
34:53: Relational databases - data linking through to other data

35:40 Faceting - Favorite Datasette Feature

Count rows based on column values
35:56: Example: New York has 4,189 legislative terms (most of any state)
36:15: Facet by type: Representative vs Senator
36:22: Drill down - senators from Massachusetts
36:29: Massachusetts Senate: 38 Republican terms, 32 Democrat, 11 Whig
36:42: Faceting helps understand data at macro level

37:07 Query Builder Interface

Select columns and build filters
37:33: Example: Party contains "rep"
Returns Republican, Democratic-Republican, Independent Republican parties
37:58: Runs full query and shows all matching rows

38:35 Exporting Data

Multiple export formats available
CSV: 38:44 Ready for Excel/Tableau
JSON: 38:55 For JavaScript/Python applications
39:14: Everything in Datasette is an API - if you can see it, you can get it as JSON

39:25 URL-Based API

Specify parameters in URL (e.g., state=CA)
39:36: Change CA to FL in URL to get Florida data
39:47: Powerful API interface for anything in SQLite database
39:57: Nothing cannot be put in SQLite database

41:37 Custom SQL Queries

Click "View and edit SQL" to see underlying query
41:51: Format SQL to tidy up
42:03: Edit SQL directly - remove unwanted columns
42:24: LIKE operator with % wildcards for pattern matching

43:18 Named Parameters in Queries

Use :parameter_name syntax
43:38: Datasette adds form fields for parameters
44:03: Hide SQL - create custom applications as bookmarks
44:20: Share URL on Slack/email for interactive database applications

44:44 Data Journalism Use Case

Initial inspiration for Datasette
44:57: Make data available to reporters in newsroom
45:11: Few people load data, everyone else works off bookmarks
45:24: Send links for COVID numbers, senator data, etc.

45:38 SQL Query API

JSON API that takes SQL as parameter
46:07: Send SQL queries through API endpoint
46:39: Sounds like SQL injection vulnerability
46:56: Safe because:
- Publishing entire databases intentionally
- Read-only databases
- 47:07: 1-second time limit on queries
47:38: Very productive for JavaScript development - "glue SQL queries together"

47:54 Core Datasette Features Summary

Browse tables
Filter and facet
Export as JSON/CSV
48:14: Copyable plugin - export to LaTeX, GitHub markdown tables

48:49 Datasette Plugins

115 plugins available
49:36: Open source benefit - people add features without pull requests
49:53: Plugins let you experiment without affecting core project

50:25 datasette-cluster-map Plugin

First plugin ever written
Installation: datasette install datasette-cluster-map
50:37: Same as pip install but ensures correct virtual environment
50:56: Check plugins: datasette plugins
51:15: Automatically detects latitude/longitude columns and draws map
51:34: Example: Congressional offices map showing Pago Pago (American Samoa)
52:11: Learn new things about the world through demos

53:05 Exporting Data Options

Export current page (100 records)
53:12: Stream all rows - exports all 45,000 rows to CSV

53:49 SQLite Primary Keys

Can specify primary key when creating table
54:08: If no primary key specified, SQLite adds rowid automatically
54:20: rowid is auto-incrementing number (1, 2, 3...)
54:40: SELECT rowid, * to see hidden rowid column
54:59: Datasette automatically includes rowid when no defined primary key

55:19 Learning SQL with Datasette

Build queries with UI, then click "view and edit SQL"
55:43: See how SQL query works
56:02: Universities using Datasette for database education
56:14: Students explore data, then dip into SQL

56:28 Learn SQL Tutorial

Covers: SELECT, FROM, WHERE, ORDER BY, LIMIT
Named parameters
56:59: LIKE queries
Joins

57:05 SQL Joins

57:18: Speaker admits not being super comfortable with right joins
57:23: SQLite only added right join support 6 months ago
Everything doable with right join can be done with left join

57:42 Join Example - Executive Terms

Executive terms table has executive_id column
57:57: SELECT * FROM executives WHERE id=1 returns George Washington
58:13: Want to see name in the terms table

59:04 Join Syntax

JOIN executives ON executive_terms.executive_id = executives.id
59:17: Now returns name instead of just executive_id
59:30: SELECT * returns columns from both tables
59:48: Join structure: combine tables on matching criteria
60:18: Can nest multiple joins
60:23: This is an "inner join" (most basic kind)

60:36 Table Prefixes in Queries

Include full table names for readability
60:54: Like avoiding import * in Python
Makes it clear which table each column comes from

61:53 SQL as a Long-term Skill

Speaker learned SQL nearly 25 years ago
62:04: Also learned Perl, PHP, ActionScript - SQL is what still serves today
62:10: Like bash on Linux - pick up once, pays for decades
62:37: Basic SELECT syntax same across all database engines

62:49 ⭐ sqlite-utils Introduction

Tool built over past 2 years
63:15: Version 3.30 (many changes shipped)
63:22: Two things in one:
- Command line interface (CLI)
- Python library
63:36: Problem: Fastest way from "here's some data" to "data's in SQLite"
Installation: pip install sqlite-utils (or pipx, brew)

64:40 Manatee Data Example

64:53: Manatee carcass recovery locations in Florida
65:04: Florida Fish and Wildlife Conservation Commission data
65:10: Data since April 1974 (50 years)
65:18: 14,341 records
2.7 megabyte CSV file

66:02 Importing CSV to SQLite

Command: sqlite-utils insert manatees.db locations [file.csv] --csv -d
Structure: database file, table name, CSV file
66:21: --csv flag indicates CSV data
-d flag: Try to detect types (integers as integers, not strings)

66:35 CSV vs JSON for Data

66:42: CSV problem: null and blank string are indistinguishable
sqlite-utils imports blanks as blank strings
66:54: JSON preferred - has nulls, floating points represented correctly
67:00: Most data published as CSV, so tooling needed
67:06: Result: manatees.db file (5.4 megabytes)

67:57 User-Friendly SQLite Interfaces

Many Windows/Mac/GUI apps available (SQLite around 20 years)
68:15: Datasette plugins can upload CSV files directly
68:22: Speaker prefers command line for automation
68:34: Can stick recipes in bash scripts
68:42: GitHub Actions automations - download, convert, clean data

69:19 Viewing Database Schema

Command: sqlite-utils schema manatees.db
69:31: Shows table schema
69:38: X, Y, lat, long columns (4 ways to represent locations)

69:56 Exploring Manatee Data

13,000 rows
70:12: Facet data by mortality

70:50 Querying from Command Line

sqlite-utils manatees.db "SELECT ..."
71:03: Default output: tab-separated table
71:15: Without flags: outputs JSON
71:22: --format latex for LaTeX output on command line

71:40 Transform Command in `sqlite-utils`

Make alterations to table schema
72:20: Transform command is a more powerful ALTER TABLE SQL statement
72:27: SQLite ALTER TABLE limitations
- Can't reassign primary key
- Used to not rename columns (might be fixed)

72:53 How Transform Works Under the Hood

Creates new table (locations_temp) in new format
Copies old data to new table
Drops old table
Renames new table back to original name

73:10 Transform Operations

Rename columns: lat → latitude, long_ → longitude
73:28: Drop columns: created_user, last_user, X, Y, state, object_id
73:39: Reassign primary key to field_id column

74:05 Map Visualization

After renaming to latitude/longitude, map appears
74:15: Map shows 1,000 points by default
74:22: Can load all 14,000 points
Every manatee carcass in Florida since 1974
74:40: Florida Keys look safer than rest of Florida

75:09 Date Format Conversion

75:14: Speaker is European - dislikes American month/day/year format
75:20: Current format: MM/DD/YYYY (closer to European sensibilities)

75:27 ⭐ sqlite-utils convert Command

Convert values in columns to different formats
76:06: Takes Python expression as parameter
76:13: Converting rep_date, created_date, last_edited_date columns
76:20: Can put any Python expression
76:26: r.parsedate() function - uses dateutil under the hood
76:36: Converts any string to ISO formatted string
75:52: Result: ISO formatted dates (YYYY-MM-DD)

77:03 Recipe Namespace

r stands for recipes
77:20: Three built-in recipes:
- 77:26: json_split - splits comma-separated values into JSON array
- 77:31: parsedate, parsedatetime
77:43: Can use Python expressions: value.upper(), float(value) + 0.5

78:09 When to Use Python vs SQL

78:16: SQLite SQL is limited for manipulating data inside a cell.
- Can do string concatenation, basic math
  - (There is an extensive plugin system for SQLite)
- 78:22: Can't do advanced math operations
- 78:29: Missing advanced string manipulations
78:36: Postgres has rich SQL with regex, string operations
78:46: SQLite lets you register Python functions as SQL functions
79:08: Convert command defines Python function, registers it, runs SQL query

79:47 Registering Custom Functions

sqlite-utils query with --functions parameter
Pass block of Python code defining functions
80:05: Functions become available to SQLite query
80:10: Feature available in library too

80:24 ⭐ Database Normalization - Extract Command

80:33: Mortality column has duplicate data
80:40: Verified, Not Necropsied repeated thousands of times
80:47: Database normalization - don't duplicate same string thousands of times
81:06: Faceting shows decode and mortality have identical counts
81:16: Decode values: 1=human related, 4=watercraft, 8=perinatal

81:51 Extract Command Usage

sqlite-utils extract manatees.db locations DCODE MORTALITY \
  --rename MORTALITY name \
  --table mortality

sqlite-utils extract manatees.db locations decode mortality --rename mortality name
82:05: Creates new "mortality" table
82:11: Two tables now: locations and mortality
82:25: Locations now has mortality_id foreign key column
82:43: Foreign key reference defined in schema
82:49: sqlite-utils extracted data, created new table, rewritten original table
83:01: No longer duplicating data - using foreign key reference
83:08: Faceting still works - Datasette smart enough to pull official name

83:28 Benefits of Extract Tool

Haven't seen this refactoring tool in other systems
83:35: Use all the time with SQLite databases
83:41: Start with CSV, get to clean relational schema with few steps
83:47: Multiple tables with foreign keys all set up

84:21 Adding Foreign Keys to Existing Tables

❓ Question: Can you foreign key two CSV files?
- Yes, import both then use add-foreign-key command
84:39: sqlite-utils add-foreign-key mybooks.db books author_id authors id
84:60: SQLite doesn't have native ability to add foreign keys to existing tables
85:10: sqlite-utils has patterns baked in for this

85:23 Break Time

10 minute break
85:30: Next: Create database of PEPs and publish to internet
Then: Advanced SQL topics
Reconvene at 3:05pm

1:25:43: Demo: Create a database and publish to internet

105:21 SQLite vs Pandas - Tool Preference

105:27: "When all you have is a hammer everything's a nail"
When you have sqlite-utils, Python, Jupyter notebooks, and Datasette - everything becomes a SQLite problem
105:41: Not a heavy Pandas user - prefers SQL
105:53: Datasette is almost a competitor to Pandas
106:06: Many problems could be solved with either SQLite/Datasette or Pandas
Tends to lean towards tools being built

106:11 Why Python with SQL?

Python for manipulation and data cleaning layer
106:28: Tasks difficult in SQL: taking text files on disk and converting to format
SQL for querying at the end

106:46 Publishing to the Internet

106:52: GitHub Codespaces frustration - private by default
107:04: Free development environment but not free public hosting
107:17: Datasette's entire reason: help people publish data
107:22: Get data into shareable format as easily as possible
107:29: Going to publish peps.db (10 megabyte database)

107:51 Publishing with Vercel

Installation: npm install -g vercel
108:03: Using Vercel hosting platform
108:08: Works with many hosting platforms - Vercel has least steps
108:21: Encouraged to try even if haven't used Vercel before

108:27 Vercel Authentication & Setup

108:38: vercel login - gives sign-in options
Sign in with GitHub
108:49: Authenticate via webpage with code to copy/paste
109:01: Install plugin: datasette install datasette-publish-vercel
109:08: Plugin teaches Datasette how to deploy to Vercel

109:15 Publishing Database to Vercel

Command: datasette publish vercel notebook/peps.db --project peps
109:27: Uploads 10MB SQLite database to Vercel
109:34: Deploys full Datasette web application
109:41: Provides progress bar
109:47: Installing dependencies, running pip install
109:52: Generating build outputs
Deploying to Washington DC
Running checks on domains
109:59: Done! Available at peps-simonw.vercel.app

110:07 Published Database Features

Database created earlier, complete with search engine
Can search for "generators"
110:13: Running live on internet at peps-simonw.vercel.app
110:18: Can point custom domain names or assign aliases
110:31: About 30 seconds total (including software installation)
110:37: Database goes from local environment to public

110:42 JSON API

This is an API
110:50: Can build JS application needing JSON API for searching Python PEPs
110:57: Add ?_shape=array for nicer format
111:02: Every PEP matching "generators" as JSON API
Anyone in the world can use it

111:08 Vercel Platform Details

Based on Amazon Lambda functions
111:14: Scale to zero - no traffic = no cost
111:21: Very generous free tier - effectively free hosting for structured data
111:28: One catch: 50 megabyte maximum file size
111:34: For gigabyte databases: Google Cloud Run works similarly, handles up to 1GB easily

111:41 Datasette Creation Origin Story

111:47: This feature inspired creation of Datasette
111:53: Looking at Vercel (then called Zeit Now) before rebranding
112:00: Could this solve the newspaper problem?
112:05: Wanting good way to deploy structured data for people to use
112:12: Works really well
112:18: Deployed hundreds of applications to Vercel
Cost: $20/month max for Vercel plan with hundreds of projects
112:24: Really powerful way of sharing data

112:32 Other Publishing Options

112:38: Datasette can publish to Google Cloud Run and Heroku out of the box
112:44: Heroku canceled free plan - less exciting now
112:50: Favorite: Fly.io
Built plugin: datasette-publish-fly
112:56: Plugin system allows writing plugins for any hosting provider with good API
115:40: Datasette documentation has extensive coverage

113:12 Baked Data Pattern

Question about deployment process
113:28: datasette publish bundles database and application together
113:34: "Complete cheat"

113:40 Stateless Hosting Providers

Vercel, Amazon Lambda, Google Cloud Run designed as stateless
113:46: Will run code but if you want database - need to pay extra
Amazon RDS or Google database services
113:54: Hosting code that doesn't write to disk is really easy
114:00: Wrap container, launch for new request, shut down, scale with multiple copies
114:06: Very easy in 2023 to host stateless application
114:12: Stateful applications with disk access and backups - expensive and complicated

114:20 Read-Only Data Realization

Big realization: only care about read-only data
114:25: None of these accepting writes at all
114:31: If read-only, can "abuse" free hosting providers
114:38: SQLite built-in, runs in read-only mode
114:44: "Abusing" tools meant for simple applications
114:50: Stick 5MB of binary SQLite data in container - it just works
114:55: Called "baked data pattern" - baking data into application
115:02: Perfect example of why this pattern is interesting

115:08 Redeployment Strategy

New PEP published once or twice a month
115:14: Just redeploy the whole thing
Many projects deploy once a day - often enough for needed data
115:25: Don't worry about updating in place
Overwrite application with completely fresh deploy
New copy of database with changes baked in

115:47 Datasette Lite - WebAssembly Introduction

115:53: Experiment from last year, started as a joke
115:58: Pyodide project - runs Python entirely in browser
116:05: Compiles Python interpreter to WebAssembly
JupyterLite: "Most incredible piece of engineering"
116:12: Full Jupyter notebook stack in browser
116:25: Python via WebAssembly, all of Jupyter, pandas, numpy, matplotlib
116:37: Thought: wouldn't it be interesting if Datasette could run in browser?

116:44 Datasette Architecture

116:51: Very opinionated - NOT a JavaScript application
Very little JavaScript in Datasette
Some syntax highlighting
116:57: Clicking sort reloads whole page
117:03: Boring old HTML - simpler, faster to load, better accessibility
117:08: Funny to run pure server-side web application in browser using WebAssembly

117:14 Datasette Lite Demo

Built Datasette Lite
117:26: Open browser network tab to see what it does
117:32: Navigate to lite.datasette.io
117:38: Downloads:
- 2MB pyodide.asm.js (Python interpreter compiled to WebAssembly)
- 5MB pyodide.asm.data (Python standard library compiled to WebAssembly)
117:57: Downloads those files
118:03: Hits Python Package Index and runs pip install of Datasette
118:21: Pip install runs in browser, pulls latest Datasette from PyPI
118:27: Full Datasette with all features - faceting, SQL queries

118:35 Datasette Lite URLs

118:41: Fragment hash in URL
Everything after that is regular Datasette URLs
118:49: "Built as a bit of a joke"
118:55: Thought nobody would use it - loads 10MB before becoming usable
119:00: Forgot average React app is ~5MB
119:08: Waiting for few extra megabytes for full Python interpreter not restrictive

119:13 Genuinely Serverless

No server-side code running anywhere
119:19: Just static HTML, JavaScript, binary WebAssembly files
119:26: Nothing can break - no server to fill with logs or crash
119:31: Served off GitHub Pages - doesn't cost any money
119:37: Full interactive experience

119:43 Loading External Files

Built feature: give it URL to CSV, JSON, or SQLite database
119:49: If browser can fetch it, will load into Datasette Lite
119:56: Turns out to be really really useful

120:09 FiveThirtyEight Data Example

120:14: FiveThirtyEight publishes underlying data on GitHub
120:23: 538 data repository
120:31: fight-songs file - data about American college/university fight songs
Year written, etc.
120:37: CSV file

120:43 Loading CSV into Datasette Lite

Copy/paste CSV URL
Hit "Load CSV" button
120:50: Datasette fires up, fetches data from GitHub directly
Loads to a table
120:56: Now have Datasette interface on top of CSV data from URL
121:03: Can facet by official_song, student_writer
Can run SQL queries
121:09: Also bookmarkable

121:15 Sharing Datasette Lite Links

Send link to anyone else in the world
121:21: Their browser installs Datasette and Python, gets interpreter running, downloads CSV, loads in SQLite
All "bizarre crazy junk" in about 15 seconds
Then can interact with data same way

121:27 Workflow Simplification

Used to: download data, put in SQLite, publish to Vercel/Cloud Run
121:39: Now: for ad hoc data projects
Gather CSV/JSON data
121:45: Stick in GitHub gist
Link to Datasette Lite instance loading from GitHub gists

121:53 CORS and GitHub

121:59: JavaScript can only fetch from servers with special HTTP CORS header
122:05: Most hosting doesn't have this
Feed URL to most things on internet = error
122:11: GitHub has CORS header for all repository content and gists
122:16: GitHub = world's quickest CORS-friendly hosting for any data

122:32 Attribution Question

Question: Is there attribution?
122:40: There is not, but should be
Datasette has metadata mechanism for source/license info
122:47: Datasette Lite doesn't have it yet
122:52: "Lazy and haven't built it" - should build it, kind of rude

122:58 Red Pajama Training Data Example

123:05: Hobby: digging into training data behind AI models
123:12: New training set: Red Pajama
123:19: 2.6 terabytes of data for training language models
All of Wikipedia, archive.org, GitHub data, Stack Exchange
123:32: Wanted to investigate but not download 2.7TB

123:37 Clever Investigation Approach

Released as text file with 2,000 URLs
123:44: Each URL = 1 gigabyte blob
Want all data? Run wget and download all 2000 URLs = 2.7TB disk space
123:56: Instead: HTTP HEAD request against each URL
Get back just size in bytes
124:04: Run in 15 seconds - churn through all 2000, get sizes
124:09: Now have URLs and sizes

124:19 Creating Analysis Dataset

124:35: Turned into JSON file
124:40: JSON file has 2,000 records
Each: URL, size, megabytes, gigabytes
124:46: Made MB and GB calculations "feeling lazy"
124:53: Has path, split out top folder (tells source of data)

124:58 Analyzing with Datasette Lite

Have the gist
Go to Datasette Lite
Feed in URL
125:06: Now have a table
Very quick route: list of files → files + sizes → JSON gist → interactive interface
125:12: Can start playing with it
125:18: Facet by top_folder
125:23: Wikipedia was single file (all in one Wikipedia file)
C4 and Common Crawl most represented
125:31: From raw data to telling stories and understanding data in few steps as possible

125:44 Datasette Lite as Default Solution

Started as joke
125:51: Now default solution for publishing/working with few thousand records
125:58: Do it in gist, load in Datasette Lite
126:04: Don't have to think about provisioning servers or deploying server-side Python web app

126:10 Plugin Support in Datasette Lite

Datasette Lite supports plugins
126:16: Construct URL with ?install= and plugin name
126:23: Can do multiple times: install=this&install=that
126:30: On startup will pip install plugins
Now have new features (example: copyable plugin for LaTeX)
126:38: Can use directly with install=datasette-copyable in URL
126:52: Further reading documentation includes blog entry on how plugin support works
126:58: Ecosystem benefit - plugins written for Datasette work in Datasette Lite
127:04: Multiplies ecosystem value

127:24 Browser Storage Limits

Question: How much data can your browser hold?
127:29: Haven't tested upper limits yet - thinks it's gigabytes
127:42: Fascinated by large language models (ChatGPT)
127:47: WebLLM - runs ChatGPT-style model entirely in browser
127:53: Uses WebAssembly and WebGPU (new graphics card API in Chrome)
128:00: Loaded 2GB model file into browser - it worked
128:07: Believes 2GB SQLite file will work fine as well
128:14: Laptop has 64GB RAM - 2GB browser tab isn't a big deal
128:21: Download speed is bigger problem
128:26: Home internet loads 2GB in 1-2 minutes - feasible
128:31: Surprising how much you can get done in browser

128:37 Advanced SQL Introduction

128:50: Not teaching everything about SQL in next 20 minutes
128:57: Covering favorite advanced SQL features with enough detail to dive deeper later

129:08 SQL Aggregations

First topic - hardly even advanced
Helps build things up for later
129:15: Aggregation = calculating across groups of rows together
129:22: Doing calculations across groups of rows

129:30 Example: Presidential Terms by Party

Looking at executive terms table
129:36: For each party, how many presidential terms?
Democrat Party: 5
Democratic party: 21
129:43: Democratic-Republican: 7 (never heard of)
129:49: Key pattern: GROUP BY at end to group rows
129:55: Groups into chunks matching on that column
130:01: Can do aggregate functions like COUNT(*) across individual groups

130:06 Aggregate Functions

COUNT(*) most common
130:12: Other commonly used:
- MAX for maximum value
- MIN for minimum value
- SUM to add things up

130:26 Training Data Example with Aggregation

Showing Datasette Lite example
130:38: Escaping bug - ampersands where there should be "and"
130:45: Running aggregation across top_folders column
130:52: Summing total gigabytes as total_GB
130:57: Wikipedia: 111GB in one file
131:02: Stack Exchange: 74GB in one file
GitHub: 212GB across 98 files
131:07: Useful things to calculate
131:14: Worth exploring aggregates

131:19 Subqueries - More Fun Example

131:24: Concept in SQL called subqueries
131:31: Missed this for years - worked with MySQL
MySQL prior versions didn't support this feature
131:36: Didn't know this was possible
131:41: SQLite, Postgres, other SQLs have really good support

131:47 Subquery Example

Can say: SELECT something FROM so_and_so WHERE repo IN ...
131:53: Normally give comma-separated list of values
Instead: give another SELECT query
132:00: "Where repository is in list of IDs from repos where full_name is in list of repos from plugin_repos"

132:12 Breaking Down Subquery Example

Run as separate steps first
SELECT repo FROM plugin_repos
132:18: Get back list of repository names (strings)
132:24: SELECT id FROM repos WHERE full_name IN ...
132:31: full_name turned into ID
132:42: Want IDs for things where name is in that other table
132:47: Could normally do with JOIN
132:53: Joins across multiple tables quite difficult to follow
132:59: Anytime we get to a join, have to think about it
133:06: Using nested subqueries can actually follow easier
Can understand step by step: name of repos → list of IDs → list of those

133:14 Full Subquery Example

SELECT url FROM releases WHERE repo IN ...
133:21: Getting back URLs to releases
133:27: Most recent release (ordered by created)
133:33: Most recent releases of repositories in list of plugin repositories
133:39: Datasette website has two types: plugins and tools
133:44: Switch to tool_repos: swarm-to-sqlite, apple-notes-to-sqlite, db-to-sqlite
133:51: Same thing from plugin_repos: gets latest releases of plugins
133:59: Bit convoluted, probably should have broken down simpler

134:33 Scalar Subqueries - Even Cooler

Another type of subquery
134:44: For each repository, want:
- Name of repository
- 134:50: URL of latest release in that repository
134:55: Latest releases in other table
135:02: In SELECT clause itself, can run complicated query
Has to be LIMIT 1
135:23: Key idea: for each repository listed, running second SQL query inside loop
135:29: Fetch additional related data

135:35 Query Efficiency Considerations

Instinct: looks like terrible idea
135:42: Pulling 200 results, then doing another 200 SQL queries
135:47: SQLite optimizes that to certain extent
135:53: Database table with 200 rows - nothing can take longer than fraction of millisecond
135:58: Often deliberately write very inefficient queries
For size of data working with, honestly doesn't matter

136:09 Big Data and Inefficient Queries

Worked for larger company with Hive, Presto, Big Data Warehouse things
136:14: Fun thing: can write very inefficient SQL against them too
136:21: Whole reason for existing: break complex queries into map-reduces
136:26: Inefficient queries work well against Big Data warehouses
136:33: Would never do this in production for high-traffic web application
136:40: For this data size, can do lots with "laughably inefficient SQL loops"

136:47 Experimenting with Subqueries

Often default to weird subqueries to try them out
136:54: See what works and what doesn't
Always end up learning interesting new patterns
One more super fun example coming up

137:08 Common Table Expressions (CTEs)

All-time favorite SQL function/feature
137:13: Terrible name - not enough people aware of it
Called CTE (Common Table Expressions)
137:22: SQL's answer to abstraction
Like Python: duplicate code logic → put in function with nice name
137:29: Helps maintainability and productivity
137:34: SQL queries notorious for getting long
137:40: People have 500-line SQL queries left by previous employee
Nobody knows how it works, everyone terrified to touch it

137:52 Making SQL More Effective

Best way: break it up
137:59: Using Common Table Expressions
Way to create temporary table
138:04: Exists purely for duration of current query

138:12 CTE Example: Presidents Who Were VPs

Question: How many presidents originally served as vice presidents?
138:25: Using executive_terms table - mixes two types of office
Has presidents and vice presidents
138:32: Create table called "presidents"
138:37: Then SELECT * FROM presidents
138:43: Could do without CTE, get same result
138:48: Illustrating: giving this SELECT query (doing join and WHERE clause) the alias "presidents"
138:55: Quite complicated query
139:01: Later can select from it

139:07 Adding Multiple CTEs

Can add another: vice_presidents AS ...
Can SELECT * FROM vice_presidents
139:21: Or SELECT * FROM presidents
Now have two virtual tables
139:27: Can solve problem: SELECT DISTINCT name from presidents WHERE name appears in list of vice_presidents
139:32: Get back 15 rows
139:38: Each is president who was previously vice president

139:44 Data Quality Note

Suspicious of George Bush
139:50: Americans reusing names - George Bush and George H.W. Bush
139:57: Would have to dig in to verify proper relationship

140:02 Benefits of CTEs

Using technique: take very complicated queries
140:08: Turn into something where you can understand what's happening
140:14: "I love this"
140:21: Got more ambitious with problems solved in SQL vs Python once figured out how to use
Allows taking complex logic and abstracting to point where you can reason about it
140:27: Then do joins between them, unions, etc.
140:33: If don't use CTEs, very strongly recommend exploring them

140:43 CTEs vs Self Joins

Question: Could do with self join?
Yes, absolutely
140:50: Thinks it would be less readable
Hunch: would be harder to follow with self join
140:56: CTEs mostly readability hack
141:02: Great way to make it easier to look at query and understand what it's doing

141:08 JSON in SQLite

141:13: Another feature SQLite has baked in - incredibly valuable
141:19: Functions that know how to deal with JSON objects stored in database
141:25: Postgres has JSON type (JSON and JSONB) - optimized
141:31: SQLite doesn't - tells you to store JSON in text column
If you store JSON in text columns, can do interesting things with it
141:38: More importantly: SQLite can output JSON
141:44: Can use functions to build up JSON values to return

141:50 JSON Aggregation Example

Problem: get back list of all legislators
141:57: For each legislator, fetch all terms they served
All in one go
142:04: Normally: write Python code or pull back duplicate versions
Many ways to do this
142:17: Neat to say: want single row per legislator
142:22: In one column should be JSON list of objects from other table

142:33 JSON Query Structure

Four columns:
- Legislator's ID
- Name
- 142:39: JSON list of all terms served from related table
- 142:45: Number of terms (1, 13, etc.) - counter

142:50 How It Works

Using aggregate like shown earlier
142:57: SELECT these things grouped by legislators.id
143:02: Deduping and grouping on ID of legislator
Join against legislative_terms
143:09: Actually have duplicate records
Show duplicate records by removing GROUP BY

143:23 Duplicate Data Example

Actual data full of dupes
143:28: Watkins Abbott represented 13 times
Watkins Abbott had 13 terms in Congress
143:36: First step: GROUP BY legislators.id to dedupe
143:44: Do COUNT(*) - gives back ID, name, and number of terms
143:52: 13, 4, 3, 5, 1, 1, 5... (limited to 10 rows)
Get back all legislators with all counts

144:03 JSON_GROUP_ARRAY Magic

Real magic: adding aggregation called json_group_array
144:09: For every duplicate chunk of rows
144:14: Build array that joins them together
144:20: In array: put JSON object
144:26: JSON object has labels and columns (type, state, start, end, party)
144:31: Can see in output: type, state, start, end, party in JSON objects

144:37 Common Web Development Problem

2:25:01
Problem solving: always frustrating in web app development
144:42: Want to pull back all blog entries
144:48: For each one, list of tags on that blog entry
Don't want multiple SQL queries
144:54: Once figure out JSON aggregations
145:00: Can solve "most frustrating problem in web development"
145:05: SQL query does all that fetching

145:10 Article Reference

Wrote extensively about this
Article: "Returning related rows in a single query using JSON"
145:16: Shows how to do same thing in Postgres

145:23 Postgres JSON Syntax

Looks slightly different
145:29: json_agg and json_build_object
145:38: Same idea, syntax very slightly different
145:44: Really powerful trick
145:50: Favorite advanced SQL mechanism - solves problem fought many times as developer

146:05 Window Functions - Final Advanced Feature

Talked about CTEs and JSON
146:13: Final advanced feature: window functions
146:19: Much more powerful analytical problem
Want to do rolling averages over time
146:25: Specifically important for time series data
146:30: Need rolling averages or calculations by chunks of data as it moves along

146:53 Window Function Example Problem

For every repository in database, want:
- 146:59: Name of repository
- 147:06: Date of most recent release associated with repository
- 147:11: Three most recent releases as JSON array
147:17: Previously: showed JSON example returning all terms for each legislator
Hoped none served million terms - would break query
147:24: Some projects have had over 100 releases
Just want three most recent returned in query

147:31 Building the Window Function Query

147:39: Start with join - don't even need CTE
Joining releases against repos table
147:44: More than 1,000 rows
147:50: Datasette itself has had a lot of releases
147:55: This is raw data to start working with
148:00: Lot of deduping needed

148:07 Adding Rank Column

Going to add a rank
148:13: To select just three most recent, need indication
148:20: Incrementing integer for which release that was for those projects
148:27: Adding "little magic piece of syntax"
SQL calls this window function
148:34: Feature added to SQLite maybe 3 years ago, inspired by Postgres implementation

148:40 How Window Functions Work

For each release, create partition
148:46: Unique for each repository ID
Ordered by created date on releases in descending order
148:52: As "rel_rank"
Hard to see what that means from syntax
148:58: Look at output - added rel_rank column at end
149:04: Keeps going up for all of Datasette's releases (had 125)
149:12: Switch to csv-to-sqlite - only had 13 releases
Goes 1, 2, 3, 4, 5 up to 13
149:18: datasette-cluster-map had 1-12 releases

149:23 Filtering on Rank

149:29: In single result, have something else to filter on
149:36: "I just want the top three"
149:47: Can say SELECT * FROM cte WHERE rel_rank <= 3
150:06: Less than or equal to 3
150:13: Almost solved problem
For each repository, show three most recent
150:19: For Datasette: 3 rows (versions 6.4.2, 6.4.1, 6.4)
150:24: csv-to-sqlite: 3 rows, cluster-map: 3 rows
150:29: dbf-to-sqlite only ever had 1 release - shows up once
150:35: Most others had at least 3
datasette-cors: another with single release
150:41: This is almost what we need

150:49 Final Step: Adding JSON Aggregation

Taking advantage of Common Table Expression
151:01: Adding back json_group_array with json_object
That worked
151:08: For Datasette repo: 125 releases total
Only 3 we care about in recent_releases
151:14: Those 3 showing up with ID and name of creator
151:21: Quite complicated query - getting quite a lot done
151:28: Combining CTE + json_group_array + window functions
151:35: Pretty sophisticated result
151:41: If wanted to build custom application showing summarized data - most work done in this one SQL query

151:46 JSON Aggregation Performance Question

Question: What's computational cost of doing JSON aggregation in SQLite?
152:00: What's overhead of JSON construction in SQLite vs Python?
152:08: Haven't measured it - hunch is SQLite's faster
152:14: Everything in SQLite is C - incredibly well optimized
152:19: Benchmarking they're doing is absolutely extraordinary
152:25: Has benchmarked: JSON functions can crunch through gigabytes of data
152:32: Astonishingly fast at processing things

152:40 Paul Ford Quote on JSON

Quote from Paul Ford dropped into section
152:46: "We save the text and when you retrieve it we parse the JSON at 700 megabytes a second"
152:51: "You can do path queries against it, please stop overthinking it - this is a filing cabinet"
152:57: Paul Ford's trick: gets JSON from somewhere
Literally loads into SQLite table with one row per object with blob of JSON
153:04: Uses SQLite's built-in JSON features to extract bits, convert things
153:11: Often saves results to new database table
Does all data processing with JSON functions inside SQLite itself
153:19: "I'm not quite there yet - haven't embraced it to quite that extent"

153:24 JSON Performance Benefits

Phenomenally performant way of working with data
153:32: Most data these days comes in JSON shape
Database engine that can do JSON really well - SQLite does
153:37: Find yourself leaning on that more and more
153:44: Always try to get things into relational table eventually (for faceting, filtering)
153:50: Lot you can get done just using blobs of JSON with fast C library

153:57 Fun Demos - Personal Projects

Running close to time
154:03: Going to show fun demos - things built with Datasette
"Chance for me to show off some of my websites"

154:09 Niche Museums Website

Website for main hobby outside of technology
154:16: "I basically collect tiny museums"
154:22: Goal: whenever in new city, never been to Salt Lake City before
Very interesting tiny museums around here
154:30: Try to find smallest museums and go to those first
154:36: Really tiny museum - doesn't matter what it's about
Chances are person who collected stuff is there running museum
154:42: Get to have conversation with somebody who collects evidence of:
- Bigfoot
- Pez dispensers
- Little devil statues
154:48: "Super super rewarding"

154:54 Niche Museums Site Features

Website: where all tiny museums found and visited go
Can say "use my location"
155:00: Shows Donna Memorial State Park Visitor Center 450 miles away
Closest visited to here
155:06: "Clearly not explored Utah and Colorado at all yet"
This is just Datasette
155:12: Same Datasette web application seen earlier
155:18: Datasette lets you use custom templates
Custom template for home page makes it look like website about museums
155:24: Custom template knows how to put things on map
155:30: Search feature is just another canned query
Can tell because at nichemuseums.com/browse - it's Datasette
155:37: This is the website - there's the museums table
Deployed on Vercel, costs nothing to run
155:45: Has plugins
Every time new museum added, deploys new version of site - works fine

155:50 RSS Feed via SQL

Website has RSS feed
155:56: Click there - get XML back
Can subscribe to in feed reader
156:02: Actually defined using SQL query
Query called "browse_feed"
156:07: Giant monstrosity of SQL query
Using json_each and coalesce
156:13: Output of query: table with columns atom_id, atom_title, atom_updated
156:20: Datasette plugin called datasette-atom
Only fires when table has these columns
156:25: Adds .atom extension, turns into XML

156:31 Other Output Plugins

Few other plugins along those lines
datasette-ics: can output subscribable ICS calendar files
156:36: Kind of fun

156:42 SQL as Integration Language

Effectively using SQL as integration language
156:48: As developers, spend so much time getting data in one format and transforming to another
156:55: SQL's really good at that: SELECT something AS atom_id, something AS atom_link
Transforms into that other format
157:02: If have plugins that know how to output things in different shapes
Can pipe it all together
157:08: Out-of-the-box engine for reformatting data into different formats
Something played with a bunch - thinks is really neat

157:20 Blog Project - Combining Everything

Project illustrates what happens when you pull all this stuff together
157:26: Has a blog - running for 20 years
157:32: Been Django app for most of its life - Django and Postgres
Runs on Heroku
157:38: Works - got Django admin, love those features
These days likes to do things with SQLite and Datasette
157:45: Has copy of blog running at datasette.simonwillison.net
"My blog as a SQLite database"

157:51 GitHub Actions Automation

How it works: GitHub action script running
157:59: "26 star slash two - can anyone remember what that means?"
158:05: Every two hours
Every two hours grabs backup of database from Heroku as Postgres dump
158:11: Converts to SQLite database
158:18: Gets those tables
158:24: Redacts password column - don't want that out there
Runs few transforms to drop tables
158:31: At end runs datasette publish
158:38: To Cloud Run simonwillisonblog.db
158:44: With bunch of extra options, installs plugins
158:50: Every two hours latest version of blog gets published to Datasette instance
158:57: Can run SQL against all stuff, run queries and searches
All Datasette goodness available

159:02 Substack Newsletter Automation

Wanted to start Substack newsletter
159:07: For people who only ever use email to subscribe to blog
159:13: Didn't want to do any extra work - already putting stuff on blog
Wanted simplest possible way to get blog content into newsletter format
159:19: Substack do NOT give you API
159:26: Give you rich text editor to write content in
Do not give anything to help automate process
159:32: But you can copy and paste things into Substack editor

159:39 Observable Notebook Tool

Rich text editor - can do bold, links, headings, images
159:49: Built tool using Observable notebook
Called "blog-to-newsletter"
159:56: Giant bundle of JavaScript hitting Datasette API on blog
Fetches back all sorts of different bits of content
160:02: "Terrifying SQL query in here somewhere"
160:10: SQL query that reads from 5 different tables
Glues things together, sorts by, collect
160:16: All the CTE tricks

160:24 Newsletter Generation Process

Based on running that, get table of data
Has titles, URLs, different types of content
160:30: Bit of JavaScript at top glues it together into markdown
Actually HTML
160:36: Generates HTML, previews HTML
160:42: At very top: big gray "Copy" button
Click copy, go to Substack, hit paste
160:50: "That's my newsletter, I'm done"
160:56: Put that title in the title at top, add subheading, click Send
161:03: Sending newsletter once or twice a week takes about 1.5 minutes end to end

161:09 Technology Stack Summary

161:16: Blog in Django
GitHub Actions pulling from Postgres and converting to SQLite
161:22: Publish to Google Cloud Run
Run JavaScript in Observable to pull it all together
161:28: "It kind of just works" once one side-pieces those bits together
161:33: Thing likes about Observable notebooks: don't have to commit code to GitHub or open text editor
161:40: All hard work done directly in notebook interface
161:47: "Kind of cool" - nice illustration of projects you can do with all these components together

161:53 Q&A Session

Only three minutes left
Covered everything wanted to cover
Three minutes of questions

162:20 Question: Materialized Views

Question about materializing some views (e.g., "here's just the final third versions")
162:35: Didn't get into today: Datasette supports SQL views
162:42: SQLite supports views
162:49: Example: "recent_releases" view
Does join between releases and repos table
162:54: Bunch of stuff played around with earlier
163:00: Because defined as view, Datasette treats it like another table
Same interface - can facet by topics
163:06: Order by date, get things out as JSON
163:12: Behaves exactly like table except defined as extra SQL feature
163:18: "Super - something I find myself doing a lot"
163:24: Get bunch of raw data in normalized form, use views to create denormalized convenience layers
163:30: Can get that out as JSON as well
If want JSON API for very specific shape of data - quickest way is define view and do it that way

163:57 Question: SQLAlchemy vs Building Own Abstraction

Question: "If I don't want to build database abstraction layer, what about using SQLAlchemy?"
164:03: Good question
Never quite gelled with SQLAlchemy
164:08: SQLAlchemy is big and complicated
164:13: With Datasette, trying to minimize dependencies
164:19: Biggest dependency for Datasette: Jinja templating language
Feels has proved itself over many years
164:24: SQLAlchemy has proved itself as well - wouldn't be mistake to lock into SQLAlchemy
164:31: Worried about dependencies with backwards incompatible changes next year
164:38: Could be SQLAlchemy would help solve database abstraction problem well
164:45: Hunch: Datasette doesn't have many bits that would differ between databases
164:51: Mainly the way table page works - at least quite isolated
164:56: Refactoring that code at moment
Thinking maybe pluggable database backends wouldn't be as difficult as initially thought
165:02: If do pluggable database backends, first two: DuckDB and Postgres
165:08: If can do those two - 90% of data problems in world fit into SQLite, DuckDB, or Postgres

165:18 Closing Remarks

At time
165:24: "Thank you all very much for coming along"
Will be hanging around for another 15 minutes
165:31: Happy to talk about stuff more
Handout will stay available forever
Will probably update it, turn bits into more official Datasette tutorials
165:37: "Super keen on feedback on this stuff"
165:43: Anything that didn't make sense, could be easier, ideas for future projects
"I want to hear all of them"
Here for rest of PyCon
"Thank you very much"

Tutorials - Simon Willison Data analysis with SQLite and Python

Interactive graph

On this page

Notes

00:02 Introduction to PyCon Tutorial

01:14 Tutorial Overview

02:30 Requirements & Setup

05:37 What is SQLite?

09:32 Getting Started - Downloading Example Database

11: 11 Basic Python SQLite Usage

12:23 Working with Cursors

14:05 Creating Tables

16:02 Inserting Data

20: 23 Core SQL Operations

25:29 Introducing Datasette

26:51 Datasette Web Interface

30:39 Congressional Legislators Database

33:45 Exploring Data in Datasette

34:24 Foreign Keys & Relations

35:40 Faceting - Favorite Datasette Feature

37:07 Query Builder Interface

38:35 Exporting Data

39:25 URL-Based API

41:37 Custom SQL Queries

43:18 Named Parameters in Queries

44:44 Data Journalism Use Case

45:38 SQL Query API

47:54 Core Datasette Features Summary

48:49 Datasette Plugins

50:25 datasette-cluster-map Plugin

53:05 Exporting Data Options

53:49 SQLite Primary Keys

55:19 Learning SQL with Datasette

56:28 Learn SQL Tutorial

57:05 SQL Joins

57:42 Join Example - Executive Terms

59:04 Join Syntax

60:36 Table Prefixes in Queries

61:53 SQL as a Long-term Skill

62:49 ⭐ sqlite-utils Introduction

64:40 Manatee Data Example

66:02 Importing CSV to SQLite

66:35 CSV vs JSON for Data

67:57 User-Friendly SQLite Interfaces

69:19 Viewing Database Schema

69:56 Exploring Manatee Data

70:50 Querying from Command Line

71:40 Transform Command in sqlite-utils

72:53 How Transform Works Under the Hood

73:10 Transform Operations

74:05 Map Visualization

75:09 Date Format Conversion

75:27 ⭐ sqlite-utils convert Command

77:03 Recipe Namespace

78:09 When to Use Python vs SQL

79:47 Registering Custom Functions

80:24 ⭐ Database Normalization - Extract Command

81:51 Extract Command Usage

83:28 Benefits of Extract Tool

84:21 Adding Foreign Keys to Existing Tables

85:23 Break Time

1:25:43: Demo: Create a database and publish to internet

105:21 SQLite vs Pandas - Tool Preference

106:11 Why Python with SQL?

106:46 Publishing to the Internet

107:51 Publishing with Vercel

108:27 Vercel Authentication & Setup

109:15 Publishing Database to Vercel

110:07 Published Database Features

110:42 JSON API

111:08 Vercel Platform Details

111:41 Datasette Creation Origin Story

112:32 Other Publishing Options

113:12 Baked Data Pattern

113:40 Stateless Hosting Providers

114:20 Read-Only Data Realization

115:08 Redeployment Strategy

115:47 Datasette Lite - WebAssembly Introduction

116:44 Datasette Architecture

117:14 Datasette Lite Demo

118:35 Datasette Lite URLs

119:13 Genuinely Serverless

119:43 Loading External Files

120:09 FiveThirtyEight Data Example

120:43 Loading CSV into Datasette Lite

121:15 Sharing Datasette Lite Links

121:27 Workflow Simplification

121:53 CORS and GitHub

122:32 Attribution Question

122:58 Red Pajama Training Data Example

123:37 Clever Investigation Approach

124:19 Creating Analysis Dataset

124:58 Analyzing with Datasette Lite

125:44 Datasette Lite as Default Solution

126:10 Plugin Support in Datasette Lite

127:24 Browser Storage Limits

128:37 Advanced SQL Introduction

129:08 SQL Aggregations

129:30 Example: Presidential Terms by Party

130:06 Aggregate Functions

130:26 Training Data Example with Aggregation

131:19 Subqueries - More Fun Example

131:47 Subquery Example

132:12 Breaking Down Subquery Example

133:14 Full Subquery Example

134:33 Scalar Subqueries - Even Cooler

135:35 Query Efficiency Considerations

136:09 Big Data and Inefficient Queries

136:47 Experimenting with Subqueries

137:08 Common Table Expressions (CTEs)

137:52 Making SQL More Effective

138:12 CTE Example: Presidents Who Were VPs

139:07 Adding Multiple CTEs

139:44 Data Quality Note

140:02 Benefits of CTEs

140:43 CTEs vs Self Joins

141:08 JSON in SQLite

141:50 JSON Aggregation Example

142:33 JSON Query Structure

142:50 How It Works

143:23 Duplicate Data Example

144:03 JSON_GROUP_ARRAY Magic

144:37 Common Web Development Problem

145:10 Article Reference

145:23 Postgres JSON Syntax

146:05 Window Functions - Final Advanced Feature

146:53 Window Function Example Problem

147:31 Building the Window Function Query

148:07 Adding Rank Column

148:40 How Window Functions Work

149:23 Filtering on Rank

150:49 Final Step: Adding JSON Aggregation

151:46 JSON Aggregation Performance Question

152:40 Paul Ford Quote on JSON

153:24 JSON Performance Benefits

153:57 Fun Demos - Personal Projects

154:09 Niche Museums Website

154:54 Niche Museums Site Features

155:50 RSS Feed via SQL

156:31 Other Output Plugins

156:42 SQL as Integration Language

157:20 Blog Project - Combining Everything

157:51 GitHub Actions Automation

159:02 Substack Newsletter Automation

159:39 Observable Notebook Tool

160:24 Newsletter Generation Process

161:09 Technology Stack Summary

161:53 Q&A Session

162:20 Question: Materialized Views

163:57 Question: SQLAlchemy vs Building Own Abstraction

165:18 Closing Remarks