diff --git a/blog/sqlite-json-munge-2022-01-04.markdown b/blog/sqlite-json-munge-2022-01-04.markdown new file mode 100644 index 0000000..f662d4b --- /dev/null +++ b/blog/sqlite-json-munge-2022-01-04.markdown @@ -0,0 +1,357 @@ +--- +title: Bashing JSON into Shape with SQLite +date: 2022-01-04 +series: howto +tags: + - sqlite + - json +--- + +It is clear that most of the world has decided that they want to use JSON for +their public-facing API endpoints. However, most of the time you will need to +deal with storage engines that don't deal with JSON very well. This can be +confusing to deal with because you need to fit a square peg into a round hole. + +However, [SQLite](https://www.sqlite.org) added [JSON +functions](https://www.sqlite.org/json1.html) to allow you to munge and modify +JSON data in whatever creative ways you want. You can use these and SQLite +[triggers](https://www.sqlite.org/lang_createtrigger.html) in order to +automatically massage JSON into whatever kind of tables you want. Throw in +upserts and you'll be able to make things even more automated. This support +was added in SQLite 3.9.0 (released in 2015), so assuming Debian didn't disable +it for no good reason, you should be able to use it today. + +For this example, we're going to be querying publicly available +[JSONFeed](https://www.jsonfeed.org/) endpoints and turning that into SQL +tables. Let's start with a table schema that looks like this: + +```sql +CREATE TABLE IF NOT EXISTS jsonfeed_raw + ( feed_url TEXT PRIMARY KEY + , scrape_date TEXT NOT NULL DEFAULT (DATE('now')) + , raw TEXT NOT NULL + ); +``` + +[The scrape date is essentially the date that the JSONFeed row was inserted into +the database. This can be useful when writing other parts of the stack to +automatically query feeds for changes. This is left as an exercise to the +reader.](conversation://Mara/hacker) + +You can then insert things into the SQLite database using Python's `sqlite3` +module: + +```python +#!/usr/bin/env nix-shell +#! nix-shell -p python39 --run python + +import sqlite3 +import urllib.request + +con = sqlite3.connect("data.db") + +def get_feed(feed_url): + req = urllib.request.Request(feed_url, headers={"User-Agent": "Xe/feedfetch"}) + with urllib.request.urlopen(req) as response: + cur = con.cursor() + body = response.read() + cur.execute(""" + INSERT INTO jsonfeed_raw + (feed_url, raw) + VALUES + (?, json(?)) + """, (feed_url, body)) + con.commit() + print("got feed %s" % (feed_url)) + +get_feed("https://christine.website/blog.json") +``` + +So now let's play with the data! Let's load the database schema in with the +`sqlite3` command: + +```console +$ sqlite3 data.db < schema.sql +``` + +[The less-than symbol there is a redirect, it loads the data from `schema.sql` +as standard input to the `sqlite` command. See here +for more information on redirections.](conversation://Mara/hacker) + +Then run that python script to populate the database: + +```console +$ python ./jsonfeedfetch.py +got feed https://christine.website/blog.json +``` + +Then open up the SQLite command line: + +```console +$ sqlite3 data.db +SQLite version 3.36.0 2021-06-18 18:36:39 +Enter ".help" for usage hints. +sqlite> +``` + +And now we can play with a few of the JSON functions. First let's show off +[`json_extract`](https://www.sqlite.org/json1.html#the_json_extract_function). +This lets you pull a value out of a JSON object. For example, let's get the feed +title out of my website's JSONFeed: + +```console +sqlite> select json_extract(raw, '$.title') from jsonfeed_raw; +Xe's Blog +``` + +We can use this function to help us create a table that stores the metadata we +care about from a JSONFeed, such as this: + +```sql +CREATE TABLE IF NOT EXISTS jsonfeed_metadata + ( feed_url TEXT PRIMARY KEY + , title TEXT NOT NULL + , description TEXT + , home_page_url TEXT + , updated_at TEXT NOT NULL DEFAULT (DATE('now')) + ); +``` + +[If you ask my coworkers, they can confirm that I actually do real life +unironcally write SQL like that.](conversation://Cadey/coffee) + +Then we can populate that table with a query like this: + +```sql +INSERT INTO jsonfeed_metadata + ( feed_url + , title + , description + , home_page_url + , updated_at + ) +SELECT jsonfeed_raw.feed_url AS feed_url + , json_extract(jsonfeed_raw.raw, '$.title') AS title + , json_extract(jsonfeed_raw.raw, '$.description') AS description + , json_extract(jsonfeed_raw.raw, '$.home_page_url') AS home_page_url + , DATE('now') AS updated_at +FROM jsonfeed_raw; +``` + +[The `AS` keyword lets you bind values in a `SELECT` statement to names for use +elsewhere in the query. I don't know if it's _strictly_ needed, however it makes +the names line up and SQLite doesn't complain about it, so it's probably +fine.](conversation://Mara/hacker) + +Now this is workable, however you know what's easier than writing statements in +the SQLite console like that? Not having to! SQLite triggers allow us to run +database statements automatically when certain conditions happen. The main +condition we want to care about right now is when we insert new data. We can +turn that statement into an after-insert trigger like this: + +```sql +CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_ins + AFTER INSERT ON jsonfeed_raw + BEGIN + INSERT INTO jsonfeed_metadata + ( feed_url + , title + , description + , home_page_url + ) + VALUES ( NEW.feed_url + , json_extract(NEW.raw, '$.title') + , json_extract(NEW.raw, '$.description') + , json_extract(NEW.raw, '$.home_page_url') + ); + END; +``` + +Then we can run a few commands to nuke all the database state: + +```console +sqlite3> DELETE FROM jsonfeed_metadata; +sqlite3> DELETE FROM jsonfeed_raw; +``` + +And run that python script again, then the data should automatically show up: + +``` +sqlite3> SELECT * FROM jsonfeed_metadata; +https://christine.website/blog.json|Xe's Blog|My blog posts and rants about various technology things.|https://christine.website|2022-01-04 +``` + +It's like magic! + +However, if you run that python script again without deleting the rows, you will +get a primary key violation. We can fix this by turning the insert into an +[upsert](https://www.sqlite.org/lang_UPSERT.html) with something like this: + +```python +cur.execute(""" + INSERT INTO jsonfeed_raw + (feed_url, raw) + VALUES + (?, json(?)) + ON CONFLICT DO + UPDATE SET raw = json(?) +""", (feed_url, body, body)) +``` + +And also make a complementary update trigger for the `jsonfeed_raw` table: + +```sql +CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_upd + AFTER UPDATE ON jsonfeed_raw + BEGIN + INSERT INTO jsonfeed_metadata + ( feed_url + , title + , description + , home_page_url + ) + VALUES ( NEW.feed_url + , json_extract(NEW.raw, '$.title') + , json_extract(NEW.raw, '$.description') + , json_extract(NEW.raw, '$.home_page_url') + ) + ON CONFLICT DO + UPDATE SET + title = json_extract(NEW.raw, '$.title') + , description = json_extract(NEW.raw, '$.description') + , home_page_url = json_extract(NEW.raw, '$.home_page_url') + ; +``` + +[You should probably update the original trigger to be an upsert too. You can +follow this trigger as a guide. Be sure to `DROP TRIGGER jsonfeed_raw_upd;` +first though!](conversation://Mara/hacker) + +We can also scrape the feed items out too with `json_each`. `json_each` lets you +iterate a JSON array and returns SQLite rows for every value in that array. +Let's take this for example: + +```console +sqlite> select * from json_each('["foo", "bar"]'); +0|foo|text|foo|1||$[0]|$ +1|bar|text|bar|2||$[1]|$ +``` + +The schema for the temporary table that `json_each` (and the related +`json_tree`) uses can be found [here](https://www.sqlite.org/json1.html#jeach). +You can also grab things out of a list in an object with the second argument to +`json_each`, so you can do things like this: + +```console +sqlite> select * from json_each('{"spam": ["foo", "bar"]}', '$.spam'); +0|foo|text|foo|3||$.spam[0]|$.spam +1|bar|text|bar|4||$.spam[1]|$.spam +``` + +Using this, we can make a table for each of the feed items that looks something +like this: + +```sql +CREATE TABLE IF NOT EXISTS jsonfeed_posts + ( url TEXT PRIMARY KEY + , feed_url TEXT NOT NULL + , title TEXT NOT NULL + , date_published TEXT NOT NULL + ); +``` + +And then munge everything out of the data in the database with a query like +this: + +```sql +INSERT INTO jsonfeed_posts + ( url + , feed_url + , title + , date_published + ) +SELECT + json_extract(json_each.value, '$.url') AS url +, jsonfeed_raw.feed_url AS feed_url +, json_extract(json_each.value, '$.title') AS title +, json_extract(json_each.value, '$.date_published') AS date_published +FROM + jsonfeed_raw +, json_each(jsonfeed_raw.raw, '$.items'); +``` + +This will fetch all of the values of the `items` field in every JSONFeed and +then automatically populate them into the `jsonfeed_posts` table. However +turning this into a trigger with the naiive approach will not instantly work. + +Let's say we have the trigger form that looks like this: + +```sql +CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_upd_posts + AFTER INSERT ON jsonfeed_raw + BEGIN + INSERT INTO jsonfeed_posts + ( url + , feed_url + , title + , date_published + ) + SELECT + json_extract(json_each.value, '$.url') AS url + , NEW.feed_url AS feed_url + , json_extract(json_each.value, '$.title') AS title + , json_extract(json_each.value, '$.date_published') AS date_published + FROM json_each(NEW.raw, '$.items') + ON CONFLICT DO + UPDATE SET title = excluded.title + , date_published = excluded.date_published + ; + END; +``` + +If you paste this into your SQLite console, you'll get this error: + +``` +Error: near "DO": syntax error +``` + +This is actually due to a [parsing ambiguity in +SQLite](https://www.sqlite.org/lang_UPSERT.html). In order to fix this you will +need to add `WHERE TRUE` between the `FROM` and `ON CONFLICT` clauses of the +trigger: + +```sql +-- ... +FROM json_each(NEW.raw, '$.items') +WHERE TRUE +ON CONFLICT DO +-- ... +``` + +[And thus the day is saved by the wheretrue, the hidden apex predator of the +SQLite realm, a fated value that is only non-falsy at night. Weep in terror lest +it add you to its table of victims!](conversation://Numa/delet) + +[The correlating insert trigger change is also an exercise for the +reader.](conversation://Mara/hacker) + +Now you can add JSONFeeds how you want and all of the data will automatically be +updated. This can probably be vastly simplified further with the use of +[generated columns](https://dgl.cx/2020/06/sqlite-json-support), however this +should work admirably for most needs. + +SQLite is able to be a NOSQL database. It's good enough for your needs. If you +want to play with the code I wrote while writing this article, check it out +[here](https://git.io/JSDVR). This post was written live on +[twitch.tv](https://twitch.tv/princessxen). Please follow or subscribe to be +kept up to date on when I go live! + +The VOD for this post is [here](https://www.twitch.tv/videos/1253083566). The +corresponding YouTube upload is [here](https://youtu.be/zkM_lY65Lcw). It won't +be available immediately after this post goes live, but it will go up in time. + +Here is my favorite message from the chat while I was researching this post: + +> jbpratt: if you were married to sqlite, i'd be reporting domestic abuse. This +> is awesome