11 KiB
title | date | series | tags | ||
---|---|---|---|---|---|
Bashing JSON into Shape with SQLite | 2022-01-04 | howto |
|
It is clear that most of the world has decided that they want to use JSON for their public-facing API endpoints. However, most of the time you will need to deal with storage engines that don't deal with JSON very well. This can be confusing to deal with because you need to fit a square peg into a round hole.
However, SQLite added JSON functions to allow you to munge and modify JSON data in whatever creative ways you want. You can use these and SQLite triggers in order to automatically massage JSON into whatever kind of tables you want. Throw in upserts and you'll be able to make things even more automated. This support was added in SQLite 3.9.0 (released in 2015), so assuming Debian didn't disable it for no good reason, you should be able to use it today.
For this example, we're going to be querying publicly available JSONFeed endpoints and turning that into SQL tables. Let's start with a table schema that looks like this:
CREATE TABLE IF NOT EXISTS jsonfeed_raw
( feed_url TEXT PRIMARY KEY
, scrape_date TEXT NOT NULL DEFAULT (DATE('now'))
, raw TEXT NOT NULL
);
You can then insert things into the SQLite database using Python's sqlite3
module:
#!/usr/bin/env nix-shell
#! nix-shell -p python39 --run python
import sqlite3
import urllib.request
con = sqlite3.connect("data.db")
def get_feed(feed_url):
req = urllib.request.Request(feed_url, headers={"User-Agent": "Xe/feedfetch"})
with urllib.request.urlopen(req) as response:
cur = con.cursor()
body = response.read()
cur.execute("""
INSERT INTO jsonfeed_raw
(feed_url, raw)
VALUES
(?, json(?))
""", (feed_url, body))
con.commit()
print("got feed %s" % (feed_url))
get_feed("https://xeiaso.net/blog.json")
So now let's play with the data! Let's load the database schema in with the
sqlite3
command:
$ sqlite3 data.db < schema.sql
The less-than symbol there is a redirect, it loads the data from schema.sql
as standard input to the sqlite
command. See here
for more information on redirections.
Then run that python script to populate the database:
$ python ./jsonfeedfetch.py
got feed https://xeiaso.net/blog.json
Then open up the SQLite command line:
$ sqlite3 data.db
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite>
And now we can play with a few of the JSON functions. First let's show off
json_extract
.
This lets you pull a value out of a JSON object. For example, let's get the feed
title out of my website's JSONFeed:
sqlite> select json_extract(raw, '$.title') from jsonfeed_raw;
Xe's Blog
We can use this function to help us create a table that stores the metadata we care about from a JSONFeed, such as this:
CREATE TABLE IF NOT EXISTS jsonfeed_metadata
( feed_url TEXT PRIMARY KEY
, title TEXT NOT NULL
, description TEXT
, home_page_url TEXT
, updated_at TEXT NOT NULL DEFAULT (DATE('now'))
);
Then we can populate that table with a query like this:
INSERT INTO jsonfeed_metadata
( feed_url
, title
, description
, home_page_url
, updated_at
)
SELECT jsonfeed_raw.feed_url AS feed_url
, json_extract(jsonfeed_raw.raw, '$.title') AS title
, json_extract(jsonfeed_raw.raw, '$.description') AS description
, json_extract(jsonfeed_raw.raw, '$.home_page_url') AS home_page_url
, DATE('now') AS updated_at
FROM jsonfeed_raw;
Now this is workable, however you know what's easier than writing statements in the SQLite console like that? Not having to! SQLite triggers allow us to run database statements automatically when certain conditions happen. The main condition we want to care about right now is when we insert new data. We can turn that statement into an after-insert trigger like this:
CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_ins
AFTER INSERT ON jsonfeed_raw
BEGIN
INSERT INTO jsonfeed_metadata
( feed_url
, title
, description
, home_page_url
)
VALUES ( NEW.feed_url
, json_extract(NEW.raw, '$.title')
, json_extract(NEW.raw, '$.description')
, json_extract(NEW.raw, '$.home_page_url')
);
END;
Then we can run a few commands to nuke all the database state:
sqlite3> DELETE FROM jsonfeed_metadata;
sqlite3> DELETE FROM jsonfeed_raw;
And run that python script again, then the data should automatically show up:
sqlite3> SELECT * FROM jsonfeed_metadata;
https://xeiaso.net/blog.json|Xe's Blog|My blog posts and rants about various technology things.|https://xeiaso.net|2022-01-04
It's like magic!
However, if you run that python script again without deleting the rows, you will get a primary key violation. We can fix this by turning the insert into an upsert with something like this:
cur.execute("""
INSERT INTO jsonfeed_raw
(feed_url, raw)
VALUES
(?, json(?))
ON CONFLICT DO
UPDATE SET raw = json(?)
""", (feed_url, body, body))
And also make a complementary update trigger for the jsonfeed_raw
table:
CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_upd
AFTER UPDATE ON jsonfeed_raw
BEGIN
INSERT INTO jsonfeed_metadata
( feed_url
, title
, description
, home_page_url
)
VALUES ( NEW.feed_url
, json_extract(NEW.raw, '$.title')
, json_extract(NEW.raw, '$.description')
, json_extract(NEW.raw, '$.home_page_url')
)
ON CONFLICT DO
UPDATE SET
title = json_extract(NEW.raw, '$.title')
, description = json_extract(NEW.raw, '$.description')
, home_page_url = json_extract(NEW.raw, '$.home_page_url')
;
We can also scrape the feed items out too with json_each
. json_each
lets you
iterate a JSON array and returns SQLite rows for every value in that array.
Let's take this for example:
sqlite> select * from json_each('["foo", "bar"]');
0|foo|text|foo|1||$[0]|$
1|bar|text|bar|2||$[1]|$
The schema for the temporary table that json_each
(and the related
json_tree
) uses can be found here.
You can also grab things out of a list in an object with the second argument to
json_each
, so you can do things like this:
sqlite> select * from json_each('{"spam": ["foo", "bar"]}', '$.spam');
0|foo|text|foo|3||$.spam[0]|$.spam
1|bar|text|bar|4||$.spam[1]|$.spam
Using this, we can make a table for each of the feed items that looks something like this:
CREATE TABLE IF NOT EXISTS jsonfeed_posts
( url TEXT PRIMARY KEY
, feed_url TEXT NOT NULL
, title TEXT NOT NULL
, date_published TEXT NOT NULL
);
And then munge everything out of the data in the database with a query like this:
INSERT INTO jsonfeed_posts
( url
, feed_url
, title
, date_published
)
SELECT
json_extract(json_each.value, '$.url') AS url
, jsonfeed_raw.feed_url AS feed_url
, json_extract(json_each.value, '$.title') AS title
, json_extract(json_each.value, '$.date_published') AS date_published
FROM
jsonfeed_raw
, json_each(jsonfeed_raw.raw, '$.items');
This will fetch all of the values of the items
field in every JSONFeed and
then automatically populate them into the jsonfeed_posts
table. However
turning this into a trigger with the naiive approach will not instantly work.
Let's say we have the trigger form that looks like this:
CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_upd_posts
AFTER INSERT ON jsonfeed_raw
BEGIN
INSERT INTO jsonfeed_posts
( url
, feed_url
, title
, date_published
)
SELECT
json_extract(json_each.value, '$.url') AS url
, NEW.feed_url AS feed_url
, json_extract(json_each.value, '$.title') AS title
, json_extract(json_each.value, '$.date_published') AS date_published
FROM json_each(NEW.raw, '$.items')
ON CONFLICT DO
UPDATE SET title = excluded.title
, date_published = excluded.date_published
;
END;
If you paste this into your SQLite console, you'll get this error:
Error: near "DO": syntax error
This is actually due to a parsing ambiguity in
SQLite. In order to fix this you will
need to add WHERE TRUE
between the FROM
and ON CONFLICT
clauses of the
trigger:
-- ...
FROM json_each(NEW.raw, '$.items')
WHERE TRUE
ON CONFLICT DO
-- ...
The correlating insert trigger change is also an exercise for the reader.
Now you can add JSONFeeds how you want and all of the data will automatically be updated. This can probably be vastly simplified further with the use of generated columns, however this should work admirably for most needs.
SQLite is able to be a NOSQL database. It's good enough for your needs. If you want to play with the code I wrote while writing this article, check it out here. This post was written live on twitch.tv. Please follow or subscribe to be kept up to date on when I go live!
The VOD for this post is here. The corresponding YouTube upload is here. It won't be available immediately after this post goes live, but it will go up in time.
Here is my favorite message from the chat while I was researching this post:
jbpratt: if you were married to sqlite, i'd be reporting domestic abuse. This is awesome