xesite/blog/sqlite-json-munge-2022-01-0...

11 KiB

title date series tags
Bashing JSON into Shape with SQLite 2022-01-04 howto
sqlite
json

It is clear that most of the world has decided that they want to use JSON for their public-facing API endpoints. However, most of the time you will need to deal with storage engines that don't deal with JSON very well. This can be confusing to deal with because you need to fit a square peg into a round hole.

However, SQLite added JSON functions to allow you to munge and modify JSON data in whatever creative ways you want. You can use these and SQLite triggers in order to automatically massage JSON into whatever kind of tables you want. Throw in upserts and you'll be able to make things even more automated. This support was added in SQLite 3.9.0 (released in 2015), so assuming Debian didn't disable it for no good reason, you should be able to use it today.

For this example, we're going to be querying publicly available JSONFeed endpoints and turning that into SQL tables. Let's start with a table schema that looks like this:

CREATE TABLE IF NOT EXISTS jsonfeed_raw
  ( feed_url     TEXT  PRIMARY KEY
  , scrape_date  TEXT  NOT NULL  DEFAULT (DATE('now'))
  , raw          TEXT  NOT NULL
  );

The scrape date is essentially the date that the JSONFeed row was inserted into the database. This can be useful when writing other parts of the stack to automatically query feeds for changes. This is left as an exercise to the reader.

You can then insert things into the SQLite database using Python's sqlite3 module:

#!/usr/bin/env nix-shell
#! nix-shell -p python39 --run python

import sqlite3
import urllib.request

con = sqlite3.connect("data.db")

def get_feed(feed_url):
    req = urllib.request.Request(feed_url, headers={"User-Agent": "Xe/feedfetch"})
    with urllib.request.urlopen(req) as response:
        cur = con.cursor()
        body = response.read()
        cur.execute("""
           INSERT INTO jsonfeed_raw
             (feed_url, raw)
           VALUES
             (?, json(?))
        """, (feed_url, body))
        con.commit()
        print("got feed %s" % (feed_url))

get_feed("https://christine.website/blog.json")

So now let's play with the data! Let's load the database schema in with the sqlite3 command:

$ sqlite3 data.db < schema.sql

The less-than symbol there is a redirect, it loads the data from schema.sql as standard input to the sqlite command. See here for more information on redirections.

Then run that python script to populate the database:

$ python ./jsonfeedfetch.py
got feed https://christine.website/blog.json

Then open up the SQLite command line:

$ sqlite3 data.db
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite>

And now we can play with a few of the JSON functions. First let's show off json_extract. This lets you pull a value out of a JSON object. For example, let's get the feed title out of my website's JSONFeed:

sqlite> select json_extract(raw, '$.title') from jsonfeed_raw;
Xe's Blog

We can use this function to help us create a table that stores the metadata we care about from a JSONFeed, such as this:

CREATE TABLE IF NOT EXISTS jsonfeed_metadata
  ( feed_url      TEXT  PRIMARY KEY
  , title         TEXT  NOT NULL
  , description   TEXT
  , home_page_url TEXT
  , updated_at    TEXT  NOT NULL  DEFAULT (DATE('now'))
  );

If you ask my coworkers, they can confirm that I actually do real life unironcally write SQL like that.

Then we can populate that table with a query like this:

INSERT INTO jsonfeed_metadata
            ( feed_url
            , title
            , description
            , home_page_url
            , updated_at
            )
SELECT jsonfeed_raw.feed_url AS feed_url
     , json_extract(jsonfeed_raw.raw, '$.title') AS title
     , json_extract(jsonfeed_raw.raw, '$.description') AS description
     , json_extract(jsonfeed_raw.raw, '$.home_page_url') AS home_page_url
     , DATE('now') AS updated_at
FROM jsonfeed_raw;

The AS keyword lets you bind values in a SELECT statement to names for use elsewhere in the query. I don't know if it's strictly needed, however it makes the names line up and SQLite doesn't complain about it, so it's probably fine.

Now this is workable, however you know what's easier than writing statements in the SQLite console like that? Not having to! SQLite triggers allow us to run database statements automatically when certain conditions happen. The main condition we want to care about right now is when we insert new data. We can turn that statement into an after-insert trigger like this:

CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_ins
  AFTER INSERT ON jsonfeed_raw
  BEGIN
    INSERT INTO jsonfeed_metadata
                ( feed_url
                , title
                , description
                , home_page_url
                )
    VALUES ( NEW.feed_url
           , json_extract(NEW.raw, '$.title')
           , json_extract(NEW.raw, '$.description')
           , json_extract(NEW.raw, '$.home_page_url')
           );
  END;

Then we can run a few commands to nuke all the database state:

sqlite3> DELETE FROM jsonfeed_metadata;
sqlite3> DELETE FROM jsonfeed_raw;

And run that python script again, then the data should automatically show up:

sqlite3> SELECT * FROM jsonfeed_metadata;
https://christine.website/blog.json|Xe's Blog|My blog posts and rants about various technology things.|https://christine.website|2022-01-04

It's like magic!

However, if you run that python script again without deleting the rows, you will get a primary key violation. We can fix this by turning the insert into an upsert with something like this:

cur.execute("""
    INSERT INTO jsonfeed_raw
      (feed_url, raw)
    VALUES
      (?, json(?))
    ON CONFLICT DO
      UPDATE SET raw = json(?)
""", (feed_url, body, body))

And also make a complementary update trigger for the jsonfeed_raw table:

CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_upd
  AFTER UPDATE ON jsonfeed_raw
  BEGIN
    INSERT INTO jsonfeed_metadata
                ( feed_url
                , title
                , description
                , home_page_url
                )
    VALUES ( NEW.feed_url
            , json_extract(NEW.raw, '$.title')
            , json_extract(NEW.raw, '$.description')
            , json_extract(NEW.raw, '$.home_page_url')
            )
    ON CONFLICT DO
       UPDATE SET
             title         = json_extract(NEW.raw, '$.title')
           , description   = json_extract(NEW.raw, '$.description')
           , home_page_url = json_extract(NEW.raw, '$.home_page_url')
           ;

You should probably update the original trigger to be an upsert too. You can follow this trigger as a guide. Be sure to DROP TRIGGER jsonfeed_raw_upd; first though!

We can also scrape the feed items out too with json_each. json_each lets you iterate a JSON array and returns SQLite rows for every value in that array. Let's take this for example:

sqlite> select * from json_each('["foo", "bar"]');
0|foo|text|foo|1||$[0]|$
1|bar|text|bar|2||$[1]|$

The schema for the temporary table that json_each (and the related json_tree) uses can be found here. You can also grab things out of a list in an object with the second argument to json_each, so you can do things like this:

sqlite> select * from json_each('{"spam": ["foo", "bar"]}', '$.spam');
0|foo|text|foo|3||$.spam[0]|$.spam
1|bar|text|bar|4||$.spam[1]|$.spam

Using this, we can make a table for each of the feed items that looks something like this:

CREATE TABLE IF NOT EXISTS jsonfeed_posts
  ( url             TEXT  PRIMARY KEY
  , feed_url        TEXT  NOT NULL
  , title           TEXT  NOT NULL
  , date_published  TEXT  NOT NULL
  );

And then munge everything out of the data in the database with a query like this:

INSERT INTO jsonfeed_posts
            ( url
            , feed_url
            , title
            , date_published
            )
SELECT
  json_extract(json_each.value, '$.url') AS url
, jsonfeed_raw.feed_url AS feed_url
, json_extract(json_each.value, '$.title') AS title
, json_extract(json_each.value, '$.date_published') AS date_published
FROM
  jsonfeed_raw
, json_each(jsonfeed_raw.raw, '$.items');

This will fetch all of the values of the items field in every JSONFeed and then automatically populate them into the jsonfeed_posts table. However turning this into a trigger with the naiive approach will not instantly work.

Let's say we have the trigger form that looks like this:

CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_upd_posts
  AFTER INSERT ON jsonfeed_raw
  BEGIN
    INSERT INTO jsonfeed_posts
                ( url
                , feed_url
                , title
                , date_published
                )
    SELECT
        json_extract(json_each.value, '$.url') AS url
      , NEW.feed_url AS feed_url
      , json_extract(json_each.value, '$.title') AS title
      , json_extract(json_each.value, '$.date_published') AS date_published
    FROM json_each(NEW.raw, '$.items')
    ON CONFLICT DO
      UPDATE SET title = excluded.title
               , date_published = excluded.date_published
               ;
  END;

If you paste this into your SQLite console, you'll get this error:

Error: near "DO": syntax error

This is actually due to a parsing ambiguity in SQLite. In order to fix this you will need to add WHERE TRUE between the FROM and ON CONFLICT clauses of the trigger:

-- ...
FROM json_each(NEW.raw, '$.items')
WHERE TRUE
ON CONFLICT DO
-- ...

And thus the day is saved by the wheretrue, the hidden apex predator of the SQLite realm, a fated value that is only non-falsy at night. Weep in terror lest it add you to its table of victims!

The correlating insert trigger change is also an exercise for the reader.

Now you can add JSONFeeds how you want and all of the data will automatically be updated. This can probably be vastly simplified further with the use of generated columns, however this should work admirably for most needs.

SQLite is able to be a NOSQL database. It's good enough for your needs. If you want to play with the code I wrote while writing this article, check it out here. This post was written live on twitch.tv. Please follow or subscribe to be kept up to date on when I go live!

The VOD for this post is here. The corresponding YouTube upload is here. It won't be available immediately after this post goes live, but it will go up in time.

Here is my favorite message from the chat while I was researching this post:

jbpratt: if you were married to sqlite, i'd be reporting domestic abuse. This is awesome