forked from cadey/xesite
358 lines
11 KiB
Markdown
358 lines
11 KiB
Markdown
|
---
|
||
|
title: Bashing JSON into Shape with SQLite
|
||
|
date: 2022-01-04
|
||
|
series: howto
|
||
|
tags:
|
||
|
- sqlite
|
||
|
- json
|
||
|
---
|
||
|
|
||
|
It is clear that most of the world has decided that they want to use JSON for
|
||
|
their public-facing API endpoints. However, most of the time you will need to
|
||
|
deal with storage engines that don't deal with JSON very well. This can be
|
||
|
confusing to deal with because you need to fit a square peg into a round hole.
|
||
|
|
||
|
However, [SQLite](https://www.sqlite.org) added [JSON
|
||
|
functions](https://www.sqlite.org/json1.html) to allow you to munge and modify
|
||
|
JSON data in whatever creative ways you want. You can use these and SQLite
|
||
|
[triggers](https://www.sqlite.org/lang_createtrigger.html) in order to
|
||
|
automatically massage JSON into whatever kind of tables you want. Throw in
|
||
|
upserts and you'll be able to make things even more automated. This support
|
||
|
was added in SQLite 3.9.0 (released in 2015), so assuming Debian didn't disable
|
||
|
it for no good reason, you should be able to use it today.
|
||
|
|
||
|
For this example, we're going to be querying publicly available
|
||
|
[JSONFeed](https://www.jsonfeed.org/) endpoints and turning that into SQL
|
||
|
tables. Let's start with a table schema that looks like this:
|
||
|
|
||
|
```sql
|
||
|
CREATE TABLE IF NOT EXISTS jsonfeed_raw
|
||
|
( feed_url TEXT PRIMARY KEY
|
||
|
, scrape_date TEXT NOT NULL DEFAULT (DATE('now'))
|
||
|
, raw TEXT NOT NULL
|
||
|
);
|
||
|
```
|
||
|
|
||
|
[The scrape date is essentially the date that the JSONFeed row was inserted into
|
||
|
the database. This can be useful when writing other parts of the stack to
|
||
|
automatically query feeds for changes. This is left as an exercise to the
|
||
|
reader.](conversation://Mara/hacker)
|
||
|
|
||
|
You can then insert things into the SQLite database using Python's `sqlite3`
|
||
|
module:
|
||
|
|
||
|
```python
|
||
|
#!/usr/bin/env nix-shell
|
||
|
#! nix-shell -p python39 --run python
|
||
|
|
||
|
import sqlite3
|
||
|
import urllib.request
|
||
|
|
||
|
con = sqlite3.connect("data.db")
|
||
|
|
||
|
def get_feed(feed_url):
|
||
|
req = urllib.request.Request(feed_url, headers={"User-Agent": "Xe/feedfetch"})
|
||
|
with urllib.request.urlopen(req) as response:
|
||
|
cur = con.cursor()
|
||
|
body = response.read()
|
||
|
cur.execute("""
|
||
|
INSERT INTO jsonfeed_raw
|
||
|
(feed_url, raw)
|
||
|
VALUES
|
||
|
(?, json(?))
|
||
|
""", (feed_url, body))
|
||
|
con.commit()
|
||
|
print("got feed %s" % (feed_url))
|
||
|
|
||
|
get_feed("https://christine.website/blog.json")
|
||
|
```
|
||
|
|
||
|
So now let's play with the data! Let's load the database schema in with the
|
||
|
`sqlite3` command:
|
||
|
|
||
|
```console
|
||
|
$ sqlite3 data.db < schema.sql
|
||
|
```
|
||
|
|
||
|
[The less-than symbol there is a redirect, it loads the data from `schema.sql`
|
||
|
as standard input to the `sqlite` command. See <a
|
||
|
href="https://christine.website/blog/fun-with-redirection-2021-09-22">here</a>
|
||
|
for more information on redirections.](conversation://Mara/hacker)
|
||
|
|
||
|
Then run that python script to populate the database:
|
||
|
|
||
|
```console
|
||
|
$ python ./jsonfeedfetch.py
|
||
|
got feed https://christine.website/blog.json
|
||
|
```
|
||
|
|
||
|
Then open up the SQLite command line:
|
||
|
|
||
|
```console
|
||
|
$ sqlite3 data.db
|
||
|
SQLite version 3.36.0 2021-06-18 18:36:39
|
||
|
Enter ".help" for usage hints.
|
||
|
sqlite>
|
||
|
```
|
||
|
|
||
|
And now we can play with a few of the JSON functions. First let's show off
|
||
|
[`json_extract`](https://www.sqlite.org/json1.html#the_json_extract_function).
|
||
|
This lets you pull a value out of a JSON object. For example, let's get the feed
|
||
|
title out of my website's JSONFeed:
|
||
|
|
||
|
```console
|
||
|
sqlite> select json_extract(raw, '$.title') from jsonfeed_raw;
|
||
|
Xe's Blog
|
||
|
```
|
||
|
|
||
|
We can use this function to help us create a table that stores the metadata we
|
||
|
care about from a JSONFeed, such as this:
|
||
|
|
||
|
```sql
|
||
|
CREATE TABLE IF NOT EXISTS jsonfeed_metadata
|
||
|
( feed_url TEXT PRIMARY KEY
|
||
|
, title TEXT NOT NULL
|
||
|
, description TEXT
|
||
|
, home_page_url TEXT
|
||
|
, updated_at TEXT NOT NULL DEFAULT (DATE('now'))
|
||
|
);
|
||
|
```
|
||
|
|
||
|
[If you ask my coworkers, they can confirm that I actually do real life
|
||
|
unironcally write SQL like that.](conversation://Cadey/coffee)
|
||
|
|
||
|
Then we can populate that table with a query like this:
|
||
|
|
||
|
```sql
|
||
|
INSERT INTO jsonfeed_metadata
|
||
|
( feed_url
|
||
|
, title
|
||
|
, description
|
||
|
, home_page_url
|
||
|
, updated_at
|
||
|
)
|
||
|
SELECT jsonfeed_raw.feed_url AS feed_url
|
||
|
, json_extract(jsonfeed_raw.raw, '$.title') AS title
|
||
|
, json_extract(jsonfeed_raw.raw, '$.description') AS description
|
||
|
, json_extract(jsonfeed_raw.raw, '$.home_page_url') AS home_page_url
|
||
|
, DATE('now') AS updated_at
|
||
|
FROM jsonfeed_raw;
|
||
|
```
|
||
|
|
||
|
[The `AS` keyword lets you bind values in a `SELECT` statement to names for use
|
||
|
elsewhere in the query. I don't know if it's _strictly_ needed, however it makes
|
||
|
the names line up and SQLite doesn't complain about it, so it's probably
|
||
|
fine.](conversation://Mara/hacker)
|
||
|
|
||
|
Now this is workable, however you know what's easier than writing statements in
|
||
|
the SQLite console like that? Not having to! SQLite triggers allow us to run
|
||
|
database statements automatically when certain conditions happen. The main
|
||
|
condition we want to care about right now is when we insert new data. We can
|
||
|
turn that statement into an after-insert trigger like this:
|
||
|
|
||
|
```sql
|
||
|
CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_ins
|
||
|
AFTER INSERT ON jsonfeed_raw
|
||
|
BEGIN
|
||
|
INSERT INTO jsonfeed_metadata
|
||
|
( feed_url
|
||
|
, title
|
||
|
, description
|
||
|
, home_page_url
|
||
|
)
|
||
|
VALUES ( NEW.feed_url
|
||
|
, json_extract(NEW.raw, '$.title')
|
||
|
, json_extract(NEW.raw, '$.description')
|
||
|
, json_extract(NEW.raw, '$.home_page_url')
|
||
|
);
|
||
|
END;
|
||
|
```
|
||
|
|
||
|
Then we can run a few commands to nuke all the database state:
|
||
|
|
||
|
```console
|
||
|
sqlite3> DELETE FROM jsonfeed_metadata;
|
||
|
sqlite3> DELETE FROM jsonfeed_raw;
|
||
|
```
|
||
|
|
||
|
And run that python script again, then the data should automatically show up:
|
||
|
|
||
|
```
|
||
|
sqlite3> SELECT * FROM jsonfeed_metadata;
|
||
|
https://christine.website/blog.json|Xe's Blog|My blog posts and rants about various technology things.|https://christine.website|2022-01-04
|
||
|
```
|
||
|
|
||
|
It's like magic!
|
||
|
|
||
|
However, if you run that python script again without deleting the rows, you will
|
||
|
get a primary key violation. We can fix this by turning the insert into an
|
||
|
[upsert](https://www.sqlite.org/lang_UPSERT.html) with something like this:
|
||
|
|
||
|
```python
|
||
|
cur.execute("""
|
||
|
INSERT INTO jsonfeed_raw
|
||
|
(feed_url, raw)
|
||
|
VALUES
|
||
|
(?, json(?))
|
||
|
ON CONFLICT DO
|
||
|
UPDATE SET raw = json(?)
|
||
|
""", (feed_url, body, body))
|
||
|
```
|
||
|
|
||
|
And also make a complementary update trigger for the `jsonfeed_raw` table:
|
||
|
|
||
|
```sql
|
||
|
CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_upd
|
||
|
AFTER UPDATE ON jsonfeed_raw
|
||
|
BEGIN
|
||
|
INSERT INTO jsonfeed_metadata
|
||
|
( feed_url
|
||
|
, title
|
||
|
, description
|
||
|
, home_page_url
|
||
|
)
|
||
|
VALUES ( NEW.feed_url
|
||
|
, json_extract(NEW.raw, '$.title')
|
||
|
, json_extract(NEW.raw, '$.description')
|
||
|
, json_extract(NEW.raw, '$.home_page_url')
|
||
|
)
|
||
|
ON CONFLICT DO
|
||
|
UPDATE SET
|
||
|
title = json_extract(NEW.raw, '$.title')
|
||
|
, description = json_extract(NEW.raw, '$.description')
|
||
|
, home_page_url = json_extract(NEW.raw, '$.home_page_url')
|
||
|
;
|
||
|
```
|
||
|
|
||
|
[You should probably update the original trigger to be an upsert too. You can
|
||
|
follow this trigger as a guide. Be sure to `DROP TRIGGER jsonfeed_raw_upd;`
|
||
|
first though!](conversation://Mara/hacker)
|
||
|
|
||
|
We can also scrape the feed items out too with `json_each`. `json_each` lets you
|
||
|
iterate a JSON array and returns SQLite rows for every value in that array.
|
||
|
Let's take this for example:
|
||
|
|
||
|
```console
|
||
|
sqlite> select * from json_each('["foo", "bar"]');
|
||
|
0|foo|text|foo|1||$[0]|$
|
||
|
1|bar|text|bar|2||$[1]|$
|
||
|
```
|
||
|
|
||
|
The schema for the temporary table that `json_each` (and the related
|
||
|
`json_tree`) uses can be found [here](https://www.sqlite.org/json1.html#jeach).
|
||
|
You can also grab things out of a list in an object with the second argument to
|
||
|
`json_each`, so you can do things like this:
|
||
|
|
||
|
```console
|
||
|
sqlite> select * from json_each('{"spam": ["foo", "bar"]}', '$.spam');
|
||
|
0|foo|text|foo|3||$.spam[0]|$.spam
|
||
|
1|bar|text|bar|4||$.spam[1]|$.spam
|
||
|
```
|
||
|
|
||
|
Using this, we can make a table for each of the feed items that looks something
|
||
|
like this:
|
||
|
|
||
|
```sql
|
||
|
CREATE TABLE IF NOT EXISTS jsonfeed_posts
|
||
|
( url TEXT PRIMARY KEY
|
||
|
, feed_url TEXT NOT NULL
|
||
|
, title TEXT NOT NULL
|
||
|
, date_published TEXT NOT NULL
|
||
|
);
|
||
|
```
|
||
|
|
||
|
And then munge everything out of the data in the database with a query like
|
||
|
this:
|
||
|
|
||
|
```sql
|
||
|
INSERT INTO jsonfeed_posts
|
||
|
( url
|
||
|
, feed_url
|
||
|
, title
|
||
|
, date_published
|
||
|
)
|
||
|
SELECT
|
||
|
json_extract(json_each.value, '$.url') AS url
|
||
|
, jsonfeed_raw.feed_url AS feed_url
|
||
|
, json_extract(json_each.value, '$.title') AS title
|
||
|
, json_extract(json_each.value, '$.date_published') AS date_published
|
||
|
FROM
|
||
|
jsonfeed_raw
|
||
|
, json_each(jsonfeed_raw.raw, '$.items');
|
||
|
```
|
||
|
|
||
|
This will fetch all of the values of the `items` field in every JSONFeed and
|
||
|
then automatically populate them into the `jsonfeed_posts` table. However
|
||
|
turning this into a trigger with the naiive approach will not instantly work.
|
||
|
|
||
|
Let's say we have the trigger form that looks like this:
|
||
|
|
||
|
```sql
|
||
|
CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_upd_posts
|
||
|
AFTER INSERT ON jsonfeed_raw
|
||
|
BEGIN
|
||
|
INSERT INTO jsonfeed_posts
|
||
|
( url
|
||
|
, feed_url
|
||
|
, title
|
||
|
, date_published
|
||
|
)
|
||
|
SELECT
|
||
|
json_extract(json_each.value, '$.url') AS url
|
||
|
, NEW.feed_url AS feed_url
|
||
|
, json_extract(json_each.value, '$.title') AS title
|
||
|
, json_extract(json_each.value, '$.date_published') AS date_published
|
||
|
FROM json_each(NEW.raw, '$.items')
|
||
|
ON CONFLICT DO
|
||
|
UPDATE SET title = excluded.title
|
||
|
, date_published = excluded.date_published
|
||
|
;
|
||
|
END;
|
||
|
```
|
||
|
|
||
|
If you paste this into your SQLite console, you'll get this error:
|
||
|
|
||
|
```
|
||
|
Error: near "DO": syntax error
|
||
|
```
|
||
|
|
||
|
This is actually due to a [parsing ambiguity in
|
||
|
SQLite](https://www.sqlite.org/lang_UPSERT.html). In order to fix this you will
|
||
|
need to add `WHERE TRUE` between the `FROM` and `ON CONFLICT` clauses of the
|
||
|
trigger:
|
||
|
|
||
|
```sql
|
||
|
-- ...
|
||
|
FROM json_each(NEW.raw, '$.items')
|
||
|
WHERE TRUE
|
||
|
ON CONFLICT DO
|
||
|
-- ...
|
||
|
```
|
||
|
|
||
|
[And thus the day is saved by the wheretrue, the hidden apex predator of the
|
||
|
SQLite realm, a fated value that is only non-falsy at night. Weep in terror lest
|
||
|
it add you to its table of victims!](conversation://Numa/delet)
|
||
|
|
||
|
[The correlating insert trigger change is also an exercise for the
|
||
|
reader.](conversation://Mara/hacker)
|
||
|
|
||
|
Now you can add JSONFeeds how you want and all of the data will automatically be
|
||
|
updated. This can probably be vastly simplified further with the use of
|
||
|
[generated columns](https://dgl.cx/2020/06/sqlite-json-support), however this
|
||
|
should work admirably for most needs.
|
||
|
|
||
|
SQLite is able to be a NOSQL database. It's good enough for your needs. If you
|
||
|
want to play with the code I wrote while writing this article, check it out
|
||
|
[here](https://git.io/JSDVR). This post was written live on
|
||
|
[twitch.tv](https://twitch.tv/princessxen). Please follow or subscribe to be
|
||
|
kept up to date on when I go live!
|
||
|
|
||
|
The VOD for this post is [here](https://www.twitch.tv/videos/1253083566). The
|
||
|
corresponding YouTube upload is [here](https://youtu.be/zkM_lY65Lcw). It won't
|
||
|
be available immediately after this post goes live, but it will go up in time.
|
||
|
|
||
|
Here is my favorite message from the chat while I was researching this post:
|
||
|
|
||
|
> jbpratt: if you were married to sqlite, i'd be reporting domestic abuse. This
|
||
|
> is awesome
|