Using text embeddings to recommend similar articles

Posted: | Updated: | Tags: til aws cloud sqlite

To show related content on this blog I use Hugo’s in-built functionality, which works surprisingly well with no setup. I did, however, want to test out creating text embeddings from my posts and rank them by similarity. Before you continue reading, the usual disclaimer:

Do not take the information here as good or best practice. The purpose of this entry is to post my learnings in somewhat real-time.

I will use Amazon Titan Embeddings G1 - Text available through Amazon Bedrock and SQLite to store the results. The full code can be found towards the end of this post.

Generating the embeddings

To begin I will iterate through each markdown file and get the post’s contents, excluding frontmatter. I can then use LangChain’s embed_query() function to pass the post content, the response will be the embedding which is then inserted into a SQLite database.

I’m not sure if a more appropriate option here would be to use the embed_documents()1 function and pass an array of all the post content instead, I suppose it doesn’t matter. What might matter is that the model can take up to 8,000 tokens any input that crosses that will have to be segmented especially since I do not chunk the input be related paragraphs or text.

directory = "../blog.pesky.moe/content/posts"

total_chars = 0
for filename in os.listdir(directory):
    full_filename = os.path.join(directory, filename)
    if os.path.isfile(full_filename):
        data = frontmatter.load(full_filename)
        total_chars += len(data.content)

        print(filename)
        r = embeddings.embed_query(data.content)
        db.insert_embedding(filename, " ".join(str(e) for e in r))

total_tokens = total_chars / 6
print("Total characters", total_chars)
print("Total tokens", total_tokens)
print((total_tokens / 1000) * 0.0001, "USD")

Within the loop, I total up the number of characters to determine the estimated cost at the end. The model used is priced at 0.0001 USD for 1,000 input tokens. Using a rough estimate of 6 characters per token2, I can get the total charges. For 84,152 characters, across 43 posts the estimated charges were 0.0014 USD. The actual charges reported through Cost Explorer were 0.0023 USD, so I’ve definitely done something wrong in the cost estimate.

Calculating the similarity score

Now that we have all the embeddings in a database, I will go through every entry and generate a cosine similarity score with every other entry. The closer the score is to 1.0 the more similar the original texts are, theoretically. This score is then entered into a new table within the database. A process I’ve shamelessly copied from Simon Willison’s approach to the problem. Another option here is to explore vector databases to store the embeddings, this is something I have yet to do.

rows = db.get_embeddings()
for row_0 in rows:
    for row_1 in rows:
        similarity = cosine_similarity(
            str_float_list(row_0[1]), str_float_list(row_1[1])
        )
        db.insert_similarity(row_0[0], row_1[0], similarity)

Getting similar posts

We can now query the table within the SQLite database with all the similarity scores, aptly called similarities. If you’d like to play along you can download the SQLite database. I will start by looking for posts similar to my TIL walking through converting a video to a GIF using FFmpeg.

SELECT * FROM similarities
WHERE id = "2019-10-16-convert-video-gif.md"
ORDER BY score DESC
LIMIT 6;

The results, shown below, are pretty good. The closest to the article is the article itself, obviously with a score of 1.0, the rest of the posts all involve FFmpeg.

idother_idscore
2019-10-16-convert-video-gif.md2019-10-16-convert-video-gif.md1.0
2019-10-16-convert-video-gif.md2023-08-17-ffmpeg-slideshow.md0.726564529113765
2019-10-16-convert-video-gif.md2019-10-16-trim-video.md0.652841672123232
2019-10-16-convert-video-gif.md2019-10-16-reverse-video-audio.md0.632045929017272
2019-10-16-convert-video-gif.md2021-11-28-trim-video-ffmpeg.md0.63162492385947
2019-10-16-convert-video-gif.md2019-10-16-merge-audio-video.md0.552731252729192

The built-in hugo content similarity function returns the following articles:

  1. 2019-10-16-merge-audio-video.md
  2. 2019-10-16-reverse-video-audio.md
  3. 2019-10-16-trim-video.md
  4. 2023-08-17-ffmpeg-slideshow
  5. 2021-11-28-trim-video-ffmpeg

They both work just as well, let’s try another one. Here I’m looking for posts similar to NS Sprinters: Sprinter Light Train. The table below was returned from the embeddings.

idother_idscore
2023-08-27-ns-sprinter-slt.md2023-08-27-ns-sprinter-slt.md1.0
2023-08-27-ns-sprinter-slt.md2023-08-05-ns-sprinter-sgm.md0.902212084485836
2023-08-27-ns-sprinter-slt.md2023-12-05-ams-metro.md0.566933072994505
2023-08-27-ns-sprinter-slt.md2023-05-30-european-sleeper-launch.md0.514973856129535
2023-08-27-ns-sprinter-slt.md2023-10-22-slip-autumn-leaves.md0.484311661075243
2023-08-27-ns-sprinter-slt.md2023-01-29-steam-locomotives-at-disneyland-paris.md0.349748162030406

The list below was generated by Hugo.

  1. 2023-08-05-ns-sprinter-sgm.md
  2. 2023-06-24-elizabeth-line-trip.md
  3. 2023-06-30-european-sleeper-launch.md
  4. 2023-05-14-madrids-metro.md
  5. 2023-05-06-protos-valleilijn.md

The results this time were a bit more varied, but both returned 2023-08-05-ns-sprinter-sgm.md which was the closest match. Both methods also returned 2023-05-30-european-sleeper-launch.md which I found interesting, not sure what triggered each of them or if it was just a coincidence. Other than those two the other related articles differ for each but are still railway related.

I don’t see a benefit in moving with text embeddings for this use case, it would only add extra complexity to the blog, add an extra step in deployments, and cost me money to generate over what Hugo is already capable of. I will continue experimenting with text embeddings and try to improve on my method here but will stick with Hugo to generate my related content.

Full code

import boto3
from langchain.embeddings import BedrockEmbeddings
import sqlite3
import frontmatter
import os

# (0) Setup clients and define helper functions
client = boto3.client("bedrock-runtime", region_name="us-east-1")
embeddings = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v1", client=client
)


def str_float_list(embedding_raw):
    embedding_raw = embedding_raw.split(" ")
    embedding = []
    for str_float in embedding_raw:
        embedding.append(float(str_float))
    return embedding


def cosine_similarity(a, b):
    dot_product = sum(x * y for x, y in zip(a, b))
    magnitude_a = sum(x * x for x in a) ** 0.5
    magnitude_b = sum(x * x for x in b) ** 0.5
    return dot_product / (magnitude_a * magnitude_b)


class sqliteDB(object):
    def __init__(self, db_location):
        self.__DB_LOCATION = db_location
        self.__connection = sqlite3.connect(self.__DB_LOCATION)
        self.cursor = self.__connection.cursor()
        self.create_tables()

    def create_tables(self):
        self.cursor.execute(
            """
            CREATE TABLE IF NOT EXISTS embeddings (
                id TEXT PRIMARY KEY,
                embedding TEXT
            )
        """
        )
        self.cursor.execute(
            """
            CREATE TABLE IF NOT EXISTS similarities (
               id TEXT,
               other_id TEXT,
               score FLOAT,
               PRIMARY KEY (id, other_id)
           )
        """
        )

    def insert_embedding(self, filename, embedding):
        sql = """
            INSERT INTO embeddings (id, embedding)
            VALUES (?, ?)
        """
        try:
            self.cursor.execute(sql, (filename, embedding))
            self.__connection.commit()
            return self.cursor.lastrowid
        except sqlite3.IntegrityError:
            return False  # skip duplicates

    def insert_similarity(self, filename, other_filename, score):
        sql = """
           INSERT INTO similarities (id, other_id, score)
           VALUES (?, ? ,?)
           ON CONFLICT (id, other_id) DO UPDATE SET score = ?
        """
        try:
            self.cursor.execute(sql, (filename, other_filename, score, score))
            self.__connection.commit()
        except sqlite3.IntegrityError:
            return False  # skip duplicates
        return self.cursor.lastrowid

    def get_embeddings(self):
        sql = """
            SELECT * FROM embeddings
        """
        self.cursor.execute(sql)
        return self.cursor.fetchall()

    def __del__(self):
        self.__connection.close()

    def close(self):
        self.__connection.close()


db = sqliteDB("embeddings.db")

# (1)   Iterate through markdown files and get embeddings. Each embedding is
#       then stored in the embedding table within the SQLitedb

directory = "../blog.pesky.moe/content/posts"

total_chars = 0
for filename in os.listdir(directory):
    full_filename = os.path.join(directory, filename)
    if os.path.isfile(full_filename):
        data = frontmatter.load(full_filename)
        total_chars += len(data.content)

        print(filename)
        r = embeddings.embed_query(data.content)
        db.insert_embedding(filename, " ".join(str(e) for e in r))

total_tokens = total_chars / 6
print("Total characters", total_chars)
print("Total tokens", total_tokens)
print((total_tokens / 1000) * 0.0001, "USD")  # double check if accurate

# (2)   Iterate through each embedding and find its cosine similarity with
#       each embedding and enter the score into the similarity table.

rows = db.get_embeddings()
for row_0 in rows:
    for row_1 in rows:
        similarity = cosine_similarity(
            str_float_list(row_0[1]), str_float_list(row_1[1])
        )
        print(row_0[0], row_1[0], similarity)
        db.insert_similarity(row_0[0], row_1[0], similarity)

  1. Text embedding models python.langchain.com ↩︎

  2. Prepare the datasets docs.aws.amazon.com ↩︎


Related ramblings