PolyWolf's Blog

I Have "Numbers" Now

Published: 1/19/2025, 3:59:14 PM

My blog is hosted with CloudFront pointed at an S3 bucket, automatically updated to whenever I push a commit to the GitHub repository containing the sources. Pretty simple in the end, but required a lot of reading AWS blog posts to get set up correctly initially. Now that it’s no longer on Cohost, I have the opportunity to look at something I haven’t been able to before: numbers. Unfortunately this required reading yet more AWS blog posts.

Where Do Numbers Come From?

From looking at an automatically-created CloudWatch (AWS’s logging/metrics collection service) dashboard, I could see that CloudFront (AWS’s CDN service) was already collecting… something.

CloudWatch dashboard showing total requests and total bytes downloaded from wolfgirl.dev and static.wolfgirl.dev

Great! So my website is getting hits! But on what pages? That was the question I wanted answered, and this dashboard would not or could not answer it.

Looking further, I found documentation for how to turn on CloudFront access logs, which would contain the information I wanted in the cs-uri-stem field, among others. The data in CloudWatch currently was just “metrics”, not “logs”, though. If I wanted to put logs in CloudWatch, it would be… some unknown amount amount of money, based on usage. Given an example on the CloudWatch pricing page, I estimated about $1/mo.

Now that seems pretty reasonable, no? But currently, I pay just $0.10/mo for S3 storage, and nothing for anything else, so logs alone would 10x my costs :(. I am ruthless about making sure I pay as little as possible for the cloud services I use, so this was a non-starter.

Fortunately, there is another option: CloudFront can put its access logs directly into an S3 bucket, which I could analyze & rotate at my own convenience! Except… how?

Where Do Numbers Go?

Initially, I thought “oh, I can just write a hacky Python script to download & parse them all”. But then the thought of writing a hacky Python script seemed really unappealing to me, so I lazyweb-ed on Mastodon to see if anyone knew of any pre-built solutions to parse CloudFront logs. Fortunately someone responded (unlobito) suggesting a program called GoAccess1 that looked pretty good. Now, all I had to do was write some hacky Python that calls the following shell script!

#!/usr/bin/env bash

# Make it so if any individual step fails, the script exits early, among other things
set -euo pipefail
# Make it so ** acts as a recursive glob (not default on certain bash-es)
shopt -s globstar

if [[ -z "${AWS_BUCKET_NAME+x}" ]]; then
  >&2 echo "Must get bucket passed as AWS_BUCKET_NAME"
  exit 1

mkdir -p $output_folder
touch $output_file

# Download all logs from S3. Must also receive environment variables that allow this to work
aws s3 sync --delete "s3://${AWS_BUCKET_NAME}" "${output_folder}" 1>&2

# Generate the report with GoAccess running inside Docker
zcat "${output_folder}/**/*.gz" | docker run --rm -i -v "${output_file}:/report.html" -e LANG=$LANG allinurl/goaccess -a -o report.html --log-format CLOUDFRONT - 1>&2

echo "${output_file}"

GoAccess is pretty fast, and S3 doesn’t take that long to download2, so I can get updated reports pretty easily. Now, drumroll please…

What’s My Most Viewed Page?

A table showing /blog/rss.xml, /, and /cybersec/rss.xml being the most-hit pages

It’s RSS.

Guess I have a few readers huh! Hi 👋😸

Other Neat Insights I Didn’t Expect

I’ll try not to abuse my newfound power too hard. Nice to have it available tho.


  1. Somehow written in C, not Go?? lol

  2. Hopefully, I haven’t tested it on days/weeks worth of data, only hours.

Comment on Mastodon Comment on Bluesky