I Have

My blog is hosted with CloudFront pointed at an S3 bucket, automatically updated to whenever I push a commit to the GitHub repository containing the sources. Pretty simple in the end, but required a lot of reading AWS blog posts to get set up correctly initially. Now that it's no longer on Cohost, I have the opportunity to look at something I haven't been able to before: numbers. Unfortunately this required reading yet more AWS blog posts.!

Where Do Numbers Come From?

From looking at an automatically-created CloudWatch (AWS's logging/metrics collection service) dashboard, I could see that CloudFront (AWS's CDN service) was already collecting... something...

Great! So my website is getting hits! But on what pages? That was the question I wanted answered, and this dashboard would not or could not answer it.

Looking further, I found documentation for how to turn on CloudFront access logs, which would contain the information I wanted in the cs-uri-stem field, among others. The data in CloudWatch currently was just "metrics", not "logs", though. If I wanted to put logs in CloudWatch, it would be... some unknown amount amount of money, based on usage. Given an example on the CloudWatch pricing page, I estimated about $1/mo.

Now that seems pretty reasonable, no? But currently, I pay just $0.10/mo for S3 storage, and nothing for anything else, so logs alone would 10x my costs :(. I am ruthless about making sure I pay as little as possible for the cloud services I use, so this was a non-starter.

Fortunately, there is another option: CloudFront can put its access logs directly into an S3 bucket, which I could analyze & rotate at my own convenience! Except... how?

Where Do Numbers Go?

Initially, I thought "oh, I can just write a hacky Python script to download & parse them all". But then the thought of writing a hacky Python script seemed really unappealing to me, so I lazyweb-ed on Mastodon to see if anyone knew of any pre-built solutions to parse CloudFront logs. Fortunately someone responded (unlobito) suggesting a program called GoAccess¹ that looked pretty good. Now, all I had to do was write some hacky Python that calls the following shell script!

#!/usr/bin/env bash

# Make it so if any individual step fails, the script exits early, among other things
set -euo pipefail
# Make it so ** acts as a recursive glob (not default on certain bash-es)
shopt -s globstar

if [[ -z "${AWS_BUCKET_NAME+x}" ]]; then
  >&2 echo "Must get bucket passed as AWS_BUCKET_NAME"
  exit 1
fi

output_folder="/tmp/${AWS_BUCKET_NAME}"
mkdir -p $output_folder
output_file="/tmp/${AWS_BUCKET_NAME}.html"
touch $output_file

# Download all logs from S3. Must also receive environment variables that allow this to work
aws s3 sync --delete "s3://${AWS_BUCKET_NAME}" "${output_folder}" 1>&2

# Generate the report with GoAccess running inside Docker
zcat "${output_folder}/**/*.gz" | docker run --rm -i -v "${output_file}:/report.html" -e LANG=$LANG allinurl/goaccess -a -o report.html --log-format CLOUDFRONT - 1>&2

echo "${output_file}"

GoAccess is pretty fast, and S3 doesn't take that long to download², so I can get updated reports pretty easily. Now, drumroll please...

What's My Most Viewed Page?

It's RSS.

Guess I have a few readers huh! Hi 👋😸

Other Neat Insights I Didn't Expect

Firefox is outsize represented (50%) compared to it's overall market share (3%). Y'all are some nerds huh :3
The "bot problem" (automated vulnerability scanners, other crawlers) isn't quite a big a problem as I thought. RSS traffic is way higher.
The list of referring sites is pretty interesting; it's mostly feed syndicators, but seeing the range of ones I didn't know about is cool.

I'll try not to abuse my newfound power too hard. Nice to have it available tho.

Somehow written in C, not Go?? lol ↩
Hopefully, I haven't tested it on days/weeks worth of data, only hours. ↩

PolyWolf's Blog

I Have "Numbers" Now

Where Do Numbers Come From?

Where Do Numbers Go?

What's My Most Viewed Page?

Other Neat Insights I Didn't Expect