My blog is hosted with CloudFront pointed at an S3 bucket, automatically updated to whenever I push a commit to the GitHub repository containing the sources. Pretty simple in the end, but required a lot of reading AWS blog posts to get set up correctly initially. Now that it's no longer on Cohost, I have the opportunity to look at something I haven't been able to before: numbers. Unfortunately this required reading yet more AWS blog posts.!
Where Do Numbers Come From?
From looking at an automatically-created CloudWatch (AWS's logging/metrics collection service) dashboard, I could see that CloudFront (AWS's CDN service) was already collecting... something...

Great! So my website is getting hits! But on what pages? That was the question I wanted answered, and this dashboard would not or could not answer it.
Looking further, I found documentation for how to turn on CloudFront access logs, which would contain the information I wanted in the cs-uri-stem field, among others. The data in CloudWatch currently was just "metrics", not "logs", though. If I wanted to put logs in CloudWatch, it would be... some unknown amount amount of money, based on usage. Given an example on the CloudWatch pricing page, I estimated about $1/mo.
Now that seems pretty reasonable, no? But currently, I pay just $0.10/mo for S3 storage, and nothing for anything else, so logs alone would 10x my costs :(. I am ruthless about making sure I pay as little as possible for the cloud services I use, so this was a non-starter.
Fortunately, there is another option: CloudFront can put its access logs directly into an S3 bucket, which I could analyze & rotate at my own convenience! Except... how?
Where Do Numbers Go?
Initially, I thought "oh, I can just write a hacky Python script to download & parse them all". But then the thought of writing a hacky Python script seemed really unappealing to me, so I lazyweb-ed on Mastodon to see if anyone knew of any pre-built solutions to parse CloudFront logs. Fortunately someone responded (unlobito) suggesting a program called GoAccess1 that looked pretty good. Now, all I had to do was write some hacky Python that calls the following shell script!
#!/usr/bin/env bash
# Make it so if any individual step fails, the script exits early, among other things
# Make it so ** acts as a recursive glob (not default on certain bash-es)
if ; then
fi
output_folder="/tmp/"
output_file="/tmp/.html"
# Download all logs from S3. Must also receive environment variables that allow this to work
# Generate the report with GoAccess running inside Docker
|
GoAccess is pretty fast, and S3 doesn't take that long to download2, so I can get updated reports pretty easily. Now, drumroll please...
What's My Most Viewed Page?

It's RSS.
Guess I have a few readers huh! Hi 👋😸
Other Neat Insights I Didn't Expect
- Firefox is outsize represented (50%) compared to it's overall market share (3%). Y'all are some nerds huh :3
- The "bot problem" (automated vulnerability scanners, other crawlers) isn't quite a big a problem as I thought. RSS traffic is way higher.
- The list of referring sites is pretty interesting; it's mostly feed syndicators, but seeing the range of ones I didn't know about is cool.
I'll try not to abuse my newfound power too hard. Nice to have it available tho.
