The Origin Story
Architecting a Slack bot with phenomenal cosmic powers
(in an itty-bitty living space)
(in an itty-bitty living space)
This is the story of a very small Slack bot (Dibs On Stuff) and how I started taking it way too seriously.
Once upon a time in a land far below the equator, there lived a noctournal infrastructure guy (me). He would spend all his time at work, toiling away on various AWS malarky, bracing himself and the CFO for the monthly AWS bill.
One day he noticed the developers on Slack were chatting about who could use the demo server - they each wanted to publish their changes on the staging server for the product managers to test.
The conversation
This Slack back-and-forth seemed to happen almost daily. It went a little something like:
André: Hey is anyone using staging? I want to deploy PR #345
Jane: I'm still using it, can you wait till this afternoon?
André: Yeah, thanks
Jane: I'll let you know
André: Thanks Jane - hope you don't get caught in ... the rain
Jane: I won't André, and thank you for being such a caring ... comp-andre
André: 😂
Stephen: I'm grabbing it next! Andy/Jane/whoever will you let me know? Thanks! Regards Stephen.
André: yeah.
Jane: yep.
There turned out to be some pretty repetitive workplace dynamics going on.
- First, André needed to put himself out there and ask if a resource was available.
- Second, it's lucky that Jane was about and saw the message - otherwise her changes on the server would have gotten clobbered.
- Jane had to remember that André needs it after her and often people would forget to let André know things (he never took it personally).
- And Stephen, well no-one can help Stephen. But presumably André would have remembered that Stephen wanted it too. Even though André would sometimes go out of his way to avoid communicating with Stephen.
The first architecture
The initial architecture was fairly straight-forward. Create a Slack bot. Set up a webserver. Allow the webserver to store information about who was holding a resource (like the name of a server).
Luckily I had a little server running already and wrote some PHP (hey it was quick) to handle requests from Slack, and store them in an AWS S3 bucket. Not a terrible approach but it had some limitations:
- if multiple people tried to call dibs on something simultaneously, it was basically first in, first served. The second person's request could sometimes get overwritten (and dropped) by my S3 code.
- the server didn't have much grunt, and there was every likelyhood that it would go down, run out of resources and halt, fill up its disk space, be vulnerable to a random DOS attack, or actually get hacked and have its data intercepted.
- there really wasn't a strategy for scaling. And there didn't need to be. It was just my workplace using it. Well that was before...
The Slack App Directory
Getting my app into the Slack App Directory was a journey. But I think the end result is that they have quite a tidy marketplace. There were some hurdles. I had to form a Privacy policy and a Terms of Service, and tiptoe through some fairly major compliance learnings (GDPR) they even requested that I change wording on this website: "The Slack App for Taking Turns" was not permitted.
However, once I was in - something quite strange happened: People started installing the app. Kind of a lot of people (some from big name companies). All of a sudden I was seeing a jump from 1 or 2 people using Dibs every day, to 100-200. This was amazing.
People contacted me, telling me what they wanted. People that were enthusiastic about something I built. I have had nothing but positive interactions. And the feature requests - well, I love making new features, it squishes my little dopamine receptors all over the place. But sooner or later - I was going to need a bigger boat.
Phenomenal scalability, itty-bitty maintenance
AWS lambda functions, my best friend, my worst enemy, my old drinking buddy, the bad break-up - you name it. I've been stuffing around with lambda functions for a long time now. It's a vicious love-hate cycle of discovery. Sometimes it's a nice discovery like, RAM over 1,769MB gives you an entire vCPU to yourself, sometimes it's a bad discovery like cold starts, or automatic retries, or weirdly scoped variables that don't get reinitialised between invocations, or ... GAH - ok I'll stop.
Anyway - there are trade-offs. I ended up decommissioning the EC2 webserver altogether. Like actually shutting it down (it was great). And replacing it with: an API Gateway -> acting as a Lambda proxy -> through to a DynamoDB backend. This was possible because Lambda supports Docker image deploys - and so even though I didn't really want to roll forward with PHP, I was able to do a bit of Docker magic and run the same webserver PHP code on the Lambda runtime! (er, don't tell anyone?)
Let's skip over the horrendous complexity of AWS API Gateways - they're not that hard once you've been working with them for years, but once they're set up, they seem to be very hands-off. Let's skip to DynamoDB...
I wanted something to replace S3. At the time, S3 was eventually consistent and sort of slow. Slack bots have some weird limitations, but one of them straight off the bat is that your webserver needs to respond in under 3 seconds. Otherwise the Slack user is presented with a timeout error. S3 is fast-ish, but the way I was using it, not so much. I also had a sneaking suspicion people were occasionally getting bitten by my lack of locking for the whole thing. So I replaced S3 with DynamoDB, which was strongly consistent, had locking and response times in the milliseconds. And I haven't looked back. The service absolutely flies now. Total transaction times are something like 0.08 of a second. Yum.
Automatic timed-releases for dibs *off* stuff
The other thing that was nagging at me was the automatic Dibs-off functionality. In the scenario up above André could have called dibs on the staging server, and he would have been placed in queue after Jane, which is great he doesn't mind waiting for Jane but she still needs to manually call dibs-off before he'll get the staging server.
Instead Jane could have used the automatic timed-release functionality to state that she only wanted to hold onto the staging server for (say) 2 hours, at which point it would automatically go to André. And likewise, André could have run
/dibs on staging for 6 hours
to indicate that he only wanted to hold onto the staging server long enough to annoy Stephen.
How does one set up these sorts of once-off future events from the backend? Well, you either setup a timed scheduled job (eg cron) which looks through every single queue, for every single customer, every single minute, (sigh). Or you use a different sort of job scheduler - and so I'd been using the lovely Unix/Linux utility named "at".
For a good long stretch my little webserver was queuing up these little "at" alarms and firing them off when needed, releasing the conches automatically. It was effective, but not resilient - any sort of server downtime or bad networking moment would result in jobs getting missed. It felt fragile. One day I noticed that AWS had quietly rolled out EventBridge Schedulers that support the same syntax as "at". Ah! What has two thumbs and loves serendipity? I ported the functionality over. Highly available, highly scalable job runners that were extremely reliable. Giddy-up.
Where to now?
I am opening up the API - it'll help integrate the system with CI/CD pipelines. Imagine doing a deploy or pushing changes to a testing/QA server without having to announce it or keep asking if it's your turn. Ok - it's just a Slack bot, but maybe it could help unlock ...