DIBS ON STUFF
The Origin Story
Architecting a Slack bot with phenomenal cosmic powers
(in an itty-bitty living space)
(in an itty-bitty living space)
This is the story of a very small Slack bot (Dibs On Stuff) and how I started taking it way too seriously. Once upon a time in a land far below the equator, there lived a grumpy, noctournal, infrastructure guy (me). This infra guy would spend all his time at work, toiling away on various AWS malarky, bracing himself and the CFO for the monthly AWS bill - all while trying to make his workplace's infrastructure more scalable, robust and more in line with the Twelve-Factor App suggestions. One day he noticed the developers on Slack were chatting about who could use the demo server - they each wanted to publish their changes on the staging server for the product managers to test.
This Slack back-and-forth seemed to happen almost daily. It went a little something like: André: Hey is anyone using staging? I want to deploy PR #345 Jane: I'm still using it, can you wait till this afternoon? André: Yeah, thanks Jane: I'll let you know André: Thanks Jane - hope you don't get caught in ... the rain Jane: I won't André, and thank you for being such a caring ... comp-andre André: 😂 Stephen: I'm grabbing it next! Andy/Jane/whoever will you let me know? Thanks! Regards Stephen. André: yeah. Jane: yep. There turned out to be some pretty repetitive workplace dynamics going on.
- First, André needed to put himself out there and ask if a resource was available (and André's pretty shy)
- Second, it's lucky that Jane was about and saw the message - otherwise her changes on the server would have gotten clobbered.
- Jane had to remember that André needs it after her (quite often people would simply forget to let André know things, but he never took it personally).
- And Stephen, well Stephen is Stephen, no-one can help Stephen. But presumably André would have remembered that Stephen wanted it too. Even though André does not like Stephen, and would go out of his way to avoid communicating with him.
The first architecture
The initial architecture was fairly straight-forward. Create a Slack bot. Set up a webserver. Allow the webserver to store information about who was holding a resource (like the name of a server). Luckily I had a little server running already and wrote some PHP (hey, it was quick to deploy) to handle requests from Slack, and store them in an AWS S3 bucket. I wouldn't have necessarily called this a terrible approach. But it had some limitations:
- if multiple people tried to call dibs on something simultaneously, it was basically first in, first served. The second person's request could sometimes get overwritten (and dropped) by my sloppy S3 object writing code.
- the server didn't have much grunt, and there was every likelyhood that it would go down, run out of resources and halt, fill up its disk space, be vulnerable to a random DOS attack, or actually get hacked and have its data intercepted.
- there really wasn't a strategy for scaling. And there didn't need to be. It was just my workplace using it. Well that was before...
The Slack App Directory
Phenomenal scalability, itty-bitty maintenance
AWS lambda functions, my best friend, my worst enemy, my old drinking buddy, the bad break-up - you name it. I've been stuffing around with lambda functions for a long time now. It's a vicious love-hate cycle of discovery. Sometimes it's a nice discovery like, RAM over 1,769MB gives you an entire vCPU to yourself, sometimes it's a bad discovery like cold starts, or automatic retries, or weirdly scoped variables that don't get reinitialised between invocations, or ... GAH - ok I'll stop. Anyway - there are trade-offs. I ended up decommissioning the EC2 webserver altogether. Like actually shutting it down (it was a glorious event). And replacing it with: an API Gateway -> acting as a Lambda proxy -> through to a DynamoDB backend. This was possible because Lambda supports Docker image deploys - and so even though I didn't really want to roll forward with PHP, I was able to do a bit of Docker magic and just run the same webserver PHP code on the Lambda runtime! (er, don't tell anyone?) Let's skip over the horrendous complexity of API Gateways - they're not that hard once you've been working with them for years, but once they're set up, they seem to be very hands-off. Let's skip to DynamoDB... I wanted something to replace S3. At the time, S3 was eventually consistent and sort of slow. Slack bots have some weird limitations, but one of them straight off the bat is that your webserver needs to respond in under 3 seconds. Otherwise the Slack user is presented with a timeout error. S3 is fast-ish, but the way I was using it, not so much. I also had a sneaking suspicion people were occasionally getting bitten by my lack of locking for the whole thing. So I replaced S3 with DynamoDB, which was strongly consistent, had locking and response times in the milliseconds. And I haven't looked back. The service absolutely flies now. Total transaction times are something like 0.08 of a second. Yum.
Automatic timed-releases for dibs *off* stuff
The other thing that was nagging at me was the automatic Dibs-off functionality. In the scenario up above André could have called dibs on the staging server, and he would have been placed in queue after Jane, which is great he doesn't mind waiting for Jane (in the rain) but Jane still needs to manually call dibs-off before he'll get the staging server. Instead Jane could have used the automatic timed-release functionality to state that she only wanted to hold onto the staging server for (say) 2 hours, at which point it would automatically go to André. And likewise, André could have run
/dibs on staging for 6 hours to indicate that he only wanted to hold onto the staging server long enough to annoy Stephen.
How does one set up these sorts of once-off future events from the backend? Well, you either setup a timed scheduled job (eg cron) which looks through every single queue, for every single customer, every single minute, (sigh). Or you use a different sort of job scheduler - and so I'd been using the lovely Unix/Linux utility named "at".
For a good long stretch my little webserver was queuing up these little "at" alarms and firing them off when needed, releasing the conches automatically. It was effective, but not resilient - any sort of server downtime or bad networking moment would result in jobs getting missed. It felt fragile. One day I noticed that AWS had quietly rolled out EventBridge Schedulers that support the same syntax as "at". Ah! What has two thumbs and loves serendipity? I ported the functionality over. Highly available, highly scalable job runners that were extremely reliable. Giddy-up.
It's 2024 and the future is Pro
The last few months have been spent knuckling down on the new Pro/Enterprise features - things like API access should really open the system up for CI/CD pipelines, and the brand new architecture makes a knock-it-out-of-the-park difference in performance and reliability. Ok - it's just a Slack bot, but maybe it'll help a few organizations unlock their...