This week was a wild one for me. First full week back from vacation. And I volunteered to be the first person to take on the new “on-run” role that we’ve introduced in the Copy.ai engineering team.
We’ve expanded our team pretty rapidly, going from 4 engineers to 8 in the past few weeks. And our processes were designed for a small team to ship fast. One of the things we came to realize was that our processes aren’t sufficient to support this new team size.
So our head of engineering introduced a new role that will rotate weekly: on-run. Basically, this person is in charge of making sure builds go out, supporting things when needed, responding to incidents, etc.
So I figured I could take the first crack at it. Seemed interesting and since I was coming off a vacation, the other engineers had other things in progress. Made sense for me to step up.
And then, chaos.
We started the week with an outage. There was an issue with our database slowing down, which caused a chain reaction in our app because all of a sudden, the requests that used to take 1 or 2 seconds at most were now all taking 30 seconds and timing out. That led to the system getting more and more backed up, the memory usage started spiking, and eventually, it become completely unresponsive.
So that was my Monday, figuring out what was happening, why it was happening, and trying to coordinate other engineers to solve the issue. Ultimately, we realized we were over our database plan limits on Heroku, and our best guess was that they started throttling our access because of that.
So I got to learn how to upgrade a Heroku database instance while not breaking everything! It was actually pretty straightforward, just prepping the database ahead of time with a command from the CLI and then running the maintenance job that switched it over. So that was a fun start to the week!
Tuesday, we were focused on trying to get a new release out that would help us reduce costs by a good amount. But there ended up being some issues that we discovered that needed solving. So we spent a lot of time going back and forth trying to solve those issues.
And then on Wednesday, I got a notification that we needed to switch out any keys we were using in our preview/staging apps because of the Github/Heroku breach. We didn’t have any reason to suspect that our specific keys were accessed, but as a security precaution, we needed to roll all of them. So I started trying to figure out how each key was created, roll them, update our deployment pipelines, and ensure that everything continued to work.
Also, we were finally ready to deploy this change, so in the middle of all of this, I ran a release to get the important change out. And then ran a second release to get a few smaller things out. Because of course, while doing all of this, the other engineers were continuing to build features. Which were then backing up in the pipeline because of how we run the deployment process. So we needed to start picking various tickets and releasing those, based on risk, priority, testing status, etc. And then on Thursday, I created a new release for testing that contained probably the most cards we’ve ever had in a single release. There ended up being some bugs that needed fixed (which makes sense, because that’s what happens when you start combining a whole bunch of various features into a single release), and though we worked diligently, we weren’t able to quite get that release out yesterday, but it’s pretty close and should be ready for Monday.
I wanted to share this for a few different reasons.
It’s really important to release small, incremental changes to production when you can. We are working toward having full continuous deployment, but we aren’t there yet. With a bunch of engineers working on shipping features, it can back up and really slow things down. That’s something to be aware of as you start scaling teams.
As crazy as the week was, it was a great learning experience. I had to touch so many different pieces of the app to understand what was happening, what needed fixing, etc. And I was constantly engaged. Coming off a vacation, this was a high-octane way for me to get back into the swing of things. Sometimes, it’s difficult for me to come back into a routine, so this was really helpful.
Sometimes, things happen that are outside of your control. And when those things happen, you have to be able to address them. Not always easy, sure. But that’s part of software development. And these things tend to be great ways to learn a lot of stuff fast.
Of course, the flip side of all of this is that I haven’t gotten much done on my side projects this week. But I have been thinking about them and really evaluating specific pieces.
Revisiting my Vision for Feather
One of the things that has been top of mind for me has been what I want to do with Feather. It’s been sitting there and running while I put my focus elsewhere but I do want to put some sort of plan in place for it. So in the little bit of time between the chaos, I’ve been thinking about what that plan might look like. And I realized something important.