Ok, so I got my Kanban board set up in Notion and about 10 things added. I knocked off the quick task of making sure that everything works locally after switching things around to use Postgres instances instead of SQLite. And then I picked up my next card: creating a Celery task that fetches user data from Twitter when a user logs in.
I started with an API call that does all the work of looking up the user, finding everyone that user currently follows, and then looking up each user and finding their most recent tweet that is not a retweet or a response, and then saving all of that in the database.
So the first step was to pull the code out of that function and put it into a Celery task. Then I just called the task from inside that view. That does work just fine, but here are the issues I see with it:
I have just around 5000 people I’m following. This means it takes almost 1.5 hours to run, because of the Twitter API rate limits.
It’s a dumb algorithm. By that, I mean it just does the same thing each time. Any time a user logs in, it grabs all the data.
If it fails, I end up having to redo the whole thing. Not ideal, especially if it gets 95% done and then needs to restart.
It’s a bad user experience. If someone comes to the site, signs in with Twitter, I’ve got to wait 1.5 hours to give them anything? Not great….
This is a great time to take a step back and talk about my approach to coding. I adopted an approach that I heard fairly early on in my career that has served me extremely well. It’s a three-part approach:
Make it work.
Make it right.
Make it fast.
When I start coding something, I’m not thinking about design patterns, I’m not thinking about performance, I’m not thinking about tests.
I’m focusing on 1 thing: Can I make this work?
That’s why the first version of this API route was bad. It wasn’t meant to be good. All I was doing was ensuring that I could do things the way I thought I could. So everything went in a single function (with the exception of an interface I have around the Twitter API calls, as those are in a separate file that contains a class holding the API info I need). But all of the database reads/writes, the loops, etc. are in a single function to start.
That’s all I needed in order to get things hooked up initially. That gave me exactly what I needed in order to deploy everything to my production environment and ensure everything worked the way I expected.
So that’s step number 1. Now it’s time for steps 2 and 3. I will be honest, those are a bit combined in my new approach, but that’s mostly because I’m familiar with these types of optimization problems, so by making it right, I’m also making it fast. That’s experience I got when building data pipelines, so I’m applying the same approach.
So how to approach this one? I turn it into a DAG! That’s a Directed Acyclic Graph, and it’s one of the most powerful models you can use in software development.
Here’s (the middle of) a Twitter thread I wrote about DAGs while prepping for the course I taught on automation:
In a nutshell, it’s a great way to model the way data moves through a system. You can use it to think about the way data changes over time. So I think about what data I need, what data I have, and the steps I need to take.
So here are the steps that I can define:
I have the authorization token and the Twitter id of the user who signed in. That’s passed into the API from the app when a user logs in.
First, I need to use that to determine which users the signed-in user follows. So this step I pass that through to the Twitter API and get back all of the followers for an account.
For each of those followers, I then have to do a lookup and see when that user’s last tweet (not a retweet or a reply) was.
I also need to update my database with the fact that the signed-in user follows this user.
So if I look at this flow, step 1 is what I’m given. That’s the initial node in the DAG. From there, I need to get all followers and that’s step 2. But I end up with N accounts that I need to look up. So step 3 is actually run N times. That’s a good candidate to pull out into its own task, because that way I can more easily keep track of the success/failure of each individual task run. This is also a good place to smarten up the algorithm.
So let’s zoom into that step and see what we can do to make it better. First of all, we can enable retries if needed. That’s a good way to keep retrying a run until it succeeds. This is one potential way to handle rate limits, just keep trying until the rate limit resets (can also specify how long to wait before retrying, so that’s something to think about too, so you’re not slamming an external system and potentially getting stopped).
But there’s something else we can do too. If we have the user in our database and that user has a tweet that’s within the last 3 months, I know I won’t need to add them to the output. Therefore, I don’t need to look them up again. So more popular accounts that tweet all the time will only be looked up once every 3 months or so. That’s a great way to save lookups.
There’s also another thing we can do. Based on the way the Twitter API handles rate limits, you are limited on a per-user limit and that’s based on the token generated when the user logged in. But there’s also an app rate limit. Which is usually equal to the per-user limit. So if I try a lookup but the user is rate limited, I can add a fallback to the app token that will allow me to lookup the failed request. If I have a bunch of users simultaneously pulling data and they are all following a lot of accounts, this could certainly get limited as well, and pretty quickly, but I’m thinking I probably won’t have a ton of consecutive users. If I wanted to get really fancy, I could parallelize lookups with both tokens and just run them all simultaneously, but that’s an optimization that’s probably not necessary at this point.
And step 4 will also need to be run N times, but I can simply trigger a task that saves that user relationship to the database at the same time the lookup is happening. Could do it in the main task, but this makes things a bit nicer and each task does a single thing, so it follows the single responsibility principle.
In testing, this does perform fairly well and should get more performant over time, as more users start to use it and the majority of their audiences overlap. I did uncover a bug however when I was testing. I was logging in and out repeatedly and when I did so, because of the way I was doing it, my Twitter token would change. And since I was passing the token around between tasks, the tasks that had been previously triggered with the old token started failing because the token changed and therefore wasn’t working. Since these tokens are fairly short-lived (2 hours), there’s a potential issue with that happening in normal usage, so I decided to save the token to the user table in step 1 and then look it up in any step that needed it. Makes those tasks slightly slower, but much more robust, because even if the token changed, unless it did it between the lookup from the database and the API call, it would be fine. And that might happen on 1 or 2 tasks, but not on the majority.
So that’s how I approach problems like this. Makes things a lot easier to maintain in the future and also gives me a bunch of building blocks I can use later. That’s the best thing about designing systems this way, because I can then take steps out of a given workflow and repurpose them in another. So the more I add, the more combinations I have, and I can build very complex workflows very quickly!
That’s it for this issue! Going to try to get some things done over the long holiday weekend and maybe start testing with users soon!
Also, stay tuned for a special issue coming in the next week or two. I’ve been thinking about some various company ownership structures and I might throw out some ideas to get feedback. I don’t think the current structures do particularly well and I’d like to play around with some different ones. So keep an eye out for that.
And another idea I’ve been playing with: I’m considering running a cohort-based course on building in public, with a focus on goals, strategies, and tactics. It would be free for all subscribers to this newsletter as the first cohort would be a test of everything with the hope of getting feedback on the content. If that’s something you would be interested in, let me know! And if there are any issues you are currently facing with building in public, please share those as well. That’s really helpful when planning out a course like this.
Keep building!