DjangoCon US 2017 – Taking Django Distributed by Andrew Godwin

(classical music) – Hi, everybody As Adam just told you I'm here to talk about taking Django distributed

But first of all a little bit about myself And that means making the slides advance, there we go So I as previously mentioned am a Django core developer I am perhaps most famous for working on South and migrations in the past I very gratefully handed off migrations to Marcus a while back and these days I work on Channels which we'll talk about in a second

My day job is a senior software engineer at Eventbrite, the ticketing company and I generally have a very bad tendency to run towards code on fire rather than away from it It's a problem I know I'm talking through it, it's fine So I have some bad news, computers hate you They really don't like you

And the second piece of bad news is this makes distributed computing very difficult And for a long time and I'm sure you will agree with me like in my past my reaction to this was very simple I'm gonna build a monolith It's going to be beautiful, have nice clean edges, all the code in one place We'll have one application, deploy to one server

And we can be very happy And this does work for a long time Monoliths run some of the most successful sites in the world And before I even start this talk properly, one piece of advice is don't necessarily move away from the monolith This talk is for you if you want to or if you think you have to and there are definitely reasons you should and I'll go into those, but it's not necessarily a bad thing up front

So if you are thinking about a monolith, you're thinking well it's time to split things up a bit What do I do? I have all this code and usually what would happen is you'll come in having an existing code base This talk is for those who are coming in generally with a big existing code base, you have all the code in place, you've probably got a few years or if you're Eventbrite, almost a decade of code lying around that you want to sort of take and wrestle and split it up And this isn't particularly easy If you're starting from scratch it's a little bit easier

But I also was temper starting from scratch in this way too because there's a very big tendency to what I call over-architect, to sort of take the best ideas and run with them and go we can build an amazing shot at systems, distributed with cubaneties and it serves five people You don't want that So there's three aspects to taking a site or a project to distributed The one we talk about most probably is code or even databases if you're in that realm But I'm also going to talk about teams

One of the things I've learned in the last four years is that the way your team works and the way you engineer software at scale is very important You can't get anywhere with a team of 100 people if they don't talk to each other, if they don't understand what they're working on So I'll cover some of that as well So there is no one solution I can't give you a magical solution that you can walk out of this room and go back to your project and implement it and scale forever

If I could I probably wouldn't be here I'd be on a luxury yacht in the Caribbean sunning myself with $100 notes There is no magic solution You can't do this I'm here to try and give you both strategies that you could try and apply and also advice, things where like you oughta look at the patterns and recognize things happening before you see them

One of the things I try and do in talks these days is give you The point is to where to go and learn I can't cover a lot in 45 minutes, but I can try and give you the hint of what's out there, the things to go look for and research if they catch your eye or if you start yourself, finding yourself going down that path

And the reason there's no one solution is sites are very different There's all different kinds of loads, types and different kinds of implementation types These are just some of the ideas of what you can have For example, if you're scaling Wikipedia, Wikipedia is an incredibly read-heavy site Most of the traffic to that site is people just going there, looking at an article and leaving again

A strategy that works well for Wikipedia because it's very read-heavy does not work well for something like Eventbrite where we're very write-heavy People come to us to buy tickets, to send us money A lot of what people do is very transaction-heavy and involves a lot of sort of writes and updates and so we really can't scale in the same way It's true for the kind of load too You can have a very predictable load which I'm sure when you have Wikipedia or Google, it's sort of a gentle curve as you'll see later, but you can have spiky load as well

You can have people like oh, well we're gonna write in the next 10 minutes for this blog post you put on reddit say is a very common thing that happens Or even your thing that hosts events every year and so everyone's going to arrive in the 10 minutes before tickets go on sale to try and buy them And there's also things like the chatty This is often more a problem in game development, but it's becoming more of a probably in the web Previously the idea of having a website that would sort of repeatedly send small messages backwards and forwards was unusual and then Ajax came along and Ajax helped a bit, but it's still quite bulky and now with sockets and smaller Ajax and frameworks we're starting to see there's problems of sites are very chatty

They talk backwards a lot and if you're in an area with high latency like say Australia where I was last week, that becomes a real problem But let's start with code What can we do with our code and how can we help deal with some of the problems with distribution? So first of all you use Django, you have apps They're an amazing abstraction to use For many years I put all of my code in one app called Core and there it was, all the models, all of the code, all of the templates were in one app

I just basically ignored the entire app system of Django This was fine for me as a single developer My blog is still this way There's one's app called blog with all the geo tracking inside it and the place visiting and the talk track It's all in one app

But as you become part of a bigger team, apps are a really useful line to draw boundaries across Not just for way to put models in code, but also to understand what your dependencies are Who's working on what, what the ownership is like And you want to really formalize those interfaces and one of the problems with apps and it's good in a short term is that it's very easy just call functions on models directly from another app If I'm writing a polling app for my blog, I can just make the polling app call and find random posts in my blog's post model

This is fine at small scale As you scale up one of the biggest things with code is having very clean interfaces drawn between the different parts of your code This is used for splitting up as we'll see in a second, but it's also used for just generally in terms of thinking and reasoning about the code When a code base gets to a certain size, no one person can understand all that code base and so you have to let yourself forget about pieces of the code and just think about them in the abstract and that's only possible if you have abstract pieces that have good rules about them And so these interfaces are very important for saying I can reason about this piece of code without remembering what's inside it because otherwise my head will explode

If you do house interfaces and you do get to the right scale, you can then choose to split along them And this is kind of where I say you don't have to go this far In many way staying back here with formalized interfaces is good enough for most big companies, but if you want to and want to go with separate machines, having them there gives you the perfect place to stick the cleaver and split apart your code base at that point As a sort of small example, you can imagine a site that had things like inventory and payments like a ticketing site does Inventory is a name for having tickets by the way

And there's a very clear split there if you built it correctly of saying oh, we can take the whole payment system which deals with banks and settling and all the sort of horrible stuff that goes on there and move that to one part We can take the inventory system, things like oh, you must sell exactly 30 tickets, the seat maps, move that over here And we can take the presentation layer, the rest of the logic, and keep it separate And even this sort of concept is very helpful Again you can reason about them separately

You can say like I don't want to know about payments because it's really complicated and takes a whole team, but I know that I can ask for a payment and get a confirmation back And one of the big problems when you split is communication and this is kind of the reason Channels exists Channels often comes across as oh, this is made for WebSockets It's Andrew's idea to get WebSockets across and that is true The gap I first saw in Django was oh, WebSockets need to happen, but their secret is that WebSockets are not that hard

There's plenty of good Python libraries to serve them The problem is not sockets The problem is making a system that lets you have sockets It's the idea of well, we can have people talk to a server As soon as we have two servers and they're trying to chat to each other, how do they talk server to server? That idea just doesn't exist

And that idea is the problem in bigger scale for services Imagine you have three services and you're like oh, okay, what we'll do is we'll have the three services just be like Django apps On each one they'll have whiskey runner and you just call like HDP endpoints and have that give you JSON back Very decent model, no problem with it at small scale Definitely a good thing to go for

You have three to start with You then get five First of all it's a bit of a pentagram which is worrying, but secondly you can see it's gone from three to 10 interconnections And now what if you have 10 services Oh, oh dear, okay

And this problem gets bigger I have heard for example that Uber have 2500 services You can imagine this model doesn't really work in that case And so the thing Channels goes for which is not necessarily always a good fit, but I think for most cases is is a message bus or a service bus This is where rather than having interconnection, all your services and all your individual pieces talk to a common bus and that is how they collaborate and share information among each other

Channels is a very good medium for this Eventbrite service infrastructure is now decided to be moved onto running not on Channels in terms of sockets, but only Channel layers, the underlying implementation of communication stuff in there This is a really good way of starting to have that code separated out Now I'll do some more tips at the end for how this works, but for now we're going to go into databases and this is kind of the hardest part And this is not just because migrations exist, but people often come to me and go Andrew, why did you need migrations? We can just use Git for code

Can't we do the same for data? And the analogy is somewhat flawed It sounds lovely like well of course, there must be a similar solution as there is for code as Git came along and did for database that Git did for code And it's not quite true The problem with data is the data turns out to be quite valuable and you can't just delete it and recreate it willy-nilly And also it's very, very big

We could have a system like Git with pure version of all the database, but it would make it 100 times or 10 times bigger When you have 60 or 70 gigabytes of data, that is not a feasible prospect and that's kinda why we don't go there And the same applies to scaling You can think about the same kind of things you do with code and we'll show those in a second, but there are very different strategies for different kinds of writes And so the thing you might think about when you have code is what we call vertically partitioned which is a fancy name for you give each table its own database

The idea here is that say you have a couple of big tables, I have like a big users table and maybe a big images table and a big comments table Like a cheap Instagram for example And if they're all kinda the same size, you can very cheaply use one third of the space on each machine by just putting one on each machine The problem with this is that you can't split further than per table If you have one giant table, especially, this just starts falling down almost as soon as you look at it

And so the next strategy is a little bit different and I'm sure most of you have heard of it which is having replication, in particular having a single main database that you write to, having lots of replicas that you read from This is a very common pattern in Django There's lots of third party apps that you let you do this It does come with some caveats I'll cover those quickly

The main caveat is that there's a replication lag which means that when you write information to the main database, it takes a little bit of time, usually under a second, hopefully under 100 milliseconds, for that information to go into the main database to replicate to one of the replicas and then be available for you to read when you query the replica Now 100 milliseconds is not a long time, but it is if you're rendering a page and particularly one of the main problems people have is they will write to the main database and then just read from a replica straightaway and not think about the lag and you'll read back stuff that is old You may have just saved a new comment and you'll read back the page again and the comment won't be in the replica and you'll show the user page without the thing they just submitted to you This is a very common problem and the solution to it is called database pinning What you do is you say okay, if somebody writes to a table inside one of our views, to the rest of that view we will pin that view to read from the same main database as you wrote to because then you'll get consistent information

That's great until you realize if your site's write-heavy, you can't actually do that for all the pages because then you're never gonna use all the replicas You're just gonna always be writing and reading from that main database And so this is kind of one of those tricks and this is where my favorite triangle which is the CAP triangle comes into play I'm sure many of you have heard of the idiom cheap, fast and good, pick any two This is the same for databases

You get partition tolerance, availability and consistency You get basically at most two and if you're very lucky you get at least one Many databases give you maybe half of one of these if you look at it right MySQL is somewhere on the one scale I have no bias, I have a bias

And this is problem Inconsistency is everywhere You can imagine the idea that you might think say Postgres Postgres is in theory a partition tolerant and consistent database It's not always available to read if one of the replicas fails, but usually if you can read you'll get a consistent answer from what you just gave

But that's not quite true of replication because there is still a little bit of inconsistency there It's only true in a single sort of machine case and non-consistency really creeps in to all aspects of distributed computing in general And there's a very good reason for this and this reason is physics This is a nanosecond of wire The wonderful programmer Grace Hopper was famous for giving in her lectures holding up nanoseconds of wire and the idea is this is the maximum distance electricity can travel in one nanosecond because the speed of light is a certain speed, 300,000, so 300 million meters per second I believe

Don't quote me on that And in copper it travels two thirds of that speed, but even on fiber optics it's only slightly longer than this And if you think about a computer, a computer does what's called clocking If you're not familiar, a process is basically They sort of Almost like a mechanical machine where they do an operation and then clock to the next one

They just do more operations And the clock is kind of what governs how everything synchronizes in a computer It's almost like a mini version of a big computer system and it turns out if you want to go more than one gigahertz which is it turns out like using one nanosecond, you can't have components further apart than one nanosecond or they cannot physically clock faster than one gigahertz This is why we don't have big computers that run very fast It's physically impossible

And this is kind of the microcosm of why distributed is hard Think of this at a global scale If I have a server in Australia and a server in West Virginia They are at minimum 100 milliseconds apart at the speed of light I cannot beat that

It's physically As far as we know, for caveats, physically impossible to do though and so if my goal is to have more than 10 writes a second consistently, I can't do it because I cannot synchronize more than once every 100 milliseconds between those two zones And this is where the problems of distributed computing come in

If you go back to the triangle, consistency is one of those things that's affected heavily by being physically distributed and it really comes into play the databases And if you think about this model, this works very well I don't wanna dissuade you from it, but there are more advanced things we'll come to in the fourth part of the presentation about sharding and so on that we'll kind of move towards A microcosm of databases and code combined is load balancing A lot of people don't think about this at first

I certainly didn't when I started out as a ops engineer was some of my first jobs and it quickly hit me like a ton of bricks or rather like a huge number of users clicking refresh at once And the problem is you think websites are simple You don't, you know better People think websites are simple People think oh, I can just have a couple of servers scattered around

They're all equally balanced and then everyone will have consistent load times and all my users are roughly the same They're all that sort of very cookie cutter generic people If you have this, congratulations You have a gem of a website Please keep it

No one else has this What we have is loads of logic that runs at different speeds and different machines, that different pages have different amounts of processor loads For example, oh, this page is very easy to render because I can just show a template This page, I have to thumbnail these 25 slides and show them in a grid view And not only that you have widely varying users

One of my favorite interview questions to give is nothing's good as building scram I sit down by the candidate and go okay, I'm not going to ask you any sort of whiteboarding or technical questions or how do I reverse a list I want you to discuss whatever level you feel comfortable how you would build Instagram from the ground up and usually most engineers who are junior or senior year will go a decent way along, have some tables and some good scaling ideas and then you drop the bomb figuratively which is that you say okay, now one of your users has 10 million followers and nobody else does And you have what's almost called the Justin Bieber problem This one individual user is so incredibly expensive

You can't split them up They're a single atomic entity and that really mucks with the way your load works And not just that As I said before, load bouncing in theory is like this If you're in one country, especially, you have a lovely curve during the daytime when everyone's awake and a lovely relaxing curve when they're asleep in the evening

Perfect Easy to scale for You can draw a nice line about 20% above this line, say this is our maximum capacity Have a little speck passivity at all times If you're feeling particularly like thrifty, you can launch servers in the morning and take them down in the evening

If you are a ticketing website which I have some familiarity with, it looks more like this where you're like oh, it's lovely and relaxing and then an event you didn't realize existed goes on sale and they're very popular And everyone in the world arrives at once to try and get their free beer That literally happened by the way And so suddenly everyone just slams into your servers onto one single page and often into an order flow that's very complicated and scaling for this and load balancing this is very difficult You can't necessarily load balance equally across different servers because it might hit a certain set of servers differently

Like this might hit your payment endpoints much more than your event view endpoints so it takes a little bit of extra work And this gets more complicated because of me because I did WebSockets WebSockets are lovely things They're beautiful, they're great They're very good for game programming in the browser in particular, but they have some problems for load balancing and those problems are they're not like HTP requests

This is one of the reasons that they don't fit into whiskey Whiskey is you have a request, you serve it, you send a response, you're done End of responsibility Sockets aren't like that They're clingy

You open a socket It can open for hours or days It can just sit around You can't necessarily reroute it because TCP doesn't work like that Not only that, the set of tools that handle balancing sockets is very limited

You have to handle them almost as raw TCP connections, but also there's sort of HTP, but some tools won't deal with them properly and it becomes a bit of a mess And even worse they have four different kinds of failure which you don't realize until you're right to service them and you realize they have four different kinds of failure They can not open in the first place, they can close randomly and all this kind of stuff And the thing with WebSockets is they're great, but they're an extra feature When you design a site you can consider them a bonus like if you have them, if your client's browser supports them, if you can open them through a proxy they have, great, fantastic

But you should treat them as optional and you should close them liberally and freely If you want to re-load balance some sockets, just close them and let them open somewhere else Don't design them to open forever Design them so at any minute they might die If you're on the London Underground for example, there is only wifi in stations and not in the tunnel between stations and so people regularly will appear for a minute, go away for two minutes, reappear for a minute

You gotta design for that case as well Most sites I used to use in London didn't like the whole I have internet now, wait, no, it's not here again, wait, no, it's here again now with a three second latency and just started collapsing Like some of the takeaway sites I was like I'm trying to order food Did you not plan for this case? No, you didn't

So that's one of the big problems And thinking about this problem is when we come to teams So as I said teams are really important part of designing distributed software and big scale software in general Engineering is a different discipline to programming It certainly encompasses programming, but engineering is a more holistic thing

Engineering is about taking a set of programmers and designers and product people and making the best product that you can to serve the needs best And a big part of this is how you use the people you have at your disposal As I become more and more senior and go through the industry more and more I see this I used to be in the boat, eight, nine years ago, like oh, I am the genius kid programmer I can do anything, I understand everything and as you get out of the zone of you're so incompetent you think you know everything, you get to the zone where oh, you realize you know nothing

And that's kind of where I ended up at this point Teams are very important I don't know enough about managing people That's not my expertise But these are some of the things I've seen as a senior engineer who leads teams and leads projects that I think are useful

So the first thing is your developers are people too Programming is an incredibly draining profession I'm sure many of you know this, mentally and also emotionally sometimes as well Understanding requirements and also trying to adapt them and deal with people having different ideas from you can be very difficult You need to make sure there's time and space to plan for this stuff

In particular, I see lots of plans and like oh, we're gonna have these three features developed in parallel They're merged instantaneously Then we'll keep going This is often borne out of being a smaller company that's less distributed where you have one or two services, everyone knew the whole code base And when everyone's in one room and knows the whole code base you can happily do that

You can work on two different pages and then merge them together because you all know what's going on But when you have 100, 200 people, you don't know everything that's going on elsewhere You have to allow time, both to spin up and understand what the context is, see if someone else has sold this already and then to spin down and merge and so it's a very important thing to think about And part of this is technical debt really can be poisonous Again I'm preaching to the choir here I'm sure, but it's very important that you build up some technical debt when you're

Especially a startup If you're doing like the space shuttle, don't do this But presuming you're building websites because you're at Djangocon and as a website you can have some technical debt

It's fine It's almost healthy Requirements change, people change, user bases change What you should do is be very cognizant, be very aware of where your debt is and when you need to pay it off Because like normal debt it accrues interest

As your code base grows, it drags down and you need more and more and more And so you need a little bit to compete, but you need to keep ahead of it and keep managing it and it's very easy to lose track of that technical debt as you go along And then we get to the slightly more controversial question which most companies have not even sold internally which is how many Git repositories do you have This is one of the things where making a big distributed system, it doesn't really You don't think about until you get there, until you're sitting down at the computer and going oh, okay, we've managed to split out our payments code base and the rest of it Fantastic Where do we put it? If your answer is a single repository, then congratulations, that's great You're gonna have a single giant repository

Everyone's gonna have merged conflicts all the time It's gonna be terrible If you chose multiple repositories, congratulations You're not gonna have 300 repositories and no one knows where they all are and you're gonna have 300 versions and your release manager's gonna have a hell of a time It's gonna be terrible

There again is no good answer Multiple repos often seems more attractive at first It is, but it means that what you're doing is pushing complexity from programming and doing the merge conflicts That problem isn't going away It's just happening to be pushed down to the release phase and so you're giving your operations and release engineers all of your problems which is kind of selfish if you ask me so just be very aware of that

If they agree, that's fine, but don't think you're imagining making work go away This then turns into well, you have these repos, a single repo, how can people code on that stuff Do you have your teams structured around individual services or pieces of the code? Do you instead try and stretch people across different services, like oh, this team works on different things and try and encourage people to have diversity of knowledge and opinion inside the code base? This again is really difficult because often you don't have enough engineers because no one does Not only that, but this also includes designers and UX researchers and operations engineers too You never have enough people and working out how to arrange everyone so that again, no one knows everything

This is a very important thing I can't stress enough, but the right people talk to each other can get really, really difficult And this gets really problematic with ownership gaps This is a problem I never saw coming until I moved to a big company is that it's possible to just have giant pieces of code that no one knows about because they got written four years ago They still work fine They were written alright

But nobody Even if the person who wrote it is still at the company, they've probably forgotten I've forgotten most of the code I've ever written

It's not uncommon to do so And it's very easy to know what you're working on It's very hard to know what you're not working on At some companies even working out what the feature set of the site is could take a team of senior engineers weeks to work out There are some sites that are so complicated that just no one

Maybe the support team knows best actually One good tip is go and talk to your support people because they get all the experience of all the weird little features But basically nobody knows what's going on and until you run into one of these gaps, it's hard to know they're there and so really think about if you're smaller or medium, just keep a rough spreadsheet even of oh, these are the rough features that we have and here's who knows about them best

And then if somebody leaves and they're in columns by themselves, you go oh, that person has left and nobody knows about this now And so we should go and fix it This has happens in Django too Django is a very big complex project We have specialists in a lot of Django areas

There are areas of Django there are no specialists in For a long time we thought nobody knew about all the weird multi-form forms and stuff and then one of the Admitted to knowing about it

He went ah-ha What a fool, yes, exactly So that's the kind of thing that ends It's not just companies It's bit open source projects too and it's very easy to happen

So with all these sort of rough points, I wanna go into some strategies How can you take these ideas I've been throwing at you and try and What's my best advice basically? I can't give you single solutions, but what would I suggest? So first of all people love micro services and to do this correctly I have to go micro services like this

Jazz hands are very important Micro services are a big buzz word They're very common and they're very easy and that's the problem It's so easy to just ignore the other code and start a new service It's a service version of oh, we'll just delete it all and rewrite from scratch

It's easy to start and you know that it's gonna be this wonderful thing when you go in Most programmers, a few exceptions I know aside, do not like maintenance programming If you do, congratulations, you should charge more But if you don't, if the temptation is to go oh, we can do new services and then we'll just join them all together and again you're not actually saving any work What you're doing is pushing all of the work later on in the process when you now have 1000 services and no one knows what they all do

Can you imagine if you have 1000 services I doubt any one person would understand what they all do or what their ideas are and so you've just taken the same problem you had with a monolith and made it not only be in services, but now it's on different machines and there's a consistency problem At least with a monolith it's all in one place and you can sort of trace through the code and the debugger easily You try tracing bugs through four different services and four different machines It is not easy

You can't just put a PDB trace in there You have to sort of find logs and put logging in and relaunch them all and then follow the logs through It's just a pain So that's one of the big things to think about What I generally encourage is a moderate size of services

Usually five to 10 is a good start for most companies Try and think of your big business reasons and focus around those That often helps the team composition too If you have experts in certain areas in your team already, it's very sensible to cluster around them and give them some junior engineers to mentor one as well Just don't have the thing where every engineer has their own service because at that point they're always writing their own different code bases and you have a giant collaboration problem

Again I'm a big fan of service buses I sort of wrote one in Channels They're a really nice way of doing interservice and inter-machine communication One of the big things, especially in this world of Docker and of containers is that with a service bus, when you have a piece of code, you can just say hey, the bus is over here There's no need to say oh, well service one is here, service two is here, service three is here

Oh, we restarted service two Its IP address has changed so you need to restart you with new IP and it just gets to be a massive complex nightmare So that's a really big part of service buses They make deployment and scaling a lot easier They're not listening in any ports because they're just talking out so you can just launch 10, 20 processes very easily

It's not for every design There are totally cases where you shouldn't have a bus, where you should try and keep some of traffic off of the bus In particular, one good model I seen is having a service bus for some stuff in the terms that it is at most once which is a message on the bus might get there, it might not And a separate thing got a fire hose which is all the events happening in the site and they may happen once or twice or more which is at least once And different code problems can happen on different things

Cache invalidation is best done from a fire house because if you invalidate and cache twice, it's not a problem And so your cache invalidation thing should listen to a fire hose and every time it sees an idea you go oh, yeah, this user's changed Invalidate their cache Great way of doing that Shouldn't be on a bus

And then onto the consistency stuff You are gonna have inconsistent data It's almost impossible to avoid this Even if you think you're a perfect monolith, as soon as you have computers in more than one data center that are physically separate, you would have to deal with something like this Even if it's like 10, 20 milliseconds

And a really powerful way of doing this is to look at the product What are you making? Where can you allow old data? Sit down with your UX researchers and your product people and your designers and go okay, we have to give somewhere, but where can we give in a way that doesn't hurt the experience A very common way for example is to give on content you didn't author If I had say a site where I looked at images and comments, all the things that I didn't own or comment on, we could happily serve you old data It's a very good example of pinning, but this applies everywhere

There's many cases in different pieces of software where you can go along and go yeah, we could just change If we remove the pagination, this is very common by the way If you can GitHub in the

GitHub does not show you how many pages there are because that's almost certainly what the slow bit is and so by just showing next page, most of the functionality is there, but that expensive pagination query has gone away And so that's one of the compromised examples you might go and do Now sharding

Sharding is a very complex issue and one that would take an entire talk by itself to cover, but for those who aren't aware, sharding is the next step from basically the vertical partitioning That's called horizontal partitioning And the idea is that you have one table on multiple servers Say like my users tales is on 100 different servers and generally you'd say well there's 100 different shards and each user is in a certain shard It's very, very powerful

It works incredibly well for most patterns that people understand, but it is an incredible technology and person cost Having sharding on a code base makes everything slower to write, everything slower to run Some people come along and design for it upfront They go oh, okay, we're gonna be amazing We're gonna start up with our brand new startup

It's gonna be sharded from the beginning It's gonna be fantastic This is like the technical debt thing Sure, you could do that, but in the same way that you could not take on technical debt and have a perfect code base that passes every single test and had 100% coverage all the time, you're gonna be much slower And almost certainly when you're making the product to be open source or commercial, you don't understand the full scope of your problem

You should expect things to change So in the same way you should have some technical debt like I often won't write tests for some halves of my code that I think are gonna change, you probably shouldn't put sharding in there straightaway You should probably go well we're gonna leave the touchpoints in here In particular, I suggest is rooting all of your queries through a single model in Django or a single function outside of it so that when you have sharding, you can come along to that one single place and put it in It's like sharding often involves we need to look an ID, look at a hash of the ID, find a server and do that stuff

I see many sites that don't do this that directly query tables or do custom rule queries That ruins this so just try and keep the split point there, but don't spit on it and that generally works out pretty well Again WebSockets and also long polls Anything that's like any kinda connection that's open and you send more than one thing down it, expect it to die Design to failure

If you design to failure, when things don't fail it's a happy surprise You're like oh, this is great It's more efficient If you don't design to failure, your pager will be going No one has a pager Your mobile phone will be going off all the time and you'll be all it's not working It turns out the internet's not perfect and doesn't stay open all the time Who could have guessed? I used to write game servers in Twisted a few years ago now for Minecraft and that's another case where you expect that it would stay open forever and you can just send things down the socket and that's not true

You can't do that So design for failure, not just in this, but many other things I have other talks on design to failure if you want to go and watch them Teams This is very difficult

I am still thinking about this problem The best thing I've seen so far is independent, full stack teams What I mean by this is essentially when your company gets big enough, treat it as lots of smaller startups Each team has operations on it, has design on it, has product on it, has research on it and they all talk together They have that sort of small group feeling and they communicate between each other

One of the common examples, this is quite popular these days is the matrix organization I haven't got a slide for this It's the idea of oh, you have teams that are full stack, but the people who are specialists in each area sort of meet up across teams Like all the ops people have lunch together twice a week and discuss stuff I like this because it encourages you to think and move as a small company

I'm a big small company person This is why I favor this probably It does have some overhead where your teams are gonna have to interact more formally They're gonna have to request stuff of each other like big companies would and Almost like contracts and interfaces to find between them But that gets you those system and data model designs for free If your teams are defining how they talk to each other, congratulations, you've got a free data model out of it So this is generally what I prefer

Another become a thing and this is the last strategy is that I see a lot of people not having software architects A lot of people becoming just software architects Now software architecture is a very ill-defined term I'm probably one maybe, I don't know, but it's a person whose specialty is coordination and putting different pieces together The person who would go and look at all those models in the abstract and look at how they interact

It's very common to not have this and especially if you're coming from a small startup where again everyone understands the whole code base because if everyone understands the whole code base, you don't need this person It's not important No one has to do that hard work of doing the information gathering, but once you get big enough you need that person and then all too often a big enterprise will have a team of specialists, software architects, who sit in the big ivory tower at one end of the campus and think about things all day and go hmm, this is very important, philosophical It's important to be practical I value having practical

I think having people who specialize in architecture, but still are involved in writing software is very important, not losing that particular focus or view on the world And this kinda comes back to one thing which is that monoliths aren't that bad really I've seen many big sites, I've run a few big sites that were just a giant monolith and they had problems certainly, but a lot of the problems you think you have in a monolith are just problems of your team and the way you code and going distributed won't necessarily make them better

And you could and some cases you should if you have pages that are more expensive than others, you have a big funneling thing, you have video rendering These are good reasons to split up, but Django's app system is very good It's not a bad idea to have 100 Django apps because when you have 100 Django apps in one machine, it's much easier to version them all as one Doing releases, you have one repo You can just release the top version of that repo

There's no well we need version 036 of this and version 04 and version two of this, but version three of this We'll delete version two of this That's the problem you get with services so maybe, just maybe keep the monolith, but do think about it And with that thank you very much (upbeat music)