Skeleton Replacement Programme

Hi all, and welcome to another of those rare developer updates where I talk shop. But first, for those that are not versed in coding, a simple introduction to what I’m trying to accomplish.

What is SRP?

SRP can be explained like this. Imagine you were building a horse. You start with a skeleton. Then you add muscles, nerves, lymphatic system, glands, organs, and finally skin. Then you press “run” and the horse starts moving. Dominance server and client are very similar to this – the skeleton is a set of procedures that deal with how messages are passed around – opening and closing of sockets, passing requests from the player to the procedures handing them and replying with results so that a browser can display them. Muscles? Command thread that executes player requests. Lymphatic system? Logging of events. Glands? Chat system. Organs? Various procedures doing this or that.

Now, imagine that once the horse is up and running – as it currently is – you suddenly realise:

  • Holy crap – that’s a horse!
  • Well, yeah, self, what did you expect? I thought we were building a horse, no?
  • No, self, dammit, I wanted a ZEBRA!
  • Are you insane? A horse is not a zebra! Why did you build a horse then?
  • Because I made a mistake, all right? Can we just paint it with stripes?
  • No. You gotta start at the skeleton again. You can probably salvage most of the organs though…
  • Aww crap.

The Mistake

Hello, shared state. Those of you that know what shared state is probably can skip the entire developer diary as they will have perfectly clear idea what I did wrong. To put it simply – a shared state, or in my case, shared memory – is memory shared between various parts of the server so that all of them would have immediate access to things such as Topology, Player Registry etc.

Then, the threads work using those things. Then some other thread changes those things while other threads are still using them. Then we get chaos, breakage, and crashes. Naturally, methods exist that help alleviate this – locks, semaphores etc, but in my particular case (where I have all of this in place) things still break. Of course they do!

There is a solution, one that is probably clear as day to all the developers working with multi-threaded applications, but only recently revealed to me. I’ll paste a quote from the ZMQ guide:

To make utterly perfect MT programs (and I mean that literally), we don’t need mutexes, locks, or any other form of inter-thread communication except messages sent across ZeroMQ sockets.

By “perfect MT programs”, I mean code that’s easy to write and understand, that works with the same design approach in any programming language, and on any operating system, and that scales across any number of CPUs with zero wait states and no point of diminishing returns.

If you’ve spent years learning tricks to make your MT code work at all, let alone rapidly, with locks and semaphores and critical sections, you will be disgusted when you realize it was all for nothing. If there’s one lesson we’ve learned from 30+ years of concurrent programming, it is: just don’t share state. It’s like two drunkards trying to share a beer. It doesn’t matter if they’re good buddies. Sooner or later, they’re going to get into a fight. And the more drunkards you add to the table, the more they fight each other over the beer. The tragic majority of MT applications look like drunken bar fights.

The list of weird problems that you need to fight as you write classic shared-state MT code would be hilarious if it didn’t translate directly into stress and risk, as code that seems to work suddenly fails under pressure. A large firm with world-beating experience in buggy code released its list of “11 Likely Problems In Your Multithreaded Code”, which covers forgotten synchronization, incorrect granularity, read and write tearing, lock-free reordering, lock convoys, two-step dance, and priority inversion.

Yeah, we counted seven problems, not eleven. That’s not the point though. The point is, do you really want that code running the power grid or stock market to start getting two-step lock convoys at 3 p.m. on a busy Thursday? Who cares what the terms actually mean? This is not what turned us on to programming, fighting ever more complex side effects with ever more complex hacks.

Some widely used models, despite being the basis for entire industries, are fundamentally broken, and shared state concurrency is one of them. Code that wants to scale without limit does it like the Internet does, by sending messages and sharing nothing except a common contempt for broken programming models.

What to do what to do…

I was reading this guide while looking to rip out the old TCP-based communication layer server and client used to talk to each other. It’s full of problems, causes random player disconnects, it’s slow and inefficient. Obviously, I knew this day would come – as early as week 2 I knew that JSON/TCP, at least in the way I was using it – was fundamentally broken and unsustainable. The solution I received though – after reading the guide referenced above – was that the server not only should be using proper queues to talk to the client, but to talk to itself as well.

Well, now is the time to rip it apart, dump the old skeleton, and replace it with a zebra.

So, what’s the plan?

Well, self, the plan is as follows. See the old picture?

We’re going to alter it to look like this:

Major changes are as follows:

  • Client doesn’t open a new TCP connection to the server for every new player. It connects with a single 0MQ socket and uses that for both internal (“Hello, are you alive?”) and external (“Player Jozo wants his spellcasting status.”) requests.
  • Client uses 0MQ sockets internally to shuttle messages between Flask-SocketIO NameSpace (the thing that responts to player’s websocket requests) and the server. No more shared state there either.
  • Server uses 3 sockets for internal communication: One for exclusive async pipeline to the C&C thread, one to task info threads with updates and one to publish state changes.
  • Both Info and Command threads no longer communicate with the client directly. They pass their results back to the main server thread which in turn proxies them to the client.
  • Both Info and Command threads have the ability to request other threads to do some work (C&C just finished morphing units, it tells some other Info thread to pass back the results) by telling the server to order the work on other threads. Previously C&C directly talked to info threads which lead to even more issues with shared state being altered without the main thread knowing it.
  • Any Info/Command thread can request a state change by politely telling the server “player XY has been banished, here is the updated player list, get rid of him”. In practice only the command thread has the authority to effect state changes – which the main thread publishes to all other threads that require it.

Where are we right now?

As of today, a skeleton exists that is able to connect a player’s browser, accept a websocket request, pass it along the pipeline to the server, which responds. In practice this is demonstrated on “ping”:”pong” request (that carries information about tick status. So – the skeleton is there. Now it’s a matter of transplanting all the organs, and this is not trivial.

For example, the entire system that handled user login and elder creation has to be rewritten. Previously players connected directly with an exclusive connection; this raised an event, server would send a hello, player would send authentication object, etc etc. Now there are no “phone ringing” or “they hung up” events.

Also, pycant is now much closer to the server – where previously it was just a proxy, now it helps with tracking the players, their connection status, and is responsible for helping to log them out after timeouts. Technically, it did all of these things previously but in practice it often made a mess. I’m confident the new architecture is going to behave much better.

That’s all for this update, keep in touch and follow the Court Events to see how SRP development goes.