SRP deployed, things looking up!

Week 43 is over. We have successfully deployed the Dominance server V3V2, a.k.a. SRP – which as described in the last week’s development update is a brand new Dom server architecture designed to remove any chance of memory corruption or crashes caused by accessing shared state by multiple server worker threads.

SRP in action

The entire refactoring, or should I put it colorfully, ripping the horse apart, replacing the skeleton with a zebra, and then altering all the organs to match the new skeleton – took about a week. The week was mostly spent in redesigning the client though.

The server on the surface behaves more or less the same – it accepts all the usual stimuli and processes it in a mostly unchanged fashion. The major difference is that it no longer opens multiple JSON/TCP inputs but now accepts a single ZMQ input pipe. Internal handling of incoming data is obviously reshuffled as the new ZMQ pipes now flush state across all worker threads instead of using locks to halt those threads and change the shared memory content themselves. But, aside for these architectural changes, most of the server is intact – especially static stuff like the Ticker() or Battle().

Client is a whole other story. I had redesigned the client to use a similar principle – instead of opening multiple TCP connections to the server, it now opens a single ZMQ pipe. Instead of using shared state between its sub-threads (eventlet background tasks) it used internal ZMQ pipes to shuttle messages between the uplink to the server and the player’s websocket handler. And it all broke down. Uplink couldn’t be established (hello ZMQ, how is that even possible), eventlet exploded with numerous errors – randomly closing websockets, OS errors, what … the… hell?

Drama by Eventlet

Then I had a long discussion with entity.self on a toilet break as to why stuff breaks, and it occurred to me I may be using a wrong approach.

Pycant got redesigned yet again – this time truly from scratch – as 80% of the code got unceremoniously dumped. The whole concept of being “responsible” for anything other than shuttling messages to the server was abandoned. Pycant – as of now – does nothing. It does not process any server or player input – it blindly transmits the data back and forth. The responsibility of handling the authentication responses was dropped onto the javascript, while tracking of players and what websocket rooms they use is now done by the server itself.

So, when a player authenticates, it sends their own room (“room” is their socket ID), server processes this and tells pycant “This goes to room XY, that goes to room AB.” – and pycant forwards messages without question. If the room doesn’t exist – messages get dumped into the void (server will eventually log off inactive players).

And things still broke down. Less explosively but quite frequently – and especially when pushed to production.

WTF – a common question in the last few days. Then I started to notice a trend. Locking occurred. Socket timeouts even. And when they happened the entire pycant would lock up – leaving all the connected players unresponsive until the timeout would raise an exception which was apparently related to how eventlet failed to recognize a websocket closing and was attempting to shove data down a non-existant pipe – bringing the entire system to its knees when it failed.

Well, eventlet, screw you, I’m leaving you for gevent. And suddenly, using gevent-websocket as the gunicorn worker – everything works fine. Almost perfectly.

In other news…

As far as the current game is progressing, things appear to be working fine. A couple of cosmetic fixes here and there, but generally speaking I did nothing of importance relating to the actual game – no gameplay fixes were necessary.

There is an ever-growing list of minor tasks and balance issues that should probably get on my plate during the following weeks, but I’m VERY happy to report we’re not crashing or suffering any serious issues.

That’s all for now, please bring me fresh crash-level bugs! 🙂