Once I solve the server structure problem (or at least get something "Good enough") I will start detailing how I think the protocol systems should work.
For note, the "database components" have been removed from all of the graphs to make things a bit simpler. You can assume they're just hanging off of every which way in the zone graphs, and off of the central servers for the goblin graphs.
GRAPH #1: The classic zone approach.
This is, of course, no less than slightly off, but illustrates the big point.
Client comes in from internet, logs into the login server, which passes credentials to the server selection screen, which connects you to your server, then it's all hubbed to the servers which handle the individual "zones" - when a player reaches the border of a zone, they're transferred to the next server (which results in a very visible hiccup). Most games with a "seamless" world still do zoning, and goblin will likely do some zoning itself. This will attempt to be as lightweight as possible so no major hiccups (if any) happen.
This design is simple, yet very effective. All gameplay data is always in instant reach of the servers (maps, current players, npcs, etc, are all right there in memory). To accoomodate more ground and players, more zone servers are added. The only *big* downsides to this are player clustering and the zones themselves.
If too many players end up in one zone, that zone's going to have problems, and you can't do anything but reduce the other zones that particular machine handles.
Seamless worlds are just plain niftier. Being able to operate on the world on a global scale allows for a few handy things. A global weather map, a global operator for flora, roaming NPCs, etc.
The zone approach hurts scalability, since you cannot cluster servers for a small populated area, or add specialized servers in easily.
GRAPH #2: Ragnarok Online's lopsided setup.
This is how the servers graph out for the ever popular RO. For some reason it needs three separate servers to get you into the game, then all of the zones are accessed directly through the internet. Heh heh heh. It does suffer from the same problem, although it adds a few new bugs.
Since you're not going through a relay to get to the internal zones, you're relying on their ever-buggy network code to connect you to a new server when you zone. This often doesn't work, which creates that "please wait" while zoning bug and the zone disconnects.
Having everything in the open like that is a decent security risk as well. I won't get into this in detail.
Here comes Goblin. I'll walk over the old design I used to have, then show the new one.
GRAPH #3: Old Goblin style.
This all creates one seamless world. It's very hard to look at.
Login1 and Login2 are both "Connection" servers which handle network operations between the clients and the internal servers. They each hold exactly one connection to every central server, one connection to each other, and one connection for every client.
Each central server holds exactly one connection to every robot server it has been assigned (central1 takes care of robot1 and robot2), every login server in existence, and every map server in existence.
Each map server holds a connection to every central server, and the map servers that it borders. This particular graph above would create a world not all unlike a big rectangle.
Each robot server holds one connection to the central server it is associated with. Robot servers handle the player and NPC objects/threads.
This, while more scalable for a single world than the zone approach, fails miserably elsewhere.
The number of internal connections, and thus network complexity, scales O(n^2) for each added central, map, or login server. I'd *prefer* it if servers scaled O(1) so I could fling more at any given problem :)
The possible path of a data events is incredibly long, as well:
For the basic approach of a client requesting the current map location:
That's far too much latency on the system. No matter how optimized my code is, that's not going to hold up.
We first try to do it a little differently... When an event happens, the map servers notify all of the effected robots. The robots hold information about their local area, and can immediately report area changes next time a client wants a full screen.
So for an area request from a client, it now becomes this:
Er, whoops. That still sucks. The whole thing doesn't scale anyway, so we go to my last graph...
GRAPH #4: Goblin's next draft.
This graph would look a whole lot better if I knew what I was doing with graphviz, but this is all I could muster in five minutes. This also *is not* the finished design.
The connections have completely changed:
A login server holds one connection to a secondary login server, one central server, all of the robot servers it's instructed to handle, and one for every client.
A central server holds one connection to each robot it's instructed to handle, one login server, and the map servers it's instructed to handle.
The map servers hold one connection to their central server, and to all bordering map servers.
The robot servers hold one connection to their login server and to their central server.
Now, still working with internal server-push for events, we have:
A login server is just a big network queue, latency through that hop is near non-existent. All internal events are server-push toward the robot server or directly to the login server.
A player on robot1 decides to move north one square. Internally, the robot gets the move request, and sends that request to the central server, the central server forwards the event to the right mapserver. The mapserver figures out all of the robots which see the movement, and sends the request back to the centralservers tailed with the list of affected robots. The central server sends this back to each robot server (only one network operation per hop! Everything's done in batch), which notes the update. the robots can then immediately send the client an update, or queue the update for the next time the client asks about the screen data.
This setup as a problem with robots being unable to go to maps not connected directly to their central server. If they were, the movement events would have to travel across map borders, which is latency-happy and unscalable.
I am thinking of either having the robot switch which central server it talks to (which requires each robot server to talk to all central servers) or some solution I haven't come up with yet. Moving a player between robot servers has too much overhead to it, and breaks the "Can't overload one zone" fix, as robot servers will get overloaded.
I'm fairly sure that the next best thing to a perfect solution would to have all robot servers hold a connection to all central servers. If a robot is in the border area of a map server, and across that border requires moving to a different central server, the robot process would warn that next central server that it might be jumping over in a minute. The central server would then prepare memory and structures for the robot's transfer. When the transfer actually happens, it works in parallel. The robot tells the new central server it's officially moved, the old one it's officially left, and the map servers pass the robot object between each other. So the entire "zone" happens in parallel using cheap operations.
This does not solve the network complexity problem, though. This has simply moved the problem to a different side of the graph, and shortened the paths significantly.
The map servers' border system is how I plan to avoid costly zoning. There is a buffer area on the map between two map servers. When an object enters a ways into this area, the two map servers start exchanging data. They do this once. So the "border" where an object can actually get spit across zones is many tiles wide. A player stepping over the border would have to backtrack ten steps to go back to the other server. This avoids the object "flapping" between servers.
Also of note, I plan to have the map servers use a dynamic border system. If a map server detects that it's overloaded (doesn't spend enough time polling the network for new requests) it will try to locate the biggest cluster, and then slowly shrink its border toward the cluster until it starts spending that required amount of time idling (to stay as responsive as possible). This way the server systems can handle very large amounts of players hobbling around from side to side of the map (presumably during large player or GM-ran events).
If you feel like ripping me to shreds, or have a solution for my O(n^2) scaling problems, and/or shortest path problems, I would really love to hear it.