I'm also thinking of giving up trying to advance the high-level design anymore. I'll leave it the way it is with the newer positions (with the shorter pathing) and implement the damn thing. This good, bad? Input would be nice, even though there likely won't be any. Are there any communities I might have some luck linking this in?
The map servers are a big part of the scaling problem. They have a complex bordering system, their jobs are fairly intensive, they require a lot of network links, and abstracting them slows down the system. So why not just get rid of them? Toss the whole idea of a central map server out the window.
What could replace it is this:
The central servers are already where the map servers get the map data from, which is pulled from the main DB. Add a local disk-cache (and memory cache, I guess) to the central servers for map data.
Have each of the robots on the robot servers keep two to four times as much nearby map data as they would otherwise.
Then add a new protocol for the robots... Instead of consulting a map server, it would only consult its local data. When the robot moves (either every time, or every 2-4 squares) it would broadcast a message saying "Here I am!" along with its global coordinates. All of the robots on the local server would get this message, and one network event would be sent to all other robot servers.
A robot will receive this message if it is within a range of the remote robot's coordinates. It will note where that robot is on its map, then go along with its duties.
When a robot first starts up, it would send something more like "here I am, who can see me?" which other visible robots would return their coordinates for.
This has some good benefits:
A) Removes border control needs
B) The "zones" where overcrowding can happen are too small to actually get overcrowded.
C) No map servers.
D) No reason to have the entire map loaded at once anywhere.
but the bad things:
A) The more robot servers that exist, the more network broadcast events each one gets. I don't think this is O(n^2), but not much less. Too lazy to figure it out.
B) Less flexibility when dealing with the entire map.
My other stupid idea is with the normal zone system, but adding redundancy.
Each set of zones is represented by two machines. All zones are running on both. The connection server sends all events to both machines, but only sends one return event back. It will not send the message if the result messages from both servers do not match. It can do some rudimentary sanity checking to see which server it should believe more, and which one it should raise alarms about.
This is resource heavy, but adds the redundancy that doesn't otherwise exist.