Client Server Data Representation

The difference is how identity is seen by third parties. The slaves identity is granted by the master, and if the master switches slaves, third parties scarcely notice. It the same identity. The client’s identity is granted by the host, and if the client switches hosts, the client gets a new identity, as for example a new email address.

If we use Pake and Opaque for client login, then all other functionality of the server is unchanged, regardless of whether the server is a host or a slave. It is just that in the client case, changing servers is going to change your public key.

Experience with bitcoin is that a division of responsibilities, as between Wasabi wallet and Bitcoin core, is the way to go - that the peer to peer networking functions belong in another process, possibly running on another machine, possibly running on the cloud.

You want a peer on the blockchain to be well connected with a well known network address. You want a wallet that contains substantial value to be locked away and seldom on the internet. These are contradictory desires, and contradictory functions. Ideally one would be in a basement and generally turned off, the other in the cloud and always on.

Plus, I have come to the conclusion that C and C++ just suck for networking apps. Probably a good idea to go Rust for the slave or host. The wallet is event oriented, but only has a small number of concurrent tasks. A host or slave is event oriented, but has a potentially very large number of concurrent tasks. Rust has no good gui system, there is no wxWidgets framework for Rust. C++ has no good massive concurrency system, there is no Tokio for C++.

3 the select problem

To despatch an io event, the standard is select(). Which standard sucks when you have a lot of sockets to manage.

The recommended method for servers with massive numbers of clients is overlapped IO, of which Wikipedia says:

Which kind of hints that there might be a clean mapping between Windows OVERLAPPED and Linux AIO*

Because generating and reading the select() bit arrays takes time proportional to the largest fd that you provided for select(), the select() scales terribly when the number of sockets is high.

Different operating systems have provided different replacement functions for select. These include WSApoll(), epoll(), kqueue(), and evports(). All of these give better performance than select(), all give O(1) performance for adding a socket, removing a socket, and for noticing that a socket is ready for IO. (Well, epoll() does when used in edge triggered (EPOLLET) mode. It has a poll() compatibility mode which fails to perform when you have a large number of file descriptors)

Windows has WSAPoll(), which can be a blocking call, but if it blocks indefinitely, the OS will send an alert callback to the paused thread (asynchronous procedure call, APC) when something happens. The callback cannot do another blocking call without crashing, but it can do a nonblocking poll, followed by a nonblocking read or write as appropriate. This analogous to the Linux epoll(), except that epoll() becomes ungodly slow, rather than crashing. The practical effect is that “wait forever” becomes “wait until something happens that the APC did not handle, or that the APC deliberately provoked”)

Using the APC in Windows gets you behavior somewhat similar in effect to using epoll() with EPOLLET in Linux. Not using the APC gets you behavior somewhat similar in effect to Linux poll() compatibility mode.

Unfortunately, none of the efficient interfaces is a ubiquitous standard. Windows has WSAPoll(), Linux has epoll(), the BSDs (including Darwin) have kqueue(), … and none of these operating systems has any of the others. So if you want to write a portable high-performance asynchronous application, you’ll need an abstraction that wraps all of these interfaces, and provides whichever one of them is the most efficient.

The Libevent api wraps various unix like operating system efficient replacements, but unfortunately missing from its list is the windows efficient replacement.

The way to make them all look alike is to make them look like event handlers that have a pool of threads that fish stuff out of a lock free priority queue of events, create more threads capable of handling this kind of event if there is a lot of stuff in the queue and more threads are needed, and release all threads but one that sleeps on the queue if the queue is empty and stays empty.

Trouble is that windows and linux are just different. Except both support select, but everyone agrees that select really sucks, and sucks worse the more connections.

A windows gui program with a moderate number of connections should use windows asynchronous sockets, which are designed to deliver events on the main windows gui event loop, designed to give you the benefits of a separate networking thread without the need for a separate networking thread. Linux does not have asynchronous sockets. Windows servers should use overlapped io, because they are going to need ten thousand sockets, they do not have a window

Linux people recommended a small number of threads, reflecting real hardware threads, and one edge triggered epoll() per thread, which sounds vastly simpler than what windows does.

I pray that that wxWidgets takes care of mapping windows asynchronous sockets to their near equivalent functionality on Linux.

But writing a server/host/slave for Linux is fundamentally different to writing one for windows. Maybe we can isolate the differences by having pure windows sockets, startup and shutdown code, pure Linux sockets, startup and shutdown code, having the sockets code stuff data to and from lockless priority queues (which revert to locking when a thread needs to sleep or startup) Or maybe we can use wxWidgets. Perhaps worrying about this stuff is premature optimization. But the samples directory has no service examples, which suggests that writing services in wxWidgets is a bad idea. And it is an impossible idea if we are going to write in Rust.

Tokio, however, is a Rust framework for writing services, which runs on both Windows and Linux. Likely Tokio hides the differences, in a way optimal for servers, as wxWidgets hides them in a way optimal for guis.

4 the equivalent of RAII in event oriented code

This is how a server can have ten thousand tasks dealing with ten thousand clients.

Implemented, in C++20 as co_return, co_await, and co_yield, co_yield being the C++ equivalent of Rust’s poll. But C++20 has no standard coroutine libraries, and various people’s half baked ideas for a coroutine library don’t seem to be in actual use solving real problems just yet, while actual people are using the Rust library to solve real world problems.

I have read reviews by people attempting to use C++20 co-routines, and the verdict is that they are useless and unusable,

Boost fibres provide multiple stacks on a single thread of execution. But the consensus is that fibres just massively suck.

But, suppose we don’t use stack. We just put everything into a struct and disallow recursion (except you create a new struct) Then we have the functionality of fibres and coroutines, with code continuations.

Word is that co_return, co_await, and co_yield do stuff that is complicated, difficult to understand, and frequently surprising and not what you want, but with std::future, you can reasonably straightforwardly do massive concurrency, provided you have your own machinery for scheduling tasks. Maybe we do massive concurrency with neither fibres, nor coroutines – code continuations or close equivalent.

C++20 coroutines seem to be designed for the case of two tasks each of which sees the other as a subroutine, while the case that actually matters in practice is a thousand tasks holding a relationship with a thousand clients. (Javascript’s async). It is far from obvious how one might do what Javascript does using C++20 coroutines, while it is absolutely obvious how to do it with Goroutines.

4.1 Massive concurrency in Rust

The way Rust does things is that the input that you are waiting for is itself a future, and that is what drives the cooperative multi tasking engine.

When the event happens, the future gets flagged as fulfilled, so the next time the polling loop is called, co_yield never gets called. And the polling loop in your await should never get called, except the event arrives on the event queue. The Tokio tutorial explains the implementation in full detail.

From the point of view of procedural code, await is a loop that endlessly checks a condition, calls yield if the condition is not fulfilled, and exits the loop when the condition is fulfilled. But you would rather it does not return from yield/poll until the condition is likely to have changed. And you would rather the outermost future pauses the thread if nothing has changed, if the event queue is empty.

The right way to implement this is have the stack as a tree. Not sure if Tokio does that. C++20 definitely does not – but then it does not do anything. It is a pile of matchsticks and glue, and they tell you to build your own boat.

The mini Tokio tutorial shows you how to implement your own efficient futures in Rust, and, because at the bottom you are always awaiting an efficient future, all your futures will be efficient. You have, however all the tools to implement an inefficient future, and if you do, there will be a lot of spinning. So if everyone is inefficiently waiting on a future that is inefficiently waiting on a future that is waiting on a network event or timeout, and the network events and timeout futures are implemented efficiently, you are done.

If you cheerfully implement an inefficient future, which however calls an efficient future, it stops spinning.

Multithreading, as implemented in C++, Rust and Julia do not scale to huge numbers of concurrent processes the way Go does.

So, if you need asynch, you need Rust. C++ is build your own boat out of matchsticks and glue.

The await asynch syntax and semantics are, in effect, multithreading on cheap threads that only have cooperative yielding.

So you have four real threads, and ten thousand tasks, the effective equivalent of ten thousand cheap “threads”.

I conjecture that the underlying implementation is that the asynch await keywords turn your stack into a tree, and each time a branch is created and destroyed, it costs a small memory allocation/deallocation.

With real threads, each thread has its own full stack, and stacks can be costly, while with await/asynch, each task is just a small branch of the tree. Instead of having one top of stack, you have a thousand leaves with one root at start of thread, while having ten thousand full stacks would bring your program to a grinding halt.

It works like an event oriented program, except the message pumps do not have to wait for events to complete. Tasks that are waiting around, such as the message pump itself, can get started on the next thing, while the messages it dispatched are waiting around.

As recursing piles more stuff on the stack, asynching branches the stack, while masses of threads give you masses of stacks, which can quickly bring your computer to a grinding halt.

Resource acquisition, disposition, and release depend on network and timer events.

RAII guarantees that the resource is available to any function that may access the object (resource availability is a class invariant, eliminating redundant runtime tests). It also guarantees that all resources are released when the lifetime of their controlling object ends, in reverse order of acquisition.

In a situation where a multitude of things can go wrong, but seldom do, you would, without RAII, wind up with an exponentially large number of seldom tested code paths for backing out of the situation. RAII means that all the possibilities are automagically taken care of in a consistent way, and you don’t have to think about all the possible combinations and permutations.

RAII plus exceptions shuts down an potentially exponential explosion of code paths.

Our analog of the situation that RAII deals with is that we dispatch messages A and B, and create the response handler for B in the response handler for A. But A might fail, and B might get a response before A.

With await asynch, we await A, then await B, and if B has already arrived, our await for B just goes right ahead, never calling yield, but removing itself from the awake notifications.

Go already has technology and idiom for message handling. Maybe the solution for this problem is not to re-invent Go technology in C++ using perfect forwarding and lambda functions, but to provide a message interface to Go in C.

But there is less language mismatch between Rust and C++ than between Go and C++.

And maybe C++20 has arrived in time, have not checked the availability of co_await, co_return, and co_yield.

On the other hand, Go’s substitute for RAII is the defer statement, which presupposes that a resource is going to be released at the same stack level as it was acquired, whereas when I use RAII I seldom know, and it is often impossible to predict, at what stack level a resource should be released, because the resource is owned by a return value, typically created by a constructor.

On checking out Go’s implementation of message handling, it is all things for which C++ provides the primitives, and Go has assembled the primitives into very clean and easy to use facilities. Which facilities are not hard to write in C++.

The clever solution used by Go is typed, potentially bounded channels, with a channel being able to transmit channels, and the select statement. You can also do all the hairy shared memory things you do in C++, with less control and elegance. But you should not.

This an implementation of Communicating Sequential Processes, which is that input, output, and concurrency are useful and effective primitives, that can elegantly and cleanly express algorithms, even if they are running on a computer that physically can only execute a single thread, that concurrency as expressed by channels is not merely a safe way of multithreading, but a clean way of expressing intent and procedure to the computer.

Goroutines are less than a thread, because they are using some multiplexed thread’s stack. They live in an environment where a stable and small pool of threads is despatched to function calls, and when a goroutine is stopped, because it is attempting to communicate, its stack state, which is usually quite small, is stashed somewhere without destroying and creating an entire thread. They need to lightweight, because used to express algorithms, with parallelism not necessarily being an intended side effect.

The relationship between goroutines and node.js continuations is that a continuation is a small packet of state that will receive an event, and a paused goroutine is a small packet of state that will receive an event. Both approaches seem comparably successful in expressing concurrent algorithms, though node.js is single threaded.

Node.js uses async/await, and by and large, more idiots are successfully using node.js than successfully using Go, though go solutions are far more lightweight than node.js solutions, and in theory Rust should be still lighter.

So maybe I do need to invent a C++ idiom for this problem. Well, a Rust library already has the necessary idiom. Use Tokio in Rust. Score for powerful macro language and sum types. Language is more expandable.

5 Unit Test

It is hard to unit test a client server system, therefore, most people unit test using mocks: Fake classes that do not really interact with the external world replace the real classes, if you perform that part of the unit test that deals with external interaction with clients and servers. Your unit test runs against a dummy client and a dummy server – thus the unit test code necessarily differs from the real code.

But this will not detect bugs in the real class being mocked, which therefore has to be relatively simple and unchanging – not that it is necessarily all that practical to keep it simple and unchanging, consider the messy irregularity of TCP.

Any message is an event, and it is a message between an entity identified by one elliptic point, and an entity identified by another elliptic point.

We intend that a process is identified by many elliptic points, so that it has a stable but separate identity on every server. Which implies that it can send messages to itself, and these will look like messages from outside. The network address will be an opaque object. Which is not going to help us unit test code that has to access real network addresses, though our program undergoing unit test can perform client operations on itself, assuming in process loopback is handled correctly. Or maybe we just have to assume a test network, and our unit test program makes real accesses over the real internet.

But our basic architecture is that we have an opaque object representing a communication node, it has a method that creates a connection, and you can send a message on a connection, and receive a reply event on that connection.

Sending a message on a connection creates the local object that will handle the reply, and this local object’s lifetime is managed by hash code tables – or else this local object is stored in the database, written to disk in the event that sends the message, and read from disk in the event that handles the reply to the message.

Object representing server 🢥 Object representing connection to server 🢥 object representing request-response.

We send messages between entities identifed by their elliptic points, we get events on the receiving entity when these events arrive, generate replies, and get an event on the sending entity when the reply is received.

And one of the things in these messages will be these entities and information about these entities.

So we create our universal class, which may be mocked, whereby a client takes an opaque data structure representing a server, and makes a request, thereby creating a channel, on which channel it can create additional requests. It can then receive a reply on this channel, and make further requests, or replies to replies, sequentially on this channel.

We then layer this class on top of this class – as for example setting up a shared secret, timing out channels and establishing new ones, so we have as much real code as possible, implementing request object classes in terms of request object classes, so that we can mock any one layer in the hierarchy,

At the top layer, we don’t know we are re-using channels, and don’t know we are re-using secrets – we don’t even keep track of the transient secret scalar and transient shared secret point, because that might be discarded and reconstructed. All this stuff lives in an opaque object representing the current state of our communication with the server, which is, at the topmost level, identified a database record, and/or an objected instantiated from a database record and/or a handle to that object and/or a hash code to that handle.

Since we are using an opaque object of an opaque type, we can freely mix fake objects with real ones. Unit test will result in fake communications over fake channels with fake external clients and servers.

6 Factorizing the problem

On the other hand OTR seems an unreasonably complicated way of adding on what you get for free with perfect forward secrecy, authentication without signing is just the natural default for perfect forward secrecy, and signing has to be added on top. You get OTR (Off the Record) for free just by leaving stuff out. XMPP is a presence protocol is just name service, which is integral to any social networking system. Its pile of existing code supports Jitsi’s wonderful video conferencing system, which would be intolerably painful to reinvent.

And OMEMO just does not do the job. It guarantees you have a private room with the people you think you have a private room with, but how did you come to decide you wanted a private room with those people and not others? It leaves the hard part of the problem out of scope.

The problem is factoring a heap of problems that lack obvious boundaries between one problem and the next. You need to find the smallest factors that are factors of all these big problems – find a solution to your problems that is a suitable module of a solution to all these big problems.

But you don’t want to factorize all the way down,otherwise when you want a banana, you will get a banana tree, a monkey, and a jungle. You want the largest factors that are common factors of more than one problem that you have to solve.

And a factor that we identify is that we create a shared secret with a lifetime of around twenty minutes or so, longer than the lifetime of the TCP connections and longer than the query-response interactions, that ensures:

Another factor we identify is binding a group of remote object method calls together to a single one that must fail together, of which problem a reliability layer on top of UDP is a special case. But we do not want to implement our own UDP reliability layer, when QUIC, has already been developed and widely deployed. We notice that to handle this case, we need not an event object referenced by an event handle and an event hashcode, but rather an array of event objects referenced by an event handle, an event hashcode, and the sequence number within that vector.

6.1 streams, secrets, messages, and authentication

To leak minimal metadata, we should encrypt the packets with XChaCha20-SIV, or use a random nonce. (Random nonces are conveniently the default case libsodium’s crypto_box_easy). The port should be random for any one server and any one client, to make it slightly more difficult to sweep up all packets using our encryption. Any time we distribute new IP information for a server, also distribute new open port information.

XChaCha20-SIV is deterministic encryption, and deterministic encryption will leak information unless every message sent with a given key is guaranteed to be unique - in effect, we have the nonce inside the encryption instead of outside. Each packet must contain a forever incrementing packet number, which gets repeated but with a different send time, and perhaps an incremented resend count, on reliable messaging resends. This gets potentially complicated, hard to maintain, and easy to break.

Neither protocol includes authentication. The crypto_box wraps the authentication with the encryption. You need to add the authenticator after encryption and before decryption, as crypto_box does. The principle of cryptographic doom is that if you don’t, someone will find some clever way of using the error messages your higher level protocol generates to turn it into a decryption/encryption oracle.

However crypto_box_curve25519xchacha20poly1305.*easy.* in crypto_box_curve25519xchacha20poly1305.h wraps it all together. You just have to call those instead of crypto_box_easy.* Which is likely to be a whole lot easier and safer than wrapping XChaCha20-SIV.

For each crypto_box function, there is a corresponding crypto_box_curve25519xchacha20poly1305 function, apart from some special cases that you probably should not be using anyway.

namespace crypto_box{
   const auto& «whatever» = crypto_box_curve25519xchacha20poly1305_«whatever»;
}

Nonces are intended to be communicated in the clear, thus sequential nonces inevitably leak metadata. Don’t use sequential nonces. Put the packet number and message number or numbers inside the authenticated encryption.

Each packet of a packetized message will contain the windowed message id of the larger message of which it is part, the id of the thread or thread pool that will ultimately consume it, the size of the larger message of which it is part, the number of packets in the larger message of which it is part, and its packet and byte position within that larger message. The repetition is required to handle out of order messages and messages with lost packets.

Message ids will be windowed sequential, and messages lost in entirety will be reliably resent because their packets will be reliably resent.

If we allocate each packet buffer from the heap, and free it when it is used, this does not make much of a dent in performance until we are processing well over a Gib/s.

So we can worry about efficient allocation after we have released software and it is coming under heavy load.

Another more efficient way would be to have a pool of 16KiB blocks, allocate one of them to a connection whenever that connection needs it, allocate packet buffers sequentially in a 16KiB block, incrementing a count, free up packet buffers in the bloc when a packet is dealt with, decrementing the count. When the count returns to zero, it goes back to the free pool, which is accessed in lifo order. Every few seconds the pool is checked, and if there are number of buffers that have not been used in the last few minutes, we free them. We organize things that inactive connection has no packet buffers associated with it. But this is fine tuning and premature optimization.

The recipient will nack the sender about any missing packets within a multipacket message. The sender will not free up any memory containing packets that have not been acked, and the receiver will not free up any memory that has not been handled by the thread that ultimately receives the data.

Experimenting with memory allocation and deallocation times, looks like a sweet spot is to allocate in 16KiB blocks, with the initial fifo queue being allocated with two 16KiB blocks as soon as activity starts, and the entire fifo queue deallocated when it is empty. If we allocated, deallocated when activity stops, and re-allocated every millisecond, it would not matter much, and we will be doing it far less often than that, because we will keeping the buffer around for at least one round trip time. If every active queue has on average sixty four KiB, and we have sixteen thousand simultaneous active connections, only costs a gigabyte. This rather arbitrary guesstimated value seems good enough that it does not waste too much memory, nor too much time. Memory for input output streams seems cheap, might as well cheerfully spend plenty, perhaps a lot more than necessary, so as to avoid hitting other limits.

We want connections, the shared secrets, identity data, and connection parameters, hanging around for a very long time of inactivity, because they are something like logins. We don’t want their empty data stream buffers hanging around. Re-establishing a connection takes hundreds of times longer that allocating and deallocating a buffer.

We also want, in a situation of resource starvation, to cut back the connections that are the heaviest users to wait. They should not send, until told space is available, and we just don’t make it available, because their buffer got emptied out, then thrown away, and they just have to wait their turn till the server clears them to get a new one allocated when they send data.

If the server has too much work, a whole lot of connections get idled for longer and longer periods, and while idled, their buffers are discarded.

When we have a real world application facing real world heavy load, then we can fuss about fine tuning the parameters.

The packet stream that is being resolved (the packets, their time of arrival and sending, that they were acked, nacked, ack status, and all that, goes into a first in first out random access queue, composed of fixed size blocks larger than the packet size.

We hope that C++ implements large random access fifo queues with mmap. If it does not, will eventually have to write our own.

Each block starts with metadata that enables the stream of fixed sized blocks to be interpreted as a stream of variable sized packets and the metadata about those packets. The block size in bits, and the size of the block and packet metadata, but initially only 4K byte, 32K kilobit blocks will be supported. The format of metadata that is referenced or defined within packets is also negotiated, though initially the only format will be format number one. Obviously each side is free to define its own format for the metadata outside of packets, but it has to be the same size at both ends. Each party can therefore demand any metadata size it wants, subject to some limit, for metadata outside the packets.

The packets are aligned within the blocks so that 512 bit blocks to be encrypted or decrypted are aligned with the blocks of the queue so the blocks of the queue are always a multiple of 512 bits, 32 bytes, and block size is given as a multiple of 32 bytes. This will result in an average of sixteen bytes of space wasted positioning each packet to a boundary.

The pseudo random streams of encrypting information are applied with an offset that depends on the absolute position in the queue, which is why the queues have to have packets in identical position in both queues. Each block header contains unwindowing values for any windowed values in the packets and packet metadata, which unwindowing data is a mere 64 bits, but, since block and packet metadata size gets negotiated on each connection, this can be expanded without breaking backwards compatibility. The format number for packet references to metadata implies an unwindow size, but we initially assume that any connection only sends less that 2^64 512 bit packets, rather packets plus the metadata required to describe those packets takes up less than 2^73 bits, corresponding to a thousand Mbps

The packet position in the queue is the same at both ends, and is unwindowed in the block header.

The fundamental architecture of QUIC is that each packet has its own nonce, which is an integer of potentially sixty two bits, expressed in a form that is short for smaller integers, which is essentially my design, so I expect that I can use a whole lot of QUIC code.

It negotiates the AES session once per connection, and thereafter, it is sequential nonces all the way.

Make a new one time secret from a new one time public key every time you start a stream (pair of one way streams). Keeping one time secrets around for multiple streams, although it can in theory be done safely, gets startlingly complicated really fast, with the result that nine times out of ten it gets done unsafely.

Each two way stream is a pair of one way streams. Each encryption packet within a udp packet will have in the clear its stream number and a window into its stream position, the window size being log base two of the position difference between all packets in play, plus two, rounded up to the nearest multiple of seven. Its stream number is an index into shared secrets and stream states associated with this IP and port number.

If initiating a connection in the clear (and thus unauthenticated) Alice sends Bob (in a packet that is not considered part of a stream) a konce (key used once, single use elliptic point A_o). She follows it, in the same packet and in a new encrypted but unauthenticated stream, proving knowledge of the scalar corresponding to the elliptic point by using the the shared secret a_oB_d = b_dA_o, where B_d is Bob’s durable public key and b_d his durable secret key. In the encrypted but unauthenticated stream, she sends A_d, her durable public key, (which may only be durable until the application is shut down) initiating a stream encrypted with (a_o+a_d)B_d = b_d(A_o+A_d), or more precisely, symmetrically encrypted with the 384 bit hash of that elliptic point and one way stream number).

All this stuff happens during the handshake, and when we allocate a receive buffer, we have a shared secret. The sender may only send up to the size of the receive buffer, and has to wait for acks which will announce more receive buffer.

There is no immediate reason to provide the capability to create a new differently authenticated stream from within an authenticated stream, for the use cases for that are better dealt with by sending authorizations for them existing authentication signed by the other party. Hence one to one mapping between port number and durable authenticating elliptic point, with each authenticated stream within that port number deriving its shared secret from a konce covers all the use cases that occur to me. We don’t care about making creating a login relationship efficient.

When the OS gives you a packet, it gives you the handle you associated with that network address and port number, and the protocol layer of application then has to expand that into the receive stream number and packet position in the stream. After decrypting the streams within a packet, it then maps stream id and message id to the application layer message handler id. It passes the position of data within the message, but not the position within the stream because you don’t want too many copies of the shared secret floating around, and because the application does not care.

Message data may arrive out of sequence within a message, but the protocol layer always sends the data in sequence to the application, and usually the application only wants complete messages, and does not register a partial message handler anyway.

Each application runs its own instance of the protocol layer, and each application is, as far as it knows or cares, sending messages identified by their receiver message handler and reply message handler to a party identified by its zooko id. A message always receives a reply, even if the reply is only “message acknowledged”, “message abandoned”, “message not acknowledged” “timeout”, “graceful shutdown of connection”, or “ungraceful shutdown of connection”, The protocol layer maps these into encrypted sequential streams and onto message numbers within the stream when sending them out, and onto application ids, application message handlers and receiving zooko ids when receiving them.

But, if a message always receives a reply, the sender may want to know which message is being replied to. Which implies it always receives a handle to the sent message when it gets the reply. Which implies that the protocol layer has to provide unique reply ids for all messages in play where a substantive reply is expected from the recipient. (“Message received” does not need a reply id, because implicit in the reliable transport layer, but special casing such messages to save a few bytes per message adds substantially to complexity. Easier to have the recipient ack all packets and all messages every round trip time, even though acking messages is redundant, and identifying every message is redundant.)

This is implies that the protocol layer gives every message a unique sixty four bit windowed id, with the window size sufficient to cover all messages in play, all messages that have neither been acked nor abandoned.

Suppose we are transferring one dozen eight terabyte disks in tiny fifty byte messages. And suppose that all these messages are in play, which seems unlikely unless we are communicating with someone on Pluto. Well, then we will run out of storage for tracking every message in play, but suppose we did not. Then forty bits would suffice, a sixty four bit message id suffices. And, since it is windowed, using the same windowing as we are using for stream packet 384 bit ids, we can always increase it without changing the protocol on the wire when we get around to sending messages between galaxies.

A windowed value represents an indefinitely large unsigned integer, but since we are usually interested in tracking the difference between two such values, we define substraction and comparison on windowed values to give us ordinary signed integers, the largest precision integer than we can conveniently represent on our machine. Which will always suffice, for by the time we get around to enormous tasks, we will have enormous machines.

Because each application runs its own protocol layer, it is simpler, though not essential, for each application to have its own port number on its network address and thus its own streams on that port number. All protocol layers use a single operating system udp layer. All messages coming from a single application in a single session are authenticated with at least that session and application, or with an id durable between sessions of the application, or with an id durable between the user using different applications on the same machine, or with an id durable to the user and used on different machines in different applications, though the latter requires a fair bit of potentially hostile user interface.

If the application wants to use multiple identities during a session, it initiates a new connection on a new port number in the clear. One session, one port number, at most one identity. Multiple port numbers, however, do not need nor routinely have, multiple identities for the same run of the application.

If we implement a QUIC large object layer, (and we really should not do this until we have working code out there that runs without it) it will consist of reliable request responses on top of groups of unreliable request responses, in which case the unreliable request responses will have a group request object that maps from their UDP origin and port numbers, and a sequence number within that group request object that maps to an item in an array in the group request operator.

6.1.1 speed

The fastest authenticated encryption algorithm is OCB - and on high end hardware, AES256OCB.

AES256OCB, despite having a block cipher underneath, has properties that make it possible to have the same API as xchacha20poly1305. (Encrypts and authenticates arbitrary length, rather than block sized, messages.)

One of these days I will produce a fork of libsodium that supports `crypto_box_ristretto25519aes256ocb.\*easy.\*, but that is hardly urgent. Just make sure the protocol negotiation allows new ciphers to be dropped in.

7 Getting something up and running

I need to get a minimal system up that operates a database, does encryption, has a gui, does unit test, and synchronizes data with other system.

We aim for a system that has a per user database identifying public keys related to user controlled secrets, and a local machine database relating public keys to IP numbers and port numbers. A port and IP address identifies a process, and a process may know the underlying secrets of many public key.

The gui, the user interface, will allow you to enter a secret so that it is hot and online, optionally allow you to make a subordinate wallet, a ready wallet.

The system will be able to handle encryption, authentication, signatures, and perfect forward secrecy.

The system will be able to merge and floodfill the data relating public keys to IP addresses.

We will not at first implement capabilities equivalent to ready wallets, subordinate wallets, and Domain Name Service. We will add that in once we have flood fill working.

Floodfill will be implemented on top of a Merkle-patricia tree implemented with, perhaps, grotesque inefficiency by having nodes in the database where the address of each node consists of the bit length of the address as the primary sort key, then the address, and then the record, the content of the node identified by this is hashes, the type, and the addresses of the two children, and the hashes of the two children. The hash of the node is the hash of the hashes of its two children, ignoring its address. (The hash of the leaf nodes take account of the leaf node’s address, but the hashes of the tree nodes do not)

Initially we will get this working without network communication, merely with copy paste communication.

An event always consists of a bitstream, starting with a schema identifier. The schema identifier might be followed by a shared secret identifiers, which identifies the source and destination key, or followed by direct identification of the source and destination key, plus stuff to set up a shared secret.

8 Terminology

9 The problem

Getting a client and a server to communicate is apt to be surprisingly complicated. This is because the basic network architecture for passing data around does not correspond to actual usage.

TCP-IP assumes a small computer with little or no non volatile storage, and infinite streams, but actual usage is request-response, with the requests and responses going into non volatile storage.

When a bitcoin wallet is synchronizing with fourteen other bitcoin wallets, there are a whole lot of requests and replies floating around all at the same time. We need a model based on events and message objects, rather than continuous streams of data.

IP addresses and port numbers act as handles and hashcodes to get data from one process on one computer to another process on another computer, but within the process, in user address space, we need a representation that throws away the IP address, the port number, and the positional information and sequence within the TCP-IP streams, replacing it with information that models the process in ways that are more in line with actual usage.

10 Message objects and events

Any time we fire an event, send a request, we create a local data structure identified by a handle and by the twofiftysix bit hashcode of the request, the pair of entities communicating. The response to the event references either the hashcode, or the handle, or both. Because handles are local, transient, live only in ram, and are not POD, handles never form part of the hash describing the message object, even though the reply to a request will contain the handle.

We don’t store a conversation as between me and the other guy. Rather, we store a conversation as between Ann and Bob, with the parties in lexicographic order. When Ann sees the records on her computer, she knows she is Ann, when Bob sees the conversation on his computer, he knows he is Bob, and Carol sees the records, because they have been made public as part of a review, she knows that Ann is reviewing Bob, but the records have the same form, and lead to the same Merkle root, on everyone’s computer.

Associated with each pair of communicating entities is a durable secret elliptic point, formed from the wallet secrets of the parties communicating, and a transient and frequently changing secret elliptic point. These secrets never leave ram, and are erased from ram as soon as they cease to be needed. A hash formed from the durable secret elliptic point is associated with each record, and that hash goes into non volatile storage, where it is unlikely to remain very secret for very long, and is associated with the public keys, in lexicographic order, of the wallets communicating. The encryption secret formed from the transient point hides the public key associated with the durable point from eves droppers, but the public key that is used to generate the secret point goes into nonvolatiles storage, where it is unlikely to remain very secret for very long.

This ensures that the guy handing out information gets information about who is interested in his information. It is a privacy leak, but we observe that sites that hand out free information on the internet go to great lengths to get this information, and if the protocol does not provide it, will engage in hacks to get it, such as Google Analytics, which hacks lead to massive privacy violation, and the accumulation of intrusive spying data in vast centralized databases. Most internet sites use Google Analytics, which downloads an enormous pile of JavaScript on your browser, which systematically probes your system for one thousand and one privacy holes and weaknesses and reports back to Google Analytics, which then shares some of their spy data with the site that surreptitiously downloaded their enormous pile of hostile spy attack code onto your computer.

We can preserve some privacy on a client by the wallet initiating the connection deterministically generating a different derived wallet for each host that it wants to initate connection with, but if we want push, if we want peers that can be contacted by other peers, have to use the same wallet for all of them.

A peer, or logged in, connection uses one wallet for all peers. A client connection without login, uses an unchanging, deterministically generated, probabilistically unique, wallet for each server. If the client has ever logged in, the peer records the association between the deterministically generated wallet, and wallet used for peer or logged in connections, so that if the client has ever logged in, that widely used wallet remains logged in forever -albeit the client can throw away that wallet, which is derived from his master secret, and use a new wallet with a different derivation from his master secret.

The owner of a wallet has, in non volatile storage, the chain by which each wallet is derived from his master secret, and can regenerate all secrets from any link in that chain. His master secret may well be off line, on paper, while some the secrets corresponding to links in that chain are in non volatile storage, and therefore not very secure. If he wants to store a large amount of value, or final control of valuable names, he has them controlled by the secret of a cold wallet.

When an encrypted message object enters user memory, it is associated with a handle to a shared transient volatile secret, and its decryption position in the decryption stream, and thus with a pair of communicating entities. How this association is made depends on the details of the network connection, on the messy complexities of IP and of TCP-IP position in the data stream, but once the association is made, we ignore that mess, and treat all encrypted message objects alike, regardless of how they arrived.

Within a single TCP-IP connection, we have a message that says “subsequent encrypted message objects will be associated with this shared secret and thus this pair of communicating entities, with the encryption stream starting at the following multiple of 4096 bytes, and subsequent encryption stream positions for subsequent records are assumed to start at the next block of a power of two bytes where the block is large enough to contain the entire record.”, but on receiving records following that message, we associate it with the shared secret and the encryption stream position, and pay no further attention to IP numbers and position within the stream. Once the association has been made, we don’t worry which TCP stream or UDP port number the record came in on or its position within the stream. We identify the communicating entities involved by their public keys, not their IP address. When we decrypt the message, if it is a response to a request, it has the handle and/or the hash of the request.

A large record object could take quite a long time downloading. So when the first part arrives, we decrypt the first part, to find the event handler, and call the progress event of the handler, which may do nothing, every time data arrives. This may cause the timeout on the handler to be reset.

If we are sending a message object after long delay, we construct a new shared secret, so the response to a request may come over a new TCP connection, different from the one on which it was sent, with a new shared secret, and a position in the decryption stream, unrelated to the shared secret, the position in the decryption stream, and the IP stream, under which a request was sent. Our message object identity is unrelated to the underlying internet protocol transport. Its destination is a wallet, and its ID on the process of the wallet is its hashtag.

11 Handles

I have above suggested various ad hoc measures for preventing references to reused handles, but a more robust and generic solution is hash codes. You generate fresh hash codes cyclicly, checking each fresh hash code to see if it is already in use, so that each communication referencing a new event handle or new shared secret also references a new hash code. The old hash code is de-allocated when the handle is re-used, so a new hashcode will reference the new entity pointed to by the handle, and the old hashcode fail immediately and explicitly.

Make all hashcodes thirty two bits. That will suffice, and if scaling bites, we are going to have to go to multiple host processes anyway. Our planned protocol already allows you to be redirected to an arbitrary host wallet speaking on behalf of a master wallet that may well be in cold storage. When we have enormous peers on the internet hosting hundreds of millions of cients, they are going to have to run tens of thousands of processes. Our hashtags only have meaning within a single process and our wallet identifier address space is enormous. Further, a single process can have multiple wallets associated with it, and we could differentiate hashes by their target wallet.

Every message object has a destination wallet, which is an online wallet, which should only be online in one host process in one machine, and an encrypted destination event hashcode. The fully general form of a message object has a source public key, a hashcode indicating a shared secret plus a decryption offset, or is prefixed by data to generate a shared secret and decryption offset, and, if a response to a previous event, an event hashcode that has meaning on the destination wallet. However, on the wire, when the object is travelling by IP protocol, some of these values are redundant, because defaults will have already been created associated with the IP connection. On the disk and inside the host process, it is kept in the clear, so does not have the associated encryption data. At the user process level, and in the database, we are not talking to IP addresses, but to wallets. The connection between a wallet and an IP address is only dealt with when we are telling the operating system to put message objects on the wire, or they are being delivered to a user process by the operating system from the wire. On the wire, having found the destination IP and port of the target wallet, the public key of the target wallet is not in the clear, and may be implicit in the port (dry).

Any shared secret is associated with two hash codes, one being its value on the other machine, and two public keys. But under the dry principle, we don’t keep redundant data around, so the redundant data is virtual or implicit.

12 Very long lived events

If the event handler refers to a very long lived event (maybe we are waiting for a client to download waiting message objects from his host, email style, and expect to get his response through our host, email style) it stores its associated pod data in the database, deletes it from the database when the event is completed, and if the program restarts, the program reloads it from the database with the original hashtag, but probably a new handle. Obviously database access would be an intolerable overhead in the normal case, where the event is received or timed out quickly.

13 Practical message size limits

Even a shitty internet connection over a single TCP-IP connection can usually manage 0.3Mbps, 0,035Mps, and we try to avoid message objects larger than one hundred KB. If we want to communicate a very large data structure, we use a lot of one hundred KB objects, and if we are communicating the blockchain, we are probably communicating with a peer who has at least a 10Mbps connection, so use a lot of two MB message objects.

1Mbps download, 0.3 Mbps upload, Third world cell phone connection, third world roach hotel connection, erratically usable.
2-4 Mbps Basic Email Web Surfing Video Not Recommended
4–6 Mbps Good Web Surfing Experience, Low Quality Video Streaming (720p)
6–10 Mbps Excellent Web Surfing, High Quality Video Streaming (1080p)
10-20 Mbps High Quality Video Streaming, High Speed Downloads / Business-Grade Speed

A transaction involving a single individual and a single recipient will at a minimum have one signature (which identifies one UTXO, rhocoin, making it a TXO, hence 4 * 32 bytes, two utxos, unused rocoins, hence 2 * 40 bytes, and a hash referencing the underlying contract, hence 32 bytes – say 256 bytes, 2048 bits. Likely to fit in a single datagram, and you can download six thousand of them per second on a 12Mbs connection.

On a third world cell phone connection, downloading a one hundred kilobyte object has high risk of failure, and busy TCP_IP connection has short life expectancy.

For communication with client wallets, we aim that message objects received from a client should generally be smaller than 20KB, and records sent to a client wallet should generally be smaller than one hundred KB. For peer wallets and server wallets, generally smaller than 2MB. Note that bittorrent relies on 10KB message objects to communicate potentially enormous and complex data structures, and that the git protocol communicates short chunks of a few KB. Even when you are accessing a packed file over git, you access it in relatively small chunks, though when you access a git repository holding packed files over https protocol, you download the whole, potentially enormous, packed file as one potentially enormous object. But even with git over https, you have the alternative of packing it into a moderate number of moderately large packed files, so it looks as if there is a widespread allergy to very large message objects. Ten K is the sweet spot, big enough for context information overheads to be small, small enough for retries to be non disruptive, though with modern high bandwidth long fat pipes, big objects are less of a problem, and streamline communication overheads.

14 How many shared secrets, how often constructed

The overhead to construct a shared secret is 256 bits and 1.25 milliseconds, so, on a ten Megabit per second connection, if the CPU spent half its time establishing shared secrets, it could establish one secret every three hundred microseconds, eg, one secret every three thousand bits.

Since a minimal packet is already a couple of hundred bits, this does not give a huge amount of room for a DDoS attack. But it does give some room. We really should be seriously DDoS resistant, which implies that every single incoming packet needs to be quickly testable for validity, or cheap to respond to. A packet that requires the generation of a shared secret it not terribly expensive, but it is not cheap.

So, we probably want to impose a cost on a client for setting up a shared secret, And since the server could have a lot of clients, we want the cost per server to be small, which means cost per client to be mighty small in the legitimate non DDoS scenario – it only is going to bite in the DDoS scenario. Suppose the server might have a hundred thousand clients, each with sixteen kilobytes of connection data, for a total of about two gigabyes of ram in use managing client connections. Well then, setting up shared secrets for all those clients is going to take twelve and a half seconds, which is quite a bit. So we want a shared secret, once set up, to last for at least ten to twenty minutes or so. We don’t want clients glibly setting up shared secrets at whim, particularly as this could be a relatively high cost on the server for a relatively low cost on the client, since the server has many clients, but the client does not have many servers.

We want shared secrets to be long lived enough that the cost in memory is roughly comparable to the cost in time to set them up. A gigabyte of shared secrets is probably around ten million shared secrets, so would take three hours to set up. Therefore, we don’t need to worry about throwing shared secrets away to save memory – it is far more important to keep them around to save computation time. This implies a system where we keep a pile of shared secrets, and the accompanying network addresses in memory. Hashtable that hashes wallets existing in other processes, to handles to shared secrets and network addresses of existing in this process. So each process has the ability to speak to a lot of other processes cached, and probably has some durable connections to a few other processes. Which immediately makes us think about flood filling data through the system without being vulnerable to spam.

Setting up tcp connections and tearing them down is also costly, but it looks as though, for some reason, existing code can only handle a small number of tcp connections, so they encourage you to cotinually tear them down and recreate them. Maybe we should shut down a tcp connection after eighteen seconds of nonuse. Check them every multiple of 8 seconds past epoch, refrain from reuse twenth four seconds past the epoch, and shut them down altogether after thirty two seconds. (The reason for checking them at certain time since the epoch is that shutdown is apt to go more efficienty if initiated at both ends.

Which means it would be intolerable to have a shared secret generation in every UDP packet, or even very many UDP packets, so to prevent DDoS attack, and just to have efficient communications, have to have a deal where you cheaply for the server, but potentially expensively for the client, establish a connection before you construct a shared secret.

A five hundred and twelve bit hash however takes 1.5 microseconds – which is cheap. We can use hashes to resist dos attacks, making the client return to us the state cookie unchanged. If we have a ten megabit connection, then every packet is roughly the size of a hash, in which case the hash time is roughly three hundred megabits per second, not that costly to hash everything.

How big a hash code do we need to identify the shared secret? Suppose we generate one shared secret every millisecond microseconds. Then thirty two bit hashcodes are going to roll over in forty days. If we have a reasonable timeout on inactive shared secrets, reuse is never going to happen, and if it does happen, the connection fails, Well, connections are always failing for one reason or another, and a connection inappropriately failing is not likely to be very harmful, whereas a connection seemingly succeeding, while both sides make incorrect and different assumptions about it could be very harmful.

15 Message UDP protocol for messages that fit in a single packet

When I look at the existing TCP state machine, it is hideously complicated. Why am I thinking of reinventing that? Syn cookies turn out to be less tricky than I thought – the server just sends a secret short hash of the client data and the server response, which the client cannot predict, and the client response to the server response has to be consistent with that secret short hash.

Well, maybe it needs to be that complicated, but I feel it does not. If I find that it really does need to be that complicated, well, then I should not consider re-inventing the wheel.

Every packet has the source port and the destination port, and in tcp initiation, the client chooses its source port at random (bind with port zero) in order to avoid session hijacking attacks. Range of source ports up to 65535

Notice that this gives us 2⁶⁴ possible channels, and then on top of that we have the 32 bit sequence number.

IP eats up twenty bytes, and then the source and destination ports eat four more bytes. I am guessing that NAT just looks at the port numbers and address of outgoing, and then if a packet comes in equivalent incoming, just cheerfully lets it through. TCP and UDP ports look rather similar, every packet has a specific server destination port, and a random client port. Random ports are sometimes restricted to 0xC000-0xFFFF, and sometimes mighty random (starting at 0x0800 and working upwards seems popular) But 0xC000-0xFFFF viewed as a hashcode seems ample scope. Bind for port 0 returns a random port that is not in use, use that as a hashcode.

Sequence number is something like your event hashcode – or perhaps event hashcode for grouped events, with the tcp header being the group.

Assume the process somehow has an accessible and somehow known open UDP port. Client low level code somehow can get hold of the process port and IP address associated with the target elliptic point, by some mechanism we are not thinking about yet.

We don’t want the server to be wide open to starting any number of new shared secrets. Shared secrets are costly enough that we want them to last as long as cookies. But at the same time, recommended practice is that ports in use do not last long at all. We also might well want a redirect to another wallet in the same process on the same server, or a nearby process on a nearby server. But if so, let us first set up a shared secret that is associated with the shared secret on this port number, and then we can talk about shared secrets associated with other port numbers. Life is simpler if a one to one mapping between access ports and durable public and private keys, even if behind that port are many durable public and private keys.

16 UDP protocol for potentially big objects

The tcp protocol can be thought of as the tcp header, which appears in every packet of the stream, being a hashcode event object, and the sequence number, which is distinct and sequential in every packet of the unidirectional stream, being a std:dequeue event object, which fifo queue is associated with hashcode event object.

This suggests that we handle a group of events, where we want to have an event that fires when all the members of the group have successfully fired, or one of them has unrecoverably failed, with the group being handled as one event by a hashcode event object, and the the members of the group with event objects associated with a fifo queue for the group.

When a member of the group is triggered, it is added to the queue. When it is fired, it is marked as fired, and if it is the last element of the queue, it is removed from the queue, and if the next element is also marked as fired, that also is removed from the queue, until the last element of the queue is marked as triggered but not yet fired. In the common case where we have a very large number of members, which are fired in the order, or approximately the order, that they are triggered, this is efficient. When the group event is marked as all elements triggered and all elements fired, and the fifo queue empty then that fires the group event.

Well, that would be the efficient way to handle things if we were implementing TCP, a potentially infinite stream, all over again, but we are not.

Rather, we are representing a big object as a stream of objects, and we know the size in advance, so might as well have an array that remains fixed size for the entire lifetime of the group event. The member event identifiers are indexes into this one big fixed size array.

The event identifier is run time detected as a group event identifier, so it expects its event identifier to be followed by an index into the array, much as the sequence number immediately follows the TCP header.

I would kind of like to have a QUIC protocol eventually, but that can wait.If we have a UDP protocol, the communicating parties will negotiate a UDP port that uniquely identifies the processes on both computers. Associated with this UDP port will be the default public keys and the hash of the shared secret derived from those public keys, and a default decryption shared secret. The connection will have a keep alive heartbeat of small packets, and a data flow of standard sized large packets, each the same size. Each message will have a sequence number identifying the message, and each UDP packet of the message will have the sender sequence number of its message, its position within the message, and, redundantly, the power of two size of the encrypted message object. Each message object, but not each packet containing a fragment of the message object, contains the unencrypted hashtag of the shared secret, the hashtag of the event object of the sender, which may be null if it is the final message, and, if it is a reply, the hashtag of event object of the message to which it is a reply, and the position within the decryption stream as a multiple of the power of two size of the encrypted message. This data gets converted back into standard message format when it is taken off the UDP stream.

Every data packet has a sequence number, and each one gets an ack, though only when the input queue is empty, so several data packets get a group ack. If an ack is not received, the sender sends a nack. If the sender responds with a nack (huh, what packets?) resends the packets. If the sender persistently fails to respond, sending the message object failed, and the connection is shut down. If the sender can respond to nacks, but not to data packets, maybe our data packet size is too big, so we halve it. If that does not work, sending the message object failed, and the connection is shut down.

QUIC streams will be created and shut down fairly often, each time with a new shared secret, and message object reply may well arrive on a new stream distinct from the stream on which it was sent.

Message objects, other than nacks and acks, intended to manage the UDP stream are treated like any other message object, passed up to the message layer, except that their result gets sent back down to the code managing the UDP stream. A UDP stream is initiated by a regular message object, with its own data to initiate a shared secret, small enough to fit in a single UDP packet, it is just that this message object says “prepare the way for bigger message objects” – the UDP protocol for big message objects is built on top of a UDP protocol for message objects small enough to fit in a single packet.

1 related

2 clients and hosts, masters and slaves