logo
vision scalability social networking revelation

Replacing TCP, SSL, DNS, CAs, and TLS

1 related

Client Server Data Representation

2 Existing work

µTP, Micro Transport Protocol has already been written, and it is just a matter of copying it and embedding it where possible, and forking it if unavoidable. DDOS resistance looks like it is going to need forking.

It implements ledbat, a protocol designed for applications that download bulk data in the background, pushing the network close to its limits, while still playing nice with TCP.

Implementing consensus over µTP is going to need QUIC style streams, that can slow down or fail without the whole connection slowing down or failing, though it might be easier to implement consensus that just calls µTP for some tasks.

I have not investigated what implementing short fixed length streams over µTP would involve. Bittorrent already necessarily does something mighty like that. Maybe it just sequentializes everything. Which kind of makes sense, a single concurrent process managing each connection is easier to program and comprehend, even if it cannot give optimal performance. Obviously it must have a request response layer, documented only in source code. The question then is how it maps that layer onto a µTP connection. You are going to have to copy, not just µTP, but that layer, which should be part of µTP, but probably is not. You will have to factorize that they probably not cleanly factorized.

Their request response layer is probably somewhat documented in BEP0055 I suspect that what I need is not just µTP, but the largest common factors of BEP0055

µTP does not itself implement hole punching, but interoperates smoothly with libtorrents’s BEP0055’s ut_holepunch extension message, which is only documented in libtorrent source code.

A tokio-rust based µTP system is under development, but very far from complete last time I looked. Rewriting µTP in rust seems pointless. Just call it from a single tokio thread that gives effect to a hundred thousand concurrent processes. There are several projects afoot to rewrite µTP in rust, all of them stalled in a grossly broken and incomplete state.

QUIC has grander design objectives,and is a well thought out, well designed, and well tested implementation of no end of very good and much needed ideas and technologies, but relies heavily on enemy controlled cryptography.

Albeit there are some things I want to do, consensus between a small number of peers, by invitation and each peer directly connected to each of the others, the small set of peers being part of the consensus known to all peers, and all peers always online and responding appropriately, or els they get kicked out. (Practical Byzantine Fault Intolerant consensus) which it really cannot do, though it might be efficient to use a different algorithm to construct consensus, and then use µTP to download the bulk data.

3 Existing documentation

There is a great pile of RFCs on issues that arise with using udp and icmp to communicate, which contain much useful information.

RFC5405, RFC6773, datagram congestion control, RFC5595, UDP Usage Guideline

There is a formalized congestion control system ECN explicit congestion control. Most severs ignore ECN. On a small proportion of routes, 1%, ECN tagged packets are dropped

Raw sockets provide greater control than UDP sockets, and allow you to do ICMP like things through ICMP.

I also have a discussion on NAT hole punching, peering through nat, that summarizes various people’s experience.

To get an initial estimate of the path MTU, connect a datagram socket to the destination address using connect(2) and retrieve the MTU by calling getsockopt(2) with the IP_MTU option. But this can only give you an upper bound. To find the actual MTU, have to have a don’t fragment field (which is these days generally set by default on UDP) and empirically track the largest packet that makes it on this connection. Which TCP does.

3.1 first baby steps

To try and puzzle this out, I need to build a client server that can listen on an arbitrary port, and tell me about the messages it receives, and can send messages to an arbitrary hostname:port or network address:port, and which, when it receives a packet that is formatted for it, will display the information in that packet, and obey the command in that packet, which will typically be a command to send a reply that depicts what is in the packet it received, which probably got transformed by passing through multiple nats, and/or a command to display what is in the packet, which is typically a depiction of how the packet to which this packet is a reply got transformed

This test program sounds an awful lot like ICMP, which is best accessed through raw sockets. Might be a good idea to give it the capability to send ICMP, UDP, and fake TCP.

Raw sockets provide the lowest level access to the network available from userspace. An immense pile of obscure and complicated stuff is in kernel.

4 What the API should look like

It should be a consensus API for consensus among a small number of peers, rather than message API, message response being the special case of consensus between two peers, and broad consensus being constructed
out of a large number of small invitation based consensi.

A peer explicitly joins the small group when its request is acked by a majority, and rejected by no one.

On the other hand this involves re-inventing networking from scratch, as compared to simply copying http/2, or some other reliable UDP system.

Total rewrites, however desirable and necessary, always fail

So on reflection this is a blue sky proposal - likely to involve immense delay:

I need to think about the way things should be done - but I don’t want to get lost in the weeds. I have repeatedly wasted a great deal of time re-inventing stuff from scratch, only to find that when I was finished, I had something vastly inferior to what already existed, so I wound up tossing my work, and using someone else’s library with minimum adaptation.

Many a time I see something is encrusted with ancient history, backward compatibility means they cannot fix old mistakes, I design something new and fresh, and vastly superior, and discover that there were one hundred and one issues that old history encrusted thing had encountered and dealt with, and I had not foreseen, that not all of that mighty pile of code is crap to work around past mistakes which must continue to be supported, but a lot of it is issues I had not foreseen having to deal with, and had not planned a path to dealing with them.

When implementing stuff from scratch, all too often one discovers there are no end of reasons for all the stuff one thought bad and unnecessary in existing libraries.

But on with the vision. Though it will likely be vastly faster to just fix someone else’s library to have real security.

Although the api represents messages, rather than connections, it will implicitly have a very large number of connections, in that a connection is your current state with a counterparty, expected protocols (message types) and all that.

For an app to poll a very large number of connections over the network, select does not cut the mustard. Network apis have been evolving, each in its own idiosyncratic way, to the app making O(1) additions and deletions to list of counterparties on the network whose messages it is listening to, and getting notifications that are O(number of events) rather than O(number of counterparties).

The way this should be done is a linked list of data structures containing events, which the app can poll locklessly, or wait on (with a timer event guaranteed to appear in the list eventually if it is waiting on it). If the app fails to free anything from the list after an unreasonably long time, suggesting that the app has shut down ungracefully or crashed, and there are rather too many things on the list, the process that is putting things on the list will start by pushing back on the parties sending messages to the app, and end by shutting down their connections and discarding their data. The network events live entirely in memory and are volatile. If they represent long lived relationships, it is up to the app to commit the information that they represent to disk.

Every message has a public key of sender, a public key of recipient, an potentially an in-regards-to hash, a reply-to hash, and an in-reply-to hash. Some or all of these hashes may be null. It seldom makes sense for all of them to be null, and it seldom makes sense for all of them to be non null. Usually reply-to is null, and it does not always make sense for it to be non null.

The reply-to field opens up a very large can of worms, in that its main use is to reference a third party message that came from a third party server, with its own type information and sender public key, and the how does the sender know the recipient has or can obtain that message?

Every hash and every public key represents a potential endpoint, and thus represents an additive type, or rather gives the system potential clues on how to discover a mutually known additive type. (Reflect on the slow and chaotic semi automated complexity of how the many protocols involved in sending and receiving an email message are discovered, every time, for every email message.)

Some of the time, the message type is only known from one of these hashes – they imply the type information, without which the recipient would not know how to parse the message, and the recipient has to be able to recognize them before he can recognize anything else. And some of the time, figuring out the message type from these hashes is non trivial or just flat out fails. No general automatic one size fits all procedure can work on every mysterious second party hash. This is a problem that has to be dealt with ad hoc use case by use case, protocol by protocol, message type by message type.

Not all messages can be sent reliably, but the sender gets a notification event – failed, succeeded, replied to, or unlikely to be known, and the sender can immediately find out either the likely timing of such notification, or that the likely timing of such notification is unknown – and usually that the likely timing of such notification is unknown generates an exception.

The api is potentially multilayered – the message may well get translated to a multitude of similarly structured messages, that set up the connection, find out information about the recipient, all that stuff, and when those messages go on the wire, they do not necessarily have any of this stuff – commonly they just have the network, the port address, and some numbers that uniquely identify the context, which numbers are unique to the connection, but unlike the hashes from which they are derived, not globally unique, are sequential identifiers, not hashes. But at the top level, the network address, the port, and all that stuff is just not represented, except implicitly in that the public key of the recipient may well get looked up in a hash table that may well have the network address and the port.

On the wire, network address and port serves the function of in-regards-to, and will wrap stuff that provides a finer grained function of in-regards-to and in-reply-to – as I said, multilayered, with the hashes being internally mapped to to data that serves equivalent functionality. Network address and port being the outermost layer on the wire.

On the wire, once a connection is established, the sender and recipient public keys are implicit in the ip header, and rest is opaque payload, maximum payload being 1kiB. Inside the payload, the representation depends on the message type, which was established when the connection was established – the in-reply-to of the contained message is the unique sequential nonce of the message being replied to, rather than the hash of that message.

In the api, the application and api know the message type, because otherwise the api just would not work. But on the rare occasions when the message is represented globally, outside the api, then it needs a message type header.

5 TCP is broken

TCP was designed in more trusting times, when the name system consisted of a widely shared hosts file, and everyone trusted everyone.

Over the years people have piled warts on top of TCP and warts on top of warts to fix one problem after another, and every fix results in additional round trips

Thus “Cloudfare is checking your browser, you will be redirected shortly”

Every additional round trip before a web page comes up results in a significant loss of viewers. Hence http2. Which fails to fix the DDOS and cloudfare problem.

TCP is a major problem, which is slowing down the internet. DDoS protection and the certificate mess are warts growing on top of warts.

Any business that resists corporate cancer is going to come under DDoS, and if it employs a DDoS resistance service, that service is likely to place pressure on the business to do political stuff that is counterproductive to pursuing a profit. And even if it does not, the DDoS service slows down people trying to view the business website.

If the TCP replacement fixes those warts, you get more views.

6 Domain name system and SSL is broken

Any organization that has a certificate authority in its pocket can perform a man in the middle attack on an SSL connection, though the CAA domain name record somewhat mitigates this problem.

We need to also need to replace the TCP/SSL/CA/DNS system because there is money in it. A great deal of money.

The trouble with an ICO (initial coin offering), is that the issuer has no obligation to do anything other than take the money and run. We are moving to an economy where much of the value is “goodwill”, “goodwill” being names with reputations and relationships. The blockchain (or blockdag, since blockdags theoretically have better scaling than blockchains) could be used to render this value liquid in IPOs by having both names and money on the blockchain.

Atomic transactions between blockchains, plus names on the blockchain with money, a replacement for TCP/SSL/CAs/DNS could support sovereign corporations on the blockchain, so that an ICO could be an IPO (Initial Public Offering). If the blockchain is a name service as well as a money service, it could give the investors ownership of the name. The owners of examplecorp shares get to designate the board public key, and the board gets to designate the public key of CEO@examplecorp from time to time, thus rendering the value of a name potentially liquid.

Cryptocurrency exchanges are run by crooks, and are full of crooks each trying to scam all the other crooks.

If you don’t know who the pigeon is, you are the pigeon.

A healthy cryptocurrency market needs to leave the cryptocurrency exchanges behind, replacing them with atomic blockchain transactions between separate blockchains. They are dangerously centralized, and linked to a corruptly regulated finance and accounting system, which corruption we saw with Great Minority Mortgage Meltdown and the Mortgage backed Security market from 2005 November to 2007, and saw with MF Global. Jon Corzine did worse than embezzle client funds. He embezzled client funds legally.

Demand for crypto currencies is driven in substantial part by the fact that recent regulations have cheerfully set aside laws on fiduciary duty that are millennia old. The exchanges cheerfully adhere to such regulations as they find dangerously convenient, while taking advantage of cryptocurrency to avoid those regulations that they find inconvenient.

The banks, the stock exchanges, and the big accounting firms are regulated agencies whose regulators are in their pocket. The crypto currency exchanges are semi regulated, taking advantage of regulations written for those who have regulators in their pocket.

The cryptocurrency market needs to get rid of exchanges, starting with cryptocurrency exchanges, and proceeding to get rid of stock exchanges.

An exchange exists to provide an escrow that faithfully observes its fiduciary duty. And there have been a great many recent examples of such entities getting up to no good, and in the case of the mortgage backed security market, up to no good with enormous amounts of money.

A cryptocurrency with a name system could eat their lunch, greatly enriching its founders in the process.

7 Networking itself is broken

But that is too hard a problem to fix.

I had to sweat hard setting up Wireguard, because it pretends to be just another network adaptor so that it can sweep away a pile of issues as out of scope, and reading up posts and comments referencing these issues, I suspect that almost no one understands these issues, or at least no one who understands these issues is posting about them. They have a magic incomprehensible incantation which works for them in their configuration, and do not understand why it does not work for someone else in a subtly different configuration.

7.1 Internet protocol too many layer of abstraction

I have to talk internet protocol to reach other systems over the internet, but internet protocol is a messy pile of ad hoc bits of software built on top of ad hoc bits of software, and the reason it is hard to understand the nuts and bolts when you actually try to do anything useful is that you do not understand, and indeed almost no one understands, what is actually going on at the level of network adaptors and internet switches. When you send a udp packet, you are already at a high level of abstraction, and the complexity that these abstractions are intended to hide leaks.

And because you do not understand the intentionally hidden complexity that is leaking, it bites you.

7.1.1 Adaptors and switches

A private network consists of a bunch of network adaptors all connected to one ethernet switch and its configuration consists of configuring the software on each particular computer with each particular network adaptor to be consistent with the configuration of each of the others connected to the same ethernet switch, unless you have a DHCP server attached to the network, in which case each of the machines gets a random, and all too often changing, configuration from that DHCP server, but at least it is guaranteed to be consistent with the configuration of each of the other network adaptors attached to that one ethernet switch. Why do DHCP configurations not live forever, why do they not acknowledge the machine human readable name, why does the ethernet switch not have a human readable name, and why does the DHCP server have a network address related to that of the ethernet switch, but not a human readable name related to that of the ethernet switch?

What happens when you have several different network adaptors in one computer?

Obviously an IP address range has to be associated with each network adaptor, so that the computer can dispatch packets to the correct adaptor. And when the network adaptor receives a packet, the computer has to figure out what to do with it. And what it does with it is the result of a pile of undocumented software executing a pile of undocumented scripts.

If you manually configure each particular machine connected to an ethernet switch, the configuration consists of arcane magic formulae interpreted by undocumented software that differs between one system and the next.

As rapidly becomes apparent when you have to deal with more than one adaptor, connected to more than one switch.

Each physical or virtual network adaptor is driven by a device driver, which is different for each physical device and operating system. From the point of view of the software, the device driver api is the network adaptor programmer interface, and it does not care about which device driver it is, so all network adaptors must have the same programmer interface. And what is that interface?

Networking is a wart built on top of warts built on top of warts. IP6 was intended to clean up this mess, but kind of collapsed under rule by committee, developing a multitude of arcane, overly complicated, and overly clever cancers of its own, different from, and in part incompatible with, the vast pile of cruft that has grown on top of IP4.

The committee wanted to throw away the low order sixty four bits of address space to use to post information for the NSA to mop up, and then other people said to themselves, “this seems like a useless way to abuse the low order sixty four bits, so let us abuse it for something else. After all, no one is using it, nor can they use it because it is being abused”. But everyone whose internet facing host has been assigned a single address, which means has actually been assigned 264 addresses because he has sixty four bits of useless address space, needs to use it, since he probably wants to connect a private in house network through his single internet facing host, and would like to be free to give some of his in house hosts globally routable addresses.

In which case he has a private network address space, which is a random subnet of fd::/8, and a 64 bit subnet of the global address space, and what he wants is that he can assign an in house computer a globally routable address, whereupon anything it sends that has a destination that is not on his private network address space, nor his subnet of the globally routable address space, gets sent to the internet facing network interface.

Further, he would like every computer on his network to be automatically assigned a globally routable address if it uses a name in the global system, or a private fd:: address if it is using a name not in the global system, so that the first time his computer tries to access the network with the domain name he just assigned, it gets a unique network address which will never change, and a reverse dns that can only be accessed through an address on his private network. And if he assigns it a globally accessible name, he would like the global dns servers and reverse dns servers to automatically learn that address.

This is, at present, doable by the DDI, which updates both your DHC server and your DNS server. Except that hardly anyone has an in house DNS server that serves up his globally routable addresses. The I in DDI stands for IP Address Manager or IPAM. In practice, everyone relies on named entities having extremely durable network addresses which are a pain and a disaster to dynamically update, or they use dynamic DNS, not IPAM.

What would be vastly more useful and usable is that your internet facing peer routed globally routable packets to and from your private network, and machines booting up on your private network automatically received addresses static addresses corresponding their name.

Globally routable subnets can change, because of physical changes in the global network, but this happens so rarely that a painful changeover is acceptable. The IP6 fix for automatically accommodating this issue is a cumbersome disaster, and everyone winds up embedding their globally routable IP6 subnet address in a multitude of mystery magic incantations, which, in the event of a change, have to be painstakingly hunted down and changed one by one, so the IP6 automatic configuration system is just a great big wart in a dinosaur’s asshole. It throws away half the address space, and seldom accomplishes anything useful.

8 Distributed Denial of Service attack

At present, resistance to Distributed Denial of Service attacks rests on dangerously powerful central authorities, in particular Cloudfare, whose service in addition to being dangerously centralized, is expensive and poor.

The TCP replacement needs an adjustable proof of work (pow) handshake as the first part of the connection handshake, the proof of work request being first server packet in the four packet handshake.

First packet, client requests connection, second packet, server requests work,and supplies a durable and a short lived public key, third packet, client supplies work and offers transient public key, making communication possible, plus the message it is trying to send the server, or the first part of that message.

The work demanded goes up as the server load increases, thus fixing the horrors of DDoS protection.

8.1 Key agreement

Key agreement needs to be part of the the TCP replacement handshake, rather than a layer on top, to reduce round tripping.

The name system needs to be integrated with the key system, so that you get the key when when you get the network address associated with the name, and the key/name pairing needs to be blockchain secured, so you don’t have one thousand certificate authorities each with the authority to mount a man in the middle attack.

8.2 replacement handshake for publicly identified server

The the TCP replacement handshake needs to be a four phase handshake.

  1. Client->Server: Give me a connection, here are my parameters, here is my session key.

  2. Server->Client: Here is a proof of work request, my parameters, and a keyed hash of your and my parameters. Ask again with proof of work, the same parameters, and the keyed hash.

    Server then throws away the request, allocating no memory.

  3. Client->Server: OK, here I am again, with all that stuff you asked for.

    This includes a konce (key used once,single use elliptic point), and assumes that the client reliably knows the server public key i advance. This protocol is inappropriate to signons that are restricted to identified entities, because we probably do not want everyone to know who is identified.

  4. Server checks the poly1305 authentication to ensure that this is a real client reply to a real and recent server reply. Then it checks the proof of work.

    If the proof of work passes, Server allocates memory, generates and stores a session key, and stores connection parameters, the client and server session keys among them.

  5. Server->Client: OK, here is my session key, authenticated but not signed by my permanent key, and stuff, now you can start sending actual data.

Thus we can integrate TCP handshake and encryption hand shake and the innumerable DDoS protection handshakes “Cloudfare is checking your browser, oops, your browser did not pass, here is a captcha” at the cost of one single additional trip, half a round trip.

Instead of the person establishing the connection fuming while round trip after round trip goes through, we get all that stuff at the cost of one additional half round trip.

8.2.1 pow implementation

Each sequential proof of work request contains a 64 bit sequential integer. The integer starts at random 63 bit value, to ensure that every possible successful proof of work ever used is unique in the universe. The sequential integer is treated as a windowed value into a 512 bit integer, whose high order part is an unshared secret that remains unchanged for the duration.

From that 512 bit value, the server generates a unique XChaCha20 512 bit value, 256 bits of which are used to generate a Poly1305 authenticator for the proof of work request. If it receives a completed proof of work request containing the authentication, it knows it comes from an entity at that network address that was able to receive the proof of work request. Knowing it is talking to real network addresses, it can derank network addresses that create excessive burdens, so that they cannot slow down everyone else, only themselves.

When it receives the completed proof of work, it first checks the sequence number to ensure it is a recently issued request for work, then checks if there is already a channel allocated for that pow, using a table of doubly linked lists of recently allocated channels.indexed by the low order part of the pow sequence number If it discovers it has already passed that proof of work and allocated a channel, moves that proof of work to the head of list, so that the next check will be instant, just in case it is about to receive a million copies of that proof of work. Then it checks for revealed bits from those generated by XChaCha20. Then it checks the work and the Poly1305 authentication.

Checking if there is already a channel allocated overlaps and intersects with presence notification protocol. We want to have a very large number of inactive presences without secrets or network addresses in the database, a large number of long lived active presences in memory, with secrets that are not paged to disk (sodium_allocarray), and considerably smaller number of considerably shorter lived channels with flow control and buffering. A presence can only exchange short messages that fit in one packet, and only one message can be active in any round trip time. You open a presence, and the presence can then open a channel.

We probably want to do the checks in whatever order is empirically most efficient for type of DDoS attacks that we encounter in practice, the most common probably being garbage random values that bear no particular resemblance to valid connection attempts.

The next problem will valid connections that then make excessive demands. These get deranked by the next layer, and they will then have to make a new connection, which will face increasing pow and discrimination against their network address.

8.3 replacement handshake for limited circulation server

In this case the server is the gateway for a group, possibly many groups, whose unique id is not widely known. It is analogous to a closely kept email address.

The the TCP replacement handshake needs to be a four phase handshake.

  1. Client->Server: Give me a connection, here are my parameters, here is a clue about what private group I want to connect to.

  2. Server->Client: Here is a proof of work request, my parameters, including a use once elliptic point, and a keyed hash of your and my parameters. Ask again with proof of work, the same parameters, and the keyed hash.

    Server then throws away the request, allocating no memory.

  3. Client->Server: OK, here I am again, with all that stuff you asked for.

    At this point, client has given server a clue about which private group it wants to connect to, and server has given client a clue about which private group it expects membership of, and therefore what public key the client should attempt to communicate with.

  4. Server checks the keyed hash to ensure that this is a real client reply to a real and recent server reply. Then it checks the proof of work.

    If the proof of work passes, Server allocates memory

    Then it generates a transient secret from the konces (keys used once, single use elliptic points), and uses it to decrypt the clien durable public key, verifying that the client does indeed know the transient scalar. If the client durable key is OK, sign on allowed, it constructs a shared secret from all four keys, the sum of two secrets multiplying the sum of two elliptic points, and we now have an encrypted stream associated with the port number and network addresses.

9 Summary of the replacement

Thus we can integrate TCP handshake and encryption hand shake and the innumerable DDoS protection handshakes “Cloudfare is checking your browser, oops, your browser did not pass, here is a captcha” at the cost of one single additional trip, half a round trip.

Instead of the person establishing the connection fuming while round trip after round trip goes through, we get all that stuff at the cost of one additional half round trip.

10 messages, not streams

TCP sockets are designed for synchronous procedural programming, on machines with very limited memory processing limitless streams. They are now almost always used for message processing from event oriented asynchronous code, with a messaging layer on top of the endless stream layer. The replacement needs to have application layer sending messages and receiving messages in events. The application layer should not have to deal with sockets and streams. Rather, it sends a message to destination identified by its durable public key, and gets a reply, where the reply might be that the socket could not be opened, or that the socket was open but the reply timed out, among other things. When sending a message, there is a time to wait for response before giving up, and a time for the socket that may be created to live idle.

11 Proposed replacement

QUIC is the current TCP replacement. Also known as HTTP/3

We have no alternative but to interface to the vast HTTP/2 HTTP/3 ecosystem. The wallet is going to have to talk as a client to legacy server http/3 devices, and accept their CA certificates, preferably subject to Zooko scrutiny, and legacy http/3 client devices are going to have to talk to our wallet (after their wallet has downloaded a zooko based certificate from the server wallet).

Talking HTTP/3 means being wide open to DDOS attack, so that you are forced to use cloudfare. When a device with our version of QUIC talks to another device with our version of QUIC, it has to implement our DDOS resistance, and Zooko in place of CA. But when it talks to a legacy HTTP/3 device, it has to lay itself wide open to DDOS attack and CA interception.

Backwards compatibility with insecure systems always creates a massive security hole. On the one hand, every build from scratch project dies. On the gripping hand, every attempt to do fax over the internet failed and was eventually replaced by pdf attachments to email. Backwards compatibility was simply too crippling, and backwards compatibility with QUIC is going to cripple security.

Instead of putting the secure system transparently as an alternate protocol within the insecure system, you non transparently put the insecure system as a downgrade protocol within the secure system, which means our version of QUIC simply is not going to talk to older versions of QUIC unless you take some special measures to tell it to do so or enable it to do so for that particular communication end point.

The least friction interface would be that every time a new SSL name is encountered, we get a window saying “This authority claims that this is this entity. Trust this authority for this entity?” And if there is a change of authority, complain. Wrap backwards compatibility in Zooko vouched certificates, pinned certificates, and the CAA record indicating who is the right issuer for the SSL certificate

We have to have downgrade capability, but it has to be an afterthought, slipped in as a special path and special case, as user friendly as possible, but no friendlier.

QUIC’s one way streams are messages.

Its two way streams are backwards compatibility with TCP

It solves the long fat pipe problem with flexible window size.

It puts multiple objects and messages in one stream, so that one message does not have to wait for lost packets in another message to be resolved.

TCP flow control is constructed around pushback - that the sender should not send data faster than the receiver is able and willing to handle it. Normally there is one thread, or pool of of threads, handling the data received. To prevent DDoS, we should probably only have one unit of pushback per pair of network addresses. If someone has a slow receiver thread pool, and a fast receiver thread pool communicating with the same machine, he needs to break the slow receiver communication into lots of small requests and replies, hence one channel per pair of network addresses.

Quic implements everything you need to have one channel per pair of network addresses, multiplexing many request-replies into a single stream, many channels in one channel, but does not in fact implement one channel per pair of network addresses in the sense of one unit of packet flow control and one unit of DDoS monitoring, per pair of network addresses.

Finer grained flow control should be implemented as request reply on messages that may well be much larger than a packet, but much smaller than memory

In the request reply model, if the requests and replies are reasonably short, pushback does not matter, and becomes a representation of flow control. It is seldom sane to download enormous blocks of data as a single message, and we probably just should not do it - restrict replies to what can reasonably fit into memory, so that a very large message that the receiver is processing one chunk at a time has to get acks of its submessages, separate from the flow control system.

What the LEMP stack does with request headers is dynamically allocate 8KiB buffers, stuff headers into a part or whole of at 8KiB buffer, and if a header is bigger than 8KiB, arbitrarily truncates it, which suggests that this is a tactic to minimize the overheads of dynamically allocating many moderate sized buffers of variable size. Experimenting, I find that dynamic allocation tends to be the major cost in many programs, but if you do it LEMP style, dynamic allocation is unlikely to be a significant cost.

QUIC has a pile of feature bloat:

It suffers from the SSL/TLS problem of a thousand CA authorities, NSA friendly encryption, and, being funded in large part by Cloudfare, has no substantial defense against DDoS.

It fails to support rendezvous routing.

But, it has already struggled with and solved a thousand problems whose solutions I have been confusedly struggling with. So the obvious solution is to adopt Quic, rip out the domain name system, add DDoS resistance, rip out NSA friendly encryption in favour of the standard and recommended Libsodium packet encryption. (XChaCha20-Poly1305), for immortality rip out the 62 bit compressed integers in favour of unlimited precision windowed integers (With a negotiated limit on precision that will in practice always be 64 bits for the next several centuries.)

XChaCha20 is not the fastest on a long stream, but it has key agility, can encrypt arbitrary length values, including a single bit, and is as fast as ChaCha20 without any limits on the nonce.

Quic’s messaging is excessively married to HTTP. We need a generic messaging system where every message has an short number indicating destination handler, and you can generate a handler, code continuation, and get number assigned to it on the fly, so that you can send a message, and the reply goes to your code continuation.

We need to lift as much of the QUIC design as possible, and also make things act much like TCP, so that existing NATs will not notice anything has changed. Thus packets will continue to be sent to and from a widely known port that is usually below 1024 on the server, from a random port on the client in the range 49152–65535. A connection will continue to require a three phase handshake which creates a socket, albeit our sockets will be very different.

With a rendezvous, both peers will use the same socket in the range 1024-49151

The rendezvous handshake will look like the TCP handshake Syn Syn-Ack Ack, but they will both send syn packets, both send syn-ack packets, and both send ack packets. Their syn packets will be timed so that, if the timing is done right, both are sent just before the other peer’s packet is expected to be received.

Our sockets will always have a shared secret associated, which proves identity and enables encrypted communication, but which cannot be used to prove identity to a third party. The initial handshake will exchange transient secret keys, which will generate a transient durable secret, which is used to encrypt the exchange of durable secret keys, which establish a shared secret based on the both the durable and transient key, establishing forward secrecy, and failing to establish identity to third parties.

Since setting up a shared secret is costly, this creates the opportunity to syn flood attacks, therefore the syn-ack will always be a syn cookie, structured rather like existing syn cookies, a cryptographic hash of the syn based on an unshared secret known only to the server, plus it will always have a proof of work request, which may be zero, and it will have a list of supported protocols if the protocol proposed in the initial syn cookie is unacceptable. The proof of work will be that the hash of the client ack must have a certain number of zeros, and the ack must contain the cryptographic cookie, and the data that the server checks the cookie against.

TCP was designed around the case of the client sending an endless stream of characters, typed with one finger, to a program on the server. We are going to design around message response, with responses not necessarily returning in order.

The client sends a message from a durable public key to a to a durable public key. The creation and destruction of such connections is not tightly linked to messaging. If connection exists, it is used. If it does not exist, it is created. It may be torn down after a while of being unused, but the tear down is not tightly linked to message completion

In TCP a count is kept of bytes sent and bytes received, with an ack counting as one byte.

We need a count for each packet, since packets can arrive out of order, repeated, or missing. The count values will be sequential nonces for the encryption, and will start at one. As the count can potentially grow quite large, the count value will be windowed, but, unlike TCP, the windowed count represents a potentially much larger absolute count known by both ends.

Negotiating a window size is hard, since you do not really know in advance what window size will be needed. The thirty two bit window is adequate for all normal uses, but fails in special and important uses.

We will specify the window size in each packet, with the high order bit of each byte in the nonce indicating whether there is another seven bits in the nonce window, so that we can dynamically adjust the window size. We dynamically adjust the window size to big enough to exclude ambiguity. Which for the first 128 packets, and on a connection that is not very busy, all packets, will be seven windowed count bits and one window size bit.

The window needs to be large enough to exclude the ambiguity of delayed and duplicated packets wandering in late, so has to be several times larger than the difference between the most recently acked value, and the the value that will fill the reception window. Thirty two times larger should be ample. At the start, there are no early packets capable of wandering in late, so big enough to hold the full count always suffices.

If a represents a recent nonce, n represents the nonce, w represents the windowed nonce. and M represents the window mask, communicated in each packet in unary, then:

w = n&M

n = (w − a)&M + a

We use a window large enough to give the same answer on both the most recently acked nonce, and the most recently sent nonce.

The nonce will serve the dual purpose of enabling the decryption of each packet, and flow control. Each packet has a sequential nonce, we make sure all packets are acked. Nonces on packets coming from the client refer to a different shared secret than nonces on packets coming from

11.1 API

To send a message, you will construct a response handler if you are expecting a response, and then call the api with a network address, a public key of the recipient, an identifying secret key and public key of the sender, a timeout for attempting to connect, and flags permitting for direct connection, rendezvous connection, retransmit, and store and forward. If a response is expected for the message, give the expected lifetime for the response handler, a nonce for the response handler and a class identifier for the nonce. (the nonce only has to be unique within the class). You will probably use a different nonce population for messages that have to be handled promptly, messages that have to be handled within a session, and non volatile nonces that survive between sessions. Nonce populations can be windowed per class identifier, with a window large enough to accommodate the timeout, and a different class identifier for volatile and non volatile nonces. The nonce is used once within a window and within a class, but can be re-used in another class and another window.

The application code is event oriented, like gui code. It is driven by a message pump, with constructors creating event handlers, and the events driving the event handler through the message pump, and event handler, on being fired, creates new event handlers and fires old event handlers.

When the application needs to perform a task that spans many events, it does not call yield or await, but instead the event handler for each event constructs or enables the next event handler. If it needs to push information onto a stack between events, has its own explicit stack for its own multi event task, or creates a linked list of event handlers. Non volatile event handlers must be trivial C+ classes, therefore cannot contain an std::stack,

State that would be on the stack in synchronous code is in the event handler in asynchronous code. This potentially gets messy if you are processing an endless stream of structured data whose structure is orthogonal to message boundaries. Since we allow arbitrary length messages, don’t do that.

Notification of message failure may occur any time within the lifetime of the response handler, but will mostly happen within the timeout for attempting to connect.

The usual flow of control will be create an event handler, assign a nonce to it (fire it) and then it gets triggered when the event actually happens, and is then usually destroyed. Events will usually create and fire new events and trigger events that existed before they were created, rather than changing their state.

Below the api, additional messages, using low numbered message response classes, may be constructed for encryption and flow control. If an encrypted connection exists, it will use that without constructing additional messages. If it does not exist, will construct it.

Constructing a encrypted connection provides perfect forward secrecy between one connection and the next by generate new random session keys each time.

11.2 Reliability and flow control

TCP achieves reliable transmission with acks and nacks.

The original design simply acked that all bytes (not exactly bytes, because acks and nacks are counted) had been received up to a certain byte. If the transmitter has transmitted stuff, and not received an ack for what it transmitted it sends a nack, after a timeout. The receiver may resend acks.

This mechanism worked fine on short thin pipes, but if you have a million packets in flight, and packet three hundred thousand gets lost, you then then have to send seven hundred thousand to replace one packet. So the duplicate ack possibility was tortured to create a half assed version of selective acknowledgment. If the receiver receives packet 100, and 101, but not packet 99, it sends duplicate acks for packet 98. If the receiver receives three duplicate acks for packet 98, it retransmits packet 99. (two duplicate acks could be just the normal randomness.)

QUIC, however, has fix for this built in.

Obviously true selective acknowledgment is better. The receiver acks the most recent received packet, and sends a list of missing packets prior to this (acks a windowed value for the most recent packet, and the difference between packet nonces for missing packets) The sender resends the missing packets, except for the most recent missing packets. If they are still missing, they will be caught on the next ack.

In each ack, the receiver tells the sender how much more data it can receive before it sends the next ack. This prevents the receiver from being flooded, but a more common problem is the pipe being flooded.

To handle pipe flooding, the sender has a timer. If it sends stuff, and does not get an ack, it backs off, it sets the timer to a slower rate, and retransmits with a nack. The initial value of the timer is the initial timer value is smoothed RTT + max(G,4*RTTvariance)

TCP flow control focuses on getting a segment complete and acknowledged, so it can move on to the next segments. It may have a great many packets in flight, but does not have too many segments in flight. The backoff algorithm is linked with the push segments algorithm. You only push the segment the receiver has asked for in his previous acknowledgment. So you typically have the segment you are finalizing, the segment that is in flight, and the segment that the receiver asked for.

The algorithm is that the sender gets an ack that acknowledges what the receiver has received, and tells the sender how much more the receiver can receive. Whereupon the sender resends anything missing, and resumes pushing new stuff up to the limit that the receiver has specified, spread out roughly evenly over the timer period. Which implies that the receiver should ask wisely, as well as the sender send wisely.

Implementing our own flow control sounds like a lot of work. Need to lift QUIC’s flow control, and drop our own encryption and attack resistance into it, while letting it worry about flow control. I can hack into its library, while I cannot hack into the TCP library.

I have been analysing how TCP works, with a view to what needs fixing. Time to analyse how something works for which I have a library and example code.

Best (because smallest and least married to HTTP3) is picoquic.

The TCP state machine assumes that the server opens a connection on receiving a syn, sends an ack-syn to the client, whereupon the client acks the connection. But if we are using syn cookies, we are using a different state machine, where the connection is in fact only opened on receiving the server syn-ack cookie in the client ack. So the server has to acknowledge the connection, which would make it a four step handshake instead of a three step handshake. To avoid this, we have a rule that the client only opens a connection when it has data ready to send. It then gets a server cookie, and sends the cookie-ack with some data, which data the server acks.

With the cookie ack, we get a round trip time and offset between server steady time and client steady time. If we see unstable round trip times, we suspect the pipe is overloaded, and back off our estimate of max bandwidth. For flow control, we maintain an estimate of pipe length and width. Sudden pipe widenings indicate an overflow condition, because pipes may respond to overflow by massively discarding packets, or massively backing up packets, or quite possibly both. We maintain a probability estimate of the pipe behaviour.

11.3 Outline protocol

A packet protocol that establishes an encrypted connection on top of unreliable packets with minimal round trips without increasing fragility to DoS.

For servers, public keys, globally human readable names, the key owning the name, and the temporary key signed by the key owning the name, will usually be public and widely known, but this also supports the case of communication where this information is only known to the parties, and the server does not want to make the connection between a network address and a public key widely known.

To establish a connection, we need to set a bunch of values specific to this particular channel, and also create a shared secret that eavesdroppers and active attackers cannot discover.

The client is the part that initiates the communication, the server is the party that responds.

I assume a mode that provides both authentication and encryption – if a packet decrypts into a valid message, this shows it originated from an entity possessing the shared secret. This does not provide signing – the recipient cannot prove to a third party that he received it, rather than making it up.

For the moment I ignore the hard question of server key distribution, glibly invoking Zooko’s triangle without proposing an implementation of the other two points and three sides of the triangle or a solution to the problem of managing distributed reputations in Zooko’s triangle.  (Be warned that whenever people charge ahead without solving the key distribution problem, the result is a disaster.)

Client 🠆 Server: Equivalent to the syn of the three phase TCP handshake.

Client’s network address and port on which client will receive packets, protocol identifier, and client steady time that the message was sent.

If the requested protocol is not OK, we go into protocol negotiation, server responds with a list of protocols and protocol versions that it will accept, in the form of a list of lists of numbers.

Assuming it is OK, which it probably will be, server allocates nothing, prepares nothing, but sends the equivalent of a TCP ack-syn cookie, containing, among other things, a cryptographic hash of the information that was received and sent, based on a private secret known only to the server. It sends a transient public key, which changes every few minutes or so, plus a short windowed id for that transient public key, and a demand for proof of work, which may be zero. The proof of work is that the client’s ack, equivalent of the third phase of the TCP handshake, has to hash to a value ending in n zero bits, where n may be zero.

This cryptographic hash based on an unshared secret will be sent to client, and then back to server, unchanged. Its function is to avoid the necessity for the server to allocate memory or perform asymmetric cryptographic operations for a client that has not yet validated. Instead the state information is sent back and forth.

  1. Server 🠆 Client: Equivalent to the syn-ack of the three phase TCP handshake.

    Cryptographic hash based on unshared secret, server steady time, transient public key, server windowed identifier of server transient public key, proof of work demand, and any channel parameters.

    The proof of work is trivial if the server is not under load, but is increased as the server load approaches the maximum the server is capable of, in order to throttle demand.

    Client computes transient handshake shared secret as its transient private key times the server shared transient public key. It returns in the clear a copy of the cryptographic hash that the server sent to it, the data in the clear needed to validate the hash, performs the proof of work, and sends its public key, which may be a per server durable public key, always used when accessing this server on this identity, encrypted using the transient key, and the public key it wants to talk to on the server.

    Subsequent information is not encrypted using the transient keys, but using the sum of transient plus secret keys.

    This implies that the client has to know the public key that the server is using, which may be a key signed by the master public key that owns the name authorizing that new key, which key changes about as often as the server IP changes, and is therefore distributed in the same channel as the network address associated with global human names is distributed. If the client gets it wrong, then the server ignores the information encrypted to the wrong public key, and responds with the authentication of its new public key, signed by the master public key of its globally unique name, encrypted using the transient secret – this is usually public information, but since by this point we have established a shared secret and allocated memory, might as well send it securely, for sometimes it is going to be private information.

  2. Client 🠆 Server: Equivalent to the final ack of the three phase TCP handshake.

    Sends in the clear server hash as received, any data needed to reconstruct the hash, and transient secret key. Then, encrypted to transient keys, the hash of the identifier of the public key it wants to talk to, its durable public key, and client steady time at which this was sent, so that both sides have an estimate of the round trip time and the offset between server steady time and client steady time.

    Server checks the proof of work, checks the cryptographic hash against the data in the clear, then creates an entry in its hash table for this connection, with the shared secret being the transient keys plus the public keys.

We have two protocols, one for the authenticated phase, and one for unauthenticated phase. The client has to know one of the unauthenticated protocols offered by the server, or else protocol negotiation will fail in the abnormal case that protocol negotiation is needed. Normally there will only be one protocol for secured but unauthenticated communication during setup, but we make provision by having two protocols, trivially different, and three protocols, trivially different for the authenticated phase.

You will notice that the server only allocates memory and and asymmetric encryption computation after the client has successfully performed proof of work and shown that it is indeed capable of receiving data sent to the advertised network address.

In the normal case, the client requests one way authenticated encryption in the syn, where the server authenticates but the server does not, and the server may, and usually will, offer in the syn-ack only two way authenticated encryption, where the client provides an identity unique to that server and user’s current default name, but which cannot be used to identify the default name, nor the same user accessing a different website. This allows the server to see that the same user is accessing different resources, how many uniques the server has, and what each unique is doing, but does not enable the server’s to put their heads together and see that the same user is doing things on one server, and also on another server.

Now we have a shared secret, protocol negotiated, client logged in, in one round trip plus the third one way trip carrying the actual data – the same number of round trips as when setting up an unencrypted unauthenticated TCP connection.

You will notice there is no explicit step checking that both have the same shared secret – This is because we assume that each packet sent is also authenticated by the shared secret, so if they do not have the same secret, nothing will authenticate.

12 Critiques of TCP/SSL

Does the job so badly that using a different method is just as plausible. People fight to avoid TLS already, they’d rather send stuff in the clear if they could.  So just solve the problems they have.

In Web Services we frequently require message layer security in addition to transport layer security because a Web Service transaction might involve more than two endpoints and messages that are stored and forwarded etc. This is why WS-* is not TLS. (It is unfortunately horribly baroque but that was not my doing).

Problem that occurred with TLS was that there was an assumption that the job
was to secure the reliable stream connection mechanics of TCP.  False assumption.

Pretty much nobody uses streams by design, they use datagrams.  And they use them in a particular fashion: request-response.  Where we went wrong with TCP was that this was the easiest way to handle the mechanics of getting the response back to the agent that sent the request. Without TCP, one had to deal with the raw incoming datagrams and allocate them to the different sending agents.

A second problem was that the design was too intertwined with commercial PKI so certs were hung on the side as a millstone for server authentication and discarded as client side, leaving passwords to fill that gap.  A mess, which is an opportunity for redesign, frequently exploited by many designs already.

SSL came at this and built a message (record) interface on top of TCP (because that was convenient for defining a crypto layer), and then a (mainly) stream interface on top of its message interface – because programmers were by now familiar with streams, not records.

And so … here we are.  Living in a city built on top of generations of older cities.  Dig down and see the accreted layers.

What is the “right” (easiest to use correctly, hardest to use incorrectly, with good performance, across a large number of distinct application APIs) underlying interface for a secure network link? The fact that the first thing pretty much all APIs do is create a message structure on top of TCP makes it clear that “pure stream” isn’t it.  Record-oriented designs derived from 80-column punch cards are unlikely to be the answer either.  What a “clean slate” interface would look like is an interesting question, and perhaps it’s finally time to explore it.

13 General and unorganized comments

µTP, Micro Transport Protocol is a Bittorrent near drop in replacement for TCP that provides lower priority bulk downloads in the background. The library is not well documented, (header file plus examples) but as far as I can see, provides a reasonably clean separation between Bittorrent and the transport mechanism.

Google has a TCP/SSL replacement, QUIC, which avoids round tripping and renegotiation by integrating the security layer with the reliability layer, and by supporting multiple asynchronous streams within a stream

Layering a new peer-to-peer packet network over the Internet is simply what the Internet is designed for. UDP is broken in a few ways, but not that can’t be fixed. It’s simply a matter of time before a new virtual packet layer is deployed – probably one in which authentication and encryption are inherent.

For authentication and encryption to be inherent, needs to connect between public keys, needs to be based on Zooko’s triangle.  Also needs to penetrate firewalls, and do protocol negotiation with an unlimited number of possible protocols – avoiding that internet names and numbers authority.

Ian Grigg “Good protocols divide into two parts, the first of which says to the second, trust this key completely!”.

This might well be the basis of a better problem factorization than the layer factorization – divide the task by the way trust is embodied, rather than the basis of layered communication.

Trust is an application level issue, not a communication layer issue, but neither do we want each application to roll its own trust cryptography – which at present web servers are forced to do. (Insert my standard rant against SSL/TLS).

Most web servers are vulnerable to attacks akin to session cookie fixation attack, because each web page reinvents session cookie handling, and even experts in cryptography are apt to get it wrong.

The correct procedure is to generate and issue a strongly unguessable random https only cookie on successful login, representing the fact that the possessor of this cookie has proven his association with a particular database record, but very few people, including very few experts in cryptography, actually do it this way. Association between a client request and a database record needs to be part of the security system. It should not something each web page developer is expected to build on top of the security system.

TCP constructs a reliable pipeline stream connection out of unreliable packet connections.

There are a bunch of problems with TCP.  No provision was made for protocol negotiation and so any upgrade has to be fully backwards compatible.  A number of fixes have been made, for example the long fat pipe problem has been fixed by window size negotiation, which is semi incompatible and leads to flaky behaviour with old style routers, but the transaction problem remains intolerable.  The transaction problem has been reduced by protocol level workarounds, such as “Keep alive” for HTTP, but these are not entirely satisfactory.  The fix for syn flooding works, but causes some minor unnecessary degradation of performance under syn flood attacks, because the syn cookie is limited to 48 bits – needs to be 128 bits both to deal with the syn flood attack, and to prevent TCP hijacking.

TCP is inefficient over wireless, because interference problems are rather different to those provided for in the TCP model.  This problem is pretty much insoluble because of the lack of protocol negotiation.

There are cases intermediate between TCP and UDP, which require different balances of timeliness, reliability, streaming, and record boundary distinction. DCCP and SCTP have been introduced to deal with these intermediate cases, SCTP for when one has many independent transactions running over a single connection, and DCCP for data where time sensitivity matters more than reliability such as voice over IP.  SCTP would have been better for HTML and HTTP than TCP is, though it is a bit difficult to change now.  Problems such as password-authenticated key agreement transaction to a banking site require something that resembles encrypted SCTP, analogous to the way that TLS is encrypted TCP, but nothing like that exists as yet. Standards exist for encrypted DCCP, though I think the standards are unsatisfactory and suspect that each vendor will implement his own incompatible version, each of which will claim to conform to the standard.

But a new threat has arrived:  TCP man in the middle forgery.

Connection providers, such as Comcast, frequently sell more bandwidth than they can deliver.  To curtail customer demands, they forge connection shutdown packets (reset packets), to make it appear that the nodes are misbehaving, when in fact it is the connection between nodes, the connection that Comcast provides, that is misbehaving. Similarly, the great firewall of China forges reset packets when Chinese connect to web sites that contain information that the Chinese government does not approve of. Not only does the Chinese government censor, but it is able to use a mechanism that conceals the fact of censorship.

The solution to all these problems is to have protocol negotiation, standard encryption, and flow control inside the encryption.

A problem with the OSI Layer model is that as one piles one layer on top of another, one is apt to get redundant round trips.

According to google research 400 milliseconds reduces usage by 0.76%, or roughly two percent per second of delay.

Redundant round trips become an ever more serious problem as bandwidths and processor speeds increase, but round trip times reminds constant, indeed increase as we become increasingly global and increasingly rely on space based communications.

Used to be that the biggest problem with encryption was the asymmetric encryption calculations – the PKI model has lots and lots of redundant and excessive asymmetric encryptions. It also has lots and lots of redundant round trips. Now that we can use the NVIDIA GPU with CUDA as a very high speed cheap massively parallel cryptographic coprocessor, excessive PKI calculations should become less of a problem, but excess round trips are an ever increasing problem.

Any significant authentication and encryption overhead will result in people being too clever by half, and only using encryption and authentication where it is needed, with the result that they invariably screw up and fail to use it where it is needed – for example the login on the http page. So we have to lower the cost of encrypted authenticated communications, so that people can simply encrypt and authenticate everything without needing to think about it.

To get stuff right, we have to ditch the OSI layer model, but simply ditching it without replacement will result in problems. It exists for a reason, and we have to replace it with something else.

Creative Commons License reaction.la gpg key 154588427F2709CD9D7146B01C99BB982002C39F
This work is licensed under the Creative Commons Attribution 4.0 International License.