A reliable, high-frequency exchange designed for 365day x 24h usage
Back in 2017, as my partners and I looked into other crypto exchanges, we were shocked by the embarrassingly poor performance of existing platforms. Their performance and reliability issues became more severe as the number of users grew exponentially. A large number of exchanges are struggling to update their core systems without temporarily discontinuing or downgrading their services. Even after two years, the situation has not changed much.
I worked for Morgan Stanley’s global brokerage business for 11 years. Together with my colleagues from Morgan Stanley’s Electronic Trading (MSET) and Benchmark Execution Strategies (BXS) teams, we specialized in building high-frequency trading systems for our clients. These trading systems connect almost every major exchange in the world. Our products cover equity, futures, options, FX, and much more. My partners and I are excited about opening our own shop in the cryptocurrency world as we know exactly what is missing and what we can do for the industry.
Most traditional exchanges are open 5 days a week, 6 hours a day. Even FX exchanges close on Sundays. However, this is not the case for the cryptocurrency industry. Our users want to trade anytime and anywhere. Naturally, non-stop exchanges like BitMEX or Binance have become a standard service among cryptocurrency users.
In this article, we would like to deep dive into the Phemex core system architecture, discussing our design principles and practices. Examining the technology used to enable our Phemex CrossEngine to handle more than 300K orders per second while our trading engines can be easily scaled up. The response time for a market order entry and fill is less than 0.2ms. The entire Phemex system does not have any downtime even during maintenance or upgrades.
Because giving back is one of Phemex’s core values, we are honored to be the first exchange to share our core architecture with the industry and our community. A live demo can also be found in the last section. It is open for applications in our testnet.
The Phemex core engines can be divided into three logical parts: CrossEngines, TradingEngines, and LiqEngines. The CrossEngine maintains an order book per symbol. It matches and executes orders strictly based on price and time priorities. The TradingEngine is responsible for accepting client order placement requests. It manages a full set of real-time risk checks for client orders and trading accounts, including cost, fee, PnL (profit and loss) calculation, and more. Client order requests are sent to the CrossEngine if the client has enough available balance. Phemex provides leverage trading. Therefore, another important role of the TradingEngine is to monitor mark prices and initiate liquidations in time. The LiqEngine handles liquidations and auto-deleverage events.
Another key component we employ is the Phemex Message Queue (MQ). Phemex MQ determines the sequence of messages and persists them for recovery. It is a light-weight yet powerful component in our system, ensuring high availability.
Phemex follows a few practices to maximize performance:
- We use multiple processes instead of using multiple threads. Most processes are single threads. They can be scheduled to a dedicated CPU core to boost performance without having to worry about process or thread scheduling costs.
- The core engines are lock-free on the critical path. Increasing the number of TradingEngines can improve the overall performance almost linearly.
- Each Phemex executable binary file is extremely small. We have minimized 3rd party library dependencies. The simpler they are, the stabler and easier they are to maintain. Since each process is so light-weight, our system uses less resources. Compared to an all-in-one process, testing is also easier because of the smaller set of functionalities.
The Phemex core is written in C++/17 . C++ gives us true power and flexibility to build our system without any restrictions.
Phemex message queues play a critical role during system recovery. It has four main functions:
- Determining message sequences
- Detecting duplications
- Persisting messages
- Replaying historical messages
Here is how a typical sequence works:
The master MQ gives each arriving message a sequence number. It then broadcasts it to the replicator (or the slaves). Once the replicator receives the message, it will persist and send an ack message back to the master MQ. Upon the first ack arrival, the master MQ will proceed to forward the message downstream. The master MQ ensures messages are forwarded by the sequence number it stamps. The sequence number increases continuously and monotonically.
The replicators are able to use different ways to persist messages. They can use shared memory, databases, and plain files. It is important to understand that a client message must be persisted by at least one replicator before the master MQ sends it downstream. During disaster recovery, the master MQ will be recovered from the latest sequence number and will then continue with new messages.
When downstream applications crash and re-connect, the MQ Replayer will collect the message data and send it via the same sequence. Downstream engines are designed to reply to message sequences in order to generate consistent results (see CrossEngine section for more details).
As the master MQ and downstream engine both depend on the Replicator to recover the latest sequence number and missing historical data, it is important to deploy multiple Replicators at different physical locations. Theoretically, the more there are, the safer it is. A typical deployment structure is illustrated below. Multiple Replicators significantly reduce the risk of data loss. All of these Replicators use a simple protocol to communicate with each other and synchronize data periodically. If a Replicator is restarted, it will ask the MQ Replay to obtain the missing data from one of the other Replicators. For instance, Replicator Number 5 crashes after receiving message #42. When restarted, the master MQ will tell the Replicator that the current sequence number is #50. Then, Replicator Number 5 will ask the MQ Replay to collect and resend the missing messages from #43 to #49.
The Phemex MQ also discards duplicate messages with the same sequence numbers. When the MQ is used to persist the results of an engine it can recognize the same result and discard duplicated ones. Therefore, the engine can safely replay the computation without concerns of a duplicated entry in the MQ.
The most important task for the CrossEngine is to maintain an order book for each trading pair. When an aggressive order arrives, it usually generates fills with other orders in the book. It can produce a correct execution sequence based on price and time priorities. Our CrossEngine is lightning fast with multiple levels of optimization. It is capable of processing over 300K messages per second with a very small memory footprint. It is critically important to have a fast CrossEngine. Unfortunately, most existing tier one platforms have failed to solve their overload issues. Their performance is surprisingly poor due to poor choices in terms of technology and architecture. During large-scale market movements, users bear huge risks as they are forced to discontinue trading and face the losses caused by liquidations.
The CrossEngine cannot function as a single-point failure due to its vital role in an exchange. The Phemex CrossEngine receives client order requests from a Request MQ (ReqMQ). As explained in the previous section, the ReqMQ produces a reliable and sequential client request message flow. For each trade symbol, we deploy multiple CrossEngines that connect to one ReqMQ. They generate exactly the same results. As shown in Figure I, a typical layout of CrossEngines will have multiple hot and warm engines. Hot engines process requests and generate matching results for the Response MQ (RespMQ). If a hot engine crashes, it has literally zero impact on trading activities as the other hot CrossEngines will seamlessly continue to process requests and generate the exact same results. Since one request message will generate one and only one result, the result message has the same sequence number as the request message. The RespMQ is able to discard duplicate results.
The warm engine processes the request message from ReqMQ, just as hot engines do. However, it does not send any results out to the RespMQ. Warm CrossEngines persist the order book from their memory to the disk periodically. If all hot CrossEngines shut down due to unexpected circumstances, they can then recover the latest order book from the warm engines’ persistent files, and ask the ReqMQ to replay the missing data.
The most optimal practice is to deploy multiple hot and warm CrossEngines at the same time. They will have virtually no performance impact, but will significantly improve the system’s high availability.
The Phemex TradingEngine computes order costs, position costs, margin requirements, liquidation prices, bankruptcy prices, and other data for each trading account. It implements Phemex’s core business trading logic. The TradingEngine contains all clients’ trading accounts and order requests. Thousands of user accounts can be distributed into multiple TradingEngines. Increasing the number of TradingEngines can significantly expand the throughput of trading requests.
Once an order request passes the TradingEngine’s cost/risk checks, it is then sent to the CrossEngine through the ReqMQ. Upon execution of an order back from the CrossEngine RespMQ, the TradingEngine updates the position and occupied margin. A new liquidation price is then computed. Users’ account, position, and order status (APO status) data are all reported to the ApoMQ. The recovery process of the TradingEngine is similar to that of the CrossEngine’s. It first recovers the latest status from ApoMQ, then replays the missing messages from the RespMQ. To make the recovery seamless, we run a redundant warm TradingEngine as well. The warm engine consistently reads the latest APO statuses from the ApoMQ. If the main trading engine is detected as unavailable, the warm TradingEngine will be activated and begin working.
The LiqEngine is a special TradingEngine. Its dedicated purpose is to handle liquidations. The TradingEngine computes the liquidation price of each account and, once triggered, delegates the liquidation process to the LiqEngine. The LiqEngine holds the balance of Phemex’s liquidation insurance fund. If liquidating a position needs additional funds, it draws from the insurance fund to cover the losses. In the extreme case that the insurance funds have been fully drained, an auto-deleverage (ADL) process is triggered. LiqEngines read all APO status information from the ApoMQs. It ranks each position for each trading symbol. The ADL process is to find the opposite position with the highest ranks. For more details, please go to https://phemex.com/references.
Just like the TradingEngine, the LiqEngine also uses MQ for recovery and has a warm engine as a backup.
Phemex Performance Live Demo
The peak TPS (trades per second) rate in top tier exchanges is about a few hundred trades per second. Phemex’s testnet can simulate a high trading volume environment by instructing an automatic robot service to send up to 20,000 trades per second to our BTCUSD CrossEngine. The volume is visible through the rapid changes in the OrderBook, Charts, and Recent Trades panels. Users can still operate their trading accounts normally without lagging at all.
To view a demonstration of this feature, visit Phemex Performance Demonstration. We also offer open applications for the use of this feature.
We are proud to be the world’s first cryptocurrency exchange to offer a demo of this kind. It is intended to demonstrate that Phemex has the capacity to handle at least 10x the trading volume of any other exchange, or perhaps even all of them combined.