What are we talking about when we are talking about high concurrency?
What is high concurrency?
High concurrency is one of the performance indicators of Internet distributed system architecture, which usually refers to the number of requests that the system can process simultaneously per unit of time,
Simply put, it is QPS (Queries per second).
So what are we talking about when we're talking about high concurrency?
What is high concurrency?
Here is the conclusion:
The basic manifestation of high concurrency is the number of requests that the system can process simultaneously per unit time,
The core of high concurrency is the effective exploitation of CPU resources.
For example, if we develop an application called MD5 exhaustive, each request will carry an MD5 encrypted string, and the system will eventually enumerate all the results and return the original string. At this point, our application scenario or business is CPU intensive rather than IO intensive. At this point, the CPU has been doing effective calculations and can even fully utilize the CPU. At this point, discussing high concurrency is meaningless. Of course, we can improve concurrency by adding machines, which is also known as adding CPUs. This is a normal nonsense solution that everyone knows. It is meaningless to talk about adding machines. Without any high concurrency, adding machines cannot solve the problem. If there are, then it means that you haven't added enough machines! 🐶)
For most internet applications, the CPU is not and should not be the bottleneck of the system. Most of the system's time is spent waiting for I/O (hard disk/memory/network) read/write operations to complete.
At this point, some people may say, 'When I watch the system monitoring, the memory and network are both normal, but the CPU utilization is running full.' Why is this?
This is a good question. I will provide practical examples in the following text, emphasizing once again the four words' effective squeezing 'mentioned earlier, which will revolve around the entire content of this article!
Control variable method
Everything is interconnected, and when we talk about high concurrency, every aspect of the system should be matched with it. Let's first review a classic C/S HTTP request process.
Clipboard.png
As shown by the serial number in the figure:
We will analyze the request through the DNS server and reach the load balancing cluster
The load balancing server will allocate requests to the service layer based on the configured rules. The service layer is also our core business layer, and there may also be some calls to PRC, MQ, and so on
3 and then pass through the cache layer
- Finally Persist Data
- Return data to the client
To achieve high concurrency, we need load balancing, service layer, cache layer, and persistence layer to be highly available and high-performance. Even in step 5, we can optimize by compressing static files, HTTP2 pushing static files, and CDN. For each layer here, we can write several books to discuss optimization.
This article mainly discusses the service layer, which is the part circled in red in the figure. We will no longer consider discussing the impact of databases and caching.
High school knowledge tells us that this is called the control variable method.
Further Discussion on Concurrency
The Evolution History of Network Programming Models
Clipboard.png
Concurrency has always been a key and challenging issue in server-side programming. In order to optimize the concurrency of the system, it starts with the initial Fork process, progresses to the process/thread pool, then to the epoll event driven (Nginx, node. js anti human callbacks), and finally to the protocol.
It can be clearly seen from the above that the entire evolution process is the process of squeezing the effective performance of the CPU.
What? Not obvious?
So let's talk about context switching again
Before discussing context switching, let's clarify the concepts of two nouns.
Parallel: Two events complete at the same time.
Concurrency: Two events occur alternately within the same time period, and from a macro perspective, both events occur.
Threads are the smallest unit of operating system scheduling, while processes are the smallest unit of resource allocation. Due to the serial nature of the CPU, for a single core CPU, there must be only one thread occupying CPU resources at a time. Therefore, as a multitasking (process) system, Linux frequently experiences process/thread switching.
Before each task runs, the CPU needs to know where to load and run from. This information is stored in the CPU registers and the program counters of the operating system, which are called CPU context.
Processes are managed and scheduled by the kernel, and process switching can only occur in kernel state. Therefore, resources in user space such as virtual memory, stack, global variables, and the state of kernel space such as kernel stack and registers are called process context.
As mentioned earlier, threads are the smallest unit of operating system scheduling. At the same time, threads share resources such as virtual memory and global variables of the parent process, so adding their own private data to the parent process's resources is called the thread's context.
For thread context switching, if it is a thread from the same process, it will consume less resources than switching between multiple processes due to resource sharing.
It is now easier to explain that switching between processes and threads can result in CPU context switching and process/thread context switching. And these context switches will consume additional CPU resources.
Further Discussion on Context Switching in Collaborative Processes
So does the collaboration process no longer require context switching? Yes, but there will be no CPU context switching or process/thread context switching, as these switches are all in the same thread, that is, in user mode. You can even simply understand that switching between coroutine contexts is moving a pointer in your program, and the CPU resources still belong to the current thread.
For those who need a deep understanding, you can delve deeper into Go's GMP model.
The ultimate effect is that the collaborative process further squeezes the effective utilization rate of the CPU.
Go back to the question at the beginning
At this point, some people may say, 'When I watch the system monitoring, the memory and network are both normal, but the CPU utilization is running full.' Why is this?
Note that when discussing CPU utilization in this article, the word 'effective' will definitely be added as an attribute. If CPU utilization is running high, it often results in inefficient calculations.
Taking "the best language in the world" as an example, the typical CGI mode of PHP-FPM involves every HTTP request:
I will read hundreds of PHP files from the framework,
Will re establish/release the MYSQL/REIDS/MQ connection,
Will dynamically interpret, compile, and execute PHP files again,
They will continuously switch between different PHP FPM processes before switching.
The CGI running mode of PHP fundamentally determines its catastrophic performance on high concurrency.
Finding problems is often more difficult than solving them. After we understand what we are talking about when we are talking about high concurrency, we will find that high concurrency and high performance are not limited by programming languages, they are only limited by your thoughts.
Find the problem, solve the problem! What effect can we achieve when we can effectively squeeze CPU performance?
Let's take a look at the performance difference between the HTTP service of PHP+SWoole and the HTTP service of Java's high-performance asynchronous framework, Netty.
Preparation before performance comparison
What is a swoole
Swoole is an event based high-performance asynchronous and concurrent parallel network communication engine written in C and C++for PHP
What is Netty
Netty is a Java open source framework provided by JBOSS. Netty provides an asynchronous, event driven network application framework and tools for quickly developing high-performance and highly reliable network servers and client programs.
What is the maximum number of HTTP connections that a single machine can reach?
Recalling the relevant knowledge of computer networks, the HTTP protocol is an application layer protocol. At the transport layer, each TCP connection is handshake three times before it is established.
Each TCP connection is identified by four attributes: local IP, local port, remote IP, and remote port.
The TCP protocol header is as follows (image from Wikipedia):
Clipboard.png
The local port consists of 16 bits, so the maximum number of local ports is 2 ^ 16=65535.
The remote port consists of 16 bits, so the maximum number of remote ports is 2 ^ 16=65535.
At the same time, in the underlying network programming model of Linux, for each TCP connection, the operating system maintains a File descriptor (fd) file to correspond to it, and the limit on the number of fds can be viewed and modified by the ulimit - n command. Before testing, we can execute the command ulimit - n 65536 to modify this limit to 65535.
Therefore, without considering hardware resource limitations,
The maximum number of local HTTP connections is: 65535 local ports * 1 local IP=65535.
The maximum number of remote HTTP connections is 65535 * the number of remote (client) IPs+∞=unlimited~~.
PS: In fact, the operating system may have some reserved ports occupied, so the number of local connections cannot actually reach the theoretical value.
Performance Comparison
Testing Resources
Each Docker container has 1GB of memory and 2 cores of CPU, as shown in the figure:
Clipboard.png
The Docker Compose arrangement is as follows:
java8
version: "2.2"
services:
java8:
container_name: "java8"
hostname: "java8"
image: "java:8"
volumes:
- /home/cg/MyApp:/MyApp
ports:
- "5555:8080"
environment:
- TZ=Asia/Shanghai
working_dir: /MyApp
cpus: 2
cpuset: 0,1
mem_limit: 1024m
memswap_limit: 1024m
mem_reservation: 1024m
tty: true
php7-sw
version: "2.2"
services:
php7-sw:
container_name: "php7-sw"
hostname: "php7-sw"
image: "mileschou/swoole:7.1"
volumes:
- /home/cg/MyApp:/MyApp
ports:
- "5551:8080"
environment:
- TZ=Asia/Shanghai
working_dir: /MyApp
cpus: 2
cpuset: 0,1
mem_limit: 1024m
memswap_limit: 1024m
mem_reservation: 1024m
tty: true
php代码
<?php
use Swoole\Server;
use Swoole\Http\Response;
$http = new swoole_http_server("0.0.0.0", 8080);
$http->set([
'worker_num' => 2
]);
$http->on("request", function ($request, Response $response) {
//go(function () use ($response) {
// Swoole\Coroutine::sleep(0.01);
$response->end('Hello World');
//});
});
$http->on("start", function (Server $server) {
go(function () use ($server) {
echo "server listen on 0.0.0.0:8080 \n";
});
});
$http->start();
Java Key Code
The source code comes from, https://github.com/netty/netty
public static void main(String[] args) throws Exception {
// Configure SSL.
final SslContext sslCtx;
if (SSL) {
SelfSignedCertificate ssc = new SelfSignedCertificate();
sslCtx = SslContextBuilder.forServer(ssc.certificate(), ssc.privateKey()).build();
} else {
sslCtx = null;
}
// Configure the server.
EventLoopGroup bossGroup = new NioEventLoopGroup(2);
EventLoopGroup workerGroup = new NioEventLoopGroup();
try {
ServerBootstrap b = new ServerBootstrap();
b.option(ChannelOption.SO_BACKLOG, 1024);
b.group(bossGroup, workerGroup)
.channel(NioServerSocketChannel.class)
.handler(new LoggingHandler(LogLevel.INFO))
.childHandler(new HttpHelloWorldServerInitializer(sslCtx));
Channel ch = b.bind(PORT).sync().channel();
System.err.println("Open your web browser and navigate to " +
(SSL? "https" : "http") + "://127.0.0.1:" + PORT + '/');
ch.closeFuture().sync();
} finally {
bossGroup.shutdownGracefully();
workerGroup.shutdownGracefully();
}
}
Because I only provided CPU resources for the two cores, both services can only start a single work process.
Port 5551 represents a PHP service.
Port 5555 represents a Java service.
Comparison of pressure testing tool results: Apache Bench (ab)
Ab command: Docker run -- rm jordi/ab - k - c 1000- n 1000000 http://10.234.3.32:5555/
In the benchmark test of conducting 1 million HTTP requests concurrently with 1000,
Java+netty pressure test results:
Clipboard.png
Clipboard.png
PHP+SWOOLE pressure test results:
Clipboard.png
Clipboard.png
Service QPS response time ms (max, min) Memory (MB)
Java+netty 84042.11 (11,25) 600+
PHP+SWOOLE 87222.98 (9,25) 30+
PS: The figure above selects the best result under three pressure tests.
Overall, the performance difference is not significant, and PHP+swoole services are even slightly better than Java+netty services, especially in terms of memory usage. Java uses 600MB, while PHP only uses 30MB.
What does this mean?
No IO blocking operation, no co process switching will occur.
This only indicates that in the multithreading+epoll mode, CPU performance is effectively squeezed, and you can even write high concurrency and high-performance services using PHP.
Performance Comparison - A Moment of Witnessing Miracles
The above code actually does not demonstrate the excellent performance of the coroutine, as the entire request does not have blocking operations. However, our application often accompanies various blocking operations such as document reading, DB connection/query, etc. Let's take a look at the pressure test results after adding blocking operations.
In both Java and PHP code, I added sleep (0.01)//second code to simulate system call blocking for 0.01 seconds.
The code will no longer be pasted repeatedly.
Java+netty pressure test results with IO blocking operation:
Clipboard.png
It takes about 10 minutes to complete all the pressure tests...
PHP+swoole pressure test results with IO blocking operation:
Clipboard.png
Service QPS response time ms (max, min) Memory (MB)
Java+netty 1562.69 (52160) 100+
PHP+SWOOLE 9745.20 (9,25) 30+
From the results, it can be seen that the QPS of the PHP+SWOOLE service based on coprocessing is six times higher than that of the Java+netty service.
Of course, both of these test codes are source code from the official demo, and there are definitely many configurations that can be optimized. After optimization, the results will definitely be much better.
Can you reconsider why the official default number of threads/processes is not set a bit higher?
The more processes/threads, the better. As we discussed earlier, switching between processes/threads can incur additional CPU resource costs, especially when switching between user mode and kernel mode!
For these pressure test results, I am not targeting Java. I mean that as long as you understand the core of high concurrency and find this goal, no matter what programming language you use, as long as you effectively optimize CPU utilization (connection pooling, daemons, multithreading, protocols, select polling, epoll event driven), you can also build a high concurrency and high-performance system.
So, do you now understand what we're talking about when we're talking about high-performance?
Ideas are always more important than results!
Welcome to reprint this article. Please indicate the author and source when reprinting. Thank you!