IT博客汇 | 十亿行的挑战

十亿行的挑战

smallnest发表于 2024-01-30 16:05:38

国外的程序员休完他们的假期之后在玩什么？他们在玩十亿行的代码挑战。

工程师贡纳尔·莫林在元旦发起一个挑战(1BRC),挑战从 1 月 1 日持续到 1 月 31 日。

如果你决定接受它，你的任务看似简单： 编写一个 Java 程序，用于从文本文件中检索温度测量值并计算每个气象站的最小、平均值和最高温度。
只有一点需要注意：文件有 1,000,000,000 行！（1 billion， 10亿行)。

这个文本文件结构简单，每行一个测量值, 气象站和它的测量温度：

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6
...

程序应打印出每个站的最小值、平均值和最大值，按字母顺序排列，如下所示：

1	{Abha=5.0/18.0/27.4, Abidjan=15.7/26.0/34.1, Abéché=12.1/29.4/35.6, Accra=14.7/26.4/33.1, Addis Ababa=2.1/16.0/24.3, Adelaide=4.1/17.3/29.7, ...}

1BRC挑战的目标是为这项任务创建最快的实现，在这样做的同时，探索现代 Java 的好处，并找出你可以将这个平台推向多远。因此，使用所有的（虚拟）线程，利用 Vector API 和 SIMD，优化你的 GC，利用 AOT 编译，或者使用你能想到的任何其他技巧。

没想到莫林提出这个挑战后，一下子火了，在多个平台上都进行了热烈的讨论：Hacker News、lobste.rs、Reddit。

而且，实现也不再仅限于Java,其他编程语言甚至数据库都有实现。

在32 core AMD EPYC™ 7502P (Zen2), 128 GB RAM的服务器上，使用8个核，优化的Java程序在使用GraalVM native binary情况下已经跑到了1秒多。在我的Mac M2 mini上可以跑到16.42秒。

我们关注一下Go语言的实现。

一个简单的中规中矩的Go语言实现是mr-karan/1brc-go, 在我的Mac M2 mini机器可以跑到26.66秒。
作者专门写了一篇文章介绍优化的步骤:

使用生产者和消费者模式
批处理文本行
减少内存分配
成块读取文本

另一个Go语言的实现是shraddhaag/1brc, 他把他的优化步骤都在README中列出来了，在我的Mac M2 mini机器可以跑到23.73秒。

Attempt Number	Approach	Execution Time	Diff	Commit
0	Naive Implementation: Read temperatures into a map of cities. Iterate serially over each key (city) in map to find min, max and average temperatures.	6:13.15
1	Evaluate each city in map concurrently using goroutines.	4:32.80	-100.35	8bd5f43
2	Remove sorting float64 slices. Calculate min, max and average by iterating.	4:25.59	-7.21	830e5df
3	Decouple reading and processing of file content. A buffered goroutine is used to communicate between the two processes.	5:22.83	+57.24	2babf7d
4	Instead of sending each line to the channel, now sending 100 lines chunked together. Also, to minimise garbage collection, not freeing up memory when resetting a slice.	3:41.76	-161.07	b7b1781
5	Read file in chunks of 100 MB instead of reading line by line.	3:32.62	-9.14	c26fea4
6	Convert temperature from `string` to `int64`, process in `int64` and convert to `float64` at the end.	2:51.50	-41.14	7812da4
7	In the city <> temperatures map, replaced the value for each key (city) to preprocessed min, max, count and sum of all temperatures instead of storing all recorded temperatures for the city.	1:39.81	-71.79	e5213a8
8	Use producer consumer pattern to read file in chunks and process the chunks in parallel.	1:43.82	+14.01	067f2a4
9	Reduce memory allocation by processing each read chunk into a map. Result channel now can collate the smaller processed chunk maps.	0:28.544	-75.286	d4153ac
10	Avoid string concatenation overhead by not reading the decimal point when processing city temperature.	0:24.571	-3.973	90f2fe1
11	Convert byte slice to string directly instead of using a `strings.Builder`.	0:18.910	-5.761

然后我们再看一个已经merge到官方库的Go语言实现，在我的Mac M2 mini机器可以跑到17.79秒，相当不错了。
他的代码在这里,他也做了很多的优化，比如使用mmap。

这几个代码都是值得学习和研究的，我们不能说他们做了最好的优化，但是确实都是花了不少功夫的。

几个rust的实现也是值得学习的。在我的Mac M2 mini机器的跑分：

tumdum/1brc: 18.72秒
mtb0x1/1brc: 18.93秒
coriolinus/1brc：18.87秒
thebracket/one_billion_rows: 16.04秒，使用了mmap
arthurlm/one-brc-rs: 17.33秒

一个C语言的实现：

dannyvankooten/1brc: 17.72秒