Contents

[Note] Go Optimizations101 : Value Copy Costs and Samll-size Type Values

[Note] Go Optimizations101 : Value Copy Costs and Samll-size Type Values

Go Optimizations101 筆記. 複製值的成本大致與值的大小成正比。 實際上,CPU 緩存、CPU 指令和編譯器優化也可能影響價值複製成本。 為了獲得較高的代碼執行性能,如果可能的話,我們應該盡量避免在

  1. copying a large quantity of large-size values in a loop.
  2. copying very-large-size arrays and structs.

Copy 9 elements vs 10 in Array/Struct

Some types in Go belong to small-size types. Copying their values is specially optimized by the official standard Go compiler so that the copy cost is always small.

But what are small-size types? There is not a formal definition. In fact, the definition depends on specific CPU architectures and compiler implementations. In the official standard Go compiler implementation, except large-size struct and array types, all other types in Go could be viewed as small-size types.

What are small-size struct and array values? There is also not a formal definition. The official standard Go compiler tweaks some implementation details from version to version. However, in practice, we can view struct types with no more than 4 native-word-size fields and array types with no more than 4 native-word-size elements as small-size values, such as struct{a, b, c, d int}, struct{element *T; len int; cap int} and [4]uint.

For official standard Go compiler v1.19, a copy cost leap happens between copying 9-element arrays and copying 10-element arrays (the element size is one native word). The similar is for copying 9-field structs and copying 10-field structs (each filed size is one native word). from : Go Optimizations 101

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
package copycost

import "testing"

const N = 1024

type Element = uint64

var a9r [N][9]Element

func Benchmark_CopyArray_9_elements(b *testing.B) {
	var a9s [N][9]Element
	for i := 0; i < b.N; i++ {
		for k := range a9s {
			a9r[k] = a9s[k]
		}
	}

}

var a10r [N][10]Element

func Benchmark_CopyArray_10_elements(b *testing.B) {
	var a10s [N][10]Element
	for i := 0; i < b.N; i++ {
		for k := range a10s {
			a10r[k] = a10s[k]
		}
	}
}

type S9 struct{ a, b, c, d, e, f, g, h, i Element }

var s9r [N]S9

func Benchmark_CopyStruct_9_fields(b *testing.B) {
	var s9s [N]S9
	for i := 0; i < b.N; i++ {
		for k := range s9s {
			s9r[k] = s9s[k]
		}
	}
}

type S10 struct{ a, b, c, d, e, f, g, h, i, j Element }

var s10r [N]S10

func Benchmark_CopyStruct_10_fields(b *testing.B) {
	var s10s [N]S10
	for i := 0; i < b.N; i++ {
		for k := range s10s {
			s10r[k] = s10s[k]
		}
	}
}

/*
go test -benchmem -run=none   -bench=. MyGoNote/Golang/Optimizations101/Value_Copy_Cossts_and_Samll-size_Type_Values/example/copy_9_vs_10_element_arrays -v -count=1
goos: darwin
goarch: amd64
pkg: MyGoNote/Golang/Optimizations101/Value_Copy_Cossts_and_Samll-size_Type_Values/example/copy_9_vs_10_element_arrays
cpu: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
Benchmark_CopyArray_9_elements
Benchmark_CopyArray_9_elements-8          197197              5865 ns/op               0 B/op          0 allocs/op
Benchmark_CopyArray_10_elements
Benchmark_CopyArray_10_elements-8         218689              5667 ns/op               0 B/op          0 allocs/op
Benchmark_CopyStruct_9_fields
Benchmark_CopyStruct_9_fields-8           476365              2594 ns/op               0 B/op          0 allocs/op
Benchmark_CopyStruct_10_fields
Benchmark_CopyStruct_10_fields-8          252937              5036 ns/op               0 B/op          0 allocs/op
PASS
ok      MyGoNote/Golang/Optimizations101/Value_Copy_Cossts_and_Samll-size_Type_Values/example/copy_9_vs_10_element_arrays       6.039s
*/

/value_copy_costs_and_samll-size_type_values/images/copy_9_vs_10_element_arrays.png

從此結果可看出, struct 元素少於10 個 fields 可能有特別被優化過

Add 4 vs Add 5

Add4 函數消耗的 CPU 資源比 Add5 少得多

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
package copycost

import "testing"

type T4 struct{ a, b, c, d float32 }
type T5 struct{ a, b, c, d, e float32 }

var t4 T4
var t5 T5

//go:noinline
func Add4(x, y T4) (z T4) {
	z.a = x.a + y.a
	z.b = x.b + y.b
	z.c = x.c + y.c
	z.d = x.d + y.d
	return

}

//go:noinline
func Add5(x, y T5) (z T5) {
	z.a = x.a + y.a
	z.b = x.b + y.b
	z.c = x.c + y.c
	z.d = x.d + y.d
	z.e = x.e + y.e
	return
}

func Benchmark_Add4(b *testing.B) {
	for i := 0; i < b.N; i++ {
		var x, y T4
		t4 = Add4(x, y)
	}
}

func Benchmark_Add5(b *testing.B) {
	for i := 0; i < b.N; i++ {
		var x, y T5
		t5 = Add5(x, y)
	}
}

/*
go test -benchmem -run=none   -bench=. MyGoNote/Golang/Optimizations101/Value_Copy_Costs_and_Samll-size_Type_Values/example/add4_vs_5/ -v -count=1
goos: darwin
goarch: amd64
pkg: MyGoNote/Golang/Optimizations101/Value_Copy_Costs_and_Samll-size_Type_Values/example/add4_vs_5
cpu: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
Benchmark_Add4
Benchmark_Add4-8        644357623                1.851 ns/op           0 B/op          0 allocs/op
Benchmark_Add5
Benchmark_Add5-8        90365827                13.01 ns/op            0 B/op          0 allocs/op
PASS
ok      MyGoNote/Golang/Optimizations101/Value_Copy_Costs_and_Samll-size_Type_Values/example/add4_vs_5  2.582s
*/

/value_copy_costs_and_samll-size_type_values/images/add4_vs_5.png

1
2
The `//go:noinline` compiler directives used here are to prevent the calls to the two function from being inlined.
If the directives are removed, the Add4 function will become even more performant.

如果將//go:noinline 移除, 將會內聯優化, Add4 function 效能會更好.

//go: 指令

from : 1.8 简单围观一下有趣的 //go: 指令

go:noinline

該指令表示該函數禁止進行內聯

1
2
3
4
//go:noinline
func unexportedPanicForTesting(b []byte, i int) byte {
    return b[i]
}

我們觀察一下這個案例,是直接通過索引取值,邏輯比較簡單。 如果不加上 go:noinline 的話,就會出現編譯器對其進行內聯優化 顯然,內聯有好有壞。該指令就是提供這一特殊處理

內聯展開

wiki

1
2
3
4
5
內聯展開(或稱內聯,下文或交替使用)是一種將函數體直接展開到調用處的一種優化技術。它可以由手工指定(如inline關鍵字),或者經由編譯優化自動完成。內聯展開類似於宏展開,區別在於內聯展開在編譯時完成,而宏展開則可能在預編譯(如C/C++)、編譯時(如Scheme)、運行時(如Scheme)時完成。

內聯是一種重要的優化技術。內聯的好處主要在於消除函數的調用開銷(壓棧,保護/恢復現場),但內聯展開對於性能的提升不能一概而論,它可能導致生成的代碼體積膨脹,並且影響指令緩存的命中率。有研究表明函數內聯展開在緩存小的時候能提升性能,緩存較大的時候性能有可能下降[1]。

除此之外,內聯展開會引入大量冗餘代碼,需要通過一系列編譯優化步驟進行縮減。比如一個記錄(或理解為結構體)中的值是不變的,那麼可以將其值直接替換到引用處;邏輯上不被使用到的分支代碼或者變量,會被自動消除掉;邏輯上不可能進入的分支也可以消除掉。通過這些優化可以極大縮減冗餘的代碼,使得程序編譯後獲得較為緊湊的體量。

Value Copy

Example 1 : Sum Range (array)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
package copycost

import "testing"

const N = 1024

//go:noinline
func Sum_RangeArray(a [N]int) (r int) {
	for _, v := range a {
		r += v

	}
	return
}

//go:noinline
func Sum_RangeArrayPtr1(a *[N]int) (r int) {
	for _, v := range *a {
		r += v
	}
	return
}

//go:noinline
func Sum_RangeArrayPtr2(a *[N]int) (r int) {
	for _, v := range a {
		r += v
	}
	return
}

//go:noinline
func Sum_RangeSlice(a []int) (r int) {
	for _, v := range a {
		r += v
	}
	return
}

//===========

var r [128]int

func buildArray() [N]int {
	var a [N]int
	for i := 0; i < N; i++ {
		a[i] = (N - i) & i
	}
	return a
}

func Benchmark_Sum_RangeArray(b *testing.B) {
	var a = buildArray()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		r[i&127] = Sum_RangeArray(a)
	}
}

func Benchmark_Sum_RangeArrayPtr1(b *testing.B) {
	var a = buildArray()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		r[i&127] = Sum_RangeArrayPtr1(&a)
	}
}

func Benchmark_Sum_RangeArrayPtr2(b *testing.B) {
	var a = buildArray()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		r[i&127] = Sum_RangeArrayPtr2(&a)
	}

}

func Benchmark_Sum_RangeSlice(b *testing.B) {
	var a = buildArray()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		r[i&127] = Sum_RangeSlice(a[:])
	}
}

/*
go test -benchmem -run=none   -bench=. MyGoNote/Golang/Optimizations101/Value_Copy_Costs_and_Samll-size_Type_Values/example/sum_range/ -v -count=1
goos: darwin
goarch: amd64
pkg: MyGoNote/Golang/Optimizations101/Value_Copy_Costs_and_Samll-size_Type_Values/example/sum_range
cpu: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
Benchmark_Sum_RangeArray
Benchmark_Sum_RangeArray-8               1839027               652.0 ns/op             0 B/op          0 allocs/op
Benchmark_Sum_RangeArrayPtr1
Benchmark_Sum_RangeArrayPtr1-8           2566839               468.8 ns/op             0 B/op          0 allocs/op
Benchmark_Sum_RangeArrayPtr2
Benchmark_Sum_RangeArrayPtr2-8           3054258               390.2 ns/op             0 B/op          0 allocs/op
Benchmark_Sum_RangeSlice
Benchmark_Sum_RangeSlice-8               2839160               434.5 ns/op             0 B/op          0 allocs/op
PASS
ok      MyGoNote/Golang/Optimizations101/Value_Copy_Costs_and_Samll-size_Type_Values/example/sum_range  6.802s
*/

/value_copy_costs_and_samll-size_type_values/images/sum_range.png

從此可看出 Sum_RangeArray 效能是最差的

1
2
3
4
5
6
func Sum_RangeArray(a [N]int) (r int) {
	for _, v := range a {
		r += v
	}
	return
}

因為 array 的 value 被 copy 了兩次

  1. 第一次, 當 array 被當參數傳進 function 時
  2. 第二次, 當 ragne 在讀 array時 複製到 v 時 . (the direct part of the container following the range keyword will be copied if the second iteration variable is used).

Sum_RangeArrayPtr1 function 比 Sum_RangeArray 還快, 因為 array 的 value 只被 copy 一次, 發生在 當 ragne 在讀 array時 複製到 v

1
2
3
4
5
6
func Sum_RangeArrayPtr1(a *[N]int) (r int) {
	for _, v := range *a {
		r += v
	}
	return
}

Sum_RangeArrayPtr2Sum_RangeSlice 皆無 copy 發生, 所以速度都比較快

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
//go:noinline
func Sum_RangeArrayPtr2(a *[N]int) (r int) {
	for _, v := range a {
		r += v
	}
	return
}

//go:noinline
func Sum_RangeSlice(a []int) (r int) {
	for _, v := range a {
		r += v
	}
	return
}

Example 2 : Sum Slice

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
package copycost

import "testing"

type Element [10]int64

//go:noinline
func Sum_PlainForLoop(s []Element) (r int64) {
	for i := 0; i < len(s); i++ {
		r += s[i][0]
	}
	return
}

//go:noinline
func Sum_OneIterationVar(s []Element) (r int64) {
	for i := range s {
		r += s[i][0]
	}
	return
}

//go:noinline
func Sum_UseSecondIterationVar(s []Element) (r int64) {
	for _, v := range s {
		r += v[0]
	}
	return
}

//===================

func buildSlice() []Element {
	var s = make([]Element, 1000)
	for i := 0; i < len(s); i++ {
		s[i] = Element{0: int64(i)}
	}
	return s
}

var r [128]int64

func Benchmark_PlainForLoop(b *testing.B) {
	var s = buildSlice()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		r[i&127] = Sum_PlainForLoop(s)
	}
}

func Benchmark_OneIterationVar(b *testing.B) {
	var s = buildSlice()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		r[i&127] = Sum_OneIterationVar(s)
	}
}

func Benchmark_UseSecondIterationVar(b *testing.B) {

	var s = buildSlice()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		r[i&127] = Sum_UseSecondIterationVar(s)
	}
}

/*
go test -benchmem -run=none   -bench=. MyGoNote/Golang/Optimizations101/Value_Copy_Costs_and_Samll-size_Type_Values/example/sum_slice/ -v -count=1
goos: darwin
goarch: amd64
pkg: MyGoNote/Golang/Optimizations101/Value_Copy_Costs_and_Samll-size_Type_Values/example/sum_slice
cpu: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
Benchmark_PlainForLoop
Benchmark_PlainForLoop-8                 1671280               687.3 ns/op             0 B/op          0 allocs/op
Benchmark_OneIterationVar
Benchmark_OneIterationVar-8              1765395               854.2 ns/op             0 B/op          0 allocs/op
Benchmark_UseSecondIterationVar
Benchmark_UseSecondIterationVar-8         167341              6415 ns/op               0 B/op          0 allocs/op
PASS
ok      MyGoNote/Golang/Optimizations101/Value_Copy_Costs_and_Samll-size_Type_Values/example/sum_slice  5.231s
*/

從結果來看, Sum_UseSecondIterationVar的速度是最慢的, 是因為每一個 element 被複製到 iteration 的變數 v, 且 copy cost 不小 (the type Element is not a small-size type). 其他兩個 function 沒有 copy

1
2
3
4
5
6
7
//go:noinline
func Sum_UseSecondIterationVar(s []Element) (r int64) {
	for _, v := range s {
		r += v[0]
	}
	return
}

Example 3 : Sum struct

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
package copycost

import "testing"

type S struct{ a, b, c, d, e int }

//go:noinline
func sum_UseSecondIterationVar(s []S) int {
	var sum int

	for _, v := range s {
		sum += v.c
		sum += v.d
		sum += v.e
	}
	return sum
}

//go:noinline
func sum_OneIterationVar_Index(s []S) int {
	var sum int
	for i := range s {
		sum += s[i].c
		sum += s[i].d
		sum += s[i].e
	}
	return sum
}

//go:noinline
func sum_OneIterationVar_Ptr(s []S) int {
	var sum int
	for i := range s {
		v := &s[i]
		sum += v.c
		sum += v.d
		sum += v.e
	}
	return sum
}

var s = make([]S, 1000)
var r [128]int

func init() {
	for i := range s {
		s[i] = S{i, i, i, i, i}
	}
}

func Benchmark_UseSecondIterationVar(b *testing.B) {
	for i := 0; i < b.N; i++ {
		r[i&127] = sum_UseSecondIterationVar(s)
	}
}

func Benchmark_OneIterationVar_Index(b *testing.B) {
	for i := 0; i < b.N; i++ {
		r[i&127] = sum_OneIterationVar_Index(s)
	}
}

func Benchmark_OneIterationVar_Ptr(b *testing.B) {
	for i := 0; i < b.N; i++ {
		r[i&127] = sum_OneIterationVar_Ptr(s)
	}
}

/*
go test -benchmem -run=none   -bench=. MyGoNote/Golang/Optimizations101/Value_Copy_Costs_and_Samll-size_Type_Values/example/sum_struct/ -v -count=1
goos: darwin
goarch: amd64
pkg: MyGoNote/Golang/Optimizations101/Value_Copy_Costs_and_Samll-size_Type_Values/example/sum_struct
cpu: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
Benchmark_UseSecondIterationVar
Benchmark_UseSecondIterationVar-8         653397              2781 ns/op               0 B/op          0 allocs/op
Benchmark_OneIterationVar_Index
Benchmark_OneIterationVar_Index-8        1000000              1100 ns/op               0 B/op          0 allocs/op
Benchmark_OneIterationVar_Ptr
Benchmark_OneIterationVar_Ptr-8          1000000              1077 ns/op               0 B/op          0 allocs/op
PASS
*/

/value_copy_costs_and_samll-size_type_values/images/sum_struct.png

如同 Example 2 sum_UseSecondIterationVar的速度是最慢的, 是因為每一個 element 被複製到 iteration 的變數 v

Reference