Performance Analysis of Golang Large Killer PProf

  golang, Performance tuning, php

Performance Analysis of Golang Large Killer PProf

Original address:Performance Analysis of Golang Large Killer PProf


Write several tons of code and realize hundreds of interfaces. The function test also passed, and finally the deployment went online successfully.

As a result, the performance was poor. What the hell?

Want to do performance analysis


To optimize performance, first focus on the tool chain provided by Go itself as the analysis basis. This article will take you to learn and use Go’s back garden, involving the following:

  • Runtime/pprof: collect the running data of the program (not Server) for analysis.
  • Net/http/pprof: collect HTTP Server runtime data for analysis

What is it?

Pprof is a tool for visualizing and analyzing performance analysis data

Pprofprofile.protoRead the collection of analysis samples and generate reports to visualize and help analyze data (supporting text and graphical reports)

Profile.proto is a description file of Protocol Buffer v3, which describes a set of callstack and symbolization information, and is used to represent a set of sampled call stacks for statistical analysis. it is a very common stacktrace configuration file format.

What usage models are supported

  • Report generation: report generation
  • Interactive terminal use: interactive terminal use
  • Web interface:Web interface

What can be done

  • CPU Profiling:CPU analysis, which collects the usage of the monitored application CPU (including registers) according to a certain frequency to determine where the application spends time when actively consuming CPU cycles
  • Memory Profiling: memory analysis, which records stack traces when the application makes heap allocations, and is used to monitor current and historical memory usage and check for memory leaks
  • Block Profiling: block analysis, recording the location of the goroutine block waiting for synchronization (including timer channel)
  • Mutex Profiling: mutex analysis, reporting the competition of mutex

A simple example

We will write a simple and somewhat problematic example for basic program preliminary analysis.

Writing demo files

(1)demo.go, document content:

package main

import (
    _ "net/http/pprof"

func main() {
    go func() {
        for {

    http.ListenAndServe("", nil)

(2)data/d.go, document content:

package data

var datas []string

func Add(str string) string {
    data := []byte(str)
    sData := string(data)
    datas = append(datas, sData)

    return sData

Run this file and your HTTP service will have an endpoint of /debug/pprof that can be used to observe the application.


First, through the Web interface

View Current Overview: Accesshttp://


0    block
5    goroutine
3    heap
0    mutex
9    threadcreate

full goroutine stack dump

There are many sub-pages in this page. Let’s continue to dig deeper and see what we can get.

  • cpu(CPU Profiling):$HOST/debug/pprof/profile, the default CPU Profiling for 30s, get a profile file for analysis
  • block(Block Profiling):$HOST/debug/pprof/blockTo view the stack trace that caused blocking synchronization
  • goroutine:$HOST/debug/pprof/goroutineTo view all currently running goroutines stack traces
  • heap(Memory Profiling):$HOST/debug/pprof/heapTo view the memory allocation of the active object
  • mutex(Mutex Profiling):$HOST/debug/pprof/mutexTo view the stack trace of the competing holder that caused the mutex
  • threadcreate:$HOST/debug/pprof/threadcreateTo view the stack trace for creating a new OS thread

Second, through the use of interactive terminals

(1)go tool pprofhttp://localhost:6060/debug/pprof/profile? seconds=60

$ go tool pprof http://localhost:6060/debug/pprof/profile\?seconds\=60

Fetching profile over HTTP from http://localhost:6060/debug/pprof/profile?seconds=60
Saved profile in /Users/eddycjy/pprof/pprof.samples.cpu.007.pb.gz
Type: cpu
Duration: 1mins, Total samples = 26.55s (44.15%)
Entering interactive mode (type "help" for commands, "o" for options)

After executing the command, you need to wait for 60 seconds (you can adjust the value of seconds), and pprof will perform CPU Profiling. After finishing, the interactive command mode of pprof will be entered by default, and the analysis results can be viewed or exported. Specific executablepprof helpView command description

(pprof) top10
Showing nodes accounting for 25.92s, 97.63% of 26.55s total
Dropped 85 nodes (cum <= 0.13s)
Showing top 10 nodes out of 21
      flat  flat%   sum%        cum   cum%
    23.28s 87.68% 87.68%     23.29s 87.72%  syscall.Syscall
     0.77s  2.90% 90.58%      0.77s  2.90%  runtime.memmove
     0.58s  2.18% 92.77%      0.58s  2.18%  runtime.freedefer
     0.53s  2.00% 94.76%      1.42s  5.35%  runtime.scanobject
     0.36s  1.36% 96.12%      0.39s  1.47%  runtime.heapBitsForObject
     0.35s  1.32% 97.44%      0.45s  1.69%  runtime.greyobject
     0.02s 0.075% 97.51%     24.96s 94.01%  main.main.func1
     0.01s 0.038% 97.55%     23.91s 90.06%  os.(*File).Write
     0.01s 0.038% 97.59%      0.19s  0.72%  runtime.mallocgc
     0.01s 0.038% 97.63%     23.30s 87.76%  syscall.Write
  • Flat: Time-consuming to run on a given function
  • Flat%: Same as above CPU running time
  • Sum%: the total percentage of CPU used cumulatively for a given function
  • Cum: The current function plus the calls above it takes a long time to run.
  • Cum%: the same CPU running time as above

The last column is the name of the function. In most cases, we can get the running status of an application program through these five columns and optimize it.

(2)go tool pprofhttp://localhost:6060/debug/pprof/heap

$ go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /Users/eddycjy/pprof/pprof.alloc_objects.alloc_space.inuse_objects.inuse_space.008.pb.gz
Type: inuse_space
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 837.48MB, 100% of 837.48MB total
      flat  flat%   sum%        cum   cum%
  837.48MB   100%   100%   837.48MB   100%  main.main.func1
  • -inuse_space: analyze the application’s memory resident usage
  • -alloc_objects: analyze the temporary memory allocation of the application

(3) go tool pprofhttp://localhost:6060/debug/pprof/block

(4) go tool pprofhttp://localhost:6060/debug/pprof/mutex

Iii. PProf visual interface

This is an expected section. Before this, we need to simply write test cases to run


(1) new data/d_test.go, file content:

package data

import "testing"

const url = ""

func TestAdd(t *testing.T) {
    s := Add(url)
    if s == "" {
        t.Errorf("Test.Add error!")

func BenchmarkAdd(b *testing.B) {
    for i := 0; i < b.N; i++ {

(2) execute test cases

$ go test -bench=.
BenchmarkAdd-4       10000000           187 ns/op
ok    2.300s

-memprofile can also learn about it.

启动 PProf 可视化界面
Method 1:
$ go tool pprof -http=:8080
Method 2:
$ go tool pprof 
$ (pprof) web

If it doesCould not execute dot; may need to install graphviz., is to prompt you to installgraphvizThe (please turn right Google)

查看 PProf 可视化界面





The bigger the box, the thicker the line means the more time it takes.





Through PProf’s visual interface, we can more conveniently and intuitively see the call chain, usage, etc. of Go applications. In the View menu bar, we also support the switching of the above modes.

Do you think, when you worry and don’t know what the problem is, can you use these auxiliary tools to detect the problem, is the efficiency doubled in an instant?

Iv. PProf flame map

Another method of visualizing data is the flame diagram, which requires manual installation of the native PProf tool:

(1) installation of PProf

$ go get -u

(2) Start PProf visual interface:

$ pprof -http=:8080

(3) check PProf visual interface

When you open PPRF’s visual interface, you will obviously find that it is more refined than PPRF of the official tool chain and has more Flame Graph (flame graph)

It is one of the goals of this time, and its greatest advantage is dynamic. The calling sequence is from top to bottom (A -> B -> C -> D). each block represents a function, and the larger the block, the longer the CPU takes up. At the same time, it also supports click blocks for in-depth analysis!



In this chapter, the performance weapon PProf of Go is briefly introduced. In a specific scene, PProf has brought great help to locate and analyze the problem.

I hope this article is helpful to you, and I also suggest that you can do it yourself. It is best to think deeply about it, which contains a lot of usage and knowledge points.

Thinking questions

You have seen the end very well, so there are two simple thinking questions, hoping to expand your thinking.

(1) Must 1)flat be greater than cum, and why? In what scenario will cum be larger than flat?

(2) What are the performance problems of demo code in this chapter? How to solve it?

Come on, share your thoughts!

Original address:Performance Analysis of Golang Large Killer PProf