在开发中为了提升性能，减轻数据库压力，一般会给热点数据设置缓存，如 Redis，用户请求过来后先查询 Redis，有则直接返回，没有就会去查询数据库，然后再写入缓存。而在分布式、高并发场景下，缓存系统常会遇到如下两类问题：

缓存击穿（Cache Penetration）
缓存雪崩（Cache Avalanche）

下面从定义、成因、区别以及在 Go 语言中常见的解决思路和示例来一并说明。

缓存击穿

定义

某些请求绕开缓存（如 key 本身不存在于缓存中，或者被恶意请求大量不存在的 key），直接打到后端存储，造成后端压力激增。

例如：假设有一个 key 的缓存过期了，恰好有大量请求同时访问这个 key，而这个 key 又没有设置缓存，导致所有请求都打到数据库上，造成数据库负载过大，甚至宕机。

问题分析

针对不同的原因，需要采取不同的解决方案：

大量请求访问不存在的 key：可以使用布隆过滤器（Bloom Filter）来判断请求的 key 是否存在于数据库中，如果不存在，则直接返回空值，避免请求打到数据库上。
热点 key 缓存失效：对于热点 key，可以设置永不过期，来避免缓存失效。
大量请求打到数据库：可以使用本地与分布式限流来控制请求的并发量，避免数据库压力过大。

而对于本地和分步式限流又可以展现分析：

本地限流：在单机应用中，可以使用互斥锁 sync.Mutex 或 sync.RWMutex 来实现限流，限制同一时间内只能有一个请求访问数据库。也就是说当缓存失效时，只有第一个请求会去查询数据库，后续的请求会等待第一个请求完成后再返回结果。这样降低了数据库的压力，但是会增加请求的延迟，导致整个系统的性能下降。
分布式限流：在分布式应用中，可以使用 Redis 的 SETNX 命令来实现分布式锁，限制同一时间内只能有一个请求访问数据库。当然缺点是 Redis 的 SETNX 命令会有一定的性能损耗，并且所有请求变为串行化，也会增加请求的延迟。

解决方案

针对上面的缺点，Go 官方库 golang.org/x/sync/singleflight 提供了一个 singleflight 包，可以用来解决缓存击穿的问题。作为一个用于并发调用去重与结果共享的轻量级包，它通过将并发请求按照 key 分组，只允许一个“领头”请求执行实际操作，其他请求等待并共享该操作结果，从而有效防止缓存击穿或过度调用后端服务的“雪崩”问题。其本质就是将相同的请求合并成一个，避免重复请求。

它最初由 Brad Fitzpatrick 提出，用于解决高并发下的「群体效应」（thundering herd）问题，通过为每次调用指定 key，实现请求合并。

在 Go 社区中，singleflight 常被比作「短暂的记忆化」（short-lived memoization），它与持久化缓存不同，结果仅在当前并发请求周期内可复用，返回后即失效。

示例

下面是一个简单的示例，首先模拟一个缓存击穿的场景：

var ErrCacheMiss = errors.New("cache miss") // Error indicating cache miss

func main() {
	var wg sync.WaitGroup
	concurrentRequests := 10 // Number of concurrent requests

	// Simulate 10 concurrent requests
	for range concurrentRequests {
		go func() {
			defer wg.Done()
			data, err := fetchData("key")
			if err != nil {
				log.Print(err)
				return
			}
			log.Println(data)
		}()
		wg.Add(1)
	}
	wg.Wait()
}

// fetchData retrieves data from cache or database
func fetchData(key string) (string, error) {
	// Try to load data from cache
	data, err := getFromCache(key)
	if err != nil && err == ErrCacheMiss {
		// Load data from database if cache misses
		data, err := getFromDatabase(key)
		if err != nil {
			return "", err
		}
		// Store the data in cache
		storeInCache(key, data)
		return data, nil
	}
	return data, nil
}

// getFromCache simulates retrieving data from cache
func getFromCache(key string) (string, error) {
	return "", ErrCacheMiss // Simulate a cache miss
}

// storeInCache simulates storing data in cache
func storeInCache(key, data string) {}

// getFromDatabase simulates retrieving data from the database
func getFromDatabase(key string) (string, error) {
	log.Println("Querying database...")
	timestamp := strconv.Itoa(int(time.Now().UnixNano()))
	return timestamp, nil
}

这里的 fetchData 函数模拟了一个缓存击穿的场景，首先尝试从缓存中获取数据，如果缓存未命中，则查询数据库并将结果存入缓存。运行上面的代码，输出如下：

2025/05/09 11:02:55 Querying database...
2025/05/09 11:02:55 Querying database...
2025/05/09 11:02:55 1746759775630807000
2025/05/09 11:02:55 1746759775630798000
2025/05/09 11:02:55 Querying database...
2025/05/09 11:02:55 1746759775630857000
2025/05/09 11:02:55 Querying database...
2025/05/09 11:02:55 1746759775630887000
2025/05/09 11:02:55 Querying database...
2025/05/09 11:02:55 1746759775630899000
2025/05/09 11:02:55 Querying database...
2025/05/09 11:02:55 1746759775630906000
2025/05/09 11:02:55 Querying database...
2025/05/09 11:02:55 Querying database...
2025/05/09 11:02:55 1746759775630949000
2025/05/09 11:02:55 Querying database...
2025/05/09 11:02:55 1746759775630962000
2025/05/09 11:02:55 1746759775630922000
2025/05/09 11:02:55 Querying database...
2025/05/09 11:02:55 1746759775630978000

可以看到不同协程在查询数据库时，都会打印 Querying database...，说明每个请求都直接打到了数据库上，造成了缓存击穿。
接下来我们使用 singleflight 来解决这个问题：

var requestGroup singleflight.Group // Group for merging duplicate requests

// fetchData retrieves data from cache or database
func fetchData(key string) (string, error) {
	// Try to load data from cache
	data, err := getFromCache(key)
	if err != nil && err == ErrCacheMiss {
		// Use singleflight to merge duplicate requests
		result, err, _ := requestGroup.Do(key, func() (interface{}, error) {
			// Load data from database if cache misses
			data, err := getFromDatabase(key)
			if err != nil {
				return nil, err
			}
			// Store the data in cache
			storeInCache(key, data)
			return data, nil
		})
		if err != nil {
			log.Println(err)
			return "", err
		}
		data = result.(string)
	}
	return data, nil
}

在上面的代码中，我们使用 singleflight 的 Do 方法来合并重复请求。只有第一个请求会执行实际的数据库查询，后续的请求会等待第一个请求完成后共享结果。这样就避免了缓存击穿的问题。
运行上面的代码，输出如下：

2025/05/09 11:07:18 Querying database...
2025/05/09 11:07:18 1746760038153869000
2025/05/09 11:07:18 1746760038153869000
2025/05/09 11:07:18 1746760038153869000
2025/05/09 11:07:18 1746760038153869000
2025/05/09 11:07:18 1746760038153869000
2025/05/09 11:07:18 1746760038153869000
2025/05/09 11:07:18 1746760038153869000
2025/05/09 11:07:18 1746760038153869000
2025/05/09 11:07:18 1746760038153869000
2025/05/09 11:07:18 1746760038153869000

可以看到，只有第一个请求打印了 Querying database...，后续的请求都直接返回了缓存的结果，避免了缓存击穿的问题。

singleflight 的实现原理

singleflight 这个库的源码非常精炼，值得学习。它的实现原理是使用 sync.Mutex 和 sync.WaitGroup 来实现请求的合并。它通过维护一个 map，用于存储正在进行的请求。

// Group represents a class of work and forms a namespace in
// which units of work can be executed with duplicate suppression.
type Group struct {
	mu sync.Mutex       // protects m
	m  map[string]*call // lazily initialized
}

在 Group 结构体中，m 是一个 map[string]*call，用于存储正在进行的请求。call 结构体中包含了请求的结果、错误信息和一个 sync.WaitGroup，用于等待请求完成。

// call is an in-flight or completed singleflight.Do call
type call struct {
	wg sync.WaitGroup

	// These fields are written once before the WaitGroup is done
	// and are only read after the WaitGroup is done.
	val interface{}
	err error

	// These fields are read and written with the singleflight
	// mutex held before the WaitGroup is done, and are read but
	// not written after the WaitGroup is done.
	dups  int
	chans []chan<- Result
}

当一个请求到来时，它会先检查这个 map 中是否有正在进行的请求，如果有，则等待这个请求完成后返回结果；如果没有，则创建一个新的请求，并将其加入到 map 中，执行实际的操作。具体的实现就在 Do 方法中：

// Do executes and returns the results of the given function, making
// sure that only one execution is in-flight for a given key at a
// time. If a duplicate comes in, the duplicate caller waits for the
// original to complete and receives the same results.
// The return value shared indicates whether v was given to multiple callers.
func (g *Group) Do(key string, fn func() (interface{}, error)) (v interface{}, err error, shared bool) {
	g.mu.Lock()
	if g.m == nil {
		g.m = make(map[string]*call)
	}
	if c, ok := g.m[key]; ok {
		c.dups++
		g.mu.Unlock()
		c.wg.Wait()

		if e, ok := c.err.(*panicError); ok {
			panic(e)
		} else if c.err == errGoexit {
			runtime.Goexit()
		}
		return c.val, c.err, true
	}
	c := new(call)
	c.wg.Add(1)
	g.m[key] = c
	g.mu.Unlock()

	g.doCall(c, key, fn)
	return c.val, c.err, c.dups > 0
}

singleflight 的内置 map[string]*call 并不会像缓存那样“永久”保存调用结果，它只是用来跟踪正在进行中的调用。原因是：

在 Do 方法中看到，m 是一个懒初始化的 map，当第一个请求到来时，它会被初始化。
在 doCall 方法中，当请求完成后，会将其从 map 中删除。

// doCall handles the single call for a key.
func (g *Group) doCall(c *call, key string, fn func() (interface{}, error)) {
	normalReturn := false
	recovered := false

	// use double-defer to distinguish panic from runtime.Goexit,
	// more details see https://golang.org/cl/134395
	defer func() {
		// the given function invoked runtime.Goexit
		if !normalReturn && !recovered {
			c.err = errGoexit
		}

		g.mu.Lock()
		defer g.mu.Unlock()
		c.wg.Done()
		if g.m[key] == c {
			delete(g.m, key)
		}

		if e, ok := c.err.(*panicError); ok {
			// In order to prevent the waiting channels from being blocked forever,
			// needs to ensure that this panic cannot be recovered.
			if len(c.chans) > 0 {
				go panic(e)
				select {} // Keep this goroutine around so that it will appear in the crash dump.
			} else {
				panic(e)
			}
		} else if c.err == errGoexit {
			// Already in the process of goexit, no need to call again
		} else {
			// Normal return
			for _, ch := range c.chans {
				ch <- Result{c.val, c.err, c.dups > 0}
			}
		}
	}()

	func() {
		defer func() {
			if !normalReturn {
				// Ideally, we would wait to take a stack trace until we've determined
				// whether this is a panic or a runtime.Goexit.
				//
				// Unfortunately, the only way we can distinguish the two is to see
				// whether the recover stopped the goroutine from terminating, and by
				// the time we know that, the part of the stack trace relevant to the
				// panic has been discarded.
				if r := recover(); r != nil {
					c.err = newPanicError(r)
				}
			}
		}()

		c.val, c.err = fn()
		normalReturn = true
	}()

	if !normalReturn {
		recovered = true
	}
}

当然这里还有一个细节，那就是 singleflight 的 group 结构体为什么用 sync.Mutex 和 map 来实现，而不是 sync.Map 呢？在 reddit 中有关讨论中提到，sync.Map 的性能在高并发场景下会比 sync.Mutex 差很多，因为 sync.Map 的实现是基于 读写锁 的，而 sync.Mutex 是基于 互斥锁 的。对于高并发场景，使用 sync.Mutex 会更高效。Why does singleflight use mutex + map instead of sync.Map?

总结

缓存击穿关键在于「防穿透」（布隆过滤、空值缓存）+「防并发」（singleflight）。

布隆过滤器：用于判断请求的 key 是否存在于数据库中，如果不存在，则直接返回空值，避免请求打到数据库上。
空值缓存：当请求的 key 不存在时，可以将空值缓存到 Redis 中，并设置一个较短的过期时间，这样后续短期内的请求就可以直接从缓存中获取空值，避免重复请求打到数据库上。
singleflight：用于合并重复请求，避免多个请求同时打到数据库上。

缓存雪崩

定义

缓存雪崩是指在某个时间点，大量的缓存数据同时失效，导致大量请求直接打到后端存储，造成后端压力激增。

例如：假设有一批缓存数据在同一时间点失效，恰好有大量请求同时访问这些数据，导致所有请求都打到数据库上，造成数据库负载过大，甚至宕机。

问题分析

上面缓存击穿用到的 布隆过滤器、空值缓存 和 singleflight 也可以用来一定程度上解决缓存雪崩的问题。当大量缓存数据同时失效时，可以使用 布隆过滤器 来判断请求的 key 是否存在于数据库中，如果不存在，则直接返回空值，避免请求打到数据库上；使用 空值缓存 来将空值缓存到 Redis 中，并设置一个较短的过期时间，这样后续短期内的请求就可以直接从缓存中获取空值，避免重复请求打到数据库上；使用 singleflight 来合并重复请求，避免多个请求同时打到数据库上。

当然针对缓存雪崩已经发生的情况，从缓存雪崩的成因来分析，主要有以下几点：

缓存失效时间设置不合理：如果所有的缓存数据都在同一时间点失效，就会导致大量请求同时打到数据库上，造成数据库压力过大。
缓存数据库宕机：如果缓存数据库宕机，所有请求都会打到后端存储，造成后端压力激增。

所以可以考虑从成因入手：

缓存失效时间设置不合理：在设置缓存失效时间时，使用随机数来打散缓存失效时间，避免大量缓存数据同时失效。
缓存数据库宕机：使用 Redis Sentinel 或 Redis Cluster 来实现高可用的缓存数据库，避免单点故障导致缓存数据库宕机。

震朕的小宇宙

缓存击穿和缓存雪崩

缓存击穿

定义

问题分析

解决方案

示例

singleflight 的实现原理

总结

缓存雪崩

定义

问题分析