支付系统：缓存

2023-08-27

系统设计

3116 words

Page content

缓存使用场景

缓存用户信息 - 如钱包余额
作为消息队列
分布式锁 - Cache lock
防止异步消息重复消费
限流 - rate limit
Geohash - 查询地理范围
布隆过滤 - 检测user是否参加过活动

缓存过期的策略 - Cache eviction policies

Following are some of the most common cache eviction policies:

First In First Out (FIFO): The cache evicts the first block accessed first without any regard to how often or how many times it was accessed before.
Last In First Out (LIFO): The cache evicts the block accessed most recently first without any regard to how often or how many times it was accessed before.
Least Recently Used (LRU): Discards the least recently used items first.
Most Recently Used (MRU): Discards, in contrast to LRU, the most recently used items first.
Least Frequently Used (LFU): Counts how often an item is needed. Those that are used least often are discarded first.
Random Replacement (RR): Randomly selects a candidate item and discards it to make space when necessary.

缓存更新的策略

Cache Aside Pattern

这是最常用最常用的pattern了。其具体逻辑如下：

失效：应用程序先从cache取数据，没有得到，则从数据库中取数据，成功后，放到缓存中。
命中：应用程序从cache中取数据，取到后返回。
更新：先把数据存到数据库中，成功后，再让缓存失效。

Read Flow

sequenceDiagram actor User participant App participant Cache participant DB User->>App: Read API App->>Cache: item exist? alt hit cache Cache->>App: App->>User: return data else fallback to db App->>DB: Read DB DB->>App: return data App-->>Cache: save data into cache App->>User: return data end

Update Flow

sequenceDiagram actor User participant App participant Cache participant DB User->>App: Update API App->>DB: Write DB DB->>App: OK App-->>Cache: Invalidate item in cache App->>User: OK

Q1：为什么不是写完数据库后更新缓存？

两个并发的写操作导致脏数据。示例：

P1写db
P2写db，P2先于P1完成cache更新
P1更新cache，导致cache不是最新的数据

Q2: 先删除缓存，然后再更新数据库，后续的操作把数据再装载的缓存中。

两个并发操作，一个是更新操作，另一个是查询操作，更新操作删除缓存后，查询操作没有命中缓存，先把老数据读出来后放到缓存中，然后更新操作更新了数据库。于是，在缓存中的数据还是老的数据，导致缓存中的数据是脏的。

Q3：先更新数据库，成功后，让缓存失效

一个是查询操作，一个是更新操作的并发。首先，没有了删除cache数据的操作了，而是先更新了数据库中的数据，此时，缓存依然有效，所以，并发的查询操作拿的是没有更新的数据，但是，更新操作马上让缓存的失效了，后续的查询操作再把数据从数据库中拉出来。

Q4: Cache Aside这种方式是否有并发问题？有

两个并发的读/写操作导致脏数据。示例：

P1 read，未命中cache，然后读DB
P2 update, 写完DB后，让cache失效
P1 保存之前读取的数据到cache，这是cache就是脏数据了

Q4这个case理论上会出现，不过，~~实际上出现的概率可能非常低~~。

因为这个条件需要发生在读缓存时缓存失效，而且并发着有一个写操作。而实际上数据库的写操作会比读操作慢得多，而且还要锁表，而读操作必需在写操作前进入数据库操作，而又要晚于写操作更新缓存，所有的这些条件都具备的概率基本并不大。

Read-Through Cache

Read Through 套路就是在查询操作中更新缓存，也就是说，当缓存失效的时候（过期或LRU换出），Cache Aside是由调用方负责把数据加载入缓存，而Read Through则用缓存服务自己来加载，从而对应用方是透明的。

缓存服务自己来加载，怎么加载？

Write-Through Cache

Write Through 套路和Read Through相仿，不过是在更新数据时发生。

当有数据更新的时候，如果没有命中缓存，直接更新DB，然后返回。如果命中了缓存，则更新缓存，然后再由Cache自己更新数据库（这是一个同步操作）- 同时更新Cache和DB。

Under this scheme, data is written into the cache and the corresponding database at the same time. The cached data allows for fast retrieval and, since the same data gets written in the permanent storage, we will have complete data consistency between the cache and the storage. Also, this scheme ensures that nothing will get lost in case of a crash, power failure, or other system disruptions. Although, write through minimizes the risk of data loss, since every write operation must be done twice before returning success to the client, this scheme has the disadvantage of higher latency for write operations.

Write-Around cache

数据直接写到DB，跳过Cache。

This technique is similar to write through cache, but data is written directly to permanent storage, bypassing the cache. This can reduce the cache being flooded with write operations that will not subsequently be re-read, but has the disadvantage that a read request for recently written data will create a “cache miss” and must be read from slower back-end storage and experience higher latency.

Write-Back Cache

在更新数据的时候，只更新缓存，不更新数据库，而缓存会异步地批量更新数据库。

因为异步，write back还可以合并对同一个数据的多次操作，所以性能的提高是相当可观的。但是，其带来的问题是，数据不是强一致性的，而且可能会丢失。Unix/Linux非正常关机会导致数据丢失，就是因为这个事。

Write Back实现逻辑比较复杂，因为他需要track有哪数据是被更新了的，需要刷到持久层上。

操作系统的write back会在仅当这个cache需要失效的时候，才会被真正持久起来，比如，内存不够了，或是进程退出了等情况，这又叫lazy write。

如果更新DB失败，如何保证Cache与DB最终一致？

Under this scheme, data is written to cache alone and completion is immediately confirmed to the client. The write to the permanent storage is done after specified intervals or under certain conditions. This results in low latency and high throughput for write-intensive applications, however, this speed comes with the risk of data loss in case of a crash or other adverse event because the only copy of the written data is in the cache.

缓存引发的线上问题

在DB事务中更新Cache，导致DB性能下降

我们在重构系统时，一些新人引入了如下更新缓存的方式；

invalidate cache
start DB txn
update DB data
save data into cache
commit DB txn
invalidate cache again if DB error

导致的问题

在压测中发现，在DB事务中更新Cache，影响了DB操作性能，导致单个事务耗时变长，系统能支撑的TPS明显下降。
Cache变慢时，导致DB transaction也变得很慢，极端情况下数据库连接被耗尽

解决方法：把Cache更新逻辑，挪到DB事务外面；写完数据库后删除旧的缓存。在DB事务中，不应该有外部调用，如rpc call，http call或者访问redis。

写完数据库后删除旧缓存的问题

start DB txn
update DB data
commit DB txn
delete old Cache

问题：两个并发操作，一个是更新操作，另一个是查询操作，更新操作删除缓存后，查询操作没有命中缓存。先把老数据读出来后放到缓存中。于是，在缓存中的数据还是老的数据，导致缓存中的数据是脏的。

为什么读到了老数据？

因为读接口，默认读的是slave db；如果发生主从延迟，从库读到数据都是过时的。尤其在系统QPS比较高的时候，DB主从延迟没有办法避免。

解决方法：写完数据库后，读主库刷新缓存

start DB txn
update DB data
commit DB txn
read master DB and reset Cache

读主库能100%取到最新的数据，规避主从延迟的问题，同时会导致主库的压力变大。 P1，P2两个并发，依然可能导致cache脏数据，但概率比较低。

hashtag 导致Redis CPU过载

问题：压测时发现某个codis节点CPU负载高，检查key发现某个key prefix的格式 "{%d}:xxx:xxx", user_id

hashtag key: a string wrapped by curly brackets - {} as cache key prefix format: "{%s}-%s" % (str1literal, str2userid) e.g. "{abc}-123", "{abc}-456"

包含同样的hashtag的key会被存在一个codis slot上，可能导致数据分布不均，从而导致某个codis节点过载。

解决方法：更改Cache key生成方式，去掉{}，防止key被分配到单个codis节点上。

Redis作为消息队列导致hot key

问题：采用codis作为消息队列，单个队列key QPS过高，导致单个codis节点CPU负载过高，影响线上业务。

解决方法：

短期 - 定时任务，清理掉big key；
长期 - 用单独的codis集群，或者Kafka作为消息队列使用

Cache lock没有及时释放

问题：业务逻辑报错后，cache lock没有及时释放，导致上游重试的时候一直返回fail to get cache lock

func DoBusiness(p Param) error {
  lock := getCacheLock()
  if err != nil {
    return err // `fail to get cache lock` err
  }

  err := DoSomething()
  if err != nil {
    return err; // if return here, lock is not released
  }
  ...
  lock.release() // release lock
  ...
  return nil
}

解决方法：把lock放到defer方法中释放：

func DoBusiness(p Param) error {
  lock, err := getCacheLock()
  defer func() {
    lock.release();
  }

  if err != nil {
    return err
  }

  err := DoSomething()
  if err != nil {
    return err;
  }
  ...
  return nil
}

压测流量打到的线上codis集群

解决方法：压测脚本也需要先做测试，小流量验证正确性。

DR切换机房导致缓存失效

问题：某些业务依赖缓存中的key来保证幂等，DR切换时，服务切换到了新的Redis，但是这些key没有同步过去，不能保证幂等，导致重复下单。

解决方法：关键流程的幂等不能完全依赖缓存，比较安全的是使用数据库uk。

DR切换机房导致缓存连不上

问题：DR切换之后，Cache Proxy DNS解析出来的IP没有加入新机房的白名单，服务连不上缓存，无法重新启动。

解决方法：SRE在做切换时，接入cache IP白名单的checklist