Single consumer vs multiple consumers
Problems with multiple consumers(order of message processing, double processing)
Scale horizontally: more machine or vertically better machine
Single consumer: message order is preserved but unreliable(restart, network issue...) ;
Multiplen consumers : high availability ; but messages can be processed out of order; and higher change of processing the same messages multiple time.
Out of order issue demonstration: push model:
I have a queue and two consumers 1 and 2 messages are dispatched in a round robin manner ;
I have three messages A, B and C. A to consumer 1, B to consumer 2....
One day the message A is processed in the consumer but slower than B is processed, while A is a create booking , B is cancel booking
Here we cancel first and then create, resulting in booking is still not canceled.
Double processing : pull based
I have a queue and two consumers 1 and 2; I have three messages A, B and C.
Consumer 1 pulled A and consumer 2 pulled consumer B; messages are marked invisible
However consumer 1 processed message A but failed to send the ack to the queue;
So after the visibility timeout, A become available again and pulled by consumer 2;
Same example for message A and B, A is booking create and B is cancel. Here consumer A create a booing, B cancel a booking
But consumer 2 proceeded A again so the booking created again.
One solution is to use a deduplication cache shared by all consumers,
One more example multiple consumer, log-based messaging :
Solution: data partitioning (sharding): multiple queues for each consumer
Pros and cons
Applications
Pros and cons
Applications
Messaging system - RabbitMQ - shared queues
Producer send messages to a single queue and then partitions to multiple regular queues, shards. Each shard has a single consumer. Total order is lose, but order is guaranteed in shard level
Log-based messaging system Kafka, Kinesis
Divide incoming messages into partitions , each partition is processed by its own consumer. -> no competing consumer -> so you want to parallelize message processing we create more partitions.
Database
Each shard(also called node) holds a bunch of data. -> new issues raise
How to partition data
Configuration service: how many shards and which data in which shard -> client requests comes to request router and router check configuration service to check which shard is to use. -> SQL and NoSQL database
Avoid additional components, allow nodes to communicate to each other on how to split data and route request -> Dynamo and Cassandra
Shard can be used a distributed cache system
Object storage -> S3, google cloud storage, Azure Blob ...
When we upload file, one node is selected and store file there ;
Metadata is stored in the configuration service .
When get file, request router check the metadata service and forward the request to the node.
Lookup Strategy
Range strategy
Hash strategy
Lookup strategy : create a mapping ourselves , and assign a shard randomly to a user .
Range Strategy: shard responsible for a range of continuous keys.
Pros: easy to implement, works well with range queries.(orders in a month)
Cons: provide suboptimal balancing between shards (which may lead to hot shard problem-> single shard get much higher load than others)
Hash strategy: based on users last name, take the string and compute a hash number.
Physical and virtual shards
Request routing options
Physical machine vs virtual shards : a physical hard can hold multiple virtual shards
Request routing service : identify the shard that's stores the data for specified ket and forward the request to the physical machine that host the identified shard -> a mapping
The mapping has the shard name and ip of shard live on. -> where to put the info?
Option1 : static field deployed every machines
Lack of flexibility , update (number of shard, IP, shard to IP ) will requires redeployment
Option2: shared storage : S3 and daemon process pull the info from S3 periodically,
More flexible compare to option1
But increase complexity of client ,
Daemon process is vital here otherwise the mapping may be come outdated and request forwarding is wrong
No updating client -> proxy option and p2p option
This mapping doesn't change frequently-> If we have small number of servers, we can share the list via DNS; if we have many server, we can pul LB between clients and servers.
How proxy machine find shard server on the network or shard server find each others -> service discovery
How
Why rebalancing -> uneven load (number or requests ) on shard servers -> uneven data distribution/hot keys/share server failure(handover) -> Rebalancing
Option1 : split when shard size exceed a limit -> auto or manually -> clone shards(while handle all request) and updates metadata
Split shard when adding a new server.
Shard owns a small range of hash keys , shard number fixed. ->fix number partition strategy
How to implement
Pros and cons
Virtual nodes
Application