For testing purpose, I need to simulate client for generating 100,000 messages per second and send them to kafka topic. Is there any tool or way that can help me generate these random messages?
There's a built-in tool for generating dummy load, located in bin/kafka-producer-perf-test.sh
(https://github.com/apache/kafka/blob/trunk/bin/kafka-producer-perf-test.sh). You may refer to https://github.com/apache/kafka/blob/trunk/tools/src/main/java/org/apache/kafka/tools/ProducerPerformance.java#L106 to figure out how to use it.
One usage example would be like that:
bin/kafka-producer-perf-test.sh --broker-list localhost:9092 --messages 10000000 --topic test --threads 10 --message-size 100 --batch-size 10000 --throughput 100000
The key here is the --throughput 100000
flag which indicated "throttle maximum message amount to approx. 100000 messages per second"
The existing answers (e.g., kafka-producer-perf-test.sh) are useful for performance testing, but much less so when you need to generate more than just "a single stream of raw bytes". If you need, for example, to simulate more realistic data with nested structures, or generate data in multiple topics that have some relationship to each other, they are not sufficient. So if you need more than generating a bunch of raw bytes, I'd look at the alternatives below.
Update Dec 2020: As of today, I recommend the use of https://github.com/MichaelDrogalis/voluble. Some background info: The author is the product manager at Confluent for Kafka Streams and ksqlDB, and the author/developer of http://www.onyxplatform.org/.
From the Voluble README:
- Creating realistic data by integrating with Java Faker.
- Cross-topic relationships
- Populating both keys and values of records
- Making both primitive and complex/nested values
- Bounded or unbounded streams of data
- Tombstoning
Voluble ships as a Kafka connector to make it easy to scale and change serialization formats. You can use Kafka Connect through its REST API or integrated with ksqlDB. In this guide, I demonstrate using the latter, but the configuration is the same for both. I leave out Connect specific configuration like serializers and tasks that need to be configured for any connector.
Old answer (2016): I'd suggest to take a look at https://github.com/josephadler/eventsim, which will produce more "realistic" synthetic data (yeah, I am aware of the irony of what I just said :-P):
Eventsim is a program that generates event data for testing and demos. It's written in Scala, because we are big data hipsters (at least sometimes). It's designed to replicate page requests for a fake music web site (picture something like Spotify); the results look like real use data, but are totally fake. You can configure the program to create as much data as you want: data for just a few users for a few hours, or data for a huge number of users of users over many years. You can write the data to files, or pipe it out to Apache Kafka.
You can use the fake data for product development, correctness testing, demos, performance testing, training, or in any other place where a stream of real looking data is useful. You probably shouldn't use this data to research machine learning algorithms, and definitely shouldn't use it to understand how real people behave.
You can make use of Kafka Connect to generate random test data. Check out this custom source Connector https://github.com/xushiyan/kafka-connect-datagen
It allows you to define some settings like message template and randomizable fields to generate test data. Also check out this post for detailed demonstration.
kafka-connect-datagen
that supports Avro data - github.com/confluentinc/kafka-connect-datagen –
Skite © 2022 - 2024 — McMap. All rights reserved.