KafkaSpout very high latency

I am new to apache storm and kafka, as part of POC I am trying to process a message stream using Kafka and apache storm. I am using storm-kafka source from https://github.com/apache/storm/tree/master/external/storm-kafka, I am able to create a sample program which reads messages from kafka topic using KafkaSpout and output it on another kafka topic. I have 3 node kafka(all three running on same server) cluster and created the topics with 8 partitions. I setup the KafkaSpout parallelism as 8 and bolt's parallelism as 8 as well, tried with 8 executors as well as task. I have tried setting up lot of tunnig parameters both at kafka level, SpoutConfig level and storm level but i am getting very high overall latency issue. I need message process garuntee so acking is really required. Storm cluster has one supervisor and zookeeper has 3 noed, it is shared between kafka and storm. It is running on Red Hat Linux machine with 144MB RAM with 16CPU. With below parameters, i ma getting very high spout process latency about 40Sec, I need to get around 50K msg/sec level, can you please help me with the configuration to achieve it. I have gone through lot of posts on various site and tried lot of tuning options with no results.

Storm config
topology.receiver.buffer.size=16
topology.transfer.buffer.size=4096
topology.executor.receive.buffer.size=16384
topology.executor.send.buffer.size=16384
topology.spout.max.batch.size=65536
topology.max.spout.pending=10000
topology.acker.executors=20

Kafka config
fetch.size.bytes=1048576
socket.timeout.ms=10000
fetch.max.wait=10000
buffer.size.bytes=1048576

Thanks in advance.

Storm UI screenshot

Solution

Your topology has several problems:

you should have same number of spout executors as kafka partitions
your topology can't handle tuples fast enough. I am surprised on how tuples didn't start to fail by timeout. Use a reasonable value for topology.max.spout.pending, i recommend 150 or 250. This will only prevent timeouts, your spouts will consume tuples slowly because the remainder of the topology can't handle it.
You need to add more executors to your bolts, the only your topology gets faster is bring more execution units into play. Executors and threads are not the same thing, you need to put more executors in the topology. Your single executer latency is 0,097 this means your single executor can process around 10309 tuples per second; This is, to reach your goal of 50k per second you need to have at least 5 executors. I am sure that with your 16 cpu machine you can use more than 1 CPU to to work on the bolt.
The main purpose of the tasks is to promote them --during rebalanced-- to executors; therefore the num tasks >= num executors.
If you are using global grouping, you will need to redesign your topology to use something like the fields grouping instead.