I am planning to develop a music app which includes a function to find the similar song just like what KKBOX and Shazam are doing, but I'm not familiar in this area. I've found that they applied FFT to proceed the comparison of the songs so that the user can search the similar song.
However, i am thinking that what if I generate the waveform of the song, and then directly compare the waveform of the songs. I would like to ask is it possible for my idea?
As your objective is to find "similar" songs, comparing a 2d waveform is highly unlikely to work. However, it's a good idea to first explore the feasibility of your approach, before rejecting it out of hand.
I would suggest picking a set of 5 songs
Run through the librosa tutorials (https://librosa.org/doc/main/tutorial.html) and/or some of the walkthroughs on Medium (e.g. https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d), but stopping before you get to the part that uses MFCC. Just focus on the waveform images.
Looking at the visualizations for your songs and thinking through this problem, reason about a)why the waveform-comparison ought to work, and b)why the waveform-comparison won't work.
So think about things like tempo, timbre and timing - what would be the effect on the waveform of playing the same song on different instruments, with a different effects treatment, at a different tempo, or in a different order (same song, but changing order of verses and chorus).
Setting aside the non-trivial quetion of which waveform you'd be using (amplitude? of what frequency/frequencies?), at this point, you should see how many problems there are with just looking at the waveform, and why MFCC (or similar) is better. Additionally, you'll be better prepared to think about how MFCC parameters might be selected - how much of the song do you need to sample, when should you start the sampling.
Is your idea possible? Probably not in the way you are thinking - maybe you could experiment with something like transforming the data of the song in some way and then comparing that representation (e.g. looking at changes in amplitude or tempo) The problem with audio is that it encapsulates a lot of features in its signal:
Watch a tutorial on audio mixing and you'll see/hear just how much the output signal of the exact same song can be changed without actually changing the song being played.
Innovation sometimes emerges when curious people try things that 'probably won't work', so anything is worth a shot, but once you've figured out for yourself why something won't work, it's useful to accept commonly used techniques, and look for opportunities for innovation in other ways.