I am new to the GitHub code and data search kind of thing. My motive is to search / scan A to Z Public Repositories of GitHub/Git to make sure that noone has copied my company source code or sensitive data.
I am thinking about the below challenges;
Please give me a guide for this.
Thanks a lot for quick help (in advance!)
Abhishek
Welcome to StackOverflow!
Your best bet is to use Github's search API to find code that you are interested. For example, using Github's search (not through the API) for my domain name, I was able to find code that I've committed.
However, keep in mind that this won't solve your problem of making sure no one has copied your source code. There are countless git services: GitHub, GitLab, Bitbucket, just to name a few. Besides that, you also have to contend with private repositories where searching wouldn't go. It is impossible to search everything. Your best bet is to have safe-guards in place to prevent it from happening such as having strict access controls, ensuring your employees as well as any vendors you work with understand and agree to company policy regarding data.
Finally, having a good responsible disclosure program will encourage white-hat hackers to inform you of any breaches.
Now, with all that in mind, I still think creating a small bot to search the popular places like github, etc. is not a bad idea. Another thing you could do is create a canary, where you have an object that's sole job is to be uniquely identifiable so that if there is a breach, your search can find it easily.
A canary can be a unique row in a database, a specific file with unique text within it, etc. where you can do a search for that text regularly and if it comes up, you know that there was a breach.