Considering other orchestration tools like dokku, dcos, deis, flynn, docker swarm, etc.. Kubernetes is no where near to them in terms of lines of code, on an average those tools are around 100k-200k lines of code.
Intuitively it feels strange that to manage containers i.e. to check health, scale containers up and down, kill them, restart them, etc.. doesn't have to consist of 2.4M+ lines of code (which is the scale of an entire Operating System code base), I feel like there is something more to it.
What is different in Kubernetes compared to other orchestration solutions that makes it so big?
I dont have any knowledge of maintaining more than 5-6 servers. Please explain why it is so big, what functionalities play big part in it.
First and foremost: don't be misled by the number of lines in the code, most of it are dependencies in the vendor
folder that does not account for the core logic (utilities, client libraries, gRPC, etcd, etc.).
To put things into perspective, for Kubernetes:
$ cloc kubernetes --exclude-dir=vendor,_vendor,build,examples,docs,Godeps,translations
7072 text files.
6728 unique files.
1710 files ignored.
github.com/AlDanial/cloc v 1.70 T=38.72 s (138.7 files/s, 39904.3 lines/s)
--------------------------------------------------------------------------------
Language files blank comment code
--------------------------------------------------------------------------------
Go 4485 115492 139041 1043546
JSON 94 5 0 118729
HTML 7 509 1 29358
Bourne Shell 322 5887 10884 27492
YAML 244 374 508 10434
JavaScript 17 1550 2271 9910
Markdown 75 1468 0 5111
Protocol Buffers 43 2715 8933 4346
CSS 3 0 5 1402
make 45 346 868 976
Python 11 202 305 958
Bourne Again Shell 13 127 213 655
sed 6 5 41 152
XML 3 0 0 88
Groovy 1 2 0 16
--------------------------------------------------------------------------------
SUM: 5369 128682 163070 1253173
--------------------------------------------------------------------------------
For Docker (and not Swarm or Swarm mode as this includes more features like volumes, networking, and plugins that are not included in these repositories). We do not include projects like Machine, Compose, libnetwork, so in reality the whole docker platform might include much more LoC:
$ cloc docker --exclude-dir=vendor,_vendor,build,docs
2165 text files.
2144 unique files.
255 files ignored.
github.com/AlDanial/cloc v 1.70 T=8.96 s (213.8 files/s, 30254.0 lines/s)
-----------------------------------------------------------------------------------
Language files blank comment code
-----------------------------------------------------------------------------------
Go 1618 33538 21691 178383
Markdown 148 3167 0 11265
YAML 6 216 117 7851
Bourne Again Shell 66 838 611 5702
Bourne Shell 46 768 612 3795
JSON 10 24 0 1347
PowerShell 2 87 120 292
make 4 60 22 183
C 8 27 12 179
Windows Resource File 3 10 3 32
Windows Message File 1 7 0 32
vim script 2 9 5 18
Assembly 1 0 0 7
-----------------------------------------------------------------------------------
SUM: 1915 38751 23193 209086
-----------------------------------------------------------------------------------
Please note that these are very raw estimations, using cloc. This might be worth a deeper analysis.
Roughly, it seems like the project accounts for half of the LoC (~1250K LoC) mentioned in the question (whether you value dependencies or not, which is subjective).
Most of the bloat comes from libraries supporting various Cloud providers to ease the bootstrapping on their platform or to support specific features (volumes, etc.) through plugins. It also has a Lot of Examples to dismiss from the line count. A fair LoC estimation needs to exclude a lot of unnecessary documentation and example directories.
It is also much more feature rich compared to Docker Swarm, Nomad or Dokku to cite a few. It supports advanced networking scenarios, has load balancing built-in, includes PetSets, Cluster Federation, volume plugins or other features that other projects do not support yet.
It supports multiple container engines, so it is not exclusively running docker containers but could possibly run other engines (such as rkt).
A lot of the core logic involves interaction with other components: Key-Value stores, client libraries, plugins, etc. which extends far beyond simple scenarios.
Distributed Systems are notoriously hard, and Kubernetes seems to support a majority of the tooling from key players in the container industry without compromise (where other solutions are making such compromise). As a result, the project can look artificially bloated and too big for its core mission (deploying containers at scale). In reality, these statistics are not that surprising.
Comparing Kubernetes to Docker or Dokku is not really appropriate. The scope of the project is far bigger and it includes much more features as it is not limited to the Docker family of tooling.
While Docker has a lot of its features scattered across multiple libraries, Kubernetes tends to have everything under its core repository (which inflates the line count substantially but also explains the popularity of the project).
Considering this, the LoC statistic is not that surprising.