![]() ![]() Because Nomad is designed to self-heal and recover from these types of failures, Nomad rescheduled and restarted the failed jobs elsewhere. We have filed an issue, and there is a potential fix in master. There are two reasons for this:įirst, we found a bug in the Docker engine that appears to be a race condition by starting so many containers in a very short period of time. Nomad actually scheduled and ran nearly 1.003M containers. There is one curiosity with C1M: the graph shows that we scheduled and ran more than 1M containers. Still, scheduling and starting 1 million containers in just over six minutes is an impressive feat. As before, the time between client acknowledgement of placement and container start time is Nomad waiting for Docker to start the container. By the time a million containers have been scheduled, 99.4% of them have already been acknowledged by the clients.ĩ9% of all containers are running in 370.5 seconds. The time it takes for clients to receive tasks is nearly realtime despite the vast scale. This is a rate of nearly 3,750 placements per second. Nomad completes all scheduling in 266.7 seconds (less than five minutes). The graph axes, line definitions, and colors are the same as the prior section.Įven at 1,000,000 containers, Nomad provides near-linear performance. In our investigation we found that the clients were simply saturated as they started hundreds of containers within a few seconds. At 58.2 seconds Nomad has started 99% of the containers. In less than 20 seconds, Nomad is now just waiting for all the containers to start. Nomad completes scheduling and placement of all the containers in 18.1 seconds, exceeding 5,500 placements per second.īy 19.5 seconds (less than 2 seconds after the placement is complete), all clients have acknowledged that they have received their placements. These results lead to interesting observations:įirst, the performance of Nomad in scheduling is nearly linear. In other words, if you went to this machine and ran docker ps, you would see this container in the "running" state. Running (Blue) - The container completed launching and is now running.Received (Grey) - The client has acknowledged that it received a container task and will begin starting it.At this stage, it is now ready for the client to retrieve it. In other words, a container has been binpacked and scheduled. Scheduled (Orange) - The scheduler chose where the container should be running.We begin by looking at the results for the C100K, since interesting observations can be drawn by comparing these results to the C1M results shown later. The linked repository can also be used to reproduce any of these results. This forces the scheduler to evaluate and check constraints in addition to pure binpacking.įor full technicals of the C1M setup, including the Docker images used, the Nomad job specification, Terraform scripts, and more please see the full technical README of the C1M. To make the benchmark even more strenuous and realistic, we designed the jobs to have constraints on which nodes can run the tasks. Instead, we broke down the tasks into many jobs in an even split in order to provide more strain on the scheduler as well as better represent real world scenarios where many jobs would be running. We could have submitted 1 job with 1,000,000 tasks for the C1M. Nomad can also schedule other tasks such as VMs, binaries, etc. A task is an application to run, which is a Docker container running a simple Go service in this test. We used Terraform to spin up thousands of resources in minutes.Ī job is a declaration of work submitted to the scheduler. The strong and consistent performance of Google Cloud made the entire testing process efficient. Our partners at Google generously provided the credits to run this amount of compute on Google Compute Engine. The cluster size below is number of Nomad clients, and does not include the additional servers. For each cluster configuration, we ran five Nomad servers. We ran both a C100K (100,000 containers) and C1M with the following cluster configurations. Thank you to Google for providing the credits and support necessary to run the infrastructure for the C1M on Google Cloud. Details and observations of this benchmark are explained below. We tested Nomad against the C1M to ensure that we meet the needs of our users at any scale.Ī cluster of five Nomad servers scheduled one million containers in less than five minutes, a rate of 3,750 containers per second. HashiCorp prides itself on creating technically excellent software, and the C1M is a test to showcase this. We call this the Million Container Challenge (C1M). We designed an ambitious challenge to test these promises: schedule one million containers. ![]() IntroĬluster schedulers promise us ease of deployment with ultimate scalability. We updated the scalability benchmark with the 2 Million Container Challenge. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |