Mesos 1.2.1 & 1.3.1 slave - docker job exits normally but reporting as failure
Asked Answered
O

0

6

My mesos version was recently upgraded from 0.28 to 1.2.1.

Jobs are being scheduled using Chronos. My docker job is being invoked properly, but still getting TASK_FAILED error event it completes with exit status ZERO.

This is reproducible always. I tried with multiple mesos versions, but still no luck. Wondering if I am missing anything.

OS details :-

Kernel  - 3.8.13-98.7.1.el7uek
OS - OL 7.3
Sanpshot - 7-2017.6.4

Since, it is not the latest Kernal, I have added the following in my Chronos Job environment variable as i am not able to install the latest docker version.

{
          "name":"DOCKER_API_VERSION",
          "value":"1.22"
      }

Launching 'mesos-docker-executor' with flags

'--container="mesos-81cb9c2a-d18b-4127-872b-2a5676dfb314-S0.97dc2c67-5d69-4a8c-b4e1-ba15807697cf" 
--docker="docker" 
--docker_socket="/var/run/docker.sock" 
--help="false" 
--initialize_driver_logging="true" 
--launcher_dir="/usr/libexec/mesos" --logbufsecs="0" 
--logging_level="INFO" 
--mapped_directory="/mnt/mesos/sandbox" 
--quiet="false" 
--sandbox_directory="/mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/docker/links/97dc2c67-5d69-4a8c-b4e1-ba15807697cf" 
--stop_timeout="0ns"'

Mesos slave logs :-

    I0906 14:05:00.958442     9 slave.cpp:1625] Got assigned task 'ct:1504706700007:0:Job_Task_Test:' for framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:00.958544     9 slave.cpp:6386] Checkpointing FrameworkInfo to '/mesos-data/slave-1/meta/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/framework.info'
I0906 14:05:00.958868     9 slave.cpp:6397] Checkpointing framework pid '[email protected]:43144' to '/mesos-data/slave-1/meta/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/framework.pid'
I0906 14:05:00.959430     9 slave.cpp:1785] Launching task 'ct:1504706700007:0:Job_Task_Test:' for framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:00.966035     9 paths.cpp:547] Trying to chown '/mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf' to user 'root'
I0906 14:05:00.966223     9 slave.cpp:6903] Checkpointing ExecutorInfo to '/mesos-data/slave-1/meta/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/executor.info'
I0906 14:05:00.966648     9 slave.cpp:6479] Launching executor 'ct:1504706700007:0:Job_Task_Test:' of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 with resources cpus(*)(allocated: *):0.1; mem(*)(allocated: *):32 in work directory '/mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf'
I0906 14:05:00.967015     9 slave.cpp:6926] Checkpointing TaskInfo to '/mesos-data/slave-1/meta/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf/tasks/ct:1504706700007:0:Job_Task_Test:/task.info'
I0906 14:05:00.967300     9 slave.cpp:2118] Queued task 'ct:1504706700007:0:Job_Task_Test:' for executor 'ct:1504706700007:0:Job_Task_Test:' of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:00.967337     9 slave.cpp:884] Successfully attached file '/mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf'
I0906 14:05:00.967501    12 docker.cpp:1165] Starting container '97dc2c67-5d69-4a8c-b4e1-ba15807697cf' for task 'ct:1504706700007:0:Job_Task_Test:' (and executor 'ct:1504706700007:0:Job_Task_Test:') of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:00.968152     9 fetcher.cpp:353] Starting to fetch URIs for container: 97dc2c67-5d69-4a8c-b4e1-ba15807697cf, directory: /mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/docker/links/97dc2c67-5d69-4a8c-b4e1-ba15807697cf
I0906 14:05:00.969274    16 docker.cpp:1348] Running docker -H unix:///var/run/docker.sock pull docker.xyz.com/test/job-action:latest
I0906 14:05:01.165390     9 docker.cpp:1274] Running docker -H unix:///var/run/docker.sock inspect docker.xyz.com/test/job-action:latest
I0906 14:05:01.267083    13 docker.cpp:465] Docker pull docker.xyz.com/test/job-action:latest completed
I0906 14:05:01.267663    13 docker.cpp:1514] Launching 'mesos-docker-executor' with flags '--container="mesos-81cb9c2a-d18b-4127-872b-2a5676dfb314-S0.97dc2c67-5d69-4a8c-b4e1-ba15807697cf" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --launcher_dir="/usr/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/docker/links/97dc2c67-5d69-4a8c-b4e1-ba15807697cf" --stop_timeout="0ns"'
I0906 14:05:01.269182    13 docker.cpp:803] Checkpointing pid 147 to '/mesos-data/slave-1/meta/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf/pids/forked.pid'
I0906 14:05:01.308346    15 slave.cpp:3385] Got registration for executor 'ct:1504706700007:0:Job_Task_Test:' of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 from executor(1)@20.426.45.305:48379
I0906 14:05:01.308658    15 slave.cpp:3471] Checkpointing executor pid 'executor(1)@20.426.45.305:48379' to '/mesos-data/slave-1/meta/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf/pids/libprocess.pid'
I0906 14:05:01.309453    15 docker.cpp:1608] Ignoring updating container 97dc2c67-5d69-4a8c-b4e1-ba15807697cf because resources passed to update are identical to existing resources
I0906 14:05:01.309587    15 slave.cpp:2331] Sending queued task 'ct:1504706700007:0:Job_Task_Test:' to executor 'ct:1504706700007:0:Job_Task_Test:' of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 at executor(1)@20.426.45.305:48379
I0906 14:05:02.914752    11 slave.cpp:3816] Handling status update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 from executor(1)@20.426.45.305:48379
I0906 14:05:02.915037    11 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:02.915063    11 status_update_manager.cpp:500] Creating StatusUpdate stream for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:02.915396    11 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:02.915513    11 status_update_manager.cpp:377] Forwarding update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 to the agent
I0906 14:05:02.915587    11 slave.cpp:4256] Forwarding the update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 to [email protected]:5050
I0906 14:05:02.915688    11 slave.cpp:4150] Status update manager successfully handled status update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:02.915700    11 slave.cpp:4166] Sending acknowledgement for status update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 to executor(1)@20.426.45.305:48379
I0906 14:05:03.628334    10 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:03.628398    10 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:03.628463    10 slave.cpp:3105] Status update manager successfully handled status update acknowledgement (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:04.256636    14 slave.cpp:5731] Querying resource estimator for oversubscribable resources
I0906 14:05:04.256691    14 slave.cpp:5745] Received oversubscribable resources {} from the resource estimator
I0906 14:05:04.944047    10 slave.cpp:4346] Received ping from slave-observer(1)@20.426.45.305:5050
I0906 14:05:11.049429    12 slave.cpp:3816] Handling status update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 from executor(1)@20.426.45.305:48379
I0906 14:05:11.049636    12 docker.cpp:987] Running docker -H unix:///var/run/docker.sock inspect mesos-81cb9c2a-d18b-4127-872b-2a5676dfb314-S0.97dc2c67-5d69-4a8c-b4e1-ba15807697cf
I0906 14:05:11.136464    13 status_update_manager.cpp:323] Received status update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.136502    13 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.136572    13 status_update_manager.cpp:377] Forwarding update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 to the agent
I0906 14:05:11.136642    12 slave.cpp:4256] Forwarding the update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 to [email protected]:5050
I0906 14:05:11.136739    12 slave.cpp:4150] Status update manager successfully handled status update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.136752    12 slave.cpp:4166] Sending acknowledgement for status update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 to executor(1)@20.426.45.305:48379
I0906 14:05:11.152230    13 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.152271    13 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.152307    13 status_update_manager.cpp:531] Cleaning up status update stream for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.152454    13 slave.cpp:3105] Status update manager successfully handled status update acknowledgement (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.152475    13 slave.cpp:6882] Completing task ct:1504706700007:0:Job_Task_Test:
I0906 14:05:12.059929    13 slave.cpp:4388] Got exited event for executor(1)@20.426.45.305:48379
I0906 14:05:12.140215    14 docker.cpp:2397] Executor for container 97dc2c67-5d69-4a8c-b4e1-ba15807697cf has exited
I0906 14:05:12.140244    14 docker.cpp:2091] Destroying container 97dc2c67-5d69-4a8c-b4e1-ba15807697cf
I0906 14:05:12.140283    14 docker.cpp:2218] Running docker stop on container 97dc2c67-5d69-4a8c-b4e1-ba15807697cf
I0906 14:05:12.140717    11 slave.cpp:4768] Executor 'ct:1504706700007:0:Job_Task_Test:' of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 exited with status 0
I0906 14:05:12.140748    11 slave.cpp:4868] Cleaning up executor 'ct:1504706700007:0:Job_Task_Test:' of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 at executor(1)@20.426.45.305:48379
I0906 14:05:12.141062    11 slave.cpp:4956] Cleaning up framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:12.141129    11 gc.cpp:55] Scheduling '/mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf' for gc 6.99999836858667days in the future

Executor stderr :-

I0906 13:25:02.165338  3399 exec.cpp:162] Version: 1.2.1
I0906 13:25:02.169983  3405 exec.cpp:237] Executor registered on agent f20ab78e-acd3-407a-b1b6-47d67a947eff-S1

Mesos version 1.3.1 :-

I0906 16:15:02.975457   580 exec.cpp:162] Version: 1.3.1
I0906 16:15:02.979501   586 exec.cpp:237] Executor registered on agent 920fa575-b534-48a8-8305-61f6097f2f49-S1
E0906 16:15:13.523074   589 process.cpp:956] Failed to accept socket: future discarded

I am scheduling a docker job thru Chronos. The docker job runs fine and I can see the following in the Mesos stdout logs. Partial logs :-

Executor stdout :-

Registered docker executor on <hostname>
Starting task ct:1504704300002:0:Job_Task_Test:
.....
.....
.....
: success: true
: Stopping HTTP executor
: Started Application in 6.734 seconds (JVM running for 7.642)
s.c.a.AnnotationConfigApplicationContext : Closing org.springframework.context.annotation.AnnotationConfigApplicationContext@bebdb06: startup date [Wed Sep 06 17:25:07 UTC 2017]; root of context hierarchy
2017-09-06 17:25:14.050 INFO 1 — [ Thread-1] o.s.c.support.DefaultLifecycleProcessor : Stopping beans in phase 0
2017-09-06 17:25:14.051 INFO 1 — [ Thread-1] o.s.j.e.a.AnnotationMBeanExporter : Unregistering JMX-exposed beans on shutdown
2017-09-06 17:25:14.051 DEBUG 1 — [ Thread-1] h.i.c.PoolingHttpClientConnectionManager : Connection manager is shutting down
2017-09-06 17:25:14.051 DEBUG 1 — [ Thread-1] h.i.c.DefaultManagedHttpClientConnection : http-outgoing-0: Close connection
2017-09-06 17:25:14.052 DEBUG 1 — [ Thread-1] h.i.c.PoolingHttpClientConnectionManager : Connection manager shut down

The docker job is a java process which invokes the remote service and finally exits by calling System.exit(0); We print the following in the docker Job.

Expected status: 200, Actual status: 200, success: true

We see no issues with the task execution.

The status code as "Exited (0) 2 minutes ago" by executing the following command.

docker ps -a

CONTAINER ID        IMAGE                  COMMAND    CREATED      
                                                                                       STATUS                         PORTS               NAMES
789ec59e32442        docker-test:latest   "/bin/sh -c 'java -ja"   2 minutes ago       Exited (0) 2 minutes ago                           mesos-05fb536b-asdff-3d634c4ed860-S1.be003d0e-7701-4020-84be-234643565244

Thanks,

Olden answered 6/9, 2017 at 4:45 Comment(5)
Can you share executor logs as well? If it is easy to reproduce, it will be even better if you can share verbose logs, i.e., GLOG_v=1Leroylerwick
I suspect your task spec is not accepted. I'm not very familiar with Chronos, but maybe you can find more info about the failed update there, in particular TaskStatus.message field is interesting. You can also paste your task spec here for me to have a look.Leroylerwick
The message is "Container exited with status 125". I got the message from Chronos status message. My task did run and able to see the logs in Chronos.Olden
According to r/59817/ Failed to accept socket: future discarded might not be a problem in this case. Container exited with status 125 means there were some problems with docker run (see). Similar problem was reported in MESOS-6132 and it turns out the problem was in Mesos configuration. Can you run docker manually? I suspect there is a problem with access rights to Mesos metadata directory in other words Mesos has insufficient permissions.Gaygaya
I was able to run docker manually and running fine with the Exit status as ZERO. The docker command ran fine, but still reporting with the status 125. Could be a problem with some access rights. I will debug more on this.Olden

© 2022 - 2024 — McMap. All rights reserved.