I know, too many answers have been published already, however the truth is - startForegroundService can not be fixed at an app level and you should stop using it. That Google recommendation to use Service#startForeground() API within 5 seconds after Context#startForegroundService() was called is not something that an app can always do.
Android runs a lot of processes simultaneously and there is no any guarantee that Looper will call your target service that is supposed to call startForeground() within 5 seconds. If your target service didn't receive the call within 5 seconds, you're out of luck and your users will experience ANR situation. In your stack trace you'll see something like this:
Context.startForegroundService() did not then call Service.startForeground(): ServiceRecord{1946947 u0 ...MessageService}
main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x763e01d8 self=0x7d77814c00
| sysTid=11171 nice=-10 cgrp=default sched=0/0 handle=0x7dfe411560
| state=S schedstat=( 1337466614 103021380 2047 ) utm=106 stm=27 core=0 HZ=100
| stack=0x7fd522f000-0x7fd5231000 stackSize=8MB
| held mutexes=
#00 pc 00000000000712e0 /system/lib64/libc.so (__epoll_pwait+8)
#01 pc 00000000000141c0 /system/lib64/libutils.so (android::Looper::pollInner(int)+144)
#02 pc 000000000001408c /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+60)
#03 pc 000000000012c0d4 /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44)
at android.os.MessageQueue.nativePollOnce (MessageQueue.java)
at android.os.MessageQueue.next (MessageQueue.java:326)
at android.os.Looper.loop (Looper.java:181)
at android.app.ActivityThread.main (ActivityThread.java:6981)
at java.lang.reflect.Method.invoke (Method.java)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run (RuntimeInit.java:493)
at com.android.internal.os.ZygoteInit.main (ZygoteInit.java:1445)
As I understand, Looper has analyzed the queue here, found an "abuser" and simply killed it. The system is happy and healthy now, while developers and users are not, but since Google limits their responsibilities to the system, why should they care about the latter two? Apparently they don't. Could they make it better? Of course, e.g. they could've served "Application is busy" dialog, asking a user to make a decision about waiting or killing the app, but why bother, it's not their responsibility. The main thing is that the system is healthy now.
From my observations, this happens relatively rarely, in my case approximately 1 crash in a month for 1K users. Reproducing it is impossible, and even if it's reproduced, there is nothing you can do to fix it permanently.
There was a good suggestion in this thread to use "bind" instead of "start" and then when service is ready, process onServiceConnected, but again, it means not using startForegroundService calls at all.
I think, the right and honest action from Google side would be to tell everyone that startForegourndServcie has a deficiency and should not be used.
The question still remains: what to use instead? Fortunately for us, there are JobScheduler and JobService now, which are a better alternative for foreground services. It's a better option, because of that:
While a job is running, the system holds a wakelock on behalf of your
app. For this reason, you do not need to take any action to guarantee
that the device stays awake for the duration of the job.
It means that you don't need to care about handling wakelocks anymore and that's why it's not different from foreground services. From implementation point of view JobScheduler is not your service, it's a system's one, presumably it will handle the queue right, and Google will never terminate its own child :)
Samsung has switched from startForegroundService to JobScheduler and JobService in their Samsung Accessory Protocol (SAP). It's very helpful when devices like smartwatches need to talk to hosts like phones, where the job does need to interact with a user through an app's main thread. Since the jobs are posted by the scheduler to the main thread, it becomes possible. You should remember though that the job is running on the main thread and offload all heavy stuff to other threads and async tasks.
This service executes each incoming job on a Handler running on your
application's main thread. This means that you must offload your
execution logic to another thread/handler/AsyncTask of your choosing
The only pitfall of switching to JobScheduler/JobService is that you'll need to refactor old code, and it's not fun. I've spent last two days doing just that to use the new Samsung's SAP implementation. I'll watch my crash reports and let you know if see the crashes again. Theoretically it should not happen, but there are always details that we might not be aware of.
UPDATE
No more crashes reported by Play Store. It means that JobScheduler/JobService do not have such a problem and switching to this model is the right approach to get rid of startForegroundService issue once and forever. I hope, Google/Android reads it and will eventually comment/advise/provide an official guidance for everyone.
UPDATE 2
For those who use SAP and asking how SAP V2 utilizes JobService explanation is below.
In your custom code you'll need to initialize SAP (it's Kotlin) :
SAAgentV2.requestAgent(App.app?.applicationContext,
MessageJobs::class.java!!.getName(), mAgentCallback)
Now you need to decompile Samsung's code to see what's going on inside. In SAAgentV2 take a look at the requestAgent implementation and the following line:
SAAgentV2.d var3 = new SAAgentV2.d(var0, var1, var2);
where d defined as below
private SAAdapter d;
Go to SAAdapter class now and find onServiceConnectionRequested function that schedules a job using the following call:
SAJobService.scheduleSCJob(SAAdapter.this.d, var11, var14, var3, var12);
SAJobService is just an implementation of Android'd JobService and this is the one that does a job scheduling:
private static void a(Context var0, String var1, String var2, long var3, String var5, SAPeerAgent var6) {
ComponentName var7 = new ComponentName(var0, SAJobService.class);
Builder var10;
(var10 = new Builder(a++, var7)).setOverrideDeadline(3000L);
PersistableBundle var8;
(var8 = new PersistableBundle()).putString("action", var1);
var8.putString("agentImplclass", var2);
var8.putLong("transactionId", var3);
var8.putString("agentId", var5);
if (var6 == null) {
var8.putStringArray("peerAgent", (String[])null);
} else {
List var9;
String[] var11 = new String[(var9 = var6.d()).size()];
var11 = (String[])var9.toArray(var11);
var8.putStringArray("peerAgent", var11);
}
var10.setExtras(var8);
((JobScheduler)var0.getSystemService("jobscheduler")).schedule(var10.build());
}
As you see, the last line here uses Android'd JobScheduler to get this system service and to schedule a job.
In the requestAgent call we've passed mAgentCallback, which is a callback function that will receive control when an important event happens. This is how the callback is defined in my app:
private val mAgentCallback = object : SAAgentV2.RequestAgentCallback {
override fun onAgentAvailable(agent: SAAgentV2) {
mMessageService = agent as? MessageJobs
App.d(Accounts.TAG, "Agent " + agent)
}
override fun onError(errorCode: Int, message: String) {
App.d(Accounts.TAG, "Agent initialization error: $errorCode. ErrorMsg: $message")
}
}
MessageJobs here is a class that I've implemented to process all requests coming from a Samsung smartwatch. It's not the full code, only a skeleton:
class MessageJobs (context:Context) : SAAgentV2(SERVICETAG, context, MessageSocket::class.java) {
public fun release () {
}
override fun onServiceConnectionResponse(p0: SAPeerAgent?, p1: SASocket?, p2: Int) {
super.onServiceConnectionResponse(p0, p1, p2)
App.d(TAG, "conn resp " + p1?.javaClass?.name + p2)
}
override fun onAuthenticationResponse(p0: SAPeerAgent?, p1: SAAuthenticationToken?, p2: Int) {
super.onAuthenticationResponse(p0, p1, p2)
App.d(TAG, "Auth " + p1.toString())
}
override protected fun onServiceConnectionRequested(agent: SAPeerAgent) {
}
}
override fun onFindPeerAgentsResponse(peerAgents: Array<SAPeerAgent>?, result: Int) {
}
override fun onError(peerAgent: SAPeerAgent?, errorMessage: String?, errorCode: Int) {
super.onError(peerAgent, errorMessage, errorCode)
}
override fun onPeerAgentsUpdated(peerAgents: Array<SAPeerAgent>?, result: Int) {
}
}
As you see, MessageJobs requires MessageSocket class as well that you would need to implement and that processes all messages coming from your device.
Bottom line, it's not that simple and it requires some digging to internals and coding, but it works, and most importantly - it doesn't crash.