I'm currently working on an embedded project using an ARM Cortex M3 microcontroller with FreeRTOS as system OS. The code was written by a former colleague and sadly the project has some weird bugs which I have to find and fix as soon as possible.
Short description: The device is integrated into vehicles and sends some "special" data using an integrated modem to a remote server.
The main problem: Since the device is integrated into a vehicle, the power supply of the device can be lost at any time. Therefore the device stores some parts of the "special" data to two reserved flash pages. This code module is laid out as an eeprom emulation on two flash pages(for wear leveling and data transfer from one flash page to another). The eeprom emulation works with so called "virtual addresses", where you can write data blocks of any size to the currently active/valid flash page and read it back by using those virtual addresses. The former colleague implemented the eeprom emulation as multitasking module, where you can read/write to the flash pages from every task in the application. At first sight everything seems fine.
But my project manager told me, that the device always loses some of the "special" data at moments, where the power supply level in the vehicle goes down to some volts and the device tries to save the data to flash.
Normally the power supply is about 10-18 volts, but if it goes down to under 7 volts, the device receives an interrupt called powerwarn
and it triggers a task called powerfail task
.
The powerfail task
has the highest priority of all tasks and executes some callbacks where e.g. the modem is turned off and also where the "special" data is stored in the flash page.
I tried to understand the code and debugged for days/weeks and now I'm quite sure that I found the problem:
Within those callbacks which the powerfail task executes (called powerfail callbacks), there are RTOS calls,
where other tasks get suspended. But unfortunately those supended task could also have a unfinished EEPROM_WriteBlock()
call just before the powerwarn interrupt is received.
Therefore the powerfail task executes the callbacks and in one of the callbacks there is a EE_WriteBlock()
call where the task can't take the mutex in EE_WriteBlock()
since another task (which was suspended) has taken it already --> Deadlock!
This is the routine to write data to flash:
uint16_t
EE_WriteBlock (EE_TypeDef *EE, uint16_t VirtAddress, const void *Data, uint16_t Size)
{
.
.
xSemaphoreTakeRecursive(EE->rw_mutex, portMAX_DELAY);
/* Write the variable virtual address and value in the EEPROM */
.
.
.
xSemaphoreGiveRecursive(EE->rw_mutex);
return Status;
}
This is the RTOS specific code when 'xSemaphoreTakeRecursive()' is called:
portBASE_TYPE xQueueTakeMutexRecursive( xQueueHandle pxMutex, portTickType xBlockTime )
{
portBASE_TYPE xReturn;
/* Comments regarding mutual exclusion as per those within
xQueueGiveMutexRecursive(). */
traceTAKE_MUTEX_RECURSIVE( pxMutex );
if( pxMutex->pxMutexHolder == xTaskGetCurrentTaskHandle() )
{
( pxMutex->uxRecursiveCallCount )++;
xReturn = pdPASS;
}
else
{
xReturn = xQueueGenericReceive( pxMutex, NULL, xBlockTime, pdFALSE );
/* pdPASS will only be returned if we successfully obtained the mutex,
we may have blocked to reach here. */
if( xReturn == pdPASS )
{
( pxMutex->uxRecursiveCallCount )++;
}
else
{
traceTAKE_MUTEX_RECURSIVE_FAILED( pxMutex );
}
}
return xReturn;
}
My project manager is happy that I've found the bug but he also forces me to create a fix as quickly as possible, but what I really want is a rewrite of the code.
Maybe one of you might think, just avoid the suspension of the other tasks and you are done, but that is not a possible solution, since this could trigger another bug.
Does anybody have a quick solution/idea how I could fix this deadlock problem?
Maybe I could use xTaskGetCurrentTaskHandle()
in EE_WriteBlock()
to determine who has the ownership of the mutex and then give it if the task is not running anymore.
Thx