Regarding the SD Card (SDC):
The SDC/eMMC communication protocol, the embedded Flash Translation Layer (FTL), and the underlying FLASH memory...
My understanding is that the main cause of SDC filesystem corruptions are due to power failure occurring during critical FTL operations with the FLASH memory. i.e. writing block data, erasing blocks, the wear levelling algorithm etc.
There is excellent documentation available at the SD Association website:
https://www.sdcard.org/downloads/pls/
One document that drew my attention:
Part1_Physical_Layer_Simplified_Specification_Ver6.00.pdf
On page 96
4.6.2 Read, Write and Erase Timeout Conditions
This therefore sets an upper bound on the time taken to complete a potentially critical set of tasks by the FTL at 250ms.A card shall complete the command within the time period defined as follows or give up and return an error message.
If the host does not get any response with the given timeout it should assume that the card is not going to respond and try to recover (e.g. reset the card, power cycle, reject, etc.).
Read Timeout (max) = 100ms
Write Timeout (max) = 250ms
Erase Timeout (max) = 250ms
>> My proposed idea for handling a system power failure event is:
Rather than maintaining power for the entire board: Raspberry Pi BCM chip, peripherals, memory etc, which requires a large battery or super-capacitor...
[1a] Simply keep a separate 3.3V power supply alive that only drives the SDC, and thus maintain only a modest voltage and current to keep the SDC comms and FLASH activity alive for a short amount of time.
i.e. Tens of milliamps for a few hundred milliseconds.
[1b] The intent is to 'wait-out' the comms dropout for enough time so that the SDC will complete it's current operation and then enter the IDLE state.
[1c] Then gracefully ramp down the SDC 3.3V power supply.
My assumption here is that only the SDC communication pipeline is broken, i.e. CLK, CMD, DAT[3:0]; the SD Card device itself should be kept active for ~500ms
This should result in a few possible scenarios:
[2a] Comms is broken part way through BCM sending a command to the SDC:
SDC should reject the malformed command and go into the Stand-by state.
[2b] Comms is broken immediately after BCM has sent a command + data payload to the SDC:
SDC would commence FLASH interaction, then attempt to return its response.
[2c] Comms is broken after BCM sends command, but only partially sends the data payload:
Would the SDC timeout and go into the Stand-by state, or hang?
The question remains:
When the BCM shuts down and several files are still 'open' in the SDC, or a file read/write operation was 'ragged' (non-atomic), would the EXT4 journaling file system be able to recover from this fault?
Another concern is:
The BCM chip supplies the master clock for the SDC comms protocol (similar to SPI comms). Would the SDC still be able to operate without the CLK input or is that only required for the CMD buffer?
I am aware of several excellent solutions for unexpected Raspberry Pi power loss e.g. a read-only root filesystem, and also some great UPS-style circuits in association with "dtoverlay=gpio-shutdown" to hold up the main power supply long enough to cleanly shut down the system.
I feel there has to be a fine-tuned hardware solution to this, possibly offered as an add-on extra to the basic board.
I'm sure most users would accept a small increase in cost if some basic data resilience features were added to mitigate SD Card filesystem irreparable damage following power failures or accidental power outs, particularly in fully embedded environments where a user has no control over formally shutting down the Raspberry Pi.