I'm building a spring-boot powered service that writes data to Hadoop using filesystem API. Some data is written to parquet file and large blocks are cached in memory so when the service is shut down, potentially several hundred Mb of data have to be written to Hadoop.
FileSystem
closes automatically by default, so when the service is shut down, sometimes FileSystem
gets closed before all the writers are closed resulting in corrupted parquet files.
There is fs.automatic.close
flag in filesystem Configuration
, but FileSystem
instance is used from multiple threads and I don't know any clean way to wait for them all to finish before closing FileSystem
manually. I tried using a dedicated filesysem closing bean implementing Spring SmartLifeCycle
with max phase
so it is destroyed last, but actually it is not destroyed last but notified of shutdown last while other beans are still in the process of shutting down.
Ideally every object that needs a FileSystem
would get one and would be responsible for closing it. The problem is FileSystem.get(conf)
returns a cached instance. There is FileSystem.newInstance(conf)
, but it is not clear what are the consequences of using multiple FileSystem
instances performance-wise. There is another issue with that - there is no way to pass FileSystem
instance to ParquetWriter
- it gets one using path.getFileSystem(conf)
. And one would think that line would return a FileSystem
instance assigned to that file only, but one would be wrong - most likely the same cached instance would be returned so closing it would be wrong.
Is there a recommended way of managing a lifecycle of a FileSystem
? What would happen if a FileSystem
is created with fs.automatic.close
set to true
and never closed manually? Maybe spring-boot supports a clean way to close FileSystem
after all other beans are actually destroyed (not being destroyed)?
Thanks!