In general, fusion can be achieved at the input level (i.e. early fusion), decision level (i.e. late fusion), or intermediately [8]. Although studies in neuroscience [9, 10] and machine learning [1, 3] suggest that mid-level feature fusion could benefit learning, late fusion is still the predominant method utilized for mulitmodal learning [11–13]. This is mostly due to practical reasons.

Another reason for the popularity of late fusion is that the architecture of each unimodal stream is carefully designed over years to achieve state-of-the-art performance for each modality.