In this paper, we present a number of robust methodologies for an underwater robot to visually detect, follow, and interact with a diver for collaborative task execution. We design and develop two autonomous diver-following algorithms, the first of which utilizes both spatial- and frequency-domain features pertaining to human swimming patterns to visually track a diver. The second algorithm uses a convolutional neural network-based model for robust tracking-by-detection. In addition, we propose a hand gesture-based human–robot communication framework that is syntactically simpler and computationally more efficient than the existing grammar-based frameworks. In the proposed interaction framework, deep visual detectors are used to provide accurate hand gesture recognition; subsequently, a finite-state machine performs robust and efficient gesture-to-instruction mapping. The distinguishing feature of this framework is that it can be easily adopted by divers for communicating with underwater robots without using artificial markers or requiring memorization of complex language rules. Furthermore, we validate the performance and effectiveness of the proposed methodologies through a number of field experiments in closed- and open-water environments. Finally, we perform a user interaction study to demonstrate the usability benefits of our proposed interaction framework compared to the existing methods.