When looking at an image, humans shift their attention towards interesting regions, making sequences of eye fixations. When describing an image, they also come up with simple sentences that highlight the key elements in the scene. What is the correlation between where people look and what they describe in an image? To investigate this problem, we look into eye fixations and image captions, two types of subjective annotations that are relatively task-free and natural. From the annotations, we extract visual and verbal saliency ranks to compare against each other. We then propose a number of low-level and semantic-level features relevant to the visual-verbal consistency. Integrated into a computational model, the proposed features effectively predict the consistency between the two modalities on a large dataset with both types of annotations, namely SALICON .