In this paper, modular multiplication, the fundamental operation composing modular exponentiation, is internally parallelized for the first time at the digit level. Modular exponentiation is the core computation of numerous public-key cryptography (PKC) systems including RSA. As a performance criterion, overall latency is often more significant than throughput in the principal PKC applications of key exchange and authentication. Efforts to address total latency architecturally through traditional modular multiplication techniques utilizing pipelining are hindered by the inherent recursive nature of practical modular exponentiation methods. Thus, performance scalability relative to implementation area has been limited. Fine-grain parallelization methods revealed in this paper are compelling because they permit overall latency reduction in addition to increased throughput. First, a hybrid bi-directional method is introduced for two-parallel implementations. Second, a uni-directional p-parallel technique is introduced which attains general levels of parallelism, thereby enabling performance scalability. These new techniques create a foundation for ultra-high-performance implementations.