Parallel implementation of a three-dimensional direct simulation Monte Carlo (DSMC) code is detailed that employs shared memory system using Open Multi-Processing (OpenMP). Several techniques to optimize the serial implementation of the DSMC method are also discussed. The synchronizations in OpenMP, as well as the related critical sections, have been identified as major factors that impact the OpenMP parallel performance. Methods to remove such barriers in the OpenMP implementation of the DSMC method are presented. For dual-core and quad-core systems, speedups of 1.99 and 3.74, respectively, are obtained for the OpenMP implementation. It is also reported that memory fetching and data communication within the same node but across sockets needs further improvement in order to achieve acceptable scalability for clusters of multi-socket, shared-memory architectures.