Quality speech recognition engines have been around for decades. Assuming you're on Windows, there was a speech recognition engine that was added way back in Windows 95 and has been evolving ever since; Microsoft's built-in speech recognition is quite good out of the box. Best results come from giving them a limited vocabulary of recognized words, since they can scan just the few for high probability matches. You get worse results by allowing natural language and comprehensive dictionaries, but as described, that can still be done.
Recognition is the more tedious step, but isn't difficult. You build your speech recognition grammar that basically says "These words mean this token", and then "these tokens are valid", then you process the tokens as commands.
Text-to-speech is quite easy, although the default voices provided by the system are somewhat bland and computerized. It can be literally as easy as calling a fire-and-forget function like SpeakAsync(myTextString). Most games prefer to use voice actors and pre-recorded lines.
A bigger problem is that voice command of games usually isn't fun. Also many people cannot play them for various reasons like a lack of microphone, being in places where calling out game commands is inappropriate, or being in environments where external noises are a problem.