Ever been really hacked-off by those machines that put sentences together by splicing words? Telephone recordings, elevators, that sort of thing? Even if you've got a good ear, they're generally really annoying to listen to.
The solution to them is also very simple, yet nobody seems to ever do it.
There are two major ways of stressing a word in normal speech. Say "nine seven nine".
No, really. Give it a try.
See how the first nine sounds different from the last nine? The first nine and the seven use ongoing stress emphasis. That's the way we say a word when another word is going to follow it. The last nine uses a different emphasis, because you're going to stop speaking. It's how we sound words that occur at the end of sentences, or when we're otherwise done speaking. So there are two ways to say a word: Regular and final.
Those devices that speak by splicing together words always use only finals. Essentially, the person they recorded spoke each word as a standalone (final) word. Those were recorded, and chopped up and stored for the software to reproduce.
To get it to sound right, and to sound more natural, you record the speaker using regular emphasis.
Get them to repeat the word several times as a sentence, and chop out one from the middle that you like. Then you take the last one, and record that as a final. That gives you two sound-banks of recorded words. One set of regulars, and one set of finals. Then it's just a matter of setting up your data tables for each sentence to select a regular for each word, except the last word of a sentence, and pick the sound out of the list of finals for that one.
It sounds so much more natural, is easier on the ear and requires less concentration to understand. It's also quite simple, doesn't add much to the time with your voice-actor, and only requires double the storage (and in many of these systems, the storage is vastly underutilized).
So why does nobody ever actually do this?