Best way to get stream of speech for text display

I am designing an installation in which I would like to have a stream of the speech heard being rendered as text on a screen. It will be like an audio vortex, with text spiralling outwards in a variety of lines,

This will be running on a mac mini.

I am currently trying out ofxSpeech, which seems to work only with recognized words, and in the Apple Speech Recognition Reference I cannot find any way to access the ‘pre-recognized’ information. ( https://developer.apple.com/library/mac/documentation/Carbon/Reference/Speech_Recognition_Manager/ )

What I am hoping for is something similar to what is being used here: https://speechlogger.appspot.com/en/ (only for google chrome right now)

Think of it as a stream-of-consciousness text for all audio heard in a room (it doesn’t necessarily have to make sense, but it should be fairly responsive)

I am about to try Google Speech ( https://github.com/gillesdemey/google-speech-v2/ ) and using their suggestion for linking it to google chrome to get an indefinite amount of accesses for a show.

Is there a different technique I should be considering, though? Some part of the apple speech developer tools which will give a pre-recognized string of audio-to-speech elements I can use as a visual representation of audio which is spoken in a gallery setting?

Thank you very much for your help in this.

Is there any reason it has to be on a mac vs pc? the speech API functions in windows absolutely lets you have access to the guessed words and set a threshold for accuracy. You should be able to take the guessed speech, no matter how much gibberish it is and output it to a string. There is actually a pretty easy example in the speech basics example of the kinect developer toolkit. It obviously shows using a kinect but there is no reason you have to use the kinect for it since it utilizes the speech API functionality. Plus it works with C++ so it integrates really easily with OF rather than having to translate between objective c :thumbsup:

If you are asking for something that will give you the raw components of the audio like phonetics or something that I am not so sure about since that would be pretty low level engine functionality. you would probably need an open source library and to add in such a function to grab that information, luckly there are plenty of those libraries floating around on multiple systems

Thank you, DomAmato,

The only reason I am using a Mac is that I have a few space laptops with can be devoted to an art installation for a month. I can test them for their ability to run Windows if need be.

I shall check out the Kinect functionality. There was a component to this installation which may be utilizing them (It’s an art show about surveillance, where I wanted to track people within the space) but I have not tried to do so for a while.

I will continue with the example from https://github.com/fx-lange/ofxGSTT to see how his example that uses Chrome to bypass the googleSpeech api limit works. I am concerned with the method required for starting and stopping the microphone.

I am hopeful that someone knows how to get the guessed words (thanks for a new phrase I can search for) within the Apple speech mechanism. It’s beginning to feel like a kludgie system will be needed…

1 Like

Been in that situation before haha, you use what you have laying around :smile:

The kinect is a funny thing with surveillance since people are legitimately afraid of it, a friend of mine turns it around to face the wall when hes not using it… though really the most unsafe webcam around tend to be baby monitors. Te speech API doesn’t require a kinect so don’t feel like you have to resort to using it, I just recommended the example because it shows how to setup the speech API and I knew it existed, I am sure there are examples on MSDN too.

The ofxGSTT looks pretty cool my only precaution to using that over an embedded speech library is that it requires an internet connection. Having worked in the museum industry its amazing how few have reliable internet access, and then if the web goes down for any reason your installation essentially becomes broken which results in phone calls and email chains. Not that I am trying to discourage you from pursuing that path just lessons learned from my own past experiences

1 Like

I think people get a feeling of ‘woo-overload’ about things they heard about near a water cooler, but don’t really know the specifics of to fully understand. A lot of folk-lore-based things that are not necessarily true, but they heard it so it is on their radar.

I tend to hide my Kinects for that reason by taking them apart and putting the elements in non-kinect-shaped things.

For the record, I did track down some information on the apple developer page I linked to earlier, with some potential help, but it didn’t solve my problem. These ‘imply’ that you can reference them to find the proper information, but I only get their four-char-values (TEXT in the case of the first one) and not any unrecognized word information.

it seems this is something which no one else has tried to work with, and my google-fu isn’t high enough to get the answers.

  • kSRTEXTFormat
    The text format.
    The value of this property is a variable-length string of characters that is the text of the recognized utterance. If the utterance was rejected, this text is the spelling of the rejected word. The string value does not include either a length byte (as in Pascal strings) or a null terminating character (as in C strings).
    Available in Mac OS X v10.0 and later.
    Declared in SpeechRecognition.h

  • kSRPhraseFormat
    The phrase format.
    The value of this property is a phrase that contains one word (of type SRWord) for each word in the recognized utterance. If the utterance was rejected, this path or phrase contains one object, the rejected word. The reference constant value of the phrase is always 0, but each word in the phrase retains its own reference constant property value.
    Available in Mac OS X v10.0 and later.
    Declared in SpeechRecognition.h

  • kSRPathFormat
    The path format.
    The value of this property is a path
    that contains a sequence of words (of type SRWord) and phrases (of type SRPhrase) representing the text of the recognized utterance. If the utterance was rejected, this path or phrase contains one object, the rejected word. The reference constant value of the path is always 0, but each word or phrase in the path retains its own reference constant property value.
    Available in Mac OS X
    v10.0 and later. Declared in SpeechRecognition.h

BTW, I have just checked the ofxASR app to see what it can generate, and it is throwing errors (error code:-2 engine initial failed) so I can’t test it’s usefulness.

It looks like reinstalling Parallels and windows (and avoiding the next OS which kills my version of parallels) is my next attempt.

Thanks again.

For anyone concerned, I have spent several days trying to get this to work with different software systems.

I tried ofxGSTT, which required building in openFrameworks nightly build as it uses sound buffer (not available in 0.8.4 yet)

I tried affixers, but it has some issues with where to find the sphinx libraries, and appears to have been abandoned as a project. (I get a code -2 for the first speech engine)

I tried ofxSpeech, which looked promising, but I had trouble with setting up a localhost that was compatible with it (google does not allow use of the microphone from an html file, which means it has to be hosted on a server, but at least a localhost server will do. theoretically, since it didn’t appear to work for me)

I ‘did’ manage to get ‘most’ of ofxKinectOPenBridge working, but sadly ran into issues when the kinect I was using was an xbox kinect, and not a kinect for windows. I’m waiting on one of those now.

so… for this project I will be using processing, and a speech detection system that (like ofxSpeech) uses chrome. The documentation makes it much easier to implement (had it working on the first try) Sadly, it still waits for pauses instead of sending everything.

I honestly did not think it would be this hard. I see the UITextInput class for iOS includes dictation, but sadly it will not work with a desktop system.

Once again, I am just adding this as a sense of closure for this project, should anyone else find their way to this question.

If anyone knows how to solve any of the problems (error -2 in ofxASR, linking chrome and ofxGSTT or how to use an iOS class within a desktop system, please feel free to chime in.

(I honestly thought that this would be much easier for the task I was working on… turning ambient discussions into on-screen text for artwork)