ofxTensorFlow2 chatbot [(almost...) solved]

I trained a model with this code: tf2-transformer-chatbot/tf2_tpu_transformer_chatbot.ipynb at master · bryanlimy/tf2-transformer-chatbot · GitHub

And I can load it with ofxTensorFlow2, this are the operations:

[notice ] ofxTensorFlow2: ====== Model Operations with Shapes ======
[notice ] ofxTensorFlow2: count with shape: []
[notice ] ofxTensorFlow2: count/Read/ReadVariableOp with shape: []
[notice ] ofxTensorFlow2: total with shape: []
[notice ] ofxTensorFlow2: total/Read/ReadVariableOp with shape: []
[notice ] ofxTensorFlow2: count_1 with shape: []
[notice ] ofxTensorFlow2: count_1/Read/ReadVariableOp with shape: []
[notice ] ofxTensorFlow2: total_1 with shape: []
[notice ] ofxTensorFlow2: total_1/Read/ReadVariableOp with shape: []
[notice ] ofxTensorFlow2: layer_normalization_9/beta with shape: []
[notice ] ofxTensorFlow2: layer_normalization_9/beta/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: layer_normalization_9/gamma with shape: []
[notice ] ofxTensorFlow2: layer_normalization_9/gamma/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: dense_31/bias with shape: []
[notice ] ofxTensorFlow2: dense_31/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: dense_31/kernel with shape: []
[notice ] ofxTensorFlow2: dense_31/kernel/Read/ReadVariableOp with shape: [512, 256]
[notice ] ofxTensorFlow2: dense_30/bias with shape: []
[notice ] ofxTensorFlow2: dense_30/bias/Read/ReadVariableOp with shape: [512]
[notice ] ofxTensorFlow2: dense_30/kernel with shape: []
[notice ] ofxTensorFlow2: dense_30/kernel/Read/ReadVariableOp with shape: [256, 512]
[notice ] ofxTensorFlow2: layer_normalization_8/beta with shape: []
[notice ] ofxTensorFlow2: layer_normalization_8/beta/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: layer_normalization_8/gamma with shape: []
[notice ] ofxTensorFlow2: layer_normalization_8/gamma/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_2/dense_29/bias with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_29/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_2/dense_29/kernel with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_29/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention_2/dense_28/bias with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_28/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_2/dense_28/kernel with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_28/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention_2/dense_27/bias with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_27/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_2/dense_27/kernel with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_27/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention_2/dense_26/bias with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_26/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_2/dense_26/kernel with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_26/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: layer_normalization_7/beta with shape: []
[notice ] ofxTensorFlow2: layer_normalization_7/beta/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: layer_normalization_7/gamma with shape: []
[notice ] ofxTensorFlow2: layer_normalization_7/gamma/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_1/dense_25/bias with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_25/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_1/dense_25/kernel with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_25/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention_1/dense_24/bias with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_24/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_1/dense_24/kernel with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_24/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention_1/dense_23/bias with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_23/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_1/dense_23/kernel with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_23/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention_1/dense_22/bias with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_22/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_1/dense_22/kernel with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_22/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: layer_normalization_6/beta with shape: []
[notice ] ofxTensorFlow2: layer_normalization_6/beta/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: layer_normalization_6/gamma with shape: []
[notice ] ofxTensorFlow2: layer_normalization_6/gamma/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: dense_21/bias with shape: []
[notice ] ofxTensorFlow2: dense_21/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: dense_21/kernel with shape: []
[notice ] ofxTensorFlow2: dense_21/kernel/Read/ReadVariableOp with shape: [512, 256]
[notice ] ofxTensorFlow2: dense_20/bias with shape: []
[notice ] ofxTensorFlow2: dense_20/bias/Read/ReadVariableOp with shape: [512]
[notice ] ofxTensorFlow2: dense_20/kernel with shape: []
[notice ] ofxTensorFlow2: dense_20/kernel/Read/ReadVariableOp with shape: [256, 512]
[notice ] ofxTensorFlow2: layer_normalization_5/beta with shape: []
[notice ] ofxTensorFlow2: layer_normalization_5/beta/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: layer_normalization_5/gamma with shape: []
[notice ] ofxTensorFlow2: layer_normalization_5/gamma/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_2/dense_19/bias with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_19/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_2/dense_19/kernel with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_19/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention_2/dense_18/bias with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_18/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_2/dense_18/kernel with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_18/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention_2/dense_17/bias with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_17/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_2/dense_17/kernel with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_17/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention_2/dense_16/bias with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_16/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_2/dense_16/kernel with shape: []
[notice ] ofxTensorFlow2: attention_2/dense_16/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: layer_normalization_4/beta with shape: []
[notice ] ofxTensorFlow2: layer_normalization_4/beta/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: layer_normalization_4/gamma with shape: []
[notice ] ofxTensorFlow2: layer_normalization_4/gamma/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_1/dense_15/bias with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_15/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_1/dense_15/kernel with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_15/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention_1/dense_14/bias with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_14/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_1/dense_14/kernel with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_14/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention_1/dense_13/bias with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_13/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_1/dense_13/kernel with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_13/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention_1/dense_12/bias with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_12/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention_1/dense_12/kernel with shape: []
[notice ] ofxTensorFlow2: attention_1/dense_12/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: embedding_1/embeddings with shape: []
[notice ] ofxTensorFlow2: embedding_1/embeddings/Read/ReadVariableOp with shape: [8278, 256]
[notice ] ofxTensorFlow2: layer_normalization_3/beta with shape: []
[notice ] ofxTensorFlow2: layer_normalization_3/beta/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: layer_normalization_3/gamma with shape: []
[notice ] ofxTensorFlow2: layer_normalization_3/gamma/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: dense_11/bias with shape: []
[notice ] ofxTensorFlow2: dense_11/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: dense_11/kernel with shape: []
[notice ] ofxTensorFlow2: dense_11/kernel/Read/ReadVariableOp with shape: [512, 256]
[notice ] ofxTensorFlow2: dense_10/bias with shape: []
[notice ] ofxTensorFlow2: dense_10/bias/Read/ReadVariableOp with shape: [512]
[notice ] ofxTensorFlow2: dense_10/kernel with shape: []
[notice ] ofxTensorFlow2: dense_10/kernel/Read/ReadVariableOp with shape: [256, 512]
[notice ] ofxTensorFlow2: layer_normalization_2/beta with shape: []
[notice ] ofxTensorFlow2: layer_normalization_2/beta/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: layer_normalization_2/gamma with shape: []
[notice ] ofxTensorFlow2: layer_normalization_2/gamma/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention/dense_9/bias with shape: []
[notice ] ofxTensorFlow2: attention/dense_9/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention/dense_9/kernel with shape: []
[notice ] ofxTensorFlow2: attention/dense_9/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention/dense_8/bias with shape: []
[notice ] ofxTensorFlow2: attention/dense_8/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention/dense_8/kernel with shape: []
[notice ] ofxTensorFlow2: attention/dense_8/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention/dense_7/bias with shape: []
[notice ] ofxTensorFlow2: attention/dense_7/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention/dense_7/kernel with shape: []
[notice ] ofxTensorFlow2: attention/dense_7/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention/dense_6/bias with shape: []
[notice ] ofxTensorFlow2: attention/dense_6/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention/dense_6/kernel with shape: []
[notice ] ofxTensorFlow2: attention/dense_6/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: layer_normalization_1/beta with shape: []
[notice ] ofxTensorFlow2: layer_normalization_1/beta/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: layer_normalization_1/gamma with shape: []
[notice ] ofxTensorFlow2: layer_normalization_1/gamma/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: dense_5/bias with shape: []
[notice ] ofxTensorFlow2: dense_5/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: dense_5/kernel with shape: []
[notice ] ofxTensorFlow2: dense_5/kernel/Read/ReadVariableOp with shape: [512, 256]
[notice ] ofxTensorFlow2: dense_4/bias with shape: []
[notice ] ofxTensorFlow2: dense_4/bias/Read/ReadVariableOp with shape: [512]
[notice ] ofxTensorFlow2: dense_4/kernel with shape: []
[notice ] ofxTensorFlow2: dense_4/kernel/Read/ReadVariableOp with shape: [256, 512]
[notice ] ofxTensorFlow2: layer_normalization/beta with shape: []
[notice ] ofxTensorFlow2: layer_normalization/beta/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: layer_normalization/gamma with shape: []
[notice ] ofxTensorFlow2: layer_normalization/gamma/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention/dense_3/bias with shape: []
[notice ] ofxTensorFlow2: attention/dense_3/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention/dense_3/kernel with shape: []
[notice ] ofxTensorFlow2: attention/dense_3/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention/dense_2/bias with shape: []
[notice ] ofxTensorFlow2: attention/dense_2/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention/dense_2/kernel with shape: []
[notice ] ofxTensorFlow2: attention/dense_2/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention/dense_1/bias with shape: []
[notice ] ofxTensorFlow2: attention/dense_1/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention/dense_1/kernel with shape: []
[notice ] ofxTensorFlow2: attention/dense_1/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: attention/dense/bias with shape: []
[notice ] ofxTensorFlow2: attention/dense/bias/Read/ReadVariableOp with shape: [256]
[notice ] ofxTensorFlow2: attention/dense/kernel with shape: []
[notice ] ofxTensorFlow2: attention/dense/kernel/Read/ReadVariableOp with shape: [256, 256]
[notice ] ofxTensorFlow2: embedding/embeddings with shape: []
[notice ] ofxTensorFlow2: embedding/embeddings/Read/ReadVariableOp with shape: [8278, 256]
[notice ] ofxTensorFlow2: outputs/bias with shape: []
[notice ] ofxTensorFlow2: outputs/bias/Read/ReadVariableOp with shape: [8278]
[notice ] ofxTensorFlow2: outputs/kernel with shape: []
[notice ] ofxTensorFlow2: outputs/kernel/Read/ReadVariableOp with shape: [256, 8278]
[notice ] ofxTensorFlow2: Const with shape: []
[notice ] ofxTensorFlow2: Const_1 with shape: [1, 8278, 256]
[notice ] ofxTensorFlow2: Const_2 with shape: []
[notice ] ofxTensorFlow2: Const_3 with shape: [1, 8278, 256]
[notice ] ofxTensorFlow2: Const_4 with shape: []
[notice ] ofxTensorFlow2: serving_default_dec_inputs with shape: [-1, -1]
[notice ] ofxTensorFlow2: serving_default_inputs with shape: [-1, -1]
[notice ] ofxTensorFlow2: StatefulPartitionedCall with shape: [-1, -1, 8278]
[notice ] ofxTensorFlow2: saver_filename with shape: []
[notice ] ofxTensorFlow2: StatefulPartitionedCall_1 with shape: []
[notice ] ofxTensorFlow2: StatefulPartitionedCall_2 with shape: []
[notice ] ofxTensorFlow2: ============ End ==============

And this how to calculate the inference with the notebook (it works with the generated model):

def evaluate(sentence):
  sentence = preprocess_sentence(sentence)

  sentence = tf.expand_dims(
      START_TOKEN + tokenizer.encode(sentence) + END_TOKEN, axis=0)

  output = tf.expand_dims(START_TOKEN, 0)

  for i in range(MAX_LENGTH):
    predictions = model(inputs=[sentence, output], training=False)

    # select the last word from the seq_len dimension
    predictions = predictions[:, -1:, :]
    predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

    # return the result if the predicted_id is equal to the end token
    if tf.equal(predicted_id, END_TOKEN[0]):
      break

    # concatenated the predicted_id to the output which is given to the decoder
    # as its input.
    output = tf.concat([output, predicted_id], axis=-1)

  return tf.squeeze(output, axis=0)


def predict(sentence):
  prediction = evaluate(sentence)

  predicted_sentence = tokenizer.decode(
      [i for i in prediction if i < tokenizer.vocab_size])

  print('Input: {}'.format(sentence))
  print('Output: {}'.format(predicted_sentence))

  return predicted_sentence

I would be happy, if anybody can give me a hint how to preprocess the input and output with OF / ofxTensorFlow2.

Maybe this could be helpful for tokenizing the input? GitHub - Gavin-Development/CPP-SubwordTextEncoder: Based on the SubwordTextEncoder From Tensorflow Datasets, but I implemented it in CPP. https://www.tensorflow.org/datasets/api_docs/python/tfds/deprecated/text/SubwordTextEncoder

Of course I can try to upload the saved model somewhere, if someone wants to try it (it is around 50MB)…

Hello Jona,

for your model to work the input data will need the same preprocessing (encoding) as your training data.
You could try to export the mapping from word to integer (and back) from your python notebook and use that as look-up table in OF, but you would still need to “tokenize” your input so it can be found in the look-up table. The SubwordTextEncoder you linked seems like the right source for that part.

I would probably start with the look-up (called vocabulary in the cpp file) and load it from a file when you start your OF application.
Then I would take an input pair (sentence and in python encoded integer sequence) to implement and test the encoding and decoding of your SubwordTextEncoder.

Hope that helps a little.

Good luck!

1 Like

@felixd thanks, that helps a lot.

So if I am right, I need to port this code (from the notebook) from python to c++ for creating the dictionary?

def preprocess_sentence(sentence):
  sentence = sentence.lower().strip()
  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
  sentence = re.sub(r'[" "]+', " ", sentence)
  # removing contractions
  sentence = re.sub(r"i'm", "i am", sentence)
  sentence = re.sub(r"he's", "he is", sentence)
  sentence = re.sub(r"she's", "she is", sentence)
  sentence = re.sub(r"it's", "it is", sentence)
  sentence = re.sub(r"that's", "that is", sentence)
  sentence = re.sub(r"what's", "that is", sentence)
  sentence = re.sub(r"where's", "where is", sentence)
  sentence = re.sub(r"how's", "how is", sentence)
  sentence = re.sub(r"\'ll", " will", sentence)
  sentence = re.sub(r"\'ve", " have", sentence)
  sentence = re.sub(r"\'re", " are", sentence)
  sentence = re.sub(r"\'d", " would", sentence)
  sentence = re.sub(r"\'re", " are", sentence)
  sentence = re.sub(r"won't", "will not", sentence)
  sentence = re.sub(r"can't", "cannot", sentence)
  sentence = re.sub(r"n't", " not", sentence)
  sentence = re.sub(r"n'", "ng", sentence)
  sentence = re.sub(r"'bout", "about", sentence)
  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
  sentence = re.sub(r"[^a-zA-Z?.!,]+", " ", sentence)
  sentence = sentence.strip()
  return sentence


def load_conversations():
  # dictionary of line id to text
  id2line = {}
  with open(path_to_movie_lines, errors='ignore') as file:
    lines = file.readlines()
  for line in lines:
    parts = line.replace('\n', '').split(' +++$+++ ')
    id2line[parts[0]] = parts[4]

  inputs, outputs = [], []
  with open(path_to_movie_conversations, 'r') as file:
    lines = file.readlines()
  for line in lines:
    parts = line.replace('\n', '').split(' +++$+++ ')
    # get conversation in a list of line ID
    conversation = [line[1:-1] for line in parts[3][1:-1].split(', ')]
    for i in range(len(conversation) - 1):
      inputs.append(preprocess_sentence(id2line[conversation[i]]))
      outputs.append(preprocess_sentence(id2line[conversation[i + 1]]))
      if len(inputs) >= MAX_SAMPLES:
        return inputs, outputs
  return inputs, outputs


questions, answers = load_conversations()

And use this file ‘movie_lines.txt’ from http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip as the input?

Actually I am stuck with this part of Python code (not sure how to convert it to C++):

def load_conversations():
  # dictionary of line id to text
  id2line = {}
  with open(path_to_movie_lines, errors='ignore') as file:
    lines = file.readlines()
  for line in lines:
    parts = line.replace('\n', '').split(' +++$+++ ')
    id2line[parts[0]] = parts[4]

  inputs, outputs = [], []
  with open(path_to_movie_conversations, 'r') as file:
    lines = file.readlines()
  for line in lines:
    parts = line.replace('\n', '').split(' +++$+++ ')
    # get conversation in a list of line ID
    conversation = [line[1:-1] for line in parts[3][1:-1].split(', ')]
    for i in range(len(conversation) - 1):
      inputs.append(preprocess_sentence(id2line[conversation[i]]))
      outputs.append(preprocess_sentence(id2line[conversation[i + 1]]))
      if len(inputs) >= MAX_SAMPLES:
        return inputs, outputs
  return inputs, outputs


questions, answers = load_conversations()

The preprocess_sentence method is quite easy to write with OF I guess…

I tried to create the question / answers for the tokenizer vocabulary:

	vector < string > linesOfTheFile;
	ofBuffer buffer = ofBufferFromFile("movie_lines.txt");
	for (auto line : buffer.getLines()) {
		linesOfTheFile.push_back(line);
	}
	vector < string > questions;
	vector < string > answers;
	for (int i = 0; i < 5000; i++) {
		vector < string > q_result = ofSplitString(linesOfTheFile[i], " +++$+++ ");
		std::string line = ofToLower(q_result[4]);
		ofStringReplace(line, "?", " ?");
		ofStringReplace(line, ".", " .");
		ofStringReplace(line, "!", " !");
		ofStringReplace(line, ",", " ,");
		ofStringReplace(line, "i'm", "i am");
		ofStringReplace(line, "he's", "he is");
		ofStringReplace(line, "she's", "she is");
		ofStringReplace(line, "it's", "it is");
		questions.push_back(line);
		vector < string > a_result = ofSplitString(linesOfTheFile[i + 1], " +++$+++ ");
		std::string line1 = ofToLower(a_result[4]);
		ofStringReplace(line1, "?", " ?");
		ofStringReplace(line1, ".", " .");
		ofStringReplace(line1, "!", " !");
		ofStringReplace(line1, ",", " ,");
		ofStringReplace(line1, "i'm", "i am");
		ofStringReplace(line1, "he's", "he is");
		ofStringReplace(line1, "she's", "she is");
		ofStringReplace(line1, "it's", "it is");
		answers.push_back(line1);
	}
	
	cout << questions[20] << endl;
	cout << answers[20] << endl;

But somehow the result is different from the Python result.

This is how the unformatted text looks like:

L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let’s go.
L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow
L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay – you’re gonna need to learn how to lie.
L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No
L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I’m kidding. You know how sometimes you just become this “persona”? And you don’t know how to quit?
L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?
L868 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ The “real you”.
L867 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ What good stuff?
L866 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ I figured you’d get to the good stuff eventually.
L865 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Thank God! If I had to hear one more story about your coiffure…
L864 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Me. This endless …blonde babble. I’m like, boring myself.
L863 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ What crap?
L862 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ do you listen to this crap?
L861 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No…
L860 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Then Guillermo says, “If you go any lighter, you’re gonna look like an extra on 90210.”
L699 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ You always been this selfish?
L698 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ But
L697 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Then that’s all you had to say.
L696 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Well, no…
L695 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ You never wanted to go out with 'me, did you?
L694 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I was?
L693 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ I looked for you back at the party, but you always seemed to be “occupied”.
L663 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Tons

Edit: I think I need to sort the lines of the .txt file (I guess natural sort is what I need).
Something like:

...
L0
L1
L2
L3
...

I am one step closer. With:

	vector < string > questions;
	vector < string > answers;
	vector < string > linesOfTheFile;
	ofBuffer buffer = ofBufferFromFile("movie_lines.txt");
	for (auto line : buffer.getLines()) {
		linesOfTheFile.push_back(line);
	}
	for (int i = 0; i < 5000; i++) {
		vector < string > q_result = ofSplitString(linesOfTheFile[i * 2 + 1], " +++$+++ ");
		std::string line = ofToLower(q_result[4]);
		ofStringReplace(line, "?", " ?");
		ofStringReplace(line, ".", " .");
		ofStringReplace(line, "!", " !");
		ofStringReplace(line, ",", " ,");
		ofStringReplace(line, "i'm", "i am");
		ofStringReplace(line, "he's", "he is");
		ofStringReplace(line, "she's", "she is");
		ofStringReplace(line, "it's", "it is");
		ofStringReplace(line, "that's", "that is");
		ofStringReplace(line, "what's", "what is");
		ofStringReplace(line, "where's", "where is");
		ofStringReplace(line, "how's", "how is");
		ofStringReplace(line, "'ll", " will");
		ofStringReplace(line, "'ve", " have");
		ofStringReplace(line, "'re", " are");
		ofStringReplace(line, "'d", " would");
		ofStringReplace(line, "won't", "will not");
		ofStringReplace(line, "can't", "cannot");
		ofStringReplace(line, "n't", " not");
		ofStringReplace(line, "n'g", "ng");
		ofStringReplace(line, "'bout", "about");
		questions.push_back(line);

		vector < string > a_result = ofSplitString(linesOfTheFile[i * 2], " +++$+++ ");
		std::string line1 = ofToLower(a_result[4]);
		ofStringReplace(line1, "?", " ?");
		ofStringReplace(line1, ".", " .");
		ofStringReplace(line1, "!", " !");
		ofStringReplace(line1, ",", " ,");
		ofStringReplace(line1, "i'm", "i am");
		ofStringReplace(line1, "he's", "he is");
		ofStringReplace(line1, "she's", "she is");
		ofStringReplace(line1, "it's", "it is");
		ofStringReplace(line1, "that's", "that is");
		ofStringReplace(line1, "what's", "what is");
		ofStringReplace(line1, "where's", "where is");
		ofStringReplace(line1, "how's", "how is");
		ofStringReplace(line1, "'ll", " will");
		ofStringReplace(line1, "'ve", " have");
		ofStringReplace(line1, "'re", " are");
		ofStringReplace(line1, "'d", " would");
		ofStringReplace(line1, "won't", "will not");
		ofStringReplace(line1, "can't", "cannot");
		ofStringReplace(line1, "n't", " not");
		ofStringReplace(line1, "n'g", "ng");
		ofStringReplace(line1, "'bout", "about");
		answers.push_back(line1);
	}
	
	cout << questions[18] << endl;
	cout << answers[18] << endl;

I get the same print result as in the example with questions[20] /answer[20] .
So there is still something wrong and / or missing, MAX_SAMPLES for example…

What I actually want to do: Play a video with subtitles. Generate the “answer” with the chatbot from the current subtitle. Find the most similar subtitle to the “answer” from the chatbot and choose it as the next subtitle (and jump to the corresponding video position). The subtitle similarity and video playback is already working: Universal sentence encoder example made with ofxTensorFlow2 - #2 by Jona

The chatbot does work now (not perfect, but anyhow…): ofxTensorFlow2/example_chatbot at example_chatbot · Jonathhhan/ofxTensorFlow2 · GitHub

For training I use this branch with small changes in dataset.py for a bigger dictionary size and saving the dictionary (the dictionary is needed in the example): GitHub - Jonathhhan/tf2-transformer-chatbot at ofxTensorFlow2
I trained the model with:

python train.py --output_dir runs/save_model--batch_size 8192 --epochs 500 --max_samples 221617 --max_length 15 

And I tried to simulate the behaviour of the TensorFlow SubwordTextEncoder (tfds.deprecated.text.SubwordTextEncoder  |  TensorFlow Datasets) with some additional preprocessing in GitHub - Gavin-Development/CPP-SubwordTextEncoder: Based on the SubwordTextEncoder From Tensorflow Datasets, but I implemented it in CPP. https://www.tensorflow.org/datasets/api_docs/python/tfds/deprecated/text/SubwordTextEncoder I use the the vocabulary created with TensorFlow, otherwise the result is wrong.

Edit: I wonder, how to improve the training result and maybe how to fix some issues with the subwordencoder.