{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fccf4b1c-8e01-4edf-9c31-8d0917dbe2a4",
   "metadata": {},
   "source": [
    "# RNNs\n",
    "\n",
    "Before we start taking a look at the inner mechanics of recurrent neural networks (RNNs) we will take a look at our data and see where apporaches like CNN or Autoencoders lag and why we therefore introduce yet another type.\n",
    "\n",
    "## Motivation\n",
    "\n",
    "It is worth comparing the entities of digital data.\n",
    "For this example we want to compare text and image data.\n",
    "\n",
    "Image | Text | Note\n",
    "--- | --- | ---\n",
    "Continous values: Colors expressed via RGB are continous, we can morph from green to blue | The meaning of car and bicycle are not continous but discrete | This is important for our loss function which helps our neural net to move into the proper direction via gradients. \n",
    "Two dimensional data (plus color) - we can often mirror the image w/o loosing semantics | Single dimension in *time* - the sequence of words is really important | The *temporal* dependency of text is vastly important - this can also lead way back into the past, e.g. the name of a protagonist of a story\n",
    "A single pixel rarely changes the whole semantics of a picture | Changing a single word (e.g. negation) can vastly change the meaning of a sentence | Our neural network needs to be robust and needs to avoid producing noise\n",
    "The structure of the data is given by the contrast of an image | The structure of a sentence is also given by its grammar which often relies on abstract notions such as past and future | \n",
    "A digital image has a fixed image size | The length of a text or sentence is not fixed and depends on its content |\n",
    "\n",
    "Although both kind of data are represented by bytes their formal structure differ a lot and therefore we need a new way to tackle data that has not a fixed size, like a text or a musical composition.\n",
    "\n",
    "Representing semeantics of text is a very tedious and complex task in computer science whose research is called [natural language processing (NLP)](https://en.wikipedia.org/wiki/Computational_linguistics).\n",
    "Neural networks also accelerated the progression in this domain and are used for translation of languages or sentiment analysis (is the text written in a positive or negative tone).\n",
    "\n",
    "To tackle this stream of text data we will use recurrent neural networks (RNNs) which have a feedback loop inside itself which allows for recursive calls on data and hencefore for a variable length of input and output.\n",
    "\n",
    "![RNN](rnn.svg)\n",
    "\n",
    "We often also illustrtate the layers in a more verbose, *unrolled* way.\n",
    "In this example our input data $X$ is arranged in such a way that for each 5 samples in a row and the RNN will output a single sample.\n",
    "Think of this as we show the RNN 5 words and want to know which one should be the next one.\n",
    "\n",
    "![RNN unrolled](rnn-unrolled.svg)\n",
    "\n",
    "There are different architectures of how this feedback loop is achived, one of the most used ones is [ Long short-term memory (LSTM)](https://en.wikipedia.org/wiki/Long_short-term_memory) which we will also use in this notebook.\n",
    "\n",
    "As training data we will use *Der Proceß* by Franz Kafka which is available in txt format at [wikisource](https://de.wikisource.org/wiki/Der_Prozess)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e46404c-1974-41d1-88e2-d9859f896f19",
   "metadata": {},
   "source": [
    "## Pre-processing\n",
    "\n",
    "As always we need to talk about the pre-processing of our data.\n",
    "We want to train the RNN to write some Kafka-like texts but neural networks rely on mathematical notations and deviations which do not work on texts directly.\n",
    "Therefore we need to make a transition of words to numbers and there are 2 trivial approaches for this:\n",
    "\n",
    "* Word tokenization: Each word is assigned a token (number) is henceforth representated by this token\n",
    "* Character tokenization: Same as word tokenization but instead of words it uses chars\n",
    "\n",
    "This allows us to transform the text into a line of numbers (vector) on which we can work on as before.\n",
    "\n",
    "Both ways have advantages and disadvantages and we will try out both ways, although *Der Proceß* is not a really long book which may be not enough for \n",
    "Also is there a problem that german language conjugates words and makes it more difficult for the neural network to train on these as the token is different or we have to add a lot of work beforehand so our tokenization works well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "534c34ea-5b89-4905-b594-0b3fca30d465",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import tensorflow as tf\n",
    "from tensorflow import keras\n",
    "from tensorflow.keras.preprocessing.text import Tokenizer\n",
    "from tensorflow.keras.utils import to_categorical"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f9ac2af",
   "metadata": {},
   "source": [
    "Lets start by taking a look at the text file if e.g. the encoding is properly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e8b35e4-9685-4d7d-b261-9bf9d146c032",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Der Prozess\n",
      "\n",
      "\n",
      "\t\t\t\t\tFranz Kafka\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Die Schmiede, Berlin, 1925\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Exportiert aus Wikisource am 29. November 2021\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "[I] FRANZ KAFKA\n",
      "\n",
      "\n",
      "\n",
      "* * *\n",
      "\n",
      "\n",
      "\n",
      "Der Prozess\n",
      "\n",
      "ROMAN\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "VERLAG DIE SCHMIEDE\n",
      "\n",
      "BERLIN\n",
      "\n",
      "1925\n",
      "\n",
      "[II] EINBANDENTWURF GEORG SALTER · BERLIN\n",
      "\n",
      "COPYRIGHT 1925 BY VERLAG DIE SCHMIEDE · BERLIN\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Inhalt\n",
      "\n",
      "\n",
      "Erstes Kapitel\n",
      "\n",
      "Zweites Kapitel\n",
      "\n",
      "Drittes Kapitel\n",
      "\n",
      "Viertes Kapitel\n",
      "\n",
      "Fünftes Kapitel\n",
      "\n",
      "Sechstes Kapitel\n",
      "\n",
      "Siebentes Kapitel\n",
      "\n",
      "Achtes Kapitel\n",
      "\n",
      "Neuntes Kapitel\n",
      "\n",
      "Zehntes Kapitel\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Sekundärliteratur\n",
      "\n",
      "\n",
      "Kurt Tucholsky: Der Prozeß. In. Die Weltbühne. Jahrgang 22, Nummer 10, Seite 383–386\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "[1] ERSTES KAPITEL\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "VERHAFTUNG · GESPRÄCH MIT FRAU GRUBACH · DANN FRÄULEIN BÜRSTNER\n",
      "\n",
      "\n",
      "\n",
      "Jemand mußte Josef K. verleumdet haben, denn ohne daß er etwas Böses getan hätte, wurde er eines Morgens verhaftet. Die Köchin der Frau Grubach, seiner Zimmervermieterin, die ihm jeden Tag gegen acht Uhr früh das Frühstück brachte, kam diesmal nicht. Das war noch niemals geschehen. K. wartete noch ein Weilchen, \n"
     ]
    }
   ],
   "source": [
    "text_file_path = \"./Der_Prozess.txt\"\n",
    "\n",
    "with open(text_file_path, 'r') as f:\n",
    "    text = f.read()\n",
    "    \n",
    "print(text[0:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6892ad8-5155-474f-8689-2a1f1f15fea6",
   "metadata": {},
   "source": [
    "We will not get into too much of pre-processing here but will mention [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) are a really great tool to verify or sanitize textstructures.\n",
    "\n",
    "We only want to perform a small analysis on the text what the most used words are - to split these properly we will use regular expressions as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fbc5cc7-a6c8-46a3-8a4b-4096a616a509",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20 most used words in Proceß\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "der      1685\n",
       "und      1606\n",
       "die      1557\n",
       "er       1393\n",
       "zu       1052\n",
       "nicht     980\n",
       "den       853\n",
       "K.        850\n",
       "sich      839\n",
       "es        797\n",
       "in        756\n",
       "das       650\n",
       "sagte     643\n",
       "ich       640\n",
       "sie       578\n",
       "aber      575\n",
       "Sie       564\n",
       "daß       553\n",
       "mit       544\n",
       "dem       512\n",
       "dtype: int64"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(\"20 most used words in Proceß\")\n",
    "pd.Series(re.sub(r\"s{2,}\", \" \", text).replace(\"\\n\", \" \").replace(\"\\t\", \" \").split(\" \")).value_counts().head(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "850c14ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "# text = text.lower()\n",
    "text = text.replace('\\n', ' ')\n",
    "text = re.sub('  +', '. ', text).strip()\n",
    "text = text.replace('..', '.')\n",
    "\n",
    "text = re.sub('([!\"#$%&()*+,-./:;<=>?@[\\]^_`{|}~])', r' \\1 ', text)\n",
    "text = re.sub('\\s{2,}', ' ', text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d306936-e7ce-4919-8faf-e18c91a4af9b",
   "metadata": {},
   "source": [
    "### Create tokens\n",
    "\n",
    "For now we will start with the char level approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2792d11c-57d7-494f-9139-c359f1edbc63",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of found chars: 92\n"
     ]
    }
   ],
   "source": [
    "char_tokenizer = Tokenizer(lower=False, filters='', char_level=True)\n",
    "char_tokenizer.fit_on_texts([text])\n",
    "num_chars = len(char_tokenizer.word_index)+1\n",
    "print(f\"Number of found chars: {num_chars}\")\n",
    "char_token_list = char_tokenizer.texts_to_sequences([text])[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a65a7b02",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "469649"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(char_token_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "caa9d6e0-55cf-4af7-b3d6-98a0ffda508b",
   "metadata": {},
   "source": [
    "We will now need to transform this list of tokens into two vectors $X, y$ where $X$  is a series of $n$ tokens and $y$ is the vector with the perceeding tokens for these $n$ tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3461de91-3e1e-4276-8f35-2a17fd359d08",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List\n",
    "\n",
    "def generate_sequences(token_list: List[int], step: int, seq_length: int, num_classes: int):\n",
    "    X = []\n",
    "    y = []\n",
    "    \n",
    "    for i in range(0, len(token_list) - seq_length, step):\n",
    "        X.append(token_list[i:i+seq_length])\n",
    "        y.append(token_list[i+seq_length])\n",
    "    \n",
    "    y = to_categorical(y, num_classes=num_classes, dtype=np.int16)\n",
    "    return np.array(X, dtype=np.int16), np.array(y, dtype=np.int16)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15e2a5cc-fae9-4e05-88d9-9522ed73784a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X:  (469549, 100) \ty:  (469549, 92)\n"
     ]
    }
   ],
   "source": [
    "X, y = generate_sequences(char_token_list, step=1, seq_length=100, num_classes=num_chars)\n",
    "\n",
    "print(\"X: \", X.shape, \"\\ty: \", y.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18890fd0",
   "metadata": {},
   "source": [
    "Note that $y$ does not output just a single number but the number of tokens that are available which is called [one hot encoding](https://en.wikipedia.org/wiki/One-hot).\n",
    "This has the advantage that we get the probability of each token instead of just the next token so it allows us to not always take the most probable next token but to deviate from it.\n",
    "\n",
    "We can take a look at the first sample from $X$ and $y$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a879ebdf-4380-47d9-841a-34544b0683bd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([38,  2,  5,  1, 48,  5, 17, 22,  2,  7,  7,  1, 21,  1, 40,  5,  8,\n",
       "        3, 22,  1, 29,  8, 20, 23,  8,  1, 21,  1, 38,  4,  2,  1, 28, 11,\n",
       "        9, 15,  4,  2, 10,  2,  1, 16,  1, 36,  2,  5, 13,  4,  3,  1, 16,\n",
       "        1, 57, 64, 59, 65,  1, 21,  1, 34, 74, 33, 17,  5,  6,  4,  2,  5,\n",
       "        6,  1,  8, 12,  7,  1, 39,  4, 23,  4,  7, 17, 12,  5, 11,  2,  1,\n",
       "        8, 15,  1, 59, 64,  1, 21,  1, 55, 17, 25,  2, 15, 18,  2])"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X[0, :]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d070539",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
       "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
       "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
       "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
       "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
       "       0., 0., 0., 0., 0., 0., 0.], dtype=float32)"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y[0, :]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "676622d6-44ae-47c6-8622-9704b98a6301",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Building the LSTM network\n",
    "\n",
    "Now we can start building the network.\n",
    "Before we feed the tokens into the LSTM cell we will use an embedding.\n",
    "This allows the network to self-interpret the meaning of each token in its own learnable space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd79b296-c6c4-47aa-8d5f-671fce76ad8c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"char_rnn\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "input_2 (InputLayer)         [(None, None)]            0         \n",
      "_________________________________________________________________\n",
      "embedding_1 (Embedding)      (None, None, 100)         9200      \n",
      "_________________________________________________________________\n",
      "lstm_1 (LSTM)                (None, 256)               365568    \n",
      "_________________________________________________________________\n",
      "dense_1 (Dense)              (None, 92)                23644     \n",
      "=================================================================\n",
      "Total params: 398,412\n",
      "Trainable params: 398,412\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "from keras import layers\n",
    "from keras.models import Model\n",
    "from keras.optimizer_v2.rmsprop import RMSprop\n",
    "\n",
    "n_units = 256\n",
    "embedding_size = 100\n",
    "\n",
    "text_in = layers.Input(shape=(None,))\n",
    "x = layers.Embedding(num_chars, embedding_size,)(text_in)\n",
    "x = layers.LSTM(n_units)(x)\n",
    "# x = layers.Dropout(0.2)(x)\n",
    "text_out = layers.Dense(num_chars, activation='softmax')(x)\n",
    "\n",
    "char_model = Model(text_in, text_out, name=\"char_rnn\")\n",
    "\n",
    "char_model.compile(\n",
    "    # note that we use the same loss as with MNIST\n",
    "    # which is used when we want to learn a\n",
    "    # probability distribution\n",
    "    loss=keras.losses.CategoricalCrossentropy(),\n",
    "    optimizer=RMSprop(learning_rate=0.001)\n",
    ")\n",
    "\n",
    "char_model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6b1196d-44fc-4471-92a2-705e0e843743",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/15\n",
      "3669/3669 [==============================] - 65s 17ms/step - loss: 1.7804\n",
      "Epoch 2/15\n",
      "3669/3669 [==============================] - 64s 18ms/step - loss: 1.3451\n",
      "Epoch 3/15\n",
      "3669/3669 [==============================] - 66s 18ms/step - loss: 1.2244\n",
      "Epoch 4/15\n",
      "3669/3669 [==============================] - 67s 18ms/step - loss: 1.1601\n",
      "Epoch 5/15\n",
      "3669/3669 [==============================] - 68s 18ms/step - loss: 1.1179\n",
      "Epoch 6/15\n",
      "3669/3669 [==============================] - 68s 19ms/step - loss: 1.0868\n",
      "Epoch 7/15\n",
      "3669/3669 [==============================] - 68s 19ms/step - loss: 1.0620\n",
      "Epoch 8/15\n",
      "3669/3669 [==============================] - 68s 19ms/step - loss: 1.0407\n",
      "Epoch 9/15\n",
      "3669/3669 [==============================] - 68s 19ms/step - loss: 1.0231\n",
      "Epoch 10/15\n",
      "3669/3669 [==============================] - 68s 19ms/step - loss: 1.0078\n",
      "Epoch 11/15\n",
      "3669/3669 [==============================] - 69s 19ms/step - loss: 0.9941\n",
      "Epoch 12/15\n",
      "3669/3669 [==============================] - 69s 19ms/step - loss: 0.9818\n",
      "Epoch 13/15\n",
      "3669/3669 [==============================] - 69s 19ms/step - loss: 0.9703\n",
      "Epoch 14/15\n",
      "3669/3669 [==============================] - 68s 19ms/step - loss: 0.9602\n",
      "Epoch 15/15\n",
      "3669/3669 [==============================] - 68s 19ms/step - loss: 0.9501\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7f88011bc410>"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "char_model.fit(X, y, epochs=15, batch_size=128, shuffle=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0467e09d",
   "metadata": {},
   "source": [
    "Now we can write a function which continues to write a text on a given input.\n",
    "We will use *temperature* to determine how much we will obey the probability distribution returned by the neural network.\n",
    "A low temperature will only allow the most likely candidates and a high temperature will also consider more unlikely candidates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "638f6737-77e1-40bd-96fb-3c3fbfb463f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample_with_temp(preds, temp:float=1.0):\n",
    "    preds = np.asarray(preds).astype('float64')\n",
    "    preds = np.log(preds)/temp\n",
    "    exp_preds = np.exp(preds)\n",
    "    preds = exp_preds/np.sum(exp_preds)\n",
    "    probs = np.random.multinomial(1, preds, 1)\n",
    "    return np.argmax(probs)\n",
    "\n",
    "def generate_text(seed_text: str, next_tokens: int, model: keras.Model, tokenizer: Tokenizer, max_sequence_len: int, temp: float, char_mode: bool = False):\n",
    "    output_text = seed_text\n",
    "    \n",
    "    for _ in range(next_tokens):\n",
    "        token_list = tokenizer.texts_to_sequences([seed_text])[0]\n",
    "        token_list = token_list[-max_sequence_len:]\n",
    "        token_list = np.reshape(token_list, (1, max_sequence_len))\n",
    "        \n",
    "        probs = model.predict(token_list, verbose=0)[0]\n",
    "        y_class = sample_with_temp(probs, temp)\n",
    "        \n",
    "        output_token = tokenizer.index_word[y_class] if y_class > 0 else ''\n",
    "        \n",
    "        if char_mode:\n",
    "            seed_text += output_token\n",
    "            output_text += output_token\n",
    "        else:\n",
    "            seed_text += output_token + \" \"\n",
    "            output_text += output_token + \" \"\n",
    "        \n",
    "    return output_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e238f01-9456-4e51-beb8-6ac20f4545fc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "temp 0.1\n",
      "K. fragte sich , „daß es mir nicht verstehn . Da sie sind , daß er das Gesetz , der ihm das Gesetzes , der sich an der Schreibweise fol\n",
      "\n",
      "temp 0.2\n",
      "K. fragte sich , „ich habe das Gesicht , die er sich nicht , daß er ein wenig bereitete . Der Mann kann , daß es sich daran , daß er da\n",
      "\n",
      "temp 0.30000000000000004\n",
      "K. fragte sich , während des Advokat , der schon den Kopf , denn es war nicht , daß er sich so genug , so sehr sich ganz gerade das Frä\n",
      "\n",
      "temp 0.4\n",
      "K. fragte sich , „ich habe dir die Schultern ausgeschlossen , daß es nicht mehr sehr gut , daß dieser Teller , der seinen Augen . „Was \n",
      "\n",
      "temp 0.5\n",
      "K. fragte sich selbst erwarten wird . Ich werde Ihnen . Gewiß , daß er sich in der getragendes Betwenden , so war es den Angeklagten ha\n",
      "\n",
      "temp 0.6\n",
      "K. fragte sich . „Auf den Willer . Da kommt er machte , als er es wieder vorhandene Gedanken geben , als sie geschehen . Da ich nicht s\n",
      "\n",
      "temp 0.7000000000000001\n",
      "K. fragte sich . „Es ist nicht unwichtig übern mit dem Kaufmann , als sie zum Kreis werden , es war ihm immer weiter . Glaubst dunkel ,\n",
      "\n",
      "temp 0.8\n",
      "K. fragte sich , daß er sogar noch darin offen , er brauchte vor allem zur Ersache , weil er sich nicht verstand , aber K . größer . Si\n",
      "\n",
      "temp 0.9\n",
      "K. fragte sich immer war . Zu damit den Zustand laut gekrämtlte . K . gerade zu bemerkte . K . bin die rüde , bei dieser leicht weiter \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for temp in np.arange(0.1, 1.0, 0.1):\n",
    "    print(f\"temp {temp}\")\n",
    "    print(generate_text(\n",
    "        seed_text=\"K. fragte sich \",\n",
    "        next_tokens=120,\n",
    "        model=char_model,\n",
    "        max_sequence_len = 14,\n",
    "        tokenizer=char_tokenizer,\n",
    "        temp=temp,\n",
    "        char_mode=True,\n",
    "    ))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c64cdc8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ich gab ihm einen bissen brot und ließ es ihn essen. darüber ließe sich viel sagen. [ hein nicht , daß er das Gesetzes , der sich an dem Fensterkanz schon verstand . Sie sind mir an , daß es nicht aufges\n"
     ]
    }
   ],
   "source": [
    "print(generate_text(\n",
    "    seed_text=\"Ich gab ihm einen Bissen Brot und ließ es ihn essen. Darüber ließe sich viel sagen.\".lower(),\n",
    "    next_tokens=120,\n",
    "    model=char_model,\n",
    "    max_sequence_len=16,\n",
    "    tokenizer=char_tokenizer,\n",
    "    temp=0.2,\n",
    "    char_mode=True,\n",
    "))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49375556-37af-4639-9cbe-7327e230205d",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Model on word basis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5001130f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of found words: 8675\n"
     ]
    }
   ],
   "source": [
    "word_tokenizer = Tokenizer(lower=True, char_level=False,)\n",
    "word_tokenizer.fit_on_texts([text])\n",
    "num_words = len(word_tokenizer.word_index)+1\n",
    "print(f\"Number of found words: {num_words}\")\n",
    "word_token_list = word_tokenizer.texts_to_sequences([text])[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "557cb075",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X:  (469629, 20) \ty:  (469629, 8675)\n"
     ]
    }
   ],
   "source": [
    "X_word, y_word = generate_sequences(char_token_list, step=1, seq_length=20, num_classes=num_words)\n",
    "\n",
    "print(\"X: \", X_word.shape, \"\\ty: \", y_word.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec143993-3e13-43ce-a0a4-4d9dcd9564e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from keras import layers\n",
    "from keras.models import Model\n",
    "from keras.optimizer_v2.rmsprop import RMSprop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28a8d778",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2021-12-06 19:40:14.798682: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
      "2021-12-06 19:40:14.890694: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
      "2021-12-06 19:40:14.891349: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
      "2021-12-06 19:40:14.892530: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA\n",
      "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
      "2021-12-06 19:40:14.894232: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
      "2021-12-06 19:40:14.894846: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
      "2021-12-06 19:40:14.895380: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
      "2021-12-06 19:40:16.945934: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
      "2021-12-06 19:40:16.946560: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
      "2021-12-06 19:40:16.947114: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
      "2021-12-06 19:40:16.949266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13839 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"word_rnn\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "input_1 (InputLayer)         [(None, None)]            0         \n",
      "_________________________________________________________________\n",
      "embedding (Embedding)        (None, None, 100)         867500    \n",
      "_________________________________________________________________\n",
      "lstm (LSTM)                  (None, 256)               365568    \n",
      "_________________________________________________________________\n",
      "dropout (Dropout)            (None, 256)               0         \n",
      "_________________________________________________________________\n",
      "dense (Dense)                (None, 8675)              2229475   \n",
      "=================================================================\n",
      "Total params: 3,462,543\n",
      "Trainable params: 3,462,543\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "n_units = 256\n",
    "embedding_size = 100\n",
    "\n",
    "text_in = layers.Input(shape=(None,))\n",
    "x = layers.Embedding(num_words, embedding_size,)(text_in)\n",
    "x = layers.LSTM(n_units)(x)\n",
    "x = layers.Dropout(0.2)(x)\n",
    "text_out = layers.Dense(num_words, activation='softmax')(x)\n",
    "\n",
    "word_model = Model(text_in, text_out, name=\"word_rnn\")\n",
    "\n",
    "word_model.compile(\n",
    "    # note that we use the same loss as with MNIST\n",
    "    # which is used when we want to learn a\n",
    "    # probability distribution\n",
    "    loss=keras.losses.CategoricalCrossentropy(),\n",
    "    optimizer=RMSprop(learning_rate=0.001)\n",
    ")\n",
    "\n",
    "word_model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a54ad161",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2021-12-06 19:40:47.221813: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 8148063150 exceeds 10% of free system memory.\n",
      "2021-12-06 19:40:51.851749: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 8148063150 exceeds 10% of free system memory.\n",
      "2021-12-06 19:40:54.571676: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/15\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2021-12-06 19:40:56.889561: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8005\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3669/3669 [==============================] - 33s 8ms/step - loss: 2.0207\n",
      "Epoch 2/15\n",
      "3669/3669 [==============================] - 28s 8ms/step - loss: 1.5336\n",
      "Epoch 3/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.3965\n",
      "Epoch 4/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.3252\n",
      "Epoch 5/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.2803\n",
      "Epoch 6/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.2490\n",
      "Epoch 7/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.2247\n",
      "Epoch 8/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.2038\n",
      "Epoch 9/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.1880\n",
      "Epoch 10/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.1734\n",
      "Epoch 11/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.1612\n",
      "Epoch 12/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.1523\n",
      "Epoch 13/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.1433\n",
      "Epoch 14/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.1360\n",
      "Epoch 15/15\n",
      "3669/3669 [==============================] - 29s 8ms/step - loss: 1.1301\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7fbb810ff310>"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_model.fit(X_word, y_word, epochs=15, batch_size=128, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bae073d0-8f87-40bd-98d6-89194190c13b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "temp 0.1\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:3: RuntimeWarning: divide by zero encountered in log\n",
      "  This is separate from the ipykernel package so we can avoid doing imports until\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nachdem K. aufgestanden war fragte er sich zu der sagte der von der nicht k in sie und der es und “ der so den daß nicht \n",
      "\n",
      "temp 0.2\n",
      "Nachdem K. aufgestanden war fragte er sich zu der sagte der von der nicht k in sie und der es und “ der eine und es und \n",
      "\n",
      "temp 0.30000000000000004\n",
      "Nachdem K. aufgestanden war fragte er sich zu der sagte der von der nicht k in sie und der es und “ der eine und daß k \n",
      "\n",
      "temp 0.4\n",
      "Nachdem K. aufgestanden war fragte er sich zu der sagte der von der nicht k in sie und der sagte der es und “ der nicht er \n",
      "\n",
      "temp 0.5\n",
      "Nachdem K. aufgestanden war fragte er sich zu der sagte der von der nicht k in sie und der nicht sie und “ nicht sie und zu \n",
      "\n",
      "temp 0.6\n",
      "Nachdem K. aufgestanden war fragte er sich zu der sagte der die k zu das der und “ der nicht er sich zu der er zu “ \n",
      "\n",
      "temp 0.7000000000000001\n",
      "Nachdem K. aufgestanden war fragte er sich zu und “ der war und “ und er sie er in und der an sie und ich ich den \n",
      "\n",
      "temp 0.8\n",
      "Nachdem K. aufgestanden war fragte er sich zu der zu und “ k die ist den nicht sie den daß “ und “ war und “ der \n",
      "\n",
      "temp 0.9\n",
      "Nachdem K. aufgestanden war fragte er sich ein der sagte der mit als zu “ und die der dem war und “ ich und nicht sich zu \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for temp in np.arange(0.1, 1.0, 0.1):\n",
    "    print(f\"temp {temp}\")\n",
    "    print(generate_text(\n",
    "        seed_text=\"Nachdem K. aufgestanden war fragte er sich \",\n",
    "        next_tokens=20,\n",
    "        model=word_model,\n",
    "        max_sequence_len = 7,\n",
    "        tokenizer=word_tokenizer,\n",
    "        temp=temp,\n",
    "    ))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75480084",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "environment": {
   "kernel": "python3",
   "name": "tf2-gpu.2-6.m87",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-6:m87"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}