{"cells": [{"cell_type": "markdown", "id": "322e7875", "metadata": {}, "source": ["# Textdatei segmentieren: Aufgaben\n", "\n", "Voraussetzungen:\n", "* {doc}`e_r2b_AUTO`\n", "\n", "Datei {download}`www.w3schools.com_python_python_strings_methods.txt` einlesen:"]}, {"cell_type": "markdown", "id": "300cdea3", "metadata": {}, "source": ["## Datensatz laden und vorbereiten"]}, {"cell_type": "code", "execution_count": 2, "id": "149f54d4", "metadata": {}, "outputs": [], "source": ["# psm: python_strings_methods\n", "with open(\"www.w3schools.com_python_python_strings_methods.txt\") as my_file:\n", " psm_text = my_file.read()\n", "psm_text_zeilen = psm_text.splitlines()\n", "#psm_text_zeilen"]}, {"cell_type": "code", "execution_count": 3, "id": "e2848215", "metadata": {}, "outputs": [{"data": {"text/plain": ["[['capitalize()', 'Converts the first character to upper case'],\n", " ['casefold()', 'Converts string into lower case'],\n", " ['center()', 'Returns a centered string'],\n", " ['count()',\n", " 'Returns the number of times a specified value occurs in a string'],\n", " ['encode()', 'Returns an encoded version of the string'],\n", " ['endswith()', 'Returns true if the string ends with the specified value'],\n", " ['expandtabs()', 'Sets the tab size of the string'],\n", " ['find()',\n", " 'Searches the string for a specified value and returns the position of where it was found'],\n", " ['format()', 'Formats specified values in a string'],\n", " ['format_map()', 'Formats specified values in a string'],\n", " ['index()',\n", " 'Searches the string for a specified value and returns the position of where it was found'],\n", " ['isalnum()',\n", " 'Returns True if all characters in the string are alphanumeric'],\n", " ['isalpha()',\n", " 'Returns True if all characters in the string are in the alphabet'],\n", " ['isascii()',\n", " 'Returns True if all characters in the string are ascii characters'],\n", " ['isdecimal()', 'Returns True if all characters in the string are decimals'],\n", " ['isdigit()', 'Returns True if all characters in the string are digits'],\n", " ['isidentifier()', 'Returns True if the string is an identifier'],\n", " ['islower()', 'Returns True if all characters in the string are lower case'],\n", " ['isnumeric()', 'Returns True if all characters in the string are numeric'],\n", " ['isprintable()',\n", " 'Returns True if all characters in the string are printable'],\n", " ['isspace()', 'Returns True if all characters in the string are whitespaces'],\n", " ['istitle() ', 'Returns True if the string follows the rules of a title'],\n", " ['isupper()', 'Returns True if all characters in the string are upper case'],\n", " ['join()', 'Joins the elements of an iterable to the end of the string'],\n", " ['ljust()', 'Returns a left justified version of the string'],\n", " ['lower()', 'Converts a string into lower case'],\n", " ['lstrip()', 'Returns a left trim version of the string'],\n", " ['maketrans()', 'Returns a translation table to be used in translations'],\n", " ['partition()',\n", " 'Returns a tuple where the string is parted into three parts'],\n", " ['replace()',\n", " 'Returns a string where a specified value is replaced with a specified value'],\n", " ['rfind()',\n", " 'Searches the string for a specified value and returns the last position of where it was found'],\n", " ['rindex()',\n", " 'Searches the string for a specified value and returns the last position of where it was found'],\n", " ['rjust()', 'Returns a right justified version of the string'],\n", " ['rpartition()',\n", " 'Returns a tuple where the string is parted into three parts'],\n", " ['rsplit()',\n", " 'Splits the string at the specified separator, and returns a list'],\n", " ['rstrip()', 'Returns a right trim version of the string'],\n", " ['split()',\n", " 'Splits the string at the specified separator, and returns a list'],\n", " ['splitlines()', 'Splits the string at line breaks and returns a list'],\n", " ['startswith()',\n", " 'Returns true if the string starts with the specified value'],\n", " ['strip()', 'Returns a trimmed version of the string'],\n", " ['swapcase()', 'Swaps cases, lower case becomes upper case and vice versa'],\n", " ['title()', 'Converts the first character of each word to upper case'],\n", " ['translate()', 'Returns a translated string'],\n", " ['upper()', 'Converts a string into upper case'],\n", " ['zfill()',\n", " 'Fills the string with a specified number of 0 values at the beginning']]"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["psm_text_tokens = []\n", "\n", "for zeile in psm_text_zeilen[1:] : # die erste Zeile enth\u00e4lt etwas anderes, auslassen\n", " if len(zeile) >= 1:\n", " token_list = zeile.split(\"\\t\")\n", " psm_text_tokens.append( token_list )\n", "psm_text_tokens"]}, {"cell_type": "markdown", "id": "0b0316d8", "metadata": {}, "source": ["## Aufgabe 1: Liste aller Funktionen"]}, {"cell_type": "markdown", "id": "62b9bd0e", "metadata": {}, "source": ["Gegeben:\n", "* Unser Text in der Variablen `psm_text_tokens`.\n", "\n", "Gesucht: \n", "* Eine Liste aller Funktionen\n", "\n", "Beispiel:"]}, {"cell_type": "code", "execution_count": 4, "id": "4e999ca1", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["['capitalize()', 'casefold()', 'center()', 'count()', 'encode()', 'viele_sonstige_funktionen', 'swapcase()', 'title()', 'translate()', 'upper()', 'zfill()']\n"]}], "source": ["psm_fn_list = ['capitalize()', 'casefold()', 'center()', 'count()', 'encode()',\n", " \"viele_sonstige_funktionen\",\n", " 'swapcase()', 'title()', 'translate()', 'upper()', 'zfill()']\n", "print(psm_fn_list)"]}, {"cell_type": "markdown", "id": "440d0826", "metadata": {}, "source": ["Hier selber codieren:"]}, {"cell_type": "code", "execution_count": 5, "id": "4e929817", "metadata": {}, "outputs": [], "source": ["# psm_fn_list berechnen aus psm_text_tokens :\n", "psm_fn_list = [ token[0] for token in psm_text_tokens if len(token[0]) >= 1 ]\n", "# print(psm_fn_list): ['capitalize()', 'casefold()', 'center()', 'count()', ..."]}, {"cell_type": "code", "execution_count": 44, "id": "24c6d2fd", "metadata": {}, "outputs": [], "source": ["assert 'lower()' in psm_fn_list"]}, {"cell_type": "markdown", "id": "1acce6d9", "metadata": {}, "source": ["## Funktionen, die mit \"is\" beginnen"]}, {"cell_type": "markdown", "id": "91780977", "metadata": {}, "source": ["Gegeben: Unser Text in den Variablen\n", "* `psm_text_tokens`\n", "* `psm_fn_list`\n", "\n", "Gesucht:\n", "* eine Liste aller Funktionen, die mit \"is\" beginnen:"]}, {"cell_type": "code", "execution_count": 20, "id": "a7bda494", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["['isalnum()', 'isalpha()', 'isascii()', 'isdecimal()', 'isdigit()', 'isidentifier()', 'islower()', 'isnumeric()', 'isprintable()', 'isspace()', 'istitle() ', 'isupper()']\n"]}], "source": ["psm_fn_is = [ f for f in psm_fn_list if f.startswith(\"is\") ]\n", "print(psm_fn_is)"]}, {"cell_type": "code", "execution_count": 7, "id": "609542b5", "metadata": {}, "outputs": [], "source": ["assert 'isdecimal()' in psm_fn_is"]}, {"cell_type": "markdown", "id": "d0180219", "metadata": {}, "source": ["## Darstellung als Dict\n", "\n", "Gegeben:\n", "* `psm_text_zeilen`\n", "\n", "Gesucht:\n", "* eine Darstellung als Dict `psm_text_dict`\n", "\n", "z.B. `psm_text_dict == {'capitalize()': 'Converts the first character to upper case',\n", " 'casefold()': 'Converts string into lower case',\n", " 'center()': 'Returns a centered string', ... }`\n"]}, {"cell_type": "code", "execution_count": 21, "id": "ec3ac2f6", "metadata": {}, "outputs": [{"data": {"text/plain": ["{'capitalize()': 'Converts the first character to upper case',\n", " 'casefold()': 'Converts string into lower case',\n", " 'center()': 'Returns a centered string',\n", " 'count()': 'Returns the number of times a specified value occurs in a string',\n", " 'encode()': 'Returns an encoded version of the string',\n", " 'endswith()': 'Returns true if the string ends with the specified value',\n", " 'expandtabs()': 'Sets the tab size of the string',\n", " 'find()': 'Searches the string for a specified value and returns the position of where it was found',\n", " 'format()': 'Formats specified values in a string',\n", " 'format_map()': 'Formats specified values in a string',\n", " 'index()': 'Searches the string for a specified value and returns the position of where it was found',\n", " 'isalnum()': 'Returns True if all characters in the string are alphanumeric',\n", " 'isalpha()': 'Returns True if all characters in the string are in the alphabet',\n", " 'isascii()': 'Returns True if all characters in the string are ascii characters',\n", " 'isdecimal()': 'Returns True if all characters in the string are decimals',\n", " 'isdigit()': 'Returns True if all characters in the string are digits',\n", " 'isidentifier()': 'Returns True if the string is an identifier',\n", " 'islower()': 'Returns True if all characters in the string are lower case',\n", " 'isnumeric()': 'Returns True if all characters in the string are numeric',\n", " 'isprintable()': 'Returns True if all characters in the string are printable',\n", " 'isspace()': 'Returns True if all characters in the string are whitespaces',\n", " 'istitle() ': 'Returns True if the string follows the rules of a title',\n", " 'isupper()': 'Returns True if all characters in the string are upper case',\n", " 'join()': 'Joins the elements of an iterable to the end of the string',\n", " 'ljust()': 'Returns a left justified version of the string',\n", " 'lower()': 'Converts a string into lower case',\n", " 'lstrip()': 'Returns a left trim version of the string',\n", " 'maketrans()': 'Returns a translation table to be used in translations',\n", " 'partition()': 'Returns a tuple where the string is parted into three parts',\n", " 'replace()': 'Returns a string where a specified value is replaced with a specified value',\n", " 'rfind()': 'Searches the string for a specified value and returns the last position of where it was found',\n", " 'rindex()': 'Searches the string for a specified value and returns the last position of where it was found',\n", " 'rjust()': 'Returns a right justified version of the string',\n", " 'rpartition()': 'Returns a tuple where the string is parted into three parts',\n", " 'rsplit()': 'Splits the string at the specified separator, and returns a list',\n", " 'rstrip()': 'Returns a right trim version of the string',\n", " 'split()': 'Splits the string at the specified separator, and returns a list',\n", " 'splitlines()': 'Splits the string at line breaks and returns a list',\n", " 'startswith()': 'Returns true if the string starts with the specified value',\n", " 'strip()': 'Returns a trimmed version of the string',\n", " 'swapcase()': 'Swaps cases, lower case becomes upper case and vice versa',\n", " 'title()': 'Converts the first character of each word to upper case',\n", " 'translate()': 'Returns a translated string',\n", " 'upper()': 'Converts a string into upper case',\n", " 'zfill()': 'Fills the string with a specified number of 0 values at the beginning'}"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["psm_text_dict = {}\n", "\n", "for zeile in psm_text_zeilen[1:] :\n", " if len(zeile) >= 1:\n", " token_list = zeile.split(\"\\t\")\n", " psm_text_dict[token_list[0]] = token_list[1]\n", "# psm_text_dict: \n", "# {'capitalize()': 'Converts the first character to upper case',\n", "# 'casefold()': 'Converts string into lower case', ...\n", "psm_text_dict"]}, {"cell_type": "code", "execution_count": 22, "id": "2cd5ea71", "metadata": {}, "outputs": [{"data": {"text/plain": ["{'capitalize()': 'Converts the first character to upper case',\n", " 'casefold()': 'Converts string into lower case',\n", " 'center()': 'Returns a centered string',\n", " 'count()': 'Returns the number of times a specified value occurs in a string',\n", " 'encode()': 'Returns an encoded version of the string'}"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["# F\u00fcr interessierte Leser: was passiert in der folgenden Zeile?\n", "{ key: psm_text_dict[key] for key in list(psm_text_dict.keys())[0:5] }"]}, {"cell_type": "code", "execution_count": 23, "id": "8f707fa8", "metadata": {}, "outputs": [], "source": ["assert psm_text_dict['join()'] == 'Joins the elements of an iterable to the end of the string'"]}, {"cell_type": "markdown", "id": "8a1df299", "metadata": {}, "source": ["## Darstellung \u00e4hnlich zu orient = \"index\"\n", "\n", "Gegeben:\n", "* `psm_text_zeilen`, `psm_text_tokens`\n", "\n", "Gesucht:\n", "* `psm_orient_index`: Eine Darstellung in der gleichen Datenstruktur wie `pd.to_dict(orient='index')`, siehe {doc}`e_r1b`.\n", " * Spalte 1: \"Funktion\"\n", " * Spalte 2: \"Beschreibung\""]}, {"cell_type": "code", "execution_count": 24, "id": "bb6e714d", "metadata": {}, "outputs": [], "source": ["psm_orient_index = {} # ein Dict, kein Set\n", "\n", "for zeilennummer in range(len(psm_text_tokens)):\n", " psm_orient_index[zeilennummer] = \\\n", " { \"Funktion\": psm_text_tokens[zeilennummer][0], \n", " \"Beschreibung\": psm_text_tokens[zeilennummer][1] }"]}, {"cell_type": "code", "execution_count": 25, "id": "7caa71a6", "metadata": {}, "outputs": [{"data": {"text/plain": ["{0: {'Funktion': 'capitalize()',\n", " 'Beschreibung': 'Converts the first character to upper case'},\n", " 1: {'Funktion': 'casefold()',\n", " 'Beschreibung': 'Converts string into lower case'},\n", " 2: {'Funktion': 'center()', 'Beschreibung': 'Returns a centered string'},\n", " 3: {'Funktion': 'count()',\n", " 'Beschreibung': 'Returns the number of times a specified value occurs in a string'},\n", " 4: {'Funktion': 'encode()',\n", " 'Beschreibung': 'Returns an encoded version of the string'}}"]}, "execution_count": 25, "metadata": {}, "output_type": "execute_result"}], "source": ["# F\u00fcr interessierte Leser: was passiert in der folgenden Zeile?\n", "{ key: psm_orient_index[key] for key in range(5) }"]}, {"cell_type": "markdown", "id": "6e682890", "metadata": {}, "source": ["**Aufgabe:** Formulieren Sie diese L\u00f6sung als eine Comprehension!"]}, {"cell_type": "code", "execution_count": 27, "id": "f7669b62", "metadata": {}, "outputs": [], "source": ["psm_orient_index = { zeilennummer : \n", " {\"Funktion\": psm_text_tokens[zeilennummer][0],\n", " \"Beschreibung\": psm_text_tokens[zeilennummer][1] }\n", " for zeilennummer in range(len(psm_text_tokens))\n", " }"]}, {"cell_type": "code", "execution_count": 28, "id": "63dce494", "metadata": {}, "outputs": [], "source": ["assert psm_orient_index[0] == {'Funktion': 'capitalize()',\n", " 'Beschreibung': 'Converts the first character to upper case'}"]}, {"cell_type": "markdown", "id": "8e1de449", "metadata": {}, "source": ["## Erzeuge \"normalisierte\" Beschreibung\n", "\n", "Wir wollen den Text der Beschreibungen der Funktionen auswerten. Dazu \"normalisieren\" wir die Texte:\n", "* nur noch Kleischreibung\n", "* keine Sonderzeichen (\".,:;?!\") mehr\n", "\n", "**Schritt 1:**\n", "* definiere eine Funktion `normalisiere()`, die einen String normalisiert"]}, {"cell_type": "code", "execution_count": 29, "id": "31532790", "metadata": {}, "outputs": [{"data": {"text/plain": ["'h\u00e4 ach so'"]}, "execution_count": 29, "metadata": {}, "output_type": "execute_result"}], "source": ["def normalisiere(s):\n", " s_clean = [ Buchstabe for Buchstabe in s.lower() if Buchstabe not in \".,:;?!\" ]\n", " ... # \n", " return ergebnis\n", "\n", "normalisiere(\"H\u00e4? Ach so!\")"]}, {"cell_type": "markdown", "id": "1304cb37", "metadata": {}, "source": ["**Schritt2:**\n", "\n", "Gegeben:\n", "* `psm_orient_index`\n", "* obige Funktion `normalisiere()`\n", "\n", "Gesucht:\n", "* `psm_orient_index` mit einem neuen Key \"normalisiert\"\n"]}, {"cell_type": "code", "execution_count": 30, "id": "b4cc1a2a", "metadata": {}, "outputs": [], "source": ["for index, Zeile in psm_orient_index.items():\n", " ... # "]}, {"cell_type": "code", "execution_count": 31, "id": "52364cfe", "metadata": {}, "outputs": [{"data": {"text/plain": ["{'Funktion': 'capitalize()',\n", " 'Beschreibung': 'Converts the first character to upper case',\n", " 'normalisiert': 'converts the first character to upper case'}"]}, "execution_count": 31, "metadata": {}, "output_type": "execute_result"}], "source": ["psm_orient_index[0]"]}, {"cell_type": "markdown", "id": "e47b0686", "metadata": {}, "source": ["Zur Kontrolle in sch\u00f6nem Layout anschauen:"]}, {"cell_type": "code", "execution_count": 32, "id": "87eed014", "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FunktionBeschreibungnormalisiert
0capitalize()Converts the first character to upper caseconverts the first character to upper case
1casefold()Converts string into lower caseconverts string into lower case
2center()Returns a centered stringreturns a centered string
3count()Returns the number of times a specified value ...returns the number of times a specified value ...
4encode()Returns an encoded version of the stringreturns an encoded version of the string
\n", "
"], "text/plain": [" Funktion Beschreibung \\\n", "0 capitalize() Converts the first character to upper case \n", "1 casefold() Converts string into lower case \n", "2 center() Returns a centered string \n", "3 count() Returns the number of times a specified value ... \n", "4 encode() Returns an encoded version of the string \n", "\n", " normalisiert \n", "0 converts the first character to upper case \n", "1 converts string into lower case \n", "2 returns a centered string \n", "3 returns the number of times a specified value ... \n", "4 returns an encoded version of the string "]}, "execution_count": 32, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas as pd\n", "psm_df = pd.DataFrame.from_dict(psm_orient_index, orient = 'index')\n", "psm_df.head()"]}, {"cell_type": "markdown", "id": "c8bd6a23", "metadata": {}, "source": ["## Funktionen, die True/False zur\u00fcckliefern\n", "\n", "Gegeben:\n", "* Unser Text in den verschiedneen Variablen oben\n", "\n", "Gesucht:\n", "* eine Liste aller Funktionen, die `True` oder `False` zur\u00fcckliefern\n", "\n", "Vorgehen: Suche in der (idealerweise normalisierten) Beschreibung der Funktion nach geeigneten Hinweisen -- insbesondere nach dem String \"True\" ;-)"]}, {"cell_type": "code", "execution_count": 33, "id": "58838e1e", "metadata": {}, "outputs": [], "source": ["# TBD, 2bd, to be done"]}, {"cell_type": "code", "execution_count": null, "id": "da7273d4", "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13"}}, "nbformat": 4, "nbformat_minor": 5}