--- jupytext: formats: md:myst text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.14.1 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- # Bag of char ngrams erstellen * J.Busse, www.jbusse.de, 2023-05-30 * Version für n-Gramme, basierend auf {doc}`bag-of-words-erstellen` Lizenz: public domain / [CC 0](https://creativecommons.org/publicdomain/zero/1.0/deed.de) Idee: > one might alternatively consider a collection of character n-grams, a representation resilient against misspellings and derivations. Dateien: * {download}`regex_XX.zip` * auspacken; das Verzeichnis `regex_XX/` sollte idealerweise als Geschwister-Verzeichnis des aktuellen Arbeitsverzeichnis `md/` angelegt werden -- oder ggf. Variable `path_to_md` anpassen. ```{code-cell} ipython3 ``` ### Global Parameters ```{code-cell} ipython3 import numpy as np import pandas as pd import random random.seed(42) ``` ```{code-cell} ipython3 # show intermediary results # 0 none, 1 informative, 2 didactical, 3 debug verbosity = 2 def verbose(level,item): if level <= verbosity: display(item) ``` ```{code-cell} ipython3 fehler = 0.3 ``` ## Tippfehler Idee: Wir bauen in unsere Texte programmatisch eine hohe Anzahl von Tippfehlern ein. Ein konventioneller BOW-Ansatz auf Wort-Ebene sollte hier keine große Ähnlichkeit mehr entdecken. ```{code-cell} ipython3 tastatur = ( """qw we er rt tz ty zu ui io op pü""" # obere Reihe """as sd df fg gh hj jk kl lö öä""" # mittlere Reihe """yx yz xc cv vb bn nm""" # untere Reihe """12 23 34 45 56 67 78 89 90""" ) # alle Zeichen, für die wir Fehler kennen tastatur_set = { c for word in tastatur.split() for c in word } print(tastatur_set) ``` ```{code-cell} ipython3 nachbarn_set = { c: set() for c in tastatur_set} for word in tastatur.split(): for c in word: nachbarn_set[c].update(word) #print(nachbarn_set) ``` ```{code-cell} ipython3 nachbarn_dict = { k: list(v) for k,v in nachbarn_set.items() } print(nachbarn_dict) ``` ```{code-cell} ipython3 import random def dreher(zeichen, fehler): if zeichen not in tastatur_set: return zeichen t = random.random() if t < fehler: ret = random.choice(nachbarn_dict[zeichen]) else: ret = zeichen # nix falsch gemacht return ret ``` ```{code-cell} ipython3 for c in "Hallo Hugo": print(dreher(c, 0.3)) ``` Read Files ---- ```{code-cell} ipython3 # path to files, incl. glob mask #path_to_md = "md/*.md" path_to_md = "../regex_XX/*.md" ``` ```{code-cell} ipython3 # https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory import glob files = glob.glob(path_to_md) verbose(1,f"{len(files)} Dateien gefunden") files ``` Wir lesen die Daten in das Dictionary `corpus_dict` ein: ```{code-cell} ipython3 corpus_dict_of_strings = {} for file in files: with open(file, 'r') as f: corpus_dict_of_strings[file] = f.read() verbose(2,corpus_dict_of_strings) ``` ```{code-cell} ipython3 corpus2 = {} for v,k in corpus_dict_of_strings.items(): neu = v + "_dreher" corpus2[neu] = "".join([dreher(c, fehler) for c in k ]) corpus_dict_of_strings.update(corpus2) corpus_dict_of_strings ``` ```{code-cell} ipython3 def clean_string(string): """Liefert string "gesäubert" wieder als String zurück: Interpunktionen, Sonderzeichen etc. werden in Leerzeichen umgewandelt.""" alnum = lambda x: x if x.isalnum() else " " return "".join(alnum(c) for c in string ) ``` ```{code-cell} ipython3 corpus_list_of_lists = [ clean_string(text).split() for text in corpus_dict_of_strings.values() ] verbose(2, corpus_list_of_lists[-1]) ``` ## n-Gramme von Zeichen Idee: Wir repräsentieren einzene Wörter nicht durch sich selbst, sondern durch n-Gramme auf Zeichen-Ebene, üblicherweise mit n = 3. ```{code-cell} ipython3 corpus_as_words = [ " ".join(word_list) for word_list in corpus_list_of_lists ] verbose(2, corpus_as_words[-1]) ``` ```{code-cell} ipython3 def n_char_substrings(string, n=3, low = True): if len(string) < n: return [] elif low: return [string[i:i+n].lower() for i in range(0, len(string)-n+1)] else: return [string[i:i+n]for i in range(0, len(string)-n+1)] ``` ```{code-cell} ipython3 corpus_list_of_ngrams = [ ] for text in corpus_list_of_lists: ngram_list = [] for word in text: ngram_list.extend(n_char_substrings(word)) corpus_list_of_ngrams.append(ngram_list) verbose(2, corpus_list_of_ngrams[1][0:20]) ``` ```{code-cell} ipython3 corpus_as_ngrams = [ " ".join(ngram_list) for ngram_list in corpus_list_of_ngrams ] verbose(2, corpus_as_ngrams[0]) ``` ## Bibliothek minmales Beispiel aus https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html ```{code-cell} ipython3 import sklearn #sklearn.show_versions() from sklearn.feature_extraction.text import TfidfVectorizer ``` ```{code-cell} ipython3 # tv ... *t*fidf *v*ectorizer itself vectorizer_tv = TfidfVectorizer( analyzer="char", ngram_range=(3,3) ) X_tv = vectorizer_tv.fit_transform(corpus_as_words) ``` ```{code-cell} ipython3 vectorizer_tv.get_feature_names_out() ``` ## TfidfVectorizer Doku: * https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html * https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction ```{code-cell} ipython3 vectorizer_words = TfidfVectorizer() X_words = vectorizer_words.fit_transform(corpus_as_words) verbose(2, vectorizer_words.get_feature_names_out()[0:100]) ``` ```{code-cell} ipython3 vectorizer_ngrams = TfidfVectorizer() X_ngrams = vectorizer_ngrams.fit_transform(corpus_as_ngrams) verbose(2, vectorizer_ngrams.get_feature_names_out()[0:100]) ``` ```{code-cell} ipython3 #matrix_words = X_words #matrix_ngrams = X_ngrams ``` ```{code-cell} ipython3 from sklearn.metrics.pairwise import cosine_similarity import seaborn as sns ``` ```{code-cell} ipython3 similarity_words = cosine_similarity(X_words) similarity_df_words = pd.DataFrame(similarity_words) similarity_df_words.columns = corpus_dict_of_strings.keys() similarity_df_words.index = corpus_dict_of_strings.keys() similarity_ngrams = cosine_similarity(X_ngrams) similarity_df_ngrams = pd.DataFrame(similarity_ngrams) similarity_df_ngrams.columns = corpus_dict_of_strings.keys() similarity_df_ngrams.index = corpus_dict_of_strings.keys() similarity_tv = cosine_similarity(X_tv) similarity_df_tv = pd.DataFrame(similarity_tv) similarity_df_tv.columns = corpus_dict_of_strings.keys() similarity_df_tv.index = corpus_dict_of_strings.keys() ``` ```{code-cell} ipython3 ``` ```{code-cell} ipython3 ax_words = sns.clustermap(similarity_df_words) # ax.savefig("clustermap_words.png") ``` ```{code-cell} ipython3 ax_ngrams = sns.clustermap(similarity_df_ngrams) ``` ```{code-cell} ipython3 ax_tv = sns.clustermap(similarity_df_tv) ``` ```{code-cell} ipython3 ``` ```{code-cell} ipython3 ``` ```{code-cell} ipython3 ```