20 Newsgroups: Feature Engineering#

User Guide:

1.1.5. Elastic-Net: “Elastic-net is useful when there are multiple features that are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.” (scikit-learn 1.1.0)
- Learning curves, nur lasso_path() und enet_path(): Lasso and Elastic Net (scikit-learn 1.1.0)
1.1.6. Multi-task Elastic-Net

Reference:

sklearn.linear_model.ElasticNet
sklearn.linear_model.ElasticNetCV
sklearn.linear_model.SGDClassifier: “Implements logistic regression with elastic net penalty (SGDClassifier(loss=”log_loss”, penalty=”elasticnet”)).”

from time import time

from sklearn.datasets import fetch_20newsgroups

categories = [ "rec.autos",
    "rec.motorcycles" ]

categories1 = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
     'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc'
]

data_train = fetch_20newsgroups(
    subset="train", categories=categories, shuffle=True, random_state=42,
    remove = ('headers', 'footers', 'quotes')  # hinzugefügt JB
)

data_test = fetch_20newsgroups(
    subset="test", categories=categories, shuffle=True, random_state=42,
    remove = ('headers', 'footers', 'quotes')   # hinzugefügt JB
)
def size_mb(docs):
    return sum(len(s.encode("utf-8")) for s in docs) / 1e6


data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print("%d documents - %0.3fMB (training set)" % (len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (len(data_test.data), data_test_size_mb))

# order of labels in `target_names` can be different from `categories`
target_names = data_train.target_names
print("%d categories:" % len(target_names), target_names)

documents - 0.776MB (training set)
documents - 0.443MB (test set)
categories: ['rec.autos', 'rec.motorcycles']

data_train ist ein Dictionary; data_train['data'] ist eine Liste von Strings:

data_train['data'][:3]

['Stuff deleted...',
 "This morning a truck that had been within my sight (and I within\nhis) for about 3 miles suddenly forgot that I existed and pulled\nover right on me -- my front wheel was about even with the back\nedge of his front passenger door as I was accelerating past him.\n\nIt was trivial enough for me to tap the brakes and slide behind him\nas he slewed over (with no signal, of course) on top of me, with\nmy little horn blaring (damn, I need Fiamms!), but the satisfaction\nof being aware of my surroundings and thus surviving was not enough,\nespecially when I later pulled up alongside the bastard and he made\nno apologetic wave or anything.\n\nIs there some way that I can memorize the license plate of an\noffending vehicle and get the name and address of the owner?\nI'm not going to firebomb houses or anything, I'd just like to\nwrite a consciousness-raising letter or two. I think that it would\nbe good for BDI cagers to know that We Know Where They Live.\nMaybe they'd use 4 or 5 brain cells while driving instead of the\nusual 3.",
 "So how do I steer when my hands aren't on the bars? (Open Budweiser in left \nhand, Camel cigarette in the right, no feet allowed.) If I lean, and the \nbike turns, am I countersteering? Is countersteering like benchracing only \nwith a taller seat, so your feet aren't on the floor?"]

data_train['target'] ist eine Liste von Zahlen, die die Kategorien repräsentieren:

data_train['target'][:3]

array([0, 1, 1])

Abbildung von den als Zahlen repräsentierten Kategorien im Target auf die Kategorien als Strings:

num_to_categoy = dict(zip(set(data_train['target']), target_names))
num_to_categoy

{0: 'rec.autos', 1: 'rec.motorcycles'}

Übliche Nomenklatur:

X … Features als Matrix
y … Target als Vektor

Wir wollen die Wirkung von Varianten des Feature-Enginerings vergleichen. Also legen wir ein Dict für die verschiedenen Varianten der X-Matrix an.

# maximum of docs to be considered
# use len(data_train.data) to process all available docs

n_max_docs =  100 # alle: len(data_train['target'])

y = data_train['target'][:n_max_docs]

X_train = { 'unmodified': data_train['data'][:n_max_docs] }
X_train['unmodified'][:2]

['Stuff deleted...',
 "This morning a truck that had been within my sight (and I within\nhis) for about 3 miles suddenly forgot that I existed and pulled\nover right on me -- my front wheel was about even with the back\nedge of his front passenger door as I was accelerating past him.\n\nIt was trivial enough for me to tap the brakes and slide behind him\nas he slewed over (with no signal, of course) on top of me, with\nmy little horn blaring (damn, I need Fiamms!), but the satisfaction\nof being aware of my surroundings and thus surviving was not enough,\nespecially when I later pulled up alongside the bastard and he made\nno apologetic wave or anything.\n\nIs there some way that I can memorize the license plate of an\noffending vehicle and get the name and address of the owner?\nI'm not going to firebomb houses or anything, I'd just like to\nwrite a consciousness-raising letter or two. I think that it would\nbe good for BDI cagers to know that We Know Where They Live.\nMaybe they'd use 4 or 5 brain cells while driving instead of the\nusual 3."]

Feature Eingineering#

spaCy#

Einlesen: https://spacy.io/usage/spacy-101

import spacy
nlp = spacy.load("en_core_web_sm")

def token_filter(token):
    return not (token.is_punct 
                | token.is_space 
                #| token.is_stop 
                | token.is_bracket
                | len(token.text) < 3
               )
    
def nlp_with_token_filter(doc):
    d = nlp(doc)
    return " ".join( f"{token.lemma_}_{token.pos_}"  # f"{token.pos_}_{token.lemma_}"
                    for token in d if token_filter(token) )

t0 = time()
X_train['nlp_with_token_filter'] = [ nlp_with_token_filter(doc) 
                                    for doc in X_train['unmodified'][0:n_max_docs] ]
duration = time() - t0
print(f"{n_max_docs} docs in {duration:.2f} seconds")

100 docs in 1.67 seconds

X_train['nlp_with_token_filter'][:2]

['stuff_NOUN delete_VERB ..._PUNCT',
 'this_DET morning_NOUN truck_NOUN that_PRON have_AUX be_AUX within_ADP sight_NOUN and_CCONJ within_ADP his_NOUN for_ADP about_ADV mile_NOUN suddenly_ADV forget_VERB that_SCONJ exist_VERB and_CCONJ pull_VERB over_ADP right_ADV --_PUNCT front_ADJ wheel_NOUN be_AUX about_ADV even_ADV with_ADP the_DET back_ADJ edge_NOUN his_PRON front_ADJ passenger_NOUN door_NOUN be_AUX accelerate_VERB past_ADP he_PRON \n\n_SPACE be_AUX trivial_ADJ enough_ADV for_SCONJ tap_VERB the_DET brake_NOUN and_CCONJ slide_VERB behind_ADP he_PRON slew_VERB over_ADP with_ADP signal_NOUN course_NOUN top_NOUN with_ADP little_ADJ horn_NOUN blare_VERB damn_INTJ need_VERB Fiamms_PROPN but_CCONJ the_DET satisfaction_NOUN be_AUX aware_ADJ surrounding_NOUN and_CCONJ thus_ADV survive_VERB be_AUX not_PART enough_ADJ especially_ADV when_SCONJ later_ADV pull_VERB alongside_ADP the_DET bastard_NOUN and_CCONJ make_VERB apologetic_ADJ wave_NOUN anything_PRON \n\n_SPACE there_PRON some_DET way_NOUN that_PRON can_AUX memorize_VERB the_DET license_NOUN plate_NOUN offend_VERB vehicle_NOUN and_CCONJ get_VERB the_DET name_NOUN and_CCONJ address_NOUN the_DET owner_NOUN not_PART go_VERB firebomb_VERB house_NOUN anything_PRON just_ADV like_VERB write_VERB consciousness_NOUN raise_VERB letter_NOUN two_NUM think_VERB that_SCONJ would_AUX good_ADJ for_SCONJ BDI_PROPN cager_NOUN know_VERB that_SCONJ know_VERB where_SCONJ they_PRON live_VERB maybe_ADV they_PRON use_VERB brain_NOUN cell_NOUN while_SCONJ drive_VERB instead_ADV the_DET usual_ADJ']

spaCy WordNet#

https://spacy.io/universe/project/spacy-wordnet

#!pip install spacy_wordnet
from spacy_wordnet.wordnet_annotator import WordnetAnnotator 

nlp.add_pipe("spacy_wordnet", after='tagger', config={'lang': nlp.lang})

---------------------------------------------------------------------------
ConfigValidationError                     Traceback (most recent call last)
Cell In [14], line 1
----> 1 nlp.add_pipe("spacy_wordnet", after='tagger', config={'lang': nlp.lang})

File ~/miniconda3/lib/python3.9/site-packages/spacy/language.py:795, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    787     if not self.has_factory(factory_name):
    788         err = Errors.E002.format(
    789             name=factory_name,
    790             opts=", ".join(self.factory_names),
   (...)
    793             lang_code=self.lang,
    794         )
--> 795     pipe_component = self.create_pipe(
    796         factory_name,
    797         name=name,
    798         config=config,
    799         raw_config=raw_config,
    800         validate=validate,
    801     )
    802 pipe_index = self._get_pipe_index(before, after, first, last)
    803 self._pipe_meta[name] = self.get_factory_meta(factory_name)

File ~/miniconda3/lib/python3.9/site-packages/spacy/language.py:674, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
    671 cfg = {factory_name: config}
    672 # We're calling the internal _fill here to avoid constructing the
    673 # registered functions twice
--> 674 resolved = registry.resolve(cfg, validate=validate)
    675 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"]
    676 filled = Config(filled)

File ~/miniconda3/lib/python3.9/site-packages/thinc/config.py:746, in registry.resolve(cls, config, schema, overrides, validate)
    737 @classmethod
    738 def resolve(
    739     cls,
   (...)
    744     validate: bool = True,
    745 ) -> Dict[str, Any]:
--> 746     resolved, _ = cls._make(
    747         config, schema=schema, overrides=overrides, validate=validate, resolve=True
    748     )
    749     return resolved

File ~/miniconda3/lib/python3.9/site-packages/thinc/config.py:795, in registry._make(cls, config, schema, overrides, resolve, validate)
    793 if not is_interpolated:
    794     config = Config(orig_config).interpolate()
--> 795 filled, _, resolved = cls._fill(
    796     config, schema, validate=validate, overrides=overrides, resolve=resolve
    797 )
    798 filled = Config(filled, section_order=section_order)
    799 # Check that overrides didn't include invalid properties not in config

File ~/miniconda3/lib/python3.9/site-packages/thinc/config.py:850, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    848     schema.__fields__[key] = copy_model_field(field, Any)
    849 promise_schema = cls.make_promise_schema(value, resolve=resolve)
--> 850 filled[key], validation[v_key], final[key] = cls._fill(
    851     value,
    852     promise_schema,
    853     validate=validate,
    854     resolve=resolve,
    855     parent=key_parent,
    856     overrides=overrides,
    857 )
    858 reg_name, func_name = cls.get_constructor(final[key])
    859 args, kwargs = cls.parse_args(final[key])

File ~/miniconda3/lib/python3.9/site-packages/thinc/config.py:916, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    914         result = schema.parse_obj(validation)
    915     except ValidationError as e:
--> 916         raise ConfigValidationError(
    917             config=config, errors=e.errors(), parent=parent
    918         ) from None
    919 else:
    920     # Same as parse_obj, but without validation
    921     result = schema.construct(**validation)

ConfigValidationError: 

Config validation error

spacy_wordnet -> lang   extra fields not permitted

{'nlp': <spacy.lang.en.English object at 0x7fe74cf8b1c0>, 'name': 'spacy_wordnet', 'lang': 'en', '@factories': 'spacy_wordnet'}

import nltk
#nltk.download('wordnet')

token = nlp('prices')[0]

# wordnet object link spacy token with nltk wordnet interface 
# by giving acces to synsets and lemmas 
token._.wordnet.synsets()
token._.wordnet.lemmas()

# And automatically tags with wordnet domains
print(token._.wordnet.wordnet_domains())

synonyms by domain of interest (example from recognai)#

“””Recognai is an artificial intelligence software company who help teams and data scientists to cope with information overload. “””

https://www.recogn.ai/en/opensource/spacy-wordnet/

# spaCy WordNet lets you find synonyms by domain of interest for example economy
sentence = nlp('I want to withdraw 5,000 euros')
economy_domains = ['finance', 'banking']

token_with_synsets = [(token, token._.wordnet.wordnet_synsets_for_domain(economy_domains)) 
                      for token in sentence]

enriched_sentence = []
for token, synsets in token_with_synsets:
    if not synsets:
        enriched_sentence.append(token.text)
    else:
        lemmas_for_synset = {lemma for s in synsets for lemma in s.lemma_names()}
        enriched_sentence.append('({})'.format('|'.join(lemmas_for_synset)))

# >> I (need|want|require) to (draw|withdraw|draw_off|take_out) 5,000 euros
    
print(' '.join(enriched_sentence))

ACHTUNG: Unser Ergebnis ist anders! Denn hier wird want nicht korrekterweise als Verb, sondern als Nomen interpretiert.

Bag of Words#

Theorie zum Einlesen und viel mehr: https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html (https://opac.haw-landshut.de/TouchPoint/perma.do?q=1035%3D%22BV045113643%22+IN+%5B3%5D&v=fla&l=de)

sklearn.feature_extraction.text.TfidfVectorizer: “Equivalent to CountVectorizer followed by TfidfTransformer”: Weil wir beide Schritte unabhängig voneinander untersuchen wollen, wenden wir beide separat an.

count vectorizer#

sklearn.feature_extraction.text.CountVectorizer

flavours = [ 'unmodified', 'nlp_with_token_filter' ]

from sklearn.feature_extraction.text import CountVectorizer
X_vectorizer = {}
X_bag_of_words = {}

for flavour in flavours: 
    X_vectorizer[flavour] = CountVectorizer(strip_accents='unicode')
    X_bag_of_words[flavour] = X_vectorizer[flavour].fit_transform(X_train[flavour])
    print(f"{flavour}: {X_bag_of_words[flavour].shape}")

def inspect_feature_names(start_filter):
    for flavour in flavours:
        feature_names = [ f for f in X_vectorizer[flavour].get_feature_names_out() 
                              if f.startswith(start_filter)]
        print(f"\n{flavour}, {len(feature_names)} items: {feature_names}")

inspect_feature_names("ea")

for flavour in flavours:
    print(f"{flavour}: {X_bag_of_words[flavour].shape}")

transformer#

sklearn.feature_extraction.text.TfidfTransformer

from sklearn.feature_extraction.text import TfidfTransformer
X_transformer = {}
X_tfidf = {}

for flavour in flavours: 
    X_transformer = {flavour: CountVectorizer(strip_accents='unicode') }
    X_tfidf[flavour] = X_transformer[flavour].fit_transform(X_train[flavour])
    print(f"{flavour}: {X_tfidf[flavour].shape}")

SGDClassifier als Elastic Net#

Eigentlich interessiert uns Klassifikation (aka Logistische Regression) mit ElasticNet. Weil ElasticNet aber nur eine Regression ist (und erst nach Anwendung eines Schwellwerts als Klassifikation taugt), nutzen wir die Logistische Regression-Klassifikation, die im SGDClassifier eingebaut ist:

SGDClassifier: Implements logistic regression with elastic net penalty (SGDClassifier(loss=”log_loss”, penalty=”elasticnet”)). https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html

SGDClassifier:

from sklearn.linear_model import SGDClassifier
X_SGDClassifier_clf = {}

for flavour in flavours: 
    X_SGDClassifier_clf[flavour] = SGDClassifier(
        random_state=0, 
        # loss="log_loss", # ValueError: The loss log_loss is not supported.  
        penalty="elasticnet")
    X_SGDClassifier_clf[flavour].fit(X_tfidf[flavour], y)
    

from sklearn.metrics import ConfusionMatrixDisplay

for flavour in flavours:
    ConfusionMatrixDisplay.from_predictions(
        X_SGDClassifier_clf[flavour].predict(X_tfidf[flavour]), y)

ACHTUNG: Diese Confusion-Matritzen sehen gut aus. ABER das sind in training tests mit Overfitting, keine Tests gegen unbekannte Daten.

Noch zu tun: mit der selben Pipeline Testdaten bearbeiten, dann ernsthafte Confusion Matrix.

dsci-txt SS 2023

20 Newsgroups: Feature Engineering

Contents