20 Newsgroups: Feature Engineering
Contents
20 Newsgroups: Feature Engineering#
User Guide:
1.1.5. Elastic-Net: “Elastic-net is useful when there are multiple features that are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.” (scikit-learn 1.1.0)
Learning curves, nur
lasso_path()
undenet_path()
: Lasso and Elastic Net (scikit-learn 1.1.0)
Reference:
sklearn.linear_model.SGDClassifier: “Implements logistic regression with elastic net penalty (SGDClassifier(loss=”log_loss”, penalty=”elasticnet”)).”
from time import time
from sklearn.datasets import fetch_20newsgroups
categories = [ "rec.autos",
"rec.motorcycles" ]
categories1 = [
"alt.atheism",
"talk.religion.misc",
"comp.graphics",
"sci.space",
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc'
]
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html
https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset
data_train = fetch_20newsgroups(
subset="train", categories=categories, shuffle=True, random_state=42,
remove = ('headers', 'footers', 'quotes') # hinzugefügt JB
)
data_test = fetch_20newsgroups(
subset="test", categories=categories, shuffle=True, random_state=42,
remove = ('headers', 'footers', 'quotes') # hinzugefügt JB
)
def size_mb(docs):
return sum(len(s.encode("utf-8")) for s in docs) / 1e6
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)
print("%d documents - %0.3fMB (training set)" % (len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (len(data_test.data), data_test_size_mb))
# order of labels in `target_names` can be different from `categories`
target_names = data_train.target_names
print("%d categories:" % len(target_names), target_names)
1192 documents - 0.776MB (training set)
794 documents - 0.443MB (test set)
2 categories: ['rec.autos', 'rec.motorcycles']
data_train
ist ein Dictionary; data_train['data']
ist eine Liste von Strings:
data_train['data'][:3]
['Stuff deleted...',
"This morning a truck that had been within my sight (and I within\nhis) for about 3 miles suddenly forgot that I existed and pulled\nover right on me -- my front wheel was about even with the back\nedge of his front passenger door as I was accelerating past him.\n\nIt was trivial enough for me to tap the brakes and slide behind him\nas he slewed over (with no signal, of course) on top of me, with\nmy little horn blaring (damn, I need Fiamms!), but the satisfaction\nof being aware of my surroundings and thus surviving was not enough,\nespecially when I later pulled up alongside the bastard and he made\nno apologetic wave or anything.\n\nIs there some way that I can memorize the license plate of an\noffending vehicle and get the name and address of the owner?\nI'm not going to firebomb houses or anything, I'd just like to\nwrite a consciousness-raising letter or two. I think that it would\nbe good for BDI cagers to know that We Know Where They Live.\nMaybe they'd use 4 or 5 brain cells while driving instead of the\nusual 3.",
"So how do I steer when my hands aren't on the bars? (Open Budweiser in left \nhand, Camel cigarette in the right, no feet allowed.) If I lean, and the \nbike turns, am I countersteering? Is countersteering like benchracing only \nwith a taller seat, so your feet aren't on the floor?"]
data_train['target']
ist eine Liste von Zahlen, die die Kategorien repräsentieren:
data_train['target'][:3]
array([0, 1, 1])
Abbildung von den als Zahlen repräsentierten Kategorien im Target auf die Kategorien als Strings:
num_to_categoy = dict(zip(set(data_train['target']), target_names))
num_to_categoy
{0: 'rec.autos', 1: 'rec.motorcycles'}
Übliche Nomenklatur:
X
… Features als Matrixy
… Target als Vektor
Wir wollen die Wirkung von Varianten des Feature-Enginerings vergleichen. Also legen wir ein Dict für die verschiedenen Varianten der X-Matrix an.
# maximum of docs to be considered
# use len(data_train.data) to process all available docs
n_max_docs = 100 # alle: len(data_train['target'])
y = data_train['target'][:n_max_docs]
X_train = { 'unmodified': data_train['data'][:n_max_docs] }
X_train['unmodified'][:2]
['Stuff deleted...',
"This morning a truck that had been within my sight (and I within\nhis) for about 3 miles suddenly forgot that I existed and pulled\nover right on me -- my front wheel was about even with the back\nedge of his front passenger door as I was accelerating past him.\n\nIt was trivial enough for me to tap the brakes and slide behind him\nas he slewed over (with no signal, of course) on top of me, with\nmy little horn blaring (damn, I need Fiamms!), but the satisfaction\nof being aware of my surroundings and thus surviving was not enough,\nespecially when I later pulled up alongside the bastard and he made\nno apologetic wave or anything.\n\nIs there some way that I can memorize the license plate of an\noffending vehicle and get the name and address of the owner?\nI'm not going to firebomb houses or anything, I'd just like to\nwrite a consciousness-raising letter or two. I think that it would\nbe good for BDI cagers to know that We Know Where They Live.\nMaybe they'd use 4 or 5 brain cells while driving instead of the\nusual 3."]
Feature Eingineering#
spaCy#
Einlesen: https://spacy.io/usage/spacy-101
import spacy
nlp = spacy.load("en_core_web_sm")
def token_filter(token):
return not (token.is_punct
| token.is_space
#| token.is_stop
| token.is_bracket
| len(token.text) < 3
)
def nlp_with_token_filter(doc):
d = nlp(doc)
return " ".join( f"{token.lemma_}_{token.pos_}" # f"{token.pos_}_{token.lemma_}"
for token in d if token_filter(token) )
t0 = time()
X_train['nlp_with_token_filter'] = [ nlp_with_token_filter(doc)
for doc in X_train['unmodified'][0:n_max_docs] ]
duration = time() - t0
print(f"{n_max_docs} docs in {duration:.2f} seconds")
100 docs in 1.67 seconds
X_train['nlp_with_token_filter'][:2]
['stuff_NOUN delete_VERB ..._PUNCT',
'this_DET morning_NOUN truck_NOUN that_PRON have_AUX be_AUX within_ADP sight_NOUN and_CCONJ within_ADP his_NOUN for_ADP about_ADV mile_NOUN suddenly_ADV forget_VERB that_SCONJ exist_VERB and_CCONJ pull_VERB over_ADP right_ADV --_PUNCT front_ADJ wheel_NOUN be_AUX about_ADV even_ADV with_ADP the_DET back_ADJ edge_NOUN his_PRON front_ADJ passenger_NOUN door_NOUN be_AUX accelerate_VERB past_ADP he_PRON \n\n_SPACE be_AUX trivial_ADJ enough_ADV for_SCONJ tap_VERB the_DET brake_NOUN and_CCONJ slide_VERB behind_ADP he_PRON slew_VERB over_ADP with_ADP signal_NOUN course_NOUN top_NOUN with_ADP little_ADJ horn_NOUN blare_VERB damn_INTJ need_VERB Fiamms_PROPN but_CCONJ the_DET satisfaction_NOUN be_AUX aware_ADJ surrounding_NOUN and_CCONJ thus_ADV survive_VERB be_AUX not_PART enough_ADJ especially_ADV when_SCONJ later_ADV pull_VERB alongside_ADP the_DET bastard_NOUN and_CCONJ make_VERB apologetic_ADJ wave_NOUN anything_PRON \n\n_SPACE there_PRON some_DET way_NOUN that_PRON can_AUX memorize_VERB the_DET license_NOUN plate_NOUN offend_VERB vehicle_NOUN and_CCONJ get_VERB the_DET name_NOUN and_CCONJ address_NOUN the_DET owner_NOUN not_PART go_VERB firebomb_VERB house_NOUN anything_PRON just_ADV like_VERB write_VERB consciousness_NOUN raise_VERB letter_NOUN two_NUM think_VERB that_SCONJ would_AUX good_ADJ for_SCONJ BDI_PROPN cager_NOUN know_VERB that_SCONJ know_VERB where_SCONJ they_PRON live_VERB maybe_ADV they_PRON use_VERB brain_NOUN cell_NOUN while_SCONJ drive_VERB instead_ADV the_DET usual_ADJ']
spaCy WordNet#
#!pip install spacy_wordnet
from spacy_wordnet.wordnet_annotator import WordnetAnnotator
nlp.add_pipe("spacy_wordnet", after='tagger', config={'lang': nlp.lang})
---------------------------------------------------------------------------
ConfigValidationError Traceback (most recent call last)
Cell In [14], line 1
----> 1 nlp.add_pipe("spacy_wordnet", after='tagger', config={'lang': nlp.lang})
File ~/miniconda3/lib/python3.9/site-packages/spacy/language.py:795, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
787 if not self.has_factory(factory_name):
788 err = Errors.E002.format(
789 name=factory_name,
790 opts=", ".join(self.factory_names),
(...)
793 lang_code=self.lang,
794 )
--> 795 pipe_component = self.create_pipe(
796 factory_name,
797 name=name,
798 config=config,
799 raw_config=raw_config,
800 validate=validate,
801 )
802 pipe_index = self._get_pipe_index(before, after, first, last)
803 self._pipe_meta[name] = self.get_factory_meta(factory_name)
File ~/miniconda3/lib/python3.9/site-packages/spacy/language.py:674, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
671 cfg = {factory_name: config}
672 # We're calling the internal _fill here to avoid constructing the
673 # registered functions twice
--> 674 resolved = registry.resolve(cfg, validate=validate)
675 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"]
676 filled = Config(filled)
File ~/miniconda3/lib/python3.9/site-packages/thinc/config.py:746, in registry.resolve(cls, config, schema, overrides, validate)
737 @classmethod
738 def resolve(
739 cls,
(...)
744 validate: bool = True,
745 ) -> Dict[str, Any]:
--> 746 resolved, _ = cls._make(
747 config, schema=schema, overrides=overrides, validate=validate, resolve=True
748 )
749 return resolved
File ~/miniconda3/lib/python3.9/site-packages/thinc/config.py:795, in registry._make(cls, config, schema, overrides, resolve, validate)
793 if not is_interpolated:
794 config = Config(orig_config).interpolate()
--> 795 filled, _, resolved = cls._fill(
796 config, schema, validate=validate, overrides=overrides, resolve=resolve
797 )
798 filled = Config(filled, section_order=section_order)
799 # Check that overrides didn't include invalid properties not in config
File ~/miniconda3/lib/python3.9/site-packages/thinc/config.py:850, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
848 schema.__fields__[key] = copy_model_field(field, Any)
849 promise_schema = cls.make_promise_schema(value, resolve=resolve)
--> 850 filled[key], validation[v_key], final[key] = cls._fill(
851 value,
852 promise_schema,
853 validate=validate,
854 resolve=resolve,
855 parent=key_parent,
856 overrides=overrides,
857 )
858 reg_name, func_name = cls.get_constructor(final[key])
859 args, kwargs = cls.parse_args(final[key])
File ~/miniconda3/lib/python3.9/site-packages/thinc/config.py:916, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
914 result = schema.parse_obj(validation)
915 except ValidationError as e:
--> 916 raise ConfigValidationError(
917 config=config, errors=e.errors(), parent=parent
918 ) from None
919 else:
920 # Same as parse_obj, but without validation
921 result = schema.construct(**validation)
ConfigValidationError:
Config validation error
spacy_wordnet -> lang extra fields not permitted
{'nlp': <spacy.lang.en.English object at 0x7fe74cf8b1c0>, 'name': 'spacy_wordnet', 'lang': 'en', '@factories': 'spacy_wordnet'}
import nltk
#nltk.download('wordnet')
token = nlp('prices')[0]
# wordnet object link spacy token with nltk wordnet interface
# by giving acces to synsets and lemmas
token._.wordnet.synsets()
token._.wordnet.lemmas()
# And automatically tags with wordnet domains
print(token._.wordnet.wordnet_domains())
synonyms by domain of interest (example from recognai)#
“””Recognai is an artificial intelligence software company who help teams and data scientists to cope with information overload. “””
# spaCy WordNet lets you find synonyms by domain of interest for example economy
sentence = nlp('I want to withdraw 5,000 euros')
economy_domains = ['finance', 'banking']
token_with_synsets = [(token, token._.wordnet.wordnet_synsets_for_domain(economy_domains))
for token in sentence]
enriched_sentence = []
for token, synsets in token_with_synsets:
if not synsets:
enriched_sentence.append(token.text)
else:
lemmas_for_synset = {lemma for s in synsets for lemma in s.lemma_names()}
enriched_sentence.append('({})'.format('|'.join(lemmas_for_synset)))
# >> I (need|want|require) to (draw|withdraw|draw_off|take_out) 5,000 euros
print(' '.join(enriched_sentence))
ACHTUNG: Unser Ergebnis ist anders! Denn hier wird want nicht korrekterweise als Verb, sondern als Nomen interpretiert.
Bag of Words#
Theorie zum Einlesen und viel mehr: https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html (https://opac.haw-landshut.de/TouchPoint/perma.do?q=1035%3D%22BV045113643%22+IN+%5B3%5D&v=fla&l=de)
sklearn.feature_extraction.text.TfidfVectorizer: “Equivalent to CountVectorizer followed by TfidfTransformer”: Weil wir beide Schritte unabhängig voneinander untersuchen wollen, wenden wir beide separat an.
count vectorizer#
flavours = [ 'unmodified', 'nlp_with_token_filter' ]
from sklearn.feature_extraction.text import CountVectorizer
X_vectorizer = {}
X_bag_of_words = {}
for flavour in flavours:
X_vectorizer[flavour] = CountVectorizer(strip_accents='unicode')
X_bag_of_words[flavour] = X_vectorizer[flavour].fit_transform(X_train[flavour])
print(f"{flavour}: {X_bag_of_words[flavour].shape}")
def inspect_feature_names(start_filter):
for flavour in flavours:
feature_names = [ f for f in X_vectorizer[flavour].get_feature_names_out()
if f.startswith(start_filter)]
print(f"\n{flavour}, {len(feature_names)} items: {feature_names}")
inspect_feature_names("ea")
for flavour in flavours:
print(f"{flavour}: {X_bag_of_words[flavour].shape}")
transformer#
from sklearn.feature_extraction.text import TfidfTransformer
X_transformer = {}
X_tfidf = {}
for flavour in flavours:
X_transformer = {flavour: CountVectorizer(strip_accents='unicode') }
X_tfidf[flavour] = X_transformer[flavour].fit_transform(X_train[flavour])
print(f"{flavour}: {X_tfidf[flavour].shape}")
SGDClassifier als Elastic Net#
Eigentlich interessiert uns Klassifikation (aka Logistische Regression) mit ElasticNet. Weil ElasticNet aber nur eine Regression ist (und erst nach Anwendung eines Schwellwerts als Klassifikation taugt), nutzen wir die Logistische Regression-Klassifikation, die im SGDClassifier eingebaut ist:
SGDClassifier: Implements logistic regression with elastic net penalty (SGDClassifier(loss=”log_loss”, penalty=”elasticnet”)). https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html
SGDClassifier:
from sklearn.linear_model import SGDClassifier
X_SGDClassifier_clf = {}
for flavour in flavours:
X_SGDClassifier_clf[flavour] = SGDClassifier(
random_state=0,
# loss="log_loss", # ValueError: The loss log_loss is not supported.
penalty="elasticnet")
X_SGDClassifier_clf[flavour].fit(X_tfidf[flavour], y)
from sklearn.metrics import ConfusionMatrixDisplay
for flavour in flavours:
ConfusionMatrixDisplay.from_predictions(
X_SGDClassifier_clf[flavour].predict(X_tfidf[flavour]), y)
ACHTUNG: Diese Confusion-Matritzen sehen gut aus. ABER das sind in training tests mit Overfitting, keine Tests gegen unbekannte Daten.
Noch zu tun: mit der selben Pipeline Testdaten bearbeiten, dann ernsthafte Confusion Matrix.