20 Newsgoups: Intro#

Aufgaben, spezifisch für dsci-txt:

aktuell 2023: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups

from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', 
    categories=cats,
    shuffle=True, 
    random_state=42,
    remove = ('headers', 'footers', 'quotes')  # hinzugefügt JB
                                     )                                  
newsgroups_train.target_names
['alt.atheism', 'sci.space']
type(newsgroups_train)
sklearn.utils._bunch.Bunch

Bunch: https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html#sklearn.utils.Bunch

Bunch is just like dictionary but it supports attribute type access. This is the main difference between bunch and dict: In a Bunch, you can access the attributes using dot notations. In a dict this is not possible. https://stackoverflow.com/questions/56286221/what-is-the-difference-between-bunch-and-dictionary-type-in-python

newsgroups_train.keys()
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
newsgroups_train["target"]
array([0, 1, 1, ..., 1, 1, 1])
newsgroups_train.target_names # dot-Notation statt newsgroups_train['target_names']
['alt.atheism', 'sci.space']
type(newsgroups_train.filenames)
numpy.ndarray
newsgroups_train.filenames.shape
(1073,)
len(newsgroups_train.filenames) 
1073

newsgroups_train.data ist eine Liste von Strings:

type(newsgroups_train.data)
list
# newsgroups_train.data
newsgroups_train.data[:2]
[': \n: >> Please enlighten me.  How is omnipotence contradictory?\n: \n: >By definition, all that can occur in the universe is governed by the rules\n: >of nature. Thus god cannot break them. Anything that god does must be allowed\n: >in the rules somewhere. Therefore, omnipotence CANNOT exist! It contradicts\n: >the rules of nature.\n: \n: Obviously, an omnipotent god can change the rules.\n\nWhen you say, "By definition", what exactly is being defined;\ncertainly not omnipotence. You seem to be saying that the "rules of\nnature" are pre-existant somehow, that they not only define nature but\nactually cause it. If that\'s what you mean I\'d like to hear your\nfurther thoughts on the question.',
 "In <19APR199320262420@kelvin.jpl.nasa.gov> baalke@kelvin.jpl.nasa.gov \n\nSorry I think I missed a bit of info on this Transition Experiment. What is it?\n\nWill this mean a loss of data or will the Magellan transmit data later on ??\n\nBTW: When will NASA cut off the connection with Magellan?? Not that I am\nlooking forward to that day but I am just curious. I believe it had something\nto do with the funding from the goverment (or rather _NO_ funding :-)\n\nok that's it for now. See you guys around,\nJurriaan.\n "]

newsgroups_train.target ist eine Liste von Zahlen vom Typ Integer. Für jeden String in der Liste newsgroups_train.data ist das Target angeben:

newsgroups_train.target
array([0, 1, 1, ..., 1, 1, 1])
print(newsgroups_train.target.shape)
(1073,)