dagstuhl2019

Diese Seite: http://jbusse.de/MA/dagstuhl2019.html

Impulsvortrag Dagstuhl 2019 ... Zweck: Mit Kollegen in ein Fachgespräch kommen (a) eine kleine Forschungsidee vorstellen, sowie (b) discuss first preliminary results ... Ziel: Meinung der Kollegen einholen: Lohnt sich hier vertiefende Forschung?

Übergreifendes Forschungs-Ziel: Verbesserung von Machine Learning Modellen, insbesondere in Hinblick auf Transparenz, Verstehbarkeit, Robustheit etc ... die Erklärungskomponente liefert wichtige Hinweise auf die Anwendbarkeit eines Modells Komplexität, Anzahl beteiligte Variablen etc.

the general Q is: How can we improve ML (here: ElasticNet model generation) by adding semantic background knowledge?

ML context

domain of application: high-dimensional, broad datasets (p columns >> n rows)

bag of words, tf-idf, classification
document retrieval, search query expansion
Internet of Things (IOT), sensor data

interesting ML algorithms: https://scikit-learn.org/stable/modules/linear_model.html:

1.1.2. Ridge Regression
1.1.3. Lasso | 1.1.8. LARS Lasso
1.1.5. Elastic Net
- regression only, NO logit version for classification available in scikit :-(
- but see below, SVM-version
1.1.11. Logistic regression
- The solver "liblinear" uses a coordinate descent (CD) algorithm, and relies on the excellent C++ LIBLINEAR library ...
- The solvers implemented in the class LogisticRegression are "liblinear", "newton-cg", "lbfgs", "sag" and "saga"
1.1.12. Stochastic Gradient Descent - SGD

my personal favourite: ElasticNet

explicitly suitable for bag of word models with p >> n and highly correlated attributes
fast fit: highly optimized, very fast algorithm glmnet in R
uses a sequence of warm starts, thus suitable for online learning
small whitebox models
very fast prediction

good old-fashioned idea

the general Q is: How can we improve ML (here: ElasticNet model generation) by adding semantic background knowledge?

Titanic dataset

facts:

survival rate among females much higher than among males
age also an important predictor? possibly at least for males?

thus:

assume that attribute "sex" ist not agiven
calculate (probability of) gender based on feature "Anrede" (salutation)
age has many Nulls: use Master / Mr / Miss / Mrs to predict age

The ~~curse~~ praise of dimensionality: Use inferencing to enrich the feature set and to propagate values. Also allow for adding new features (single, married etc.).

"ontology"

"inferencing"

logical, crisp (boolean)

today: SKOS-inferencing: "female OR young"
tomorrow: "female AND young"
at maximum: generic RDFS inferencing
NOT: owl, f-logic

value propagation (sort of "numeric inferencing")

tf-idf?
probabilistic, possibilistic?

side conditions

sparse input? preserve sparsity!
"regularized" inferencing? impose cost function?
very, very fast

(re)search space: attributes of knowledge representation (KR)

concepts or words (terms)?

SKOS: (tree of) concepts; terms are modeled as labels
WordNet: (graph of) synsets of terms (fig.)

graph structure

tree: T-Box: skos, class tree, taxonomy
- OR
- AND
- complex rules
directed graph: FCA
not strongly directed (and thus presumably cyclic) graph: http://webisa.webdatacommons.org/

KR size

user-taylored, hand-made based on the headers of the specific dataset: very, very few concepts
public thesaurus
- https://www.openthesaurus.de/statistics/index: DE: 40k concepts, 120k terms
- https://iate.europa.eu/download-iate term base exchance format (tbx) DE: >700k Terms .. relationship to eurovoc.europa.eu (SKOS), http://www.eurotermbank.com/ ?

evaluation criteria

raise of score in prediction (accuracy, f1, ROC etc.)
complexity of (white box) model
robustness: graceful degradation, online learning
ethical values w.r.t. https://ec.europa.eu/digital-single-market/en/high-level-expert-group-artificial-intelligence High-Level Expert Group on Artificial Intelligence

first results

original data

-0.4258 Mr
0.1221 Mrs
-0.0809 Rev
0.0693 Miss
-0.0235 Capt
-0.0235 Don
-0.0235 Jonkheer
0.0219 Mlle
-0.0133 Dr
0.0108 Countess
0.0108 Lady
0.0108 Mme
0.0108 Ms
0.0108 Sir

with feature engineering

-0.3313 Mr
-0.175 male
0.0 female
0.0377 female_married
-0.0192 Rev
-0.0159 serious_church
-0.0195 serious
-0.0044 serious_highness