dagstuhl2019
Diese Seite: http://jbusse.de/MA/dagstuhl2019.html
Impulsvortrag Dagstuhl 2019 ... Zweck: Mit Kollegen in ein Fachgespräch kommen (a) eine kleine Forschungsidee vorstellen, sowie (b) discuss first preliminary results ... Ziel: Meinung der Kollegen einholen: Lohnt sich hier vertiefende Forschung?
Übergreifendes Forschungs-Ziel: Verbesserung von Machine Learning Modellen, insbesondere in Hinblick auf Transparenz, Verstehbarkeit, Robustheit etc ... die Erklärungskomponente liefert wichtige Hinweise auf die Anwendbarkeit eines Modells Komplexität, Anzahl beteiligte Variablen etc.
the general Q is: How can we improve ML (here: ElasticNet model generation) by adding semantic background knowledge?
ML context
domain of application: high-dimensional, broad datasets (p columns >> n rows)
- bag of words, tf-idf, classification
- document retrieval, search query expansion
- Internet of Things (IOT), sensor data
interesting ML algorithms: https://scikit-learn.org/stable/modules/linear_model.html:
- 1.1.2. Ridge Regression
- 1.1.3. Lasso | 1.1.8. LARS Lasso
- 1.1.5. Elastic Net
- regression only, NO logit version for classification available in scikit :-(
- but see below, SVM-version
- 1.1.11. Logistic regression
- The solver "liblinear" uses a coordinate descent (CD) algorithm, and relies on the excellent C++ LIBLINEAR library ...
- The solvers implemented in the class LogisticRegression are "liblinear", "newton-cg", "lbfgs", "sag" and "saga"
- 1.1.12. Stochastic Gradient Descent - SGD
my personal favourite: ElasticNet
- explicitly suitable for bag of word models with p >> n and highly correlated attributes
- fast fit: highly optimized, very fast algorithm glmnet in R
- uses a sequence of warm starts, thus suitable for online learning
- small whitebox models
- very fast prediction
good old-fashioned idea
the general Q is: How can we improve ML (here: ElasticNet model generation) by adding semantic background knowledge?
facts:
- survival rate among females much higher than among males
- age also an important predictor? possibly at least for males?
thus:
- assume that attribute "sex" ist not agiven
- calculate (probability of) gender based on feature "Anrede" (salutation)
- age has many Nulls: use Master / Mr / Miss / Mrs to predict age
The curse praise of dimensionality: Use inferencing to enrich the feature set and to propagate values.
Also allow for adding new features (single, married etc.).
"inferencing"
logical, crisp (boolean)
- today: SKOS-inferencing: "female OR young"
- tomorrow: "female AND young"
- at maximum: generic RDFS inferencing
- NOT: owl, f-logic
value propagation (sort of "numeric inferencing")
- tf-idf?
- probabilistic, possibilistic?
side conditions
- sparse input? preserve sparsity!
- "regularized" inferencing? impose cost function?
- very, very fast
(re)search space: attributes of knowledge representation (KR)
concepts or words (terms)?
- SKOS: (tree of) concepts; terms are modeled as labels
- WordNet: (graph of) synsets of terms (fig.)
graph structure
- tree: T-Box: skos, class tree, taxonomy
- OR
- AND
- complex rules
- directed graph: FCA
- not strongly directed (and thus presumably cyclic) graph: http://webisa.webdatacommons.org/
KR size
- user-taylored, hand-made based on the headers of the specific dataset: very, very few concepts
- public thesaurus
- https://www.openthesaurus.de/statistics/index: DE: 40k concepts, 120k terms
- https://iate.europa.eu/download-iate term base exchance format (tbx) DE: >700k Terms .. relationship to eurovoc.europa.eu (SKOS), http://www.eurotermbank.com/ ?
evaluation criteria
- raise of score in prediction (accuracy, f1, ROC etc.)
- complexity of (white box) model
- robustness: graceful degradation, online learning
- ethical values w.r.t. https://ec.europa.eu/digital-single-market/en/high-level-expert-group-artificial-intelligence High-Level Expert Group on Artificial Intelligence
first results
original data
- -0.4258 Mr
- 0.1221 Mrs
- -0.0809 Rev
- 0.0693 Miss
- -0.0235 Capt
- -0.0235 Don
- -0.0235 Jonkheer
- 0.0219 Mlle
- -0.0133 Dr
- 0.0108 Countess
- 0.0108 Lady
- 0.0108 Mme
- 0.0108 Ms
- 0.0108 Sir
with feature engineering
- -0.3313 Mr
- -0.175 male
- 0.0 female
- 0.0377 female_married
- -0.0192 Rev
- -0.0159 serious_church
- -0.0195 serious
- -0.0044 serious_highness