dagstuhl2019

Diese Seite: http://jbusse.de/MA/dagstuhl2019.html

Impulsvortrag Dagstuhl 2019 ... Zweck: Mit Kollegen in ein Fachgespräch kommen (a) eine kleine Forschungsidee vorstellen, sowie (b) discuss first preliminary results ... Ziel: Meinung der Kollegen einholen: Lohnt sich hier vertiefende Forschung?

Übergreifendes Forschungs-Ziel: Verbesserung von Machine Learning Modellen, insbesondere in Hinblick auf Transparenz, Verstehbarkeit, Robustheit etc ... die Erklärungskomponente liefert wichtige Hinweise auf die Anwendbarkeit eines Modells Komplexität, Anzahl beteiligte Variablen etc.

the general Q is: How can we improve ML (here: ElasticNet model generation) by adding semantic background knowledge?

ML context

domain of application: high-dimensional, broad datasets (p columns >> n rows)

  • bag of words, tf-idf, classification
  • document retrieval, search query expansion
  • Internet of Things (IOT), sensor data

interesting ML algorithms: https://scikit-learn.org/stable/modules/linear_model.html:

my personal favourite: ElasticNet

  • explicitly suitable for bag of word models with p >> n and highly correlated attributes
  • fast fit: highly optimized, very fast algorithm glmnet in R
  • uses a sequence of warm starts, thus suitable for online learning
  • small whitebox models
  • very fast prediction

good old-fashioned idea

the general Q is: How can we improve ML (here: ElasticNet model generation) by adding semantic background knowledge?

Titanic dataset

facts:

  • survival rate among females much higher than among males
  • age also an important predictor? possibly at least for males?

thus:

  • assume that attribute "sex" ist not agiven
  • calculate (probability of) gender based on feature "Anrede" (salutation)
  • age has many Nulls: use Master / Mr / Miss / Mrs to predict age

The curse praise of dimensionality: Use inferencing to enrich the feature set and to propagate values. Also allow for adding new features (single, married etc.).

"ontology"

"inferencing"

logical, crisp (boolean)

  • today: SKOS-inferencing: "female OR young"
  • tomorrow: "female AND young"
  • at maximum: generic RDFS inferencing
  • NOT: owl, f-logic

value propagation (sort of "numeric inferencing")

  • tf-idf?
  • probabilistic, possibilistic?

side conditions

  • sparse input? preserve sparsity!
  • "regularized" inferencing? impose cost function?
  • very, very fast

(re)search space: attributes of knowledge representation (KR)

concepts or words (terms)?

  • SKOS: (tree of) concepts; terms are modeled as labels
  • WordNet: (graph of) synsets of terms (fig.)

graph structure

  • tree: T-Box: skos, class tree, taxonomy
    • OR
    • AND
    • complex rules
  • directed graph: FCA
  • not strongly directed (and thus presumably cyclic) graph: http://webisa.webdatacommons.org/

KR size

evaluation criteria

first results

original data

  • -0.4258 Mr
  • 0.1221 Mrs
  • -0.0809 Rev
  • 0.0693 Miss
  • -0.0235 Capt
  • -0.0235 Don
  • -0.0235 Jonkheer
  • 0.0219 Mlle
  • -0.0133 Dr
  • 0.0108 Countess
  • 0.0108 Lady
  • 0.0108 Mme
  • 0.0108 Ms
  • 0.0108 Sir

with feature engineering

  • -0.3313 Mr
  • -0.175 male
  • 0.0 female
  • 0.0377 female_married
  • -0.0192 Rev
  • -0.0159 serious_church
  • -0.0195 serious
  • -0.0044 serious_highness