Data Science: ausgewählte Ressourcen

Diese Seite kommentiert ausführlich einige Bücher, Tutorials und andere Online-Ressourcen, die mir im Rahmen von Recherchen besonders aufgefallen sind.

Die Kürzel wie "E1", "L2" etc. werden in erklärt.

Detaillierte, gute Erklärung des Gesamtbildes

Bowles

Michael Bowles: Machine learning in Python: essential techniques for predictive analysis. Wiley 2015

Bibliothek: https://opac.haw-landshut.de/search?bvnr=BV043397686 | pdf: https://bibaccess.fh-landshut.de:3159/doi/book/10.1002/9781119183600

Wertung: In den letzten 2-3 Jahren sind dutzende Bücher zum Thema Data Science erschienen. Mir gefällt das Buch von Bowles sehr gut, da es an zwei wichtigen Modellen detailliert den Weg von den Daten zum qualitätsgesicherten Modell zeigt.

Dass die Code-Beispiele noch in Python 2.7 sind ist unerheblich, da der Code ohnehin nur exemplarischen Charakter hat. In der Praxis benutzt man Bibliotheken statt den Code von Bowles. Es ist Teil der Veranstaltung, den Code aus Bowles unter Rückgriff auf MLPC in wenigen Zeilen zu re-implementieren.

MLPC

Chris Albon: Machine Learning with Python. O'Reilly 2018

https://chrisalbon.com/

Auch auf DE erhältlich, O'Reilly 2019

Wir haben eigens für diese Veranstaltung in die Bibliothek der HAW LA einige Exemplare anschaffen lassen: EN, DE

python-data-science-handbook

Jake VanderPlas: Python Data Science Handbook.

https://jakevdp.github.io/PythonDataScienceHandbook/

The content is available on GitHub in the form of Jupyter notebooks: https://github.com/jakevdp/PythonDataScienceHandbook

DE aus der Bibliothek: Data Science mit Python: Das Handbuch für den Einsatz von IPython, Jupyter, NumPy, Pandas, Matplotlib, Scikit-Learn.

Vorgehensmodell CRISP-DM

CRISP-DM

CRISP-DM: Cross-industry standard process for data mining

CRISP-DM 1.0 Step-by-step data mining guide Pete Chapman (NCR), Julian Clinton (SPSS), Randy Kerber (NCR), Thomas Khabaza (SPSS), Thomas Reinartz (DaimlerChrysler), Colin Shearer (SPSS) and Rüdiger Wirth (DaimlerChrysler). © 2000 SPSS Inc. CRISPMWP-1104

https://www.the-modeling-agency.com/crisp-dm.pdf (Moodle)

Python

Python ist eine sehr schöne und moderne Sprache, die man sowieso lernen will.

(Ohne Python geht es nicht. Früher war bisweilen noch R wichtig, tritt aber in der Praxis von Informatikern zunehmend zurück. Bitte aber selbst ein Bild machen: google nach "R versus Python", finde z.B. hier oder hier oder hier).

python-whirlwind

Jake Vanderplas: A Whirlwind Tour of Python, O’Reilly 2016. 978-1-491-96465-1

als pdf https://jakevdp.github.io/WhirlwindTourOfPython/ > "The content is also available [...] from O'Reilly site as a free e-book or free pdf": http://www.oreilly.com/programming/free/files/a-whirlwind-tour-of-python.pdf

Code bei github: https://github.com/jakevdp/WhirlwindTourOfPython | https://jakevdp.github.io/WhirlwindTourOfPython/

Lizenz: CC0, d.h. fast beliebige (!) Wiederverwendung erlaubt

Python für Quereinsteiger, die schon Java oder C gelernt haben:

python-learnxinyminutes

Learn X in Y minutes, where X=python3. https://learnxinyminutes.com/docs/python3/

JB: Kommentierter Python-Code, kurz und knapp.

python-for-java-developers

Python Primer for Java Developers. https://lobster1234.github.io/2017/05/25/python-java-primer/

JB: gut für schnelles Umsteige-Lernen.

Die Bibliotheken: numpy, pandas, scikit-learn

Data Science in Python besteht vor allem in der genauen Kenntnis der Bibliotheken Pandas und Scikit-learn, und zwar (a) exemplarisch in Details, aber auch (b) bzgl. dem Aufbau der Dokumenation an sich, um schnelles Nachschlagen zu ermöglichen.

pandas

http://pandas.pydata.org/pandas-docs/stable/.

scikit-learn

https://scikit-learn.org/stable/.

pandas-cheat-sheet

https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf.

google nach "pandas cheat sheet", finde z.b. die Übersicht https://pbpython.com/pages/resources.html#cheat-sheets.

Im Prinzip ist auch NumPy wichtig, vor allem die Datenstruktur ndarray. Aber wir lernen NumPy eher ergänzend, durch Nachschlagen bei Bedarf.

numpy: https://www.numpy.org/.

Die Community der großen Bibliotheken stellt selbst viele praxisorientierte, didaktisierte Einführungen in ihre Bibliotheken an.

scipy-lectures-scikit-learn

http://www.scipy-lectures.org/packages/scikit-learn/index.html#hyperparameter-optimization-with-cross-validation.

tom-augspurger

Tom Augspurger: Effective Pandas.

https://github.com/TomAugspurger/effective-pandas.

Pandas ist sicherlich die wichtigste Grundlage. Interessant ist z.B. die Lernpfad-Empfehlung How to Learn Pandas von Ted Petrou.

Software

In meinen Lehrveranstaltungen arbeiten wir mit einer vorinstallierten virtuellen Ubuntu-Maschine unter Oracle, Virtualbox, siehe http://jbusse.de/2019_ws_dsci/dsci-lab.html, in der die folgende Software schon vorinstalliert und -konfiguriert ist. Es besteht also kein Bedarf, die folgende Software auf dem eigenen Windows-Rechner zu installieren!

anaconda

Download: https://www.anaconda.com/download/

Anaconda bietet Jupyter Notebooks und JupyterLab an

miniconda

Download: https://docs.conda.io/en/latest/miniconda.html

Installation: https://conda.io/projects/conda/en/latest/user-guide/install/linux.html

Wir arbeiten mit Python 3.x (derzeit 3.7)

jupytext

https://github.com/mwouts/jupytext

Übersicht: https://towardsdatascience.com/introducing-jupytext-9234fdff6c57

pandoc

https://pandoc.org/

/misc

Viele Autoren, öffentliche Organisationen (Hochschulen) oder private Bildungsanbieter bieten z.T. sehr umfangreiche Tutorials für alle Aspekte der Data Science an.

Tutorials für Menschen, die gerne mit Tutorials lernen:

L1: python-datacamp

https://www.datacamp.com/courses/intro-to-python-for-data-science

4 hours | 11 Videos | 57 Exercises | (mit Anmeldung)

JB: 4h = ein Nachmittag: Warum nicht mal machen?

Ein schönes Beginners Level Python-Tutorial, das auch schon in NumPy einführt

machine-learning-tutorial-python-introduction

https://pythonprogramming.net/machine-learning-tutorial-python-introduction/ 66 Abschnitte Tabelle TRT auf Moodle

data-analysis-python-pandas-tutorial-introduction

https://pythonprogramming.net/data-analysis-python-pandas-tutorial-introduction/ 16 * 15 min pro Abschnitt

w3resource-python-exercises

interaktive Lernerfolgskontrollen bei w3resources

https://www.w3resource.com/python-exercises/numpy/index.php, dort insbesondere NumPy Basic [40 exercises with solution] NumPy arrays [100 exercises with solution]

https://www.w3resource.com/python-exercises/pandas/index.php, dort insbesondere Python Pandas Data Series [15 exercises with solution] Python Pandas DataFrame [63 exercises with solution]

Plattformen und Communities

kaggle

https://www.kaggle.com/

(Kaggle ist eine Google-Company)

google-datalab

https://cloud.google.com/datalab/

kdnuggets

https://www.kdnuggets.com/

analyticsvidhya

https://www.analyticsvidhya.com/

arxiv-sanity

http://www.arxiv-sanity.com/

github

Standard-Plattform, die man als Informatiker kennen muss.

Sonstige Online-Tutorials

python-codeacademy

https://www.codecademy.com/learn/learn-python

Von Kaggle empfohlen, von uns aber abgewertet wegen zu strikter Lernpfad-Vorgabe

git-book

Scott Chacon, Ben Straub: Pro Git book. Apress 2014.

https://git-scm.com/book/de/v2

Glossare

analyticsvidhya-machine-learning-glossary: https://www.analyticsvidhya.com/glossary-of-common-statistics-and-machine-learning-terms/
google-machine-learning-glossary: https://developers.google.com/machine-learning/glossary/
datascienceglossary.org: http://www.datascienceglossary.org/

Curricula

edison

Edison Curriculum Data Science

http://edison-project.eu/data-science-model-curriculum-mc-ds

CRISP-DM Model

Phases, Tasks, Outputs

Die folgende Tabelle gibt für Zwecke der besseren Lesbarkeit den Text der Abbildung Figure 3: Generic tasks (bold) and outputs (italic) of the CRISP-DM reference model digital wider.

Business Understanding	Determine Business Objectives	Background
		Business Objectives
		Business Success Criteria
	Assess Situation	Inventory of Resources
		Requirements, Assumptions, and Constraints
		Risks and Contingencies
		Terminology
		Costs and Benefits
	Determine Data Mining Goals	Data Mining Goals
	Determine Data Mining Goals	Data Mining Success Criteria
Data Understanding	Collect Initial Data	Initial Data Collection Report
	Describe Data	Data Description Report
	Explore Data	Data Exploration Report
	Verify Data Quality	Data Quality Report
Data Preparation	Select Data	Rationale for Inclusion/ Exclusion
	Clean Data	Data Cleaning Report
	Construct Data	Derived Attributes
	Construct Data	Generated Records
	Integrate Data	Merged Data
	Format Data	Reformatted Data
	Dataset	Dataset Description
Modeling	Select Modeling Techniques	Modeling Technique
	Select Modeling Techniques	Modeling Assumptions
	Generate Test Design	Test Design
	Build Model	Parameter Settings
		Models
		Model Descriptions
	Assess Model	Model Assessment
	Assess Model	Revised Parameter Settings
Evaluation	Evaluate Results	Assessment of Data Mining Results w.r.t. Business Success Criteria
	Evaluate Results	Approved Models
	Review Process	Review of Process
	Determine Next Steps	List of Possible Actions
	Determine Next Steps	Decision
Deployment	Plan Deployment	Deployment Plan
	Plan Monitoring and Maintenance	Monitoring and Maintenance Plan
	Produce Final Report	Final Report
	Produce Final Report	Final Presentation
	Review Project	Experience Documentation