{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# GdW KW 46\n", "\n", "Beispiele für einfache Python-Konstrukte.\n", "* tips.csv aus \n", "\n", "Wir lesen den Datensatz tips.csv lediglich ein, um für Fingerübungen eine Quelle für ein paar Datenstrukturen zu haben." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bibliotheken und Daten laden" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Falls in der folgenden Zelle Fehler auftreten, ggf. vorher im Terminal ausführen:\n", " \n", "```sh\n", "conda install pandas numpy matplotlib\n", "conda install seaborn\n", "```\n", "\n", "für seaborn siehe auch :\n", "\n", "```sh\n", "pip install seaborn\n", "```\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from pandas import Series, DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wir laden den Datensatz explizit aus der (hier von Kaggle bezogenen) csv-Datei:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('tips.csv')\n", "type(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Der Tips-Datensatz wäre auch Teil von Seaborn:)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = sns.load_dataset(\"tips\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explorative Datenanalyse (EDA)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "1 10.34 1.66 Male No Sun Dinner 3\n", "2 21.01 3.50 Male No Sun Dinner 3" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(3)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
count244.000000244.000000244244244244244.000000
uniqueNaNNaN2242NaN
topNaNNaNMaleNoSatDinnerNaN
freqNaNNaN15715187176NaN
mean19.7859432.998279NaNNaNNaNNaN2.569672
std8.9024121.383638NaNNaNNaNNaN0.951100
min3.0700001.000000NaNNaNNaNNaN1.000000
25%13.3475002.000000NaNNaNNaNNaN2.000000
50%17.7950002.900000NaNNaNNaNNaN2.000000
75%24.1275003.562500NaNNaNNaNNaN3.000000
max50.81000010.000000NaNNaNNaNNaN6.000000
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "count 244.000000 244.000000 244 244 244 244 244.000000\n", "unique NaN NaN 2 2 4 2 NaN\n", "top NaN NaN Male No Sat Dinner NaN\n", "freq NaN NaN 157 151 87 176 NaN\n", "mean 19.785943 2.998279 NaN NaN NaN NaN 2.569672\n", "std 8.902412 1.383638 NaN NaN NaN NaN 0.951100\n", "min 3.070000 1.000000 NaN NaN NaN NaN 1.000000\n", "25% 13.347500 2.000000 NaN NaN NaN NaN 2.000000\n", "50% 17.795000 2.900000 NaN NaN NaN NaN 2.000000\n", "75% 24.127500 3.562500 NaN NaN NaN NaN 3.000000\n", "max 50.810000 10.000000 NaN NaN NaN NaN 6.000000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe(include='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "EDA und so weiter: in höheren Semestern, in Mathematik etc.: Siehe z.B. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Übung KW 45 GdW\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Spaltennamen = df.columns.tolist()\n", "Spaltennamen" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "list" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Rechnung = df['total_bill'].tolist()\n", "display(type(Rechnung))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26.88, 15.04, 14.78, 10.27, 35.26, 15.42, 18.43, 14.83, 21.58, 10.33, 16.29, 16.97, 20.65, 17.92, 20.29, 15.77, 39.42, 19.82, 17.81, 13.37, 12.69, 21.7, 19.65, 9.55, 18.35, 15.06, 20.69, 17.78, 24.06, 16.31, 16.93, 18.69, 31.27, 16.04, 17.46, 13.94, 9.68, 30.4, 18.29, 22.23, 32.4, 28.55, 18.04, 12.54, 10.29, 34.81, 9.94, 25.56, 19.49, 38.01, 26.41, 11.24, 48.27, 20.29, 13.81, 11.02, 18.29, 17.59, 20.08, 16.45, 3.07, 20.23, 15.01, 12.02, 17.07, 26.86, 25.28, 14.73, 10.51, 17.92, 27.2, 22.76, 17.29, 19.44, 16.66, 10.07, 32.68, 15.98, 34.83, 13.03, 18.28, 24.71, 21.16, 28.97, 22.49, 5.75, 16.32, 22.75, 40.17, 27.28, 12.03, 21.01, 12.46, 11.35, 15.38, 44.3, 22.42, 20.92, 15.36, 20.49, 25.21, 18.24, 14.31, 14.0, 7.25, 38.07, 23.95, 25.71, 17.31, 29.93, 10.65, 12.43, 24.08, 11.69, 13.42, 14.26, 15.95, 12.48, 29.8, 8.52, 14.52, 11.38, 22.82, 19.08, 20.27, 11.17, 12.26, 18.26, 8.51, 10.33, 14.15, 16.0, 13.16, 17.47, 34.3, 41.19, 27.05, 16.43, 8.35, 18.64, 11.87, 9.78, 7.51, 14.07, 13.13, 17.26, 24.55, 19.77, 29.85, 48.17, 25.0, 13.39, 16.49, 21.5, 12.66, 16.21, 13.81, 17.51, 24.52, 20.76, 31.71, 10.59, 10.63, 50.81, 15.81, 7.25, 31.85, 16.82, 32.9, 17.89, 14.48, 9.6, 34.63, 34.65, 23.33, 45.35, 23.17, 40.55, 20.69, 20.9, 30.46, 18.15, 23.1, 15.69, 19.81, 28.44, 15.48, 16.58, 7.56, 10.34, 43.11, 13.0, 13.51, 18.71, 12.74, 13.0, 16.4, 20.53, 16.47, 26.59, 38.73, 24.27, 12.76, 30.06, 25.89, 48.33, 13.27, 28.17, 12.9, 28.15, 11.59, 7.74, 30.14, 12.16, 13.42, 8.58, 15.98, 13.42, 16.27, 10.09, 20.45, 13.28, 22.12, 24.01, 15.69, 11.61, 10.77, 15.53, 10.07, 12.6, 32.83, 35.83, 29.03, 27.18, 22.67, 17.82, 18.78]\n" ] } ], "source": [ "print(Rechnung)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Aufgabe: Wie viele Rechnungen haben wir?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "244" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "XXXXXX" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Aufgabe: Die Rechnungswerte sortiert" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[3.07, 5.75, 7.25, 7.25, 7.51, 7.56, 7.74, 8.35, 8.51, 8.52, 8.58, 8.77, 9.55, 9.6, 9.68, 9.78, 9.94, 10.07, 10.07, 10.09, 10.27, 10.29, 10.33, 10.33, 10.34, 10.34, 10.51, 10.59, 10.63, 10.65, 10.77, 11.02, 11.17, 11.24, 11.35, 11.38, 11.59, 11.61, 11.69, 11.87, 12.02, 12.03, 12.16, 12.26, 12.43, 12.46, 12.48, 12.54, 12.6, 12.66, 12.69, 12.74, 12.76, 12.9, 13.0, 13.0, 13.03, 13.13, 13.16, 13.27, 13.28, 13.37, 13.39, 13.42, 13.42, 13.42, 13.51, 13.81, 13.81, 13.94, 14.0, 14.07, 14.15, 14.26, 14.31, 14.48, 14.52, 14.73, 14.78, 14.83, 15.01, 15.04, 15.06, 15.36, 15.38, 15.42, 15.48, 15.53, 15.69, 15.69, 15.77, 15.81, 15.95, 15.98, 15.98, 16.0, 16.04, 16.21, 16.27, 16.29, 16.31, 16.32, 16.4, 16.43, 16.45, 16.47, 16.49, 16.58, 16.66, 16.82, 16.93, 16.97, 16.99, 17.07, 17.26, 17.29, 17.31, 17.46, 17.47, 17.51, 17.59, 17.78, 17.81, 17.82, 17.89, 17.92, 17.92, 18.04, 18.15, 18.24, 18.26, 18.28, 18.29, 18.29, 18.35, 18.43, 18.64, 18.69, 18.71, 18.78, 19.08, 19.44, 19.49, 19.65, 19.77, 19.81, 19.82, 20.08, 20.23, 20.27, 20.29, 20.29, 20.45, 20.49, 20.53, 20.65, 20.69, 20.69, 20.76, 20.9, 20.92, 21.01, 21.01, 21.16, 21.5, 21.58, 21.7, 22.12, 22.23, 22.42, 22.49, 22.67, 22.75, 22.76, 22.82, 23.1, 23.17, 23.33, 23.68, 23.95, 24.01, 24.06, 24.08, 24.27, 24.52, 24.55, 24.59, 24.71, 25.0, 25.21, 25.28, 25.29, 25.56, 25.71, 25.89, 26.41, 26.59, 26.86, 26.88, 27.05, 27.18, 27.2, 27.28, 28.15, 28.17, 28.44, 28.55, 28.97, 29.03, 29.8, 29.85, 29.93, 30.06, 30.14, 30.4, 30.46, 31.27, 31.71, 31.85, 32.4, 32.68, 32.83, 32.9, 34.3, 34.63, 34.65, 34.81, 34.83, 35.26, 35.83, 38.01, 38.07, 38.73, 39.42, 40.17, 40.55, 41.19, 43.11, 44.3, 45.35, 48.17, 48.27, 48.33, 50.81]\n" ] } ], "source": [ "RechnungSortiert = XXXXXXXXXX\n", "print(RechnungSortiert)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Aufgabe: Drucke die kleinsten (resp. größten) 5 Beträge aus RechnungSortiert" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[3.07, 5.75, 7.25, 7.25, 7.51]\n", "[45.35, 48.17, 48.27, 48.33, 50.81]\n" ] } ], "source": [ "print(XXXXXXX)\n", "print(XXXXXXX)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basis-Statistiken\n", "\n", "Berechne für die Liste `Rechnung`:\n", "\n", "* Mittelwert (auch Erwartungswert, Schwerpunkt der Verteilung): Summe aller Rechnungen geteilt durch Anzahl\n", "* Median: nicht der Mittelwert der Liste, sondern der Wert des mittleren Elements der sortierten Liste\n", "\n", "* durchschnittliches Abweichungsquadrat (auch Stichprobenvarianz, empirische Varianz): Durchschnitt aller quadrierten Abweichungen vom Mittelwert, \n", "* Standardabweichung: Quadratwurzel der Varianz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Aufgabe: Mittelwert\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "19.785942622950824" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Mittelwert = XXXXXXXXXXX\n", "Mittelwert" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Aufgabe: Median" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17.81" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Median = XXXXXXXXX\n", "Median" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Aufgabe: Varianz" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "78.92813148851113" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "SummeQuadrierteAbweichungen = 0\n", "for XXXXXXXX :\n", " XXXXXX \n", " \n", "Varianz = XXXXXX\n", "display(Varianz)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Aufgabe: Standardabweichung" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8.88415057777113" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Standardabweichung = XXXXXXX\n", "display(Standardabweichung)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vergleichen Sie ihre Ergebnisse mit der Ausgabe von `df.describe()` oben: Was fällt Ihnen auf?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Aufgabe: Funktion `Basisstatistiken`\n", "\n", "Definiere eine Funktion `Basisstatistiken`, die den Mittelwert und die Standardabweichung einer Liste von Zahlen zurückgibt.\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def Basisstatistiken(Zahlenliste):\n", " ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Umsätze nach Wochentagen" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Fri', 'Sat', 'Sun', 'Thur'}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SpalteWochentage = df.day.tolist()\n", "Wochentage = set(SpalteWochentage)\n", "Wochentage" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'Sat': {'Umsaetze': [20.65, 17.92, 20.29, 15.77, 39.42, 19.82, 17.81, 13.37, 12.69, 21.7, 19.65, 9.55, 18.35, 15.06, 20.69, 17.78, 24.06, 16.31, 16.93, 18.69, 31.27, 16.04, 38.01, 26.41, 11.24, 48.27, 20.29, 13.81, 11.02, 18.29, 17.59, 20.08, 16.45, 3.07, 20.23, 15.01, 12.02, 17.07, 26.86, 25.28, 14.73, 10.51, 17.92, 44.3, 22.42, 20.92, 15.36, 20.49, 25.21, 18.24, 14.31, 14.0, 7.25, 10.59, 10.63, 50.81, 15.81, 26.59, 38.73, 24.27, 12.76, 30.06, 25.89, 48.33, 13.27, 28.17, 12.9, 28.15, 11.59, 7.74, 30.14, 20.45, 13.28, 22.12, 24.01, 15.69, 11.61, 10.77, 15.53, 10.07, 12.6, 32.83, 35.83, 29.03, 27.18, 22.67, 17.82]}, 'Fri': {'Umsaetze': [28.97, 22.49, 5.75, 16.32, 22.75, 40.17, 27.28, 12.03, 21.01, 12.46, 11.35, 15.38, 12.16, 13.42, 8.58, 15.98, 13.42, 16.27, 10.09]}, 'Thur': {'Umsaetze': [27.2, 22.76, 17.29, 19.44, 16.66, 10.07, 32.68, 15.98, 34.83, 13.03, 18.28, 24.71, 21.16, 10.65, 12.43, 24.08, 11.69, 13.42, 14.26, 15.95, 12.48, 29.8, 8.52, 14.52, 11.38, 22.82, 19.08, 20.27, 11.17, 12.26, 18.26, 8.51, 10.33, 14.15, 16.0, 13.16, 17.47, 34.3, 41.19, 27.05, 16.43, 8.35, 18.64, 11.87, 9.78, 7.51, 19.81, 28.44, 15.48, 16.58, 7.56, 10.34, 43.11, 13.0, 13.51, 18.71, 12.74, 13.0, 16.4, 20.53, 16.47, 18.78]}, 'Sun': {'Umsaetze': [16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26.88, 15.04, 14.78, 10.27, 35.26, 15.42, 18.43, 14.83, 21.58, 10.33, 16.29, 16.97, 17.46, 13.94, 9.68, 30.4, 18.29, 22.23, 32.4, 28.55, 18.04, 12.54, 10.29, 34.81, 9.94, 25.56, 19.49, 38.07, 23.95, 25.71, 17.31, 29.93, 14.07, 13.13, 17.26, 24.55, 19.77, 29.85, 48.17, 25.0, 13.39, 16.49, 21.5, 12.66, 16.21, 13.81, 17.51, 24.52, 20.76, 31.71, 7.25, 31.85, 16.82, 32.9, 17.89, 14.48, 9.6, 34.63, 34.65, 23.33, 45.35, 23.17, 40.55, 20.69, 20.9, 30.46, 18.15, 23.1, 15.69]}}\n" ] } ], "source": [ "UmsaetzeWochentage = { day: {'Umsaetze': df.loc[df.day==day].total_bill.tolist() }\n", " for day in Wochentage }\n", "print(UmsaetzeWochentage)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`UmsaetzeWochentage` ist ein Dictionary, dessen Werte selbst wiederum aus einem Dictionary und Listen bestehen.\n", "\n", "Gesucht: Ergänzung dieser Datenstruktur um `durchschnittlicherTagesumsatz` für jeden Tag.\n", "\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "for Tag in UmsaetzeWochentage:\n", " TagUmsaetze = UmsaetzeWochentage[Tag]['Umsaetze']\n", " UmsaetzeWochentage[Tag]['durchschnittlicherTagesumsatz'] = round(sum(TagUmsaetze) / len(TagUmsaetze), 2)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'Umsaetze': [28.97, 22.49, 5.75, 16.32, 22.75, 40.17, 27.28, 12.03, 21.01, 12.46, 11.35, 15.38, 12.16, 13.42, 8.58, 15.98, 13.42, 16.27, 10.09], 'durchschnittlicherTagesumsatz': 17.15}\n" ] } ], "source": [ "print(UmsaetzeWochentage['Fri'])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17.15" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "UmsaetzeWochentage['Fri']['durchschnittlicherTagesumsatz']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }