{ "cells": [ { "cell_type": "markdown", "id": "89a88ad1", "metadata": {}, "source": [ "# Runde 1b: Verschachtelte Datenstrukturen mit Tiefe 2\n", "\n", "Darum geht es: In {doc}`e_r1` haben wir die wichtigsten der grundlegenden Python-Konstrukte kennengelernt.\n", "\n", "Wir bleiben mit den Sprachmitteln in Runde 1 und wenden sie auf etwas komplexere Datenstrukturen an: Listen (oder Dicts etc.), die Listen oder Dicts etc. enthalten. Mit solch einer „verschachtelten“ Datenstruktur können wir z.B. Excel-Tabellen darstellen (und natürlich auch importieren und exportieren).\n", "\n", "Ziel ist es, ein erstes einfaches Programm zu schreiben, in dem algorithmisch mehrere Schritte aufeinander folgen.\n", "\n", "Dieses Notebook präsentiert zuerst das Problem, und zeigt dann akribisch Schritt für Schritt, wie man sich als Programmier-Anfänger in Runde 1 der Problemlösung nähert." ] }, { "cell_type": "markdown", "id": "1218b505", "metadata": {}, "source": [ "## Melbourne Housing\n", "\n", "Wir lesen die ersten `n` Zeilen eines gut bekannten Original-Datensatzes aus der Wirtschaftsinformatik als Pandas Dataframe ein.\n", "\n", "Download:\n", "* {download}`melb_data.csv`\n", "* Wohin speichern? Am einfachsten in das selbe Verzeichnis legen, in dem dieses Notebook liegt.\n", "\n", "Über diesen Datensatz:\n", "* mögliche Primärquelle (?): https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market . \n", "* Beschreibung z.B. auch hier \n", "* Ein Snapshot des Datensatzes wird auch verwendet in dem Schulungsmodul . \n", "\n", "Doku Pandas: \n", "* " ] }, { "cell_type": "code", "execution_count": 1, "id": "5f23d280", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3, 6)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "nrows = 3\n", "usecols = [ 'Landsize', 'Rooms', 'Price', 'Date', 'Address', 'Type']\n", "melb_df = pd.read_csv( \"./melb_data.csv\", \n", " usecols = usecols,\n", " nrows = nrows # eigentlich sind es 13580 Datensätze ;-)\n", " )\n", "\n", "# Was haben wir eingelesen? # Anzahl Zeilen, Spalten\n", "melb_df.shape" ] }, { "cell_type": "code", "execution_count": 2, "id": "d791ff33", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AddressRoomsTypePriceDateLandsize
085 Turner St2h1480000.03/12/2016202.0
125 Bloomburg St2h1035000.04/02/2016156.0
25 Charles St3h1465000.04/03/2017134.0
\n", "
" ], "text/plain": [ " Address Rooms Type Price Date Landsize\n", "0 85 Turner St 2 h 1480000.0 3/12/2016 202.0\n", "1 25 Bloomburg St 2 h 1035000.0 4/02/2016 156.0\n", "2 5 Charles St 3 h 1465000.0 4/03/2017 134.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "melb_df" ] }, { "cell_type": "markdown", "id": "4ecb81d9", "metadata": {}, "source": [ "Mit Pandas Dataframes zu arbeiten kann in der ersten Runde Python-101 nicht Inhalt sein. Aber man kann solch ein DataFrame in Datenstrukturen überführen, die für eine Einführung in Python 101 sehr instruktuv sind. \n", "\n", "* Doku: \n", "\n", "Die einfachste alle Darstellungen verwirft die Spaltenüberschriften völlig:" ] }, { "cell_type": "code", "execution_count": 3, "id": "e985f35b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['85 Turner St', 2, 'h', 1480000.0, '3/12/2016', 202.0],\n", " ['25 Bloomburg St', 2, 'h', 1035000.0, '4/02/2016', 156.0],\n", " ['5 Charles St', 3, 'h', 1465000.0, '4/03/2017', 134.0]]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# jede Zeile eine Liste\n", "melb_df.values.tolist()" ] }, { "cell_type": "markdown", "id": "714e2e7a", "metadata": {}, "source": [ "Informationserhaltende Darstellungen nutzen Mischungen aus Listen und dicts.\n", "\n", "Aufgabe: Vergegenwärtigen Sie sich für `list`, `dict` und `index`, ins welche konkrete Datenstruktur die abstrakte Datenstruktur „Tabelle“ jeweils überführt wird!" ] }, { "cell_type": "code", "execution_count": 4, "id": "65bc2860", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Address': ['85 Turner St', '25 Bloomburg St', '5 Charles St'],\n", " 'Rooms': [2, 2, 3],\n", " 'Type': ['h', 'h', 'h'],\n", " 'Price': [1480000.0, 1035000.0, 1465000.0],\n", " 'Date': ['3/12/2016', '4/02/2016', '4/03/2017'],\n", " 'Landsize': [202.0, 156.0, 134.0]}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# jede Spalte eine Liste\n", "melb_orient_list = melb_df.to_dict(orient='list')\n", "melb_orient_list" ] }, { "cell_type": "code", "execution_count": 5, "id": "fd1ae2c9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Address': {0: '85 Turner St', 1: '25 Bloomburg St', 2: '5 Charles St'},\n", " 'Rooms': {0: 2, 1: 2, 2: 3},\n", " 'Type': {0: 'h', 1: 'h', 2: 'h'},\n", " 'Price': {0: 1480000.0, 1: 1035000.0, 2: 1465000.0},\n", " 'Date': {0: '3/12/2016', 1: '4/02/2016', 2: '4/03/2017'},\n", " 'Landsize': {0: 202.0, 1: 156.0, 2: 134.0}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# jede Spalte ein Dict\n", "melb_orient_dict = melb_df.to_dict(orient='dict') # orient='dict': default\n", "melb_orient_dict" ] }, { "cell_type": "code", "execution_count": 6, "id": "47f51b03", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{0: {'Address': '85 Turner St',\n", " 'Rooms': 2,\n", " 'Type': 'h',\n", " 'Price': 1480000.0,\n", " 'Date': '3/12/2016',\n", " 'Landsize': 202.0},\n", " 1: {'Address': '25 Bloomburg St',\n", " 'Rooms': 2,\n", " 'Type': 'h',\n", " 'Price': 1035000.0,\n", " 'Date': '4/02/2016',\n", " 'Landsize': 156.0},\n", " 2: {'Address': '5 Charles St',\n", " 'Rooms': 3,\n", " 'Type': 'h',\n", " 'Price': 1465000.0,\n", " 'Date': '4/03/2017',\n", " 'Landsize': 134.0}}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# jede Zeile ein Dict\n", "melb_orient_index = melb_df.to_dict(orient='index')\n", "melb_orient_index" ] }, { "cell_type": "markdown", "id": "4e5e92e7", "metadata": {}, "source": [ "### Aufgabenstellung\n", "\n", "Gegeben: Der Melbourne Housing Datensatz.\n", "\n", "Gesucht:\n", "* `Price_avg`: berechne das arithmetisches Mittel („Durchschnitt“) des Preises aller Häuser.\n", "* Lege neue Spalte `Price_centered` an: Subtrahiere vom jeweiligen Preis den Durchschnittspreis.\n", "* Bestimme zur neuen Spalte `Price_centered` die .\n", "* neue Spalte `Price_standardized`: Dividiere jeden Wert aus `Price_centered` durch die Standardabweichung (Wurzel der Varianz).\n", "\n", "Lösen Sie diese Aufgabe zunächst auf Basis der Variablen `melb_orient_list`.\n", "\n", "Lösen Sie diese Aufgabe nur mit den Mittel aus Runde 1. Insbesondere stehen Ihnen Funktionen wie `min()` oder `max()` in Runde 1 ja nicht zur Verfügung. Deshalb programmieren wir diese zur Übung hier einfach selbst." ] }, { "cell_type": "code", "execution_count": 7, "id": "94638d49", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1480000.0, 1035000.0, 1465000.0]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# meine Spalte: p\n", "\n", "p = melb_orient_list[\"Price\"]\n", "p" ] }, { "cell_type": "code", "execution_count": 8, "id": "a0e52460", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3980000.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p_sum = 0\n", "for x in p:\n", " p_sum += x\n", "p_sum" ] }, { "cell_type": "code", "execution_count": 9, "id": "10a2cdb2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1326666.6666666667" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p_avg = p_sum / len(p)\n", "p_avg" ] }, { "cell_type": "code", "execution_count": 10, "id": "fb824744", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[153333.33333333326, -291666.66666666674, 138333.33333333326]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p_zentriert = [ x - p_avg for x in p ]\n", "p_zentriert" ] }, { "cell_type": "code", "execution_count": 11, "id": "cf159c4a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "206330.3715457863" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p_standardabweichung = ( sum( [ x**2 for x in p_zentriert ] ) / len(p) ) ** 0.5\n", "p_standardabweichung" ] }, { "cell_type": "code", "execution_count": 12, "id": "0ba491f3", "metadata": {}, "outputs": [], "source": [ "p_standardisiert = [ x / p_standardabweichung for x in p_zentriert]" ] }, { "cell_type": "code", "execution_count": 13, "id": "474810f6", "metadata": {}, "outputs": [], "source": [ "melb_orient_list[\"Price_centered\"] = p_zentriert" ] }, { "cell_type": "code", "execution_count": 14, "id": "288b729a", "metadata": {}, "outputs": [], "source": [ "melb_orient_list[\"Price_standardized\"] = p_standardisiert" ] }, { "cell_type": "markdown", "id": "b3ccdbf8", "metadata": {}, "source": [ "## Erweiterung\n", "\n", "Wir wollen *alle* Spalten mit Zahlen standardisieren." ] }, { "cell_type": "code", "execution_count": 15, "id": "2915f3e8", "metadata": {}, "outputs": [], "source": [ "def standard(p):\n", " \"\"\"Standardisiert eine Liste: avg abziehen und durch Standardabweichung teilen.\"\"\"\n", " \n", " # Summe\n", " p_sum = 0\n", " for x in p:\n", " p_sum += x\n", " \n", " # arithmetisches Mittel\n", " p_avg = p_sum / len(p)\n", " print(\"p_avg:\", p_avg)\n", " \n", " # zentrieren\n", " p_zentriert = [ x - p_avg for x in p ]\n", " \n", " # standardisieren\n", " p_standardabweichung = ( sum( [ x**2 for x in p_zentriert ] ) / len(p) ) ** 0.5\n", " p_standardisiert = [ x / p_standardabweichung for x in p_zentriert]\n", " \n", " return p_standardisiert" ] }, { "cell_type": "code", "execution_count": 16, "id": "fa3a7361", "metadata": {}, "outputs": [], "source": [ "#mein_test = [1,2,3,4,5]\n", "#mein_test_standardisiert = standard(mein_test)\n", "#mein_test_standardisiert" ] }, { "cell_type": "code", "execution_count": 17, "id": "886931b8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p_avg: 1326666.6666666667\n", "p_avg: 164.0\n" ] } ], "source": [ "for Spalte in [\"Price\", \"Landsize\"]:\n", " Spaltenname_neu = Spalte + \"_standardisiert\"\n", " melb_orient_list[Spaltenname_neu] = standard( melb_orient_list[Spalte] )" ] }, { "cell_type": "markdown", "id": "9e1c87e9", "metadata": {}, "source": [ "## Kontrolle\n", "\n", "\n", "Natürlich kann man das so entstandene dict auch wieder in ein DataFrame umwandeln, Doku:\n", "* " ] }, { "cell_type": "code", "execution_count": 18, "id": "3101c73e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AddressRoomsTypePriceDateLandsizePrice_centeredPrice_standardizedPrice_standardisiertLandsize_standardisiert
085 Turner St2h1480000.03/12/2016202.0153333.3333330.7431450.7431451.341269
125 Bloomburg St2h1035000.04/02/2016156.0-291666.666667-1.413591-1.413591-0.282372
25 Charles St3h1465000.04/03/2017134.0138333.3333330.6704460.670446-1.058897
\n", "
" ], "text/plain": [ " Address Rooms Type Price Date Landsize \\\n", "0 85 Turner St 2 h 1480000.0 3/12/2016 202.0 \n", "1 25 Bloomburg St 2 h 1035000.0 4/02/2016 156.0 \n", "2 5 Charles St 3 h 1465000.0 4/03/2017 134.0 \n", "\n", " Price_centered Price_standardized Price_standardisiert \\\n", "0 153333.333333 0.743145 0.743145 \n", "1 -291666.666667 -1.413591 -1.413591 \n", "2 138333.333333 0.670446 0.670446 \n", "\n", " Landsize_standardisiert \n", "0 1.341269 \n", "1 -0.282372 \n", "2 -1.058897 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "melb2_list_df = pd.DataFrame.from_dict(melb_orient_list, orient = 'columns')\n", "melb2_list_df" ] }, { "cell_type": "markdown", "id": "5bbdac15", "metadata": {}, "source": [ "Für Lernwillige: Wenn Sie mit Ihrer Lösung zufrieden sind, lösen Sie zur Übung diese Aufgabe auch in den anderen zwei Orientierungen `dict` und `index` (kniffeliger, weil der Datentyp nicht so gut passt, aber gut möglich.)" ] } ], "metadata": { "jupytext": { "formats": "ipynb,md:myst", "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.13, "jupytext_version": "1.14.1" } }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "source_map": [ 13, 25, 43, 57, 59, 67, 70, 76, 82, 88, 92, 108, 115, 122, 127, 132, 137, 141, 145, 147, 153, 176, 182, 186, 194, 197 ] }, "nbformat": 4, "nbformat_minor": 5 }