{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "89a88ad1",
   "metadata": {},
   "source": [
    "# Runde 1b: Verschachtelte Datenstrukturen mit Tiefe 2\n",
    "\n",
    "Darum geht es: In {doc}`e_r1` haben wir die wichtigsten der grundlegenden Python-Konstrukte kennengelernt.\n",
    "\n",
    "Wir bleiben mit den Sprachmitteln in Runde 1 und wenden sie auf etwas komplexere Datenstrukturen an: Listen (oder Dicts etc.), die Listen oder Dicts etc. enthalten. Mit solch einer „verschachtelten“ Datenstruktur können wir z.B. Excel-Tabellen darstellen (und natürlich auch importieren und exportieren).\n",
    "\n",
    "Ziel ist es, ein erstes einfaches Programm zu schreiben, in dem algorithmisch mehrere Schritte aufeinander folgen.\n",
    "\n",
    "Dieses Notebook präsentiert zuerst das Problem, und zeigt dann akribisch Schritt für Schritt, wie man sich als Programmier-Anfänger in Runde 1 der Problemlösung nähert."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1218b505",
   "metadata": {},
   "source": [
    "## Melbourne Housing\n",
    "\n",
    "Wir lesen die ersten `n` Zeilen eines gut bekannten Original-Datensatzes aus der Wirtschaftsinformatik als Pandas Dataframe ein.\n",
    "\n",
    "Download:\n",
    "* {download}`melb_data.csv`\n",
    "* Wohin speichern? Am einfachsten in das selbe Verzeichnis legen, in dem dieses Notebook liegt.\n",
    "\n",
    "Über diesen Datensatz:\n",
    "* mögliche Primärquelle (?): https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market . \n",
    "* Beschreibung z.B. auch hier <https://rpubs.com/mchtaufiqq/house_market>\n",
    "* Ein Snapshot des Datensatzes wird auch verwendet in dem Schulungsmodul <https://www.kaggle.com/learn/intro-to-machine-learning> . \n",
    "\n",
    "Doku Pandas: \n",
    "* <https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "5f23d280",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3, 6)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "nrows = 3\n",
    "usecols = [ 'Landsize', 'Rooms', 'Price', 'Date', 'Address', 'Type']\n",
    "melb_df = pd.read_csv( \"./melb_data.csv\", \n",
    "            usecols = usecols,\n",
    "            nrows = nrows # eigentlich sind es 13580 Datensätze ;-)\n",
    "         )\n",
    "\n",
    "# Was haben wir eingelesen? # Anzahl Zeilen, Spalten\n",
    "melb_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d791ff33",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Address</th>\n",
       "      <th>Rooms</th>\n",
       "      <th>Type</th>\n",
       "      <th>Price</th>\n",
       "      <th>Date</th>\n",
       "      <th>Landsize</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>85 Turner St</td>\n",
       "      <td>2</td>\n",
       "      <td>h</td>\n",
       "      <td>1480000.0</td>\n",
       "      <td>3/12/2016</td>\n",
       "      <td>202.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>25 Bloomburg St</td>\n",
       "      <td>2</td>\n",
       "      <td>h</td>\n",
       "      <td>1035000.0</td>\n",
       "      <td>4/02/2016</td>\n",
       "      <td>156.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5 Charles St</td>\n",
       "      <td>3</td>\n",
       "      <td>h</td>\n",
       "      <td>1465000.0</td>\n",
       "      <td>4/03/2017</td>\n",
       "      <td>134.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Address  Rooms Type      Price       Date  Landsize\n",
       "0     85 Turner St      2    h  1480000.0  3/12/2016     202.0\n",
       "1  25 Bloomburg St      2    h  1035000.0  4/02/2016     156.0\n",
       "2     5 Charles St      3    h  1465000.0  4/03/2017     134.0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "melb_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ecb81d9",
   "metadata": {},
   "source": [
    "Mit Pandas Dataframes zu arbeiten kann in der ersten Runde Python-101 nicht Inhalt sein. Aber man kann solch ein DataFrame in Datenstrukturen überführen, die für eine Einführung in Python 101 sehr instruktuv sind. \n",
    "\n",
    "* Doku: <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html>\n",
    "\n",
    "Die einfachste alle Darstellungen verwirft die Spaltenüberschriften völlig:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "e985f35b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['85 Turner St', 2, 'h', 1480000.0, '3/12/2016', 202.0],\n",
       " ['25 Bloomburg St', 2, 'h', 1035000.0, '4/02/2016', 156.0],\n",
       " ['5 Charles St', 3, 'h', 1465000.0, '4/03/2017', 134.0]]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# jede Zeile eine Liste\n",
    "melb_df.values.tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "714e2e7a",
   "metadata": {},
   "source": [
    "Informationserhaltende Darstellungen nutzen Mischungen aus Listen und dicts.\n",
    "\n",
    "Aufgabe: Vergegenwärtigen Sie sich für `list`, `dict` und `index`, ins welche konkrete Datenstruktur die abstrakte Datenstruktur „Tabelle“ jeweils überführt wird!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "65bc2860",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Address': ['85 Turner St', '25 Bloomburg St', '5 Charles St'],\n",
       " 'Rooms': [2, 2, 3],\n",
       " 'Type': ['h', 'h', 'h'],\n",
       " 'Price': [1480000.0, 1035000.0, 1465000.0],\n",
       " 'Date': ['3/12/2016', '4/02/2016', '4/03/2017'],\n",
       " 'Landsize': [202.0, 156.0, 134.0]}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# jede Spalte eine Liste\n",
    "melb_orient_list = melb_df.to_dict(orient='list')\n",
    "melb_orient_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "fd1ae2c9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Address': {0: '85 Turner St', 1: '25 Bloomburg St', 2: '5 Charles St'},\n",
       " 'Rooms': {0: 2, 1: 2, 2: 3},\n",
       " 'Type': {0: 'h', 1: 'h', 2: 'h'},\n",
       " 'Price': {0: 1480000.0, 1: 1035000.0, 2: 1465000.0},\n",
       " 'Date': {0: '3/12/2016', 1: '4/02/2016', 2: '4/03/2017'},\n",
       " 'Landsize': {0: 202.0, 1: 156.0, 2: 134.0}}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# jede Spalte ein Dict\n",
    "melb_orient_dict = melb_df.to_dict(orient='dict') # orient='dict': default\n",
    "melb_orient_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "47f51b03",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{0: {'Address': '85 Turner St',\n",
       "  'Rooms': 2,\n",
       "  'Type': 'h',\n",
       "  'Price': 1480000.0,\n",
       "  'Date': '3/12/2016',\n",
       "  'Landsize': 202.0},\n",
       " 1: {'Address': '25 Bloomburg St',\n",
       "  'Rooms': 2,\n",
       "  'Type': 'h',\n",
       "  'Price': 1035000.0,\n",
       "  'Date': '4/02/2016',\n",
       "  'Landsize': 156.0},\n",
       " 2: {'Address': '5 Charles St',\n",
       "  'Rooms': 3,\n",
       "  'Type': 'h',\n",
       "  'Price': 1465000.0,\n",
       "  'Date': '4/03/2017',\n",
       "  'Landsize': 134.0}}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# jede Zeile ein Dict\n",
    "melb_orient_index = melb_df.to_dict(orient='index')\n",
    "melb_orient_index"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e5e92e7",
   "metadata": {},
   "source": [
    "### Aufgabenstellung\n",
    "\n",
    "Gegeben: Der Melbourne Housing Datensatz.\n",
    "\n",
    "Gesucht:\n",
    "* `Price_avg`: berechne das arithmetisches Mittel („Durchschnitt“) des Preises aller Häuser.\n",
    "* Lege neue Spalte `Price_centered` an: Subtrahiere vom jeweiligen Preis den Durchschnittspreis.\n",
    "* Bestimme zur neuen Spalte `Price_centered` die <https://de.wikipedia.org/wiki/Empirische_Varianz>.\n",
    "* neue Spalte `Price_standardized`: Dividiere jeden Wert aus `Price_centered` durch die Standardabweichung (Wurzel der Varianz).\n",
    "\n",
    "Lösen Sie diese Aufgabe zunächst auf Basis der Variablen `melb_orient_list`.\n",
    "\n",
    "Lösen Sie diese Aufgabe nur mit den Mittel aus Runde 1. Insbesondere stehen Ihnen Funktionen wie `min()` oder `max()` in Runde 1 ja nicht zur Verfügung. Deshalb programmieren wir diese zur Übung hier einfach selbst."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "94638d49",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1480000.0, 1035000.0, 1465000.0]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# meine Spalte: p\n",
    "\n",
    "p = melb_orient_list[\"Price\"]\n",
    "p"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a0e52460",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3980000.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p_sum = 0\n",
    "for x in p:\n",
    "    p_sum += x\n",
    "p_sum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "10a2cdb2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1326666.6666666667"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p_avg = p_sum / len(p)\n",
    "p_avg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "fb824744",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[153333.33333333326, -291666.66666666674, 138333.33333333326]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p_zentriert = [ x - p_avg for x in p ]\n",
    "p_zentriert"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "cf159c4a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "206330.3715457863"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p_standardabweichung = ( sum( [ x**2  for x in p_zentriert ] ) / len(p) ) ** 0.5\n",
    "p_standardabweichung"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "0ba491f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "p_standardisiert = [  x / p_standardabweichung  for x in p_zentriert]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "474810f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "melb_orient_list[\"Price_centered\"] = p_zentriert"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "288b729a",
   "metadata": {},
   "outputs": [],
   "source": [
    "melb_orient_list[\"Price_standardized\"] = p_standardisiert"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3ccdbf8",
   "metadata": {},
   "source": [
    "## Erweiterung\n",
    "\n",
    "Wir wollen *alle* Spalten mit Zahlen standardisieren."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "2915f3e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "def standard(p):\n",
    "    \"\"\"Standardisiert eine Liste: avg abziehen und durch Standardabweichung teilen.\"\"\"\n",
    "    \n",
    "    # Summe\n",
    "    p_sum = 0\n",
    "    for x in p:\n",
    "        p_sum += x\n",
    "    \n",
    "    # arithmetisches Mittel\n",
    "    p_avg = p_sum / len(p)\n",
    "    print(\"p_avg:\", p_avg)\n",
    "    \n",
    "    # zentrieren\n",
    "    p_zentriert = [ x - p_avg for x in p ]\n",
    "    \n",
    "    # standardisieren\n",
    "    p_standardabweichung = ( sum( [ x**2  for x in p_zentriert ] ) / len(p) ) ** 0.5\n",
    "    p_standardisiert = [  x / p_standardabweichung  for x in p_zentriert]\n",
    "    \n",
    "    return p_standardisiert"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "fa3a7361",
   "metadata": {},
   "outputs": [],
   "source": [
    "#mein_test = [1,2,3,4,5]\n",
    "#mein_test_standardisiert = standard(mein_test)\n",
    "#mein_test_standardisiert"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "886931b8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p_avg: 1326666.6666666667\n",
      "p_avg: 164.0\n"
     ]
    }
   ],
   "source": [
    "for Spalte in [\"Price\", \"Landsize\"]:\n",
    "    Spaltenname_neu = Spalte + \"_standardisiert\"\n",
    "    melb_orient_list[Spaltenname_neu] = standard( melb_orient_list[Spalte] )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e1c87e9",
   "metadata": {},
   "source": [
    "## Kontrolle\n",
    "\n",
    "\n",
    "Natürlich kann man das so entstandene dict auch wieder in ein DataFrame umwandeln, Doku:\n",
    "* <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "3101c73e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Address</th>\n",
       "      <th>Rooms</th>\n",
       "      <th>Type</th>\n",
       "      <th>Price</th>\n",
       "      <th>Date</th>\n",
       "      <th>Landsize</th>\n",
       "      <th>Price_centered</th>\n",
       "      <th>Price_standardized</th>\n",
       "      <th>Price_standardisiert</th>\n",
       "      <th>Landsize_standardisiert</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>85 Turner St</td>\n",
       "      <td>2</td>\n",
       "      <td>h</td>\n",
       "      <td>1480000.0</td>\n",
       "      <td>3/12/2016</td>\n",
       "      <td>202.0</td>\n",
       "      <td>153333.333333</td>\n",
       "      <td>0.743145</td>\n",
       "      <td>0.743145</td>\n",
       "      <td>1.341269</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>25 Bloomburg St</td>\n",
       "      <td>2</td>\n",
       "      <td>h</td>\n",
       "      <td>1035000.0</td>\n",
       "      <td>4/02/2016</td>\n",
       "      <td>156.0</td>\n",
       "      <td>-291666.666667</td>\n",
       "      <td>-1.413591</td>\n",
       "      <td>-1.413591</td>\n",
       "      <td>-0.282372</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5 Charles St</td>\n",
       "      <td>3</td>\n",
       "      <td>h</td>\n",
       "      <td>1465000.0</td>\n",
       "      <td>4/03/2017</td>\n",
       "      <td>134.0</td>\n",
       "      <td>138333.333333</td>\n",
       "      <td>0.670446</td>\n",
       "      <td>0.670446</td>\n",
       "      <td>-1.058897</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Address  Rooms Type      Price       Date  Landsize  \\\n",
       "0     85 Turner St      2    h  1480000.0  3/12/2016     202.0   \n",
       "1  25 Bloomburg St      2    h  1035000.0  4/02/2016     156.0   \n",
       "2     5 Charles St      3    h  1465000.0  4/03/2017     134.0   \n",
       "\n",
       "   Price_centered  Price_standardized  Price_standardisiert  \\\n",
       "0   153333.333333            0.743145              0.743145   \n",
       "1  -291666.666667           -1.413591             -1.413591   \n",
       "2   138333.333333            0.670446              0.670446   \n",
       "\n",
       "   Landsize_standardisiert  \n",
       "0                 1.341269  \n",
       "1                -0.282372  \n",
       "2                -1.058897  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "melb2_list_df = pd.DataFrame.from_dict(melb_orient_list, orient = 'columns')\n",
    "melb2_list_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5bbdac15",
   "metadata": {},
   "source": [
    "Für Lernwillige: Wenn Sie mit Ihrer Lösung zufrieden sind, lösen Sie zur Übung diese Aufgabe auch in den anderen zwei Orientierungen `dict` und `index` (kniffeliger, weil der Datentyp nicht so gut passt, aber gut möglich.)"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "ipynb,md:myst",
   "text_representation": {
    "extension": ".md",
    "format_name": "myst",
    "format_version": 0.13,
    "jupytext_version": "1.14.1"
   }
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "source_map": [
   13,
   25,
   43,
   57,
   59,
   67,
   70,
   76,
   82,
   88,
   92,
   108,
   115,
   122,
   127,
   132,
   137,
   141,
   145,
   147,
   153,
   176,
   182,
   186,
   194,
   197
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}