Moving Average Analyse Of Zeit Serie Daten


Excel For Statistical Data Analysis Dies ist eine Webtext-Begleiter-Seite von Business Statistics USA Website: http://www. dict. cc/?s=default. asp? CacheImage=de Excel ist das weit verbreitete statistische Paket, das als Werkzeug dient, um statistische Konzepte und Berechnungen zu verstehen, um Ihre handgearbeitete Berechnung bei der Lösung Ihrer Hausaufgaben zu überprüfen. Die Website bietet eine Einführung in die Grundlagen und die Arbeit mit dem Excel zu verstehen. Die Wiederherstellung der veranschaulichten Zahlenbeispiele auf dieser Website wird dazu beitragen, Ihre Vertrautheit zu verbessern und dadurch die Effektivität und Effizienz Ihres Prozesses in der Statistik zu erhöhen. Um die Website zu durchsuchen. Versuchen Sie es in Seite Ctrl f. Geben Sie ein Wort oder eine Wortgruppe in das Dialogfeld ein, z. B. Wenn Sie das erste Aussehen der Wortsprache nicht suchen, versuchen Sie es mit F ind Weiter. Einleitung Diese Website bietet erläuternde Erfahrungen in der Verwendung von Excel für die Datenzusammenfassung, Präsentation und für andere statistische Grundlagenanalyse. Ich glaube, dass der populäre Gebrauch von Excel auf den Gebieten ist, in denen Excel wirklich hervorbringen kann. Dies umfasst die Organisation von Daten, d. h. die grundlegende Datenverwaltung, die Tabellierung und die Grafik. Für echte statistische Analyse auf lernen müssen, mit den professionellen kommerziellen statistischen Pakete wie SAS und SPSS. Microsoft Excel 2000 (Version 9) bietet eine Reihe von Datenanalyse-Tools namens Analysis ToolPak, die Sie verwenden können, um Schritte zu speichern, wenn Sie komplexe statistische Analysen zu entwickeln. Sie geben die Daten und Parameter für jede Analyse das Tool verwendet die entsprechenden statistischen Makro-Funktionen und zeigt die Ergebnisse in einer Ausgabetabelle. Einige Tools erzeugen Diagramme zusätzlich zu den Ausgabetabellen. Wenn der Befehl Datenanalyse im Menü Extras aktiviert ist, wird das Analysis ToolPak auf Ihrem System installiert. Wenn der Befehl Datenanalyse nicht im Menü Extras ausgeführt wird, müssen Sie das Analysis ToolPak jedoch folgendermaßen installieren: Schritt 1: Klicken Sie im Menü Extras auf Add-Ins. Wenn Analysis ToolPak nicht im Add-Ins-Dialogfeld aufgelistet ist, klicken Sie auf Durchsuchen, und suchen Sie das Laufwerk, den Ordnernamen und den Dateinamen für das Analysis ToolPak-Add-In Analys32.xll, das sich normalerweise im Ordner Programme FilesMicrosoft OfficeOfficeLibraryAnalysis befindet. Sobald Sie die Datei gefunden haben, wählen Sie sie aus und klicken Sie auf OK. Schritt 2: Wenn Sie die Datei "Analys32.xll" nicht finden, müssen Sie sie installieren. Legen Sie Ihre Microsoft Office 2000 Disk 1 in das CD-ROM-Laufwerk ein. Wählen Sie im Windows-Startmenü die Option Ausführen aus. Suchen und wählen Sie das Laufwerk für Ihre CD. Wählen Sie Setup. exe aus, klicken Sie auf Öffnen, und klicken Sie auf OK. Klicken Sie auf die Schaltfläche Features hinzufügen oder entfernen. Klicken Sie auf das nächste zu Microsoft Excel für Windows. Klicken Sie auf Add-ins. Klicken Sie neben Analysis ToolPak auf den Abwärtspfeil. Wählen Sie Ausführen von Arbeitsplatz aus. Wählen Sie die Schaltfläche Jetzt aktualisieren. Excel aktualisiert jetzt Ihr System, um Analysis ToolPak einzuschließen. Starten Sie Excel. Klicken Sie im Menü Extras auf Add-Ins. - und aktivieren Sie das Kontrollkästchen Analyse-ToolPak. Schritt 3: Das Add-In Analysis ToolPak ist nun installiert und Datenanalyse. Ist nun im Menü Extras auswählbar. Microsoft Excel ist ein leistungsfähiges Spreadsheet-Paket für Microsoft Windows und den Apple Macintosh. Spreadsheet-Software dient zum Speichern von Informationen in Spalten und Zeilen, die dann organisiert und verarbeitet werden können. Spreadsheets sind entworfen, um mit Zahlen gut funktionieren, aber häufig Text einschließen. Excel organisiert Ihre Arbeit in Arbeitsmappen jede Arbeitsmappe kann viele Arbeitsblätter enthalten Arbeitsblätter werden verwendet, um Daten aufzulisten und zu analysieren. Excel ist auf allen öffentlich zugänglichen PCs verfügbar (d. h. jene, z. B. in der Bibliothek und in den PC Labs). Es kann entweder durch Auswahl von Start - Programme - Microsoft Excel oder durch Klicken auf den Excel Short Cut, der entweder auf Ihrem Desktop oder auf einem beliebigen PC oder in der Office Toolleiste geöffnet wird, geöffnet werden. Öffnen eines Dokuments: Klicken Sie auf Datei öffnen (CtrlO), um eine vorhandene Arbeitsmappe zu öffnen. Ändern Sie den Verzeichnisbereich oder das Laufwerk, um nach anderen Dateien zu suchen. Klicken Sie auf Datei-Neu-Leeres Dokument, um eine neue Arbeitsmappe zu erstellen. Speichern und Schließen eines Dokuments: Um das Dokument mit seinem aktuellen Dateinamen, Speicherort und Dateiformat zu speichern, klicken Sie auf Datei - Speichern. Wenn Sie zum ersten Mal speichern, klicken Sie auf File-Save choosetype einen Namen für Ihr Dokument und klicken Sie dann auf OK. Verwenden Sie auch File-Save, wenn Sie in einem anderen Dateinamen speichern möchten. Wenn Sie die Arbeit an einem Dokument beendet haben, sollten Sie es schließen. Klicken Sie in dem Menü Datei auf Schließen. Wenn Sie seit der letzten Speicherung Änderungen vorgenommen haben, werden Sie gefragt, ob Sie diese speichern möchten. Der Excel-Bildschirm Arbeitsmappen und Arbeitsblätter: Beim Starten von Excel wird ein leeres Arbeitsblatt angezeigt, das aus einem mehrfachen Zellenraster mit nummerierten Zeilen auf der Seite und alphabetisch betitelten Spalten auf der Seite besteht. Jede Zelle wird durch ihre Koordinaten bezeichnet (z. B. wird A3 verwendet, um sich auf die Zelle in Spalte A und Zeile 3B10 zu beziehen: B20 wird verwendet, um sich auf den Bereich von Zellen in Spalte B und Zeilen 10 bis 20 zu beziehen). Ihre Arbeit wird in einer Excel-Datei mit dem Namen Arbeitsmappe gespeichert. Jede Arbeitsmappe kann mehrere Arbeitsblätter und Diagramme enthalten - das aktuelle Arbeitsblatt heißt das aktive Blatt. Um ein anderes Arbeitsblatt in einer Arbeitsmappe anzuzeigen, klicken Sie auf die entsprechende Registerkarte "Blatt". Sie können Befehle direkt über das Hauptmenü aufrufen oder ausführen, oder Sie können auf eine der Symbolleisten-Schaltflächen zeigen (das Anzeigefeld, das unterhalb der Schaltfläche angezeigt wird, wenn Sie den Cursor darüber positionieren, die Namensaktion der Schaltfläche anzeigt) und einmal klicken. Umgehen des Arbeitsblatts: Es ist wichtig, dass Sie sich um das Arbeitsblatt bewegen können, da Sie Daten nur an der Position des Cursors eingeben oder ändern können. Sie können den Cursor mit den Pfeiltasten oder durch Bewegen der Maus auf die gewünschte Zelle und klicken. Sobald die Zelle die aktive Zelle ist und durch eine dicke Grenze identifiziert wird, kann jeweils nur eine Zelle aktiv sein. Um von einem Arbeitsblatt zu einem anderen zu wechseln, klicken Sie auf die Registerkarten. (Wenn Ihre Arbeitsmappe viele Blätter enthält, klicken Sie mit der rechten Maustaste auf die Schaltflächen der Registerkarte, und klicken Sie dann auf das gewünschte Blatt.) Der Name des aktiven Blattes wird fett dargestellt. Bewegen zwischen den Zellen: Hier ist eine Tastenkombination, um die aktive Zelle zu verschieben: Home - bewegt sich zur ersten Spalte in der aktuellen Zeile CtrlHome - bewegt sich in die obere linke Ecke des Dokuments End dann Home - bewegt sich zur letzten Zelle im Dokument To Bewegen Sie sich zwischen Zellen auf einem Arbeitsblatt, klicken Sie auf eine beliebige Zelle oder verwenden Sie die Pfeiltasten. Um einen anderen Bereich des Blattes zu sehen, verwenden Sie die Bildlaufleisten und klicken Sie auf die Pfeile oder den Bereich oberhalb des Bildlauffelds in den vertikalen oder horizontalen Bildlaufleisten. Beachten Sie, dass die Größe eines Scroll-Felds den Proportionalbetrag des benutzten Bereichs des im Fenster sichtbaren Blatts angibt. Die Position eines Bildlauffeldes zeigt die relative Position des sichtbaren Bereichs im Arbeitsblatt an. Daten eingeben Ein neues Arbeitsblatt ist ein Raster aus Zeilen und Spalten. Die Zeilen sind mit Zahlen beschriftet und die Spalten sind mit Buchstaben gekennzeichnet. Jeder Schnittpunkt einer Zeile und einer Spalte ist eine Zelle. Jede Zelle hat eine Adresse. Der der Spaltenbuchstabe und die Zeilennummer ist. Der Pfeil auf dem Arbeitsblatt nach rechts zeigt auf Zelle A1, die momentan hervorgehoben wird. Dass es sich um eine aktive Zelle handelt. Eine Zelle muss aktiv sein, um Informationen darin einzugeben. Um eine Zelle zu markieren (markieren), klicken Sie darauf. Zum Auswählen von mehr als einer Zelle: Klicken Sie auf eine Zelle (z. B. A1), halten Sie dann die Umschalttaste gedrückt, während Sie auf einen anderen (z. B. D4) klicken, um alle Zellen zwischen und einschließlich A1 und D4 auszuwählen. Klicken Sie auf eine Zelle (z. B. A1) und ziehen Sie die Maus über den gewünschten Bereich, indem Sie auf eine andere Zelle (zB D4) tippen, um alle Zellen zwischen A1 und D4 auszuwählen. Um mehrere nicht benachbarte Zellen auszuwählen, drücken Sie die Steuerung und klicken Sie auf Die Zellen, die Sie auswählen möchten. Klicken Sie auf eine Zahl oder ein Zeichen, die eine Zeile oder Spalte beschriftet, um die gesamte Zeile oder Spalte auszuwählen. Ein Arbeitsblatt kann bis zu 256 Spalten und 65.536 Zeilen haben, so dass es eine Weile dauern wird, bevor Sie den Raum verlassen. Jede Zelle kann ein Label enthalten. Wert . Logischen Wert. Oder Formel. Etiketten können jede Kombination aus Buchstaben, Zahlen oder Symbolen enthalten. Werte sind Zahlen. In Berechnungen können nur Werte (Zahlen) verwendet werden. Ein Wert kann auch ein Datum oder ein timeLogical Werte sind true oder false. Formulas automatisch Berechnungen auf die Werte in anderen angegebenen Zellen und zeigen das Ergebnis in der Zelle, in der die Formel eingegeben wird (zum Beispiel können Sie die Zelle D3 Die Summe der Zahlen in B3 und C3 enthalten soll, so ist die in D3 angezeigte Zahl dann eine Funktion der in B3 und C3 eingegebenen Zahlen. Um Informationen in eine Zelle einzugeben, markieren Sie die Zelle und beginnen mit der Eingabe. Beachten Sie, dass die Informationen, die Sie eingeben, auch in der Formelleiste angezeigt werden, während Sie Informationen in die Zelle eingeben. Sie können auch Informationen in die Formelleiste eingeben, und die Informationen werden in der ausgewählten Zelle angezeigt. Wenn Sie das Label oder den Wert eingegeben haben: Drücken Sie die Eingabetaste, um zur nächsten Zelle zu gelangen (in diesem Fall A2). Drücken Sie die Tabulatortaste, um zur nächsten Zelle (in diesem Fall B1) zu gelangen It Eingabe von Etiketten Wenn die von Ihnen eingegebenen Informationen nicht als Wert oder Formel formatiert sind, interpretiert Excel diese als Etikett, und standardmäßig wird der Text auf der linken Seite der Zelle ausgerichtet. Wenn Sie ein langes Arbeitsblatt erstellen und die gleichen Kennzeichnungsinformationen in vielen verschiedenen Zellen wiederholen, können Sie die AutoVervollständigen-Funktion verwenden. Diese Funktion schaut auf andere Einträge in derselben Spalte und versucht, einen vorherigen Eintrag mit Ihrem aktuellen Eintrag abzugleichen. Wenn Sie z. B. Wesleyan bereits in eine andere Zelle eingegeben haben und W in einer neuen Zelle eingeben, wird Excel automatisch Wesleyan eingeben. Wenn Sie Wesleyan in die Zelle schreiben möchten, ist Ihre Aufgabe getan, und Sie können auf die nächste Zelle zu bewegen. Wenn Sie etwas anderes eingeben möchten, z. B. Williams, in die Zelle, einfach weiter eingeben, um den Begriff eingeben. Um die Funktion AutoVervollständigen zu aktivieren, klicken Sie in der Menüleiste auf Extras, wählen Sie Optionen und klicken Sie dann auf Bearbeiten, und klicken Sie auf, um das Kontrollkästchen neben AutoVervollständigen für Zellenwerte aktivieren zu aktivieren. Eine weitere Möglichkeit, um wiederholte Etiketten schnell einzugeben, besteht darin, die Auswahlliste zu verwenden. Klicken Sie mit der rechten Maustaste auf eine Zelle, und wählen Sie dann Pick From List. Dies gibt Ihnen ein Menü mit allen anderen Einträgen in Zellen in dieser Spalte. Klicken Sie auf einen Eintrag im Menü, um ihn in die aktuell markierte Zelle einzugeben. Ein Wert ist eine Zahl, ein Datum oder eine Zeit, zuzüglich einiger Symbole, um die Zahlen 91, - () 93. Zahlen werden als positiv angenommen, um eine negative Zahl einzugeben, ein Minuszeichen zu verwenden oder die Zahl in Klammern () einzutragen. Die Daten werden als MMDDYYYY gespeichert, müssen aber nicht exakt in diesem Format eingegeben werden. Wenn Sie jan 9 oder jan-9 eingeben, erkennt Excel es am 9. Januar des laufenden Jahres und speichert es als 192002. Geben Sie das vierstellige Jahr für ein Jahr außer dem aktuellen Jahr ein (z. B. jan 9, 1999). Um das aktuelle Tagesdatum einzugeben, drücken Sie die Steuerung und gleichzeitig. Die Uhrzeit ist standardmäßig auf 24 Stunden eingestellt. Verwenden Sie a oder p, um am oder pm anzuzeigen, wenn Sie eine 12-Stunden-Uhr verwenden (z. B. 8:30 p wird als 20.30 Uhr interpretiert). Um die aktuelle Uhrzeit einzugeben, drücken Sie gleichzeitig die Taste control und: (shift-semicolon). Ein als Wert (Zahl, Datum oder Uhrzeit) interpretierter Eintrag wird auf die rechte Seite der Zelle ausgerichtet, um einen Wert neu zu formatieren. Rundungsnummern, die bestimmte Kriterien erfüllen: So wenden Sie Farben auf maximale und minimale Werte an: Wählen Sie eine Zelle in der Region aus, und drücken Sie StrgShift (in Excel 2003 drücken Sie diese oder CtrlA), um die aktuelle Region auszuwählen. Wählen Sie im Menü Format die Option Bedingte Formatierung aus. Wählen Sie in Bedingung 1 die Formel Is aus, und geben Sie MAX (F: F) F1 ein. Klicken Sie auf Format, wählen Sie die Registerkarte Schriftart, wählen Sie eine Farbe aus, und klicken Sie dann auf OK. Wählen Sie in Bedingung 2 Formel aus, und geben Sie MIN (F: F) F1 ein. Wiederholen Sie Schritt 4, wählen Sie eine andere Farbe als Sie für Bedingung 1 ausgewählt haben, und klicken Sie dann auf OK. Hinweis: Achten Sie darauf, dass Sie bei der Eingabe der Formeln zwischen absoluter Referenz und relativer Referenz unterscheiden. Runding Numbers, die festgelegte Kriterien erfüllen Problem: Rundung aller Zahlen in Spalte A auf null Dezimalstellen, mit Ausnahme derjenigen, die 5 in der ersten Dezimalstelle haben. Lösung: Verwenden Sie die Funktionen IF, MOD und ROUND in der folgenden Formel: IF (MOD (A2,1) 0,5, A2, ROUND (A2,0)) So kopieren und fügen Sie alle Zellen in einem Blatt Wählen Sie die Zellen im Blatt aus Indem Sie CtrlA (in Excel 2003, wählen Sie eine Zelle in einem leeren Bereich, bevor Sie CtrlA oder aus einer ausgewählten Zelle in einem aktuellen RegionList Bereich, drücken Sie CtrlAA). ODER Klicken Sie auf Alle auswählen am oberen linken Kreuzungspunkt von Zeilen und Spalten. Drücken Sie Strg. Drücken Sie CtrlPage Down, um ein anderes Blatt auszuwählen, und wählen Sie Zelle A1 aus. Drücken Sie Enter. Kopieren des gesamten Blattes Das Kopieren des gesamten Blattes bedeutet das Kopieren der Zellen, der Seitenaufbauparameter und des definierten Bereichs Namen. Option 1: Bewegen Sie den Mauszeiger auf eine Registerkarte. Drücken Sie Strg, und halten Sie die Maus gedrückt, um das Blatt an eine andere Position zu ziehen. Lassen Sie die Maustaste und die Strg-Taste los. Option 2: Klicken Sie mit der rechten Maustaste auf die entsprechende Registerkarte. Klicken Sie im Kontextmenü auf Verschieben oder Kopieren. Im Dialogfeld Verschieben oder Kopieren können Sie das Blatt entweder an eine andere Stelle in der aktuellen Arbeitsmappe oder in eine andere Arbeitsmappe kopieren. Markieren Sie das Kontrollkästchen Eine Kopie erstellen. Option 3: Wählen Sie im Menü Fenster die Option Anordnen. Wählen Sie Kachel, um alle geöffneten Arbeitsmappen in dem Fenster zu kacheln. Verwenden Sie Option 1 (Ziehen des Blattes beim Drücken von Strg), um ein Blatt zu kopieren oder zu verschieben. Sortieren nach Spalten Die Standardeinstellung für die Sortierung in aufsteigender oder absteigender Reihenfolge ist eine Reihe. So sortieren Sie nach Spalten: Wählen Sie im Menü Daten die Option Sortieren und dann Optionen. Aktivieren Sie das Optionsfeld Sortieren nach rechts, und klicken Sie auf OK. Wählen Sie in der Option Sortieren nach des Dialogfelds Sortieren die Zeilennummer aus, nach der die Spalten sortiert werden sollen, und klicken Sie auf OK. Beschreibende Statistik Das Datenanalyse-ToolPak verfügt über ein beschreibendes Statistik-Tool, das Ihnen eine einfache Möglichkeit bietet, die Zusammenfassungsstatistiken für einen Satz von Beispieldaten zu berechnen. Zusammenfassungsstatistiken beinhalten Mittelwert, Standardfehler, Median, Modus, Standardabweichung, Varianz, Kurtosis, Schiefe, Bereich, Minimum, Maximum, Summe und Anzahl. Dieses Tool eliminiert die Notwendigkeit, einzelne Funktionen einzugeben, um jedes dieser Ergebnisse zu finden. Excel enthält aufwendige und anpassbare Symbolleisten, zum Beispiel die hier gezeigte Standard-Symbolleiste: Einige der Symbole sind nützliche mathematische Berechnungen: ist das Autosum-Symbol, das die Formel sum () einfügt, um einen Zellenbereich hinzuzufügen. Ist das FunctionWizard-Symbol, mit dem Sie auf alle verfügbaren Funktionen zugreifen können. Ist das GraphWizard-Symbol, mit dem Sie Zugriff auf alle verfügbaren Graph-Typen haben, wie in dieser Anzeige dargestellt: Mit Excel können Sie Maßnahmen für Standort und Variabilität für eine Variable erstellen. Angenommen wir wollen deskriptive Statistiken für eine Beispieldaten: 2, 4, 6 und 8. Schritt 1. Wählen Sie das Pulldown-Menü Tools, wenn Sie die Datenanalyse sehen, klicken Sie auf diese Option, andernfalls klicken Sie auf Add-In . Option zum Installieren des Analysewerkzeugs pak. Schritt 2. Klicken Sie auf die Option Datenanalyse. Schritt 3. Wählen Sie Deskriptive Statistik aus der Liste Analysetools. Schritt 4. Wenn das Dialogfenster erscheint: Geben Sie A1: A4 im Eingabebereichsfeld ein, A1 ist ein Wert in Spalte A und Zeile 1. In diesem Fall ist dieser Wert 2. Nach dem gleichen Verfahren geben Sie andere VALUES ein, bis Sie die letzte erreicht haben. Besteht eine Stichprobe aus 20 Zahlen, so können Sie als Eingabebereich zB A1, A2, A3 usw. auswählen. Schritt 5. Wählen Sie einen Ausgabebereich. In diesem Fall B1. Klicken Sie auf Zusammenfassung Statistiken, um die Ergebnisse zu sehen. Wenn Sie auf OK klicken. Sehen Sie das Ergebnis im ausgewählten Bereich. Wie Sie sehen werden, ist der Mittelwert der Probe 5, der Median 5, die Standardabweichung 2,581989, die Probenvarianz 6,6666667, der Bereich 6 und so weiter. Jeder dieser Faktoren könnte bei der Berechnung verschiedener statistischer Verfahren von Bedeutung sein. Normalverteilung Betrachten wir das Problem, die Wahrscheinlichkeit, unter einer normalen Wahrscheinlichkeitsverteilung kleiner als ein bestimmter Wert zu erhalten, zu finden. Als ein anschauliches Beispiel nehmen wir an, dass die SAT-Werte bundesweit normal mit einer Mittelwert - und Standardabweichung von 500 bzw. 100 verteilt sind. Beantworten Sie die folgenden Fragen anhand der gegebenen Informationen: A: Wie groß ist die Wahrscheinlichkeit, dass eine zufällig ausgewählte Schülerzahl weniger als 600 Punkte beträgt B: Wie hoch ist die Wahrscheinlichkeit, dass eine zufällig ausgewählte Schülerzahl 600 Punkte überschreitet C: Was ist die Wahrscheinlichkeit? Dass eine zufällig ausgewählte Schülerbewertung zwischen 400 und 600 liegen wird. Hinweis: Unter Verwendung von Excel können Sie die Wahrscheinlichkeit finden, einen Wert zu erhalten, der ungefähr kleiner oder gleich einem gegebenen Wert ist. In einem Problem, wenn der Mittelwert und die Standardabweichung der Population angegeben sind, müssen Sie den gesunden Menschenverstand verwenden, um unterschiedliche Wahrscheinlichkeiten auf der Grundlage der Frage zu finden, da Sie den Bereich unter einer normalen Kurve kennen 1. Wählen Sie im Arbeitsblatt die Option Wo die Antwort erscheinen soll. Angenommen, Sie haben die Zelle Nummer eins gewählt, A1. Wählen Sie aus den Menüs den Befehl "quotinsert pull-downquot". Schritt 2-3 Wählen Sie in den Menüs einfügen aus, und klicken Sie dann auf die Option Funktion. Schritt 4. Nachdem Sie auf die Option Funktion geklickt haben, erscheint das Dialogfeld Einfügen in der Funktionskategorie. Wählen Sie im Feld Funktionsname die Option Statistisch und dann NORMDIST. Klicken Sie auf OK Schritt 5. Nach dem Klicken auf OK erscheint das Verteilungsfeld NORMDIST: Geben Sie 600 in X ein (das Feld "Wert"). Ii. Geben Sie 500 im mittleren Feld iii ein. Geben Sie im Feld Standardabweichung 100 ein. Geben Sie quottruequot in das kumulative Feld ein, und klicken Sie dann auf OK. Wie Sie sehen, erscheint der Wert 0.84134474 in A1, was die Wahrscheinlichkeit anzeigt, dass eine zufällig ausgewählte Schülerzahl unter 600 Punkten liegt. Unter Verwendung des gesunden Menschenverstands können wir Teilquotabquot durch Subtrahieren von 0,84134474 von 1 beantworten. So ist die Teilquotbquot-Antwort 1- 0,8413474 oder 0,158653. Dies ist die Wahrscheinlichkeit, dass ein zufällig ausgewählten Studenten Score größer als 600 Punkte ist. Um Teilquotcquot zu beantworten, verwenden Sie die gleichen Techniken, um die Wahrscheinlichkeiten oder den Bereich in den linken Seiten der Werte 600 und 400 zu finden. Da diese Bereiche oder Wahrscheinlichkeiten einander überlappen, um die Frage zu beantworten, sollten Sie die kleinere Wahrscheinlichkeit von der größeren Wahrscheinlichkeit subtrahieren. Die Antwort ist gleich 0,84134474 - 0,15865526, dh 0,68269. Der Screenshot sollte folgendermaßen aussehen: Berechnen des Wertes einer Zufallsvariablen, die oft als quotquot-Wert bezeichnet wird Sie können NORMINV aus der Funktionsbox verwenden, um einen Wert für die Zufallsvariable zu berechnen - wenn die Wahrscheinlichkeit für die linke Seite dieser Variable gegeben ist. Eigentlich sollten Sie diese Funktion verwenden, um verschiedene Perzentile zu berechnen. In diesem Problem könnte man fragen, was ist die Partitur eines Schülers, dessen Perzentil ist 90 Das bedeutet, dass etwa 90 der Schüler Ergebnisse sind kleiner als diese Zahl. Auf der anderen Seite, wenn wir gebeten wurden, dieses Problem von Hand zu tun, hätten wir den x-Wert unter Verwendung der Normalverteilungsformel x m zd berechnen müssen. Jetzt können Sie mit Excel P90 berechnen. Klicken Sie in der Funktion Einfügen auf die Schaltfläche Statistik, und klicken Sie dann auf NORMINV. Der Screenshot sieht wie folgt aus: Wenn Sie NORMINV sehen, erscheint das Dialogfenster. ich. Geben Sie 0,90 für die Wahrscheinlichkeit ein (dies bedeutet, dass etwa 90 der Schüler weniger als der Wert, den wir suchen, ist) ii. Geben Sie 500 für den Mittelwert (dies ist der Mittelwert der Normalverteilung in unserem Fall) iii. Geben Sie 100 für die Standardabweichung ein (dies ist die Standardabweichung der Normalverteilung in unserem Fall) Am Ende dieses Bildschirms sehen Sie das Formelergebnis, das etwa 628 Punkte beträgt. Das bedeutet, dass die Top 10 der Schülerinnen und Schüler besser bewertet haben als 628. Konfidenzintervall für den Mittelwert Angenommen, wir wollen ein Konfidenzintervall für den Mittelwert einer Bevölkerung abschätzen. Abhängig von der Größe der Stichprobengröße können Sie einen der folgenden Fälle verwenden: Große Stichprobengröße (n ist größer als etwa 30): Die allgemeine Formel für die Entwicklung eines Konfidenzintervalls für eine Population bedeutet: In dieser Formel ist der Mittelwert Der Probe Z ist der Intervallkoeffizient, der sich aus der Normalverteilungstabelle ergibt (zB der Intervallkoeffizient für ein Konfidenzniveau von 95 ist 1,96). S die Standardabweichung der Probe und n die Probengröße ist. Nun wollen wir zeigen, wie Excel verwendet wird, um ein bestimmtes Konfidenzintervall eines Populationsmittels basierend auf einer Beispielinformation zu entwickeln. Wie Sie sehen, um diese Formel zu bewerten, benötigen Sie quotthe Durchschnitt der Beispielquot und die Marge des Fehlers Excel berechnet automatisch diese Mengen für Sie. Die einzigen Dinge, die Sie tun müssen, sind: fügen Sie die Marge des Fehlers auf den Mittelwert der Probe, Finden Sie die obere Grenze des Intervalls und subtrahieren Sie die Marge des Fehlers vom Mittelwert auf den unteren Grenzwert des Intervalls. Um zu zeigen, wie Excel diese Mengen findet, verwenden wir den Datensatz, der das stündliche Einkommen von 36 Studierenden an der Universität von Baltimore enthält. Diese Zahlen erscheinen in den Zellen A1 bis A36 auf einem Excel-Arbeitsblatt. Nach der Eingabe der Daten folgten wir der deskriptiven statistischen Prozedur, um die unbekannten Mengen zu berechnen. Der einzige zusätzliche Schritt besteht darin, auf das Konfidenzintervall im Dialogfeld der deskriptiven Statistik zu klicken und in diesem Fall den gegebenen Konfidenzniveau einzugeben. Hierbei handelt es sich um die folgenden Schritte: Schritt 1. Geben Sie die Daten in den Zellen A1 ein Auf A36 (in der Tabelle) Schritt 2. Wählen Sie in den Menüs Extras Schritt 3. Klicken Sie auf Datenanalyse, wählen Sie die Option Beschreibung und klicken Sie dann auf OK. Klicken Sie im Dialogfeld Beschreibende Statistik auf Summary Statistics. Nachdem Sie dies getan haben, klicken Sie auf die Konfidenzintervall-Ebene und geben Sie 95 - oder in anderen Problemen, was Vertrauen Intervall Sie wünschen. Geben Sie im Feld Ausgabebereich B1 oder eine beliebige Position ein. Klicken Sie nun auf OK. Der Screenshot würde wie folgt aussehen: Wie Sie sehen, zeigt die Tabelle, dass der Mittelwert der Probe 6.902777778 und der absolute Wert der Fehlergrenze 0.231678109 ist. Dieser Mittelwert basiert auf diesen Beispielinformationen. A 95 Konfidenzintervall für das Stundenlohn der UB-Studierenden hat eine Obergrenze von 6.902777778 0.231678109 und eine untere Grenze von 6.902777778 - 0.231678109. Auf der anderen Seite können wir sagen, dass von allen so gebildeten Intervallen 95 den Mittelwert der Bevölkerung enthält. Oder für praktische Zwecke können wir zuversichtlich sein, dass der Mittelwert der Bevölkerung zwischen 6.902777778 - 0.231678109 und 6.902777778 0.231678109 liegt. Wir können mindestens 95 zuversichtlich sein, dass das Intervall 6,68 und 7,13 das durchschnittliche Stundenlohn eines Arbeitsstudierenden enthält. Smal Sample Size (kleiner als 30) Wenn die Probe n kleiner als 30 ist oder wir müssen das kleine Probenverfahren verwenden, um ein Vertrauensintervall für den Mittelwert einer Population zu entwickeln. Die allgemeine Formel für die Entwicklung von Konfidenzintervallen für die Bevölkerung auf der Grundlage einer kleinen Probe ist: In dieser Formel ist der Mittelwert der Probe. Ist der Intervallkoeffizient, der eine Fläche im oberen Schwanz einer t-Verteilung mit n-1 Freiheitsgraden bereitstellt, die aus einer t-Verteilungstabelle gefunden werden kann (zum Beispiel beträgt der Intervallkoeffizient für ein 90-Konfidenzniveau 1,833, wenn die Probe 10 ist). S die Standardabweichung der Probe und n die Probengröße ist. Jetzt möchten Sie sehen, wie Excel verwendet wird, um ein bestimmtes Konfidenzintervall eines Populationsmittels basierend auf diesen kleinen Beispielinformationen zu entwickeln. Wie Sie sehen, um diese Formel zu bewerten müssen Sie quotthe Mittelwert der Beispielquot und die Marge des Fehlers Excel wird automatisch berechnen diese Mengen, wie es für große Proben. Auch hier sind die einzigen Dinge, die Sie tun müssen: fügen Sie die Fehlergrenze dem Mittelwert der Stichprobe hinzu, finden Sie die obere Grenze des Intervalls und um den Fehlerfehler vom Mittelwert zu subtrahieren, um die untere Grenze des Intervalls zu finden. Um zu zeigen, wie Excel diese Mengen findet, verwenden wir den Datensatz, der die Stundeneinnahmen von 10 Studierenden an der Universität von Baltimore enthält. Diese Zahlen erscheinen in den Zellen A1 bis A10 auf einem Excel-Arbeitsblatt. Nach der Eingabe der Daten folgen wir der deskriptiven statistischen Prozedur, um die unbekannten Mengen zu berechnen (genau die Art, wie wir Mengen für große Stichproben gefunden haben). Schritt 1: Geben Sie die Daten in den Zellen A1 bis A10 im Arbeitsblatt ein. Schritt 2. Wählen Sie in den Menüs Extras Schritt 3. Klicken Sie auf Datenanalyse, und wählen Sie die Option Beschreibende Statistik. Klicken Sie im Dialogfeld Beschreibende Statistik auf OK, klicken Sie auf Zusammenfassungsstatistik, klicken Sie auf die Konfidenzintervallstufe und geben Sie 90 oder in anderen Problemen das gewünschte Konfidenzintervall ein. Geben Sie im Feld Output Range B1 oder beliebige Position ein. Klicken Sie nun auf OK. Der Schirm schaut wie der folgende aus: Jetzt, wie die Berechnung des Vertrauensintervalls für die große Probe, berechnen Sie das Vertrauensintervall der Bevölkerung, die auf dieser kleinen Beispielinformation basiert. Das Konfidenzintervall ist: 6,8 0,414426102 oder 6,39 7,21. Wir können mindestens 90 Vertraute sein, dass das Intervall 6.39 und 7.21 das wahre Mittel der Bevölkerung enthält. Test der Hypothese über die Population Mean Again, müssen wir unterscheiden zwei Fälle in Bezug auf die Größe Ihrer Probe Large Sample Größe (sagen wir, über 30): In diesem Abschnitt möchten Sie wissen, wie Excel verwendet werden, um einen Hypothesentest über führen können Ein Bevölkerungsmittel. Wir verwenden die stündlichen Einkommen von verschiedenen Arbeit-Studierenden als diejenigen, die früher in der Vertrauensintervall Abschnitt eingeführt. Die Daten werden in den Zellen A1 bis A36 eingetragen. Das Ziel besteht darin, die folgende Null - und Alternativhypothese zu testen: Die Nullhypothese zeigt, dass das durchschnittliche Stundenlohn eines Arbeitsstudierenden gleich 7 pro Stunde ist, die alternative Hypothese jedoch, dass das durchschnittliche Stundeneinkommen nicht gleich 7 ist Stunde. Ich werde die Schritte in beschreibenden Statistiken und am Ende wird zeigen, wie der Wert der Test-Statistiken in diesem Fall, z, mit einer Zellformel zu finden. Schritt 1. Geben Sie die Daten in den Zellen A1 bis A36 (in der Tabelle) ein. Schritt 2. Wählen Sie in den Menüs Extras Schritt 3. Klicken Sie auf Datenanalyse, wählen Sie die Option Beschreibung und klicken Sie auf OK. Klicken Sie im Dialogfeld Beschreibende Statistik auf Summary Statistics. Wählen Sie das Feld Ausgabebereich, geben Sie B1 oder den gewünschten Speicherort ein. Klicken Sie nun auf OK. (Zum Berechnen des Wertes der Teststatistik Suche nach dem Mittelwert der Probe, dann dem Standardfehler In dieser Ausgabe befinden sich diese Werte in den Zellen C3 und C4.) Schritt 4. Wählen Sie Zelle D1 und geben Sie die Zellformel (C3-7 ) C4. Der Screenshot sollte wie folgt aussehen: Der Wert in Zelle D1 ist der Wert der Teststatistik. Da dieser Wert im Akzeptanzbereich von -1,96 bis 1,96 (aus der Normalverteilungstabelle) fällt, können wir die Nullhypothese nicht zurückweisen. Kleine Stichprobengröße (z. B. weniger als 30): Mit Hilfe von Schritten, die im Fall der Großprobengröße durchgeführt wurden, kann Excel verwendet werden, um eine Hypothese für Kleinprobenfall durchzuführen. Nutzen Sie das stündliche Einkommen von 10 Studierenden an der UB, um die folgende Hypothese durchzuführen. Die Nullhypothese zeigt an, dass das durchschnittliche stündliche Einkommen eines Arbeitsstudenten gleich 7 pro Stunde ist. Die alternative Hypothese zeigt, dass das durchschnittliche Stundeneinkommen nicht gleich 7 pro Stunde ist. Ich werde die Schritte in beschreibenden Statistiken und am Ende wird zeigen, wie der Wert der Test-Statistiken in diesem Fall quottquot mit einer Zellformel zu finden. Schritt 1. Geben Sie die Daten in den Zellen A1 bis A10 (in der Tabelle) ein. Schritt 2. Wählen Sie in den Menüs Extras Schritt 3. Klicken Sie auf Datenanalyse und wählen Sie die Option Beschreibende Statistik. Klicken Sie auf OK. Klicken Sie im Dialogfeld Beschreibende Statistik auf Summary Statistics. Wählen Sie die Ausgabefelder aus, geben Sie B1 oder einen beliebigen Speicherort ein. Klicken Sie erneut auf OK. (Zum Berechnen des Wertes der Teststatistik Suche nach dem Mittelwert der Probe, dann der Standardfehler, in dieser Ausgabe sind diese Werte in den Zellen C3 und C4.) Schritt 4. Die Zelle D1 auswählen und die Zellenformel (C3 - 7) eingeben, C4. Der Screenshot würde wie folgt aussehen: Da der Wert der Teststatistik t -0.66896 im Akzeptanzbereich von -2.262 auf 2.262 sinkt (aus der t-Tabelle, wobei 0.025 und die Freiheitsgrade 9 sind), lehnen wir die Nullhypothese nicht ab. Unterschied zwischen zwei Populationen In diesem Abschnitt werden wir zeigen, wie Excel verwendet wird, um einen Hypothesentest über den Unterschied zwischen zwei Populationsmitteln durchzuführen, vorausgesetzt, dass die Populationen gleiche Abweichungen haben. Die Daten in diesem Fall werden von verschiedenen Büros hier an der Universität von Baltimore genommen. Ich sammelte die Stundenlohndaten von 36 zufällig ausgewählten Studenten und 36 studentischen Hilfskräften. Das Stundenlohnspektrum für Studierende betrug 6 - 8, das Stundeneinkommen für studentische Hilfskräfte 6-9. Das Hauptziel in dieser Hypothesenprüfung ist es, zu sehen, ob es einen signifikanten Unterschied zwischen den Mitteln der beiden Populationen gibt. Die NULL - und die ALTERNATIVE-Hypothese ist, dass die Mittel gleich sind und die Mittel nicht gleich sind. Unter Bezugnahme auf die Kalkulationstabelle, wählte ich A1 und A2 als Label-Zentren. In den Zellen A2: A37 werden die Stundenstunden für die Stichprobengröße 36 dargestellt. Und die studentischen Hilfskräfte stündliches Einkommen für eine Stichprobengröße 36 wird in den Zellen B2 gezeigt: B37 Daten für die Arbeitsstudierende 6, 6, 6, 6, 6, 6, 6, 6.5, 6.5, 6.5, 6.5, 6.5, 6.5, 7, 7, 7, 7, 7, 7, 7, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 8, 8, 8, 8, 8, 8, 8, 8, 8. Daten für Studentische Hilfskräfte: , 6,5, 6,5, 6,5, 6,5, 6,5, 6,5, 7, 7, 7, 7, 7, 7,5, 7,5, 7,5, 7,5, 7,5, 7,5, 8, 8, 8, 8, 8 , 8, 8, 8.5, 8.5, 8.5, 8.5, 8.5, 9, 9, 9, 9. Verwenden Sie die Prozedur "Deskriptive Statistik", um die Varianzen der beiden Samples zu berechnen. Das Excel-Verfahren zum Testen der Differenz zwischen den beiden Populationsmitteln erfordert Informationen über die Varianzen der beiden Populationen. Da die Varianzen der beiden Populationen unbekannt sind, sollten sie durch Musterabweichungen ersetzt werden. Die Beschreibung für beide Proben zeigt, dass die Varianz der ersten Probe s 1 2 0,55546218 ist. Während die Varianz der zweiten Probe s 2 2 0,969748 beträgt. Um die gewünschte Testhypothese mit Excel durchzuführen, können folgende Schritte durchgeführt werden: Schritt 1. Wählen Sie in den Menüs Werkzeuge und klicken Sie dann auf die Option Datenanalyse. Schritt 2. Wenn das Dialogfeld Datenanalyse angezeigt wird: Wählen Sie z-Test: Zwei Sample for means, dann klicken Sie auf OK Schritt 3. Wenn das Dialogfeld z-Test: Zwei Sample for means angezeigt wird, geben Sie A1: A36 im Feld variable 1 range ein (Arbeitsstudierende Stundenlohn) Geben Sie B1: B36 im Feld Variable 2 Bereich ein (Stundengeld der studentischen Assistenten) Geben Sie im Feld Hypothese Mean Difference 0 ein (wenn Sie eine andere Differenz als 0 eingeben wollen, geben Sie diesen Wert ein) Die Varianz des ersten Samples im Variable 1 Variance-Feld Geben Sie die Varianz des zweiten Samples im Variable 2 Variance-Feld ein und wählen Sie Labels Geben Sie 0,05 oder, unabhängig von der gewünschten Signifikanzstufe, im Alpha-Feld ein Ergebnisse, wählte ich C19. Und klicken Sie dann auf OK. Der Wert der Teststatistik z-1.9845824 erscheint in unserem Fall in Zelle D24. Die Ablehnungsregel für diesen Test ist z 1,96 aus der Normalverteilungstabelle. In der Excel-Ausgabe sind diese Werte für einen Zwei-Schwanz-Test z 1.959961082. Da der Wert der Teststatistik z-1.9845824 kleiner als -1.959961082 ist, weisen wir die Nullhypothese zurück. We can also draw this conclusion by comparing the p-value for a two tail - test and the alpha value. Since p-value 0.047190813 is less than a0.05 we reject the null hypothesis. Overall we can say, based on the sample results, the two populations means are different. Small Samples: n 1 OR n 2 are less than 30 In this section we will show how Excel is used to conduct a hypothesis test about the difference between two population means. - Given that the populations have equal variances when two small independent samples are taken from both populations. Similar to the above case, the data in this case are taken from various offices here at the University of Baltimore. I collected hourly income data of 11 randomly selected work-study students and 11 randomly selected student assistants. The hourly income range for both groups was similar range, 6 - 8 and 6-9. The main objective in this hypothesis testing is similar too, to see whether there is a significant difference between the means of the two populations. The NULL and the ALTERNATIVE hypothesis are that the means are equal and they are not equal, respectively. Referring to the spreadsheet, we chose A1 and A2 as label centers. The work-study students hourly income for a sample size 11 are shown in cells A2:A12 . and the student assistants hourly income for a sample size 11 is shown in cells B2:B12 . Unlike previous case, you do not have to calculate the variances of the two samples, Excel will automatically calculate these quantities and use them in the calculation of the value of the test statistic. Similar to the previous case, but a bit different in step 2, to conduct the desired test hypothesis with Excel the following steps can be taken: Step 1. From the menus select Tools then click on the Data Analysis option. Step 2. When the Data Analysis dialog box appears: Choose t-Test: Two Sample Assuming Equal Variances then click OK Step 3 When the t-Test: Two Sample Assuming Equal Variances dialog box appears : Enter A1:A12 in the variable 1 range box (work-study student hourly income) Enter B1:B12 in the variable 2 range box (student assistant hourly income) Enter 0 in the Hypothesis Mean Difference box(if you desire to test a mean difference other than zero, enter that value) then select Labels Enter 0.05 or, whatever level of significance you desire, in the Alpha box Select a suitable Output Range for the results, I chose C1, then click OK. The value of the test statistic t-1.362229828 appears, in our case, in cell D10. The rejection rule for this test is t 2.086 from the t distribution table where the t value is based on a t distribution with n 1 - n 2 -2 degrees of freedom and where the area of the upper one tail is 0.025 ( that is equal to alpha2). In the Excel output the values for a two-tail test are t 2.085962478. Since the value of the test statistic t-1.362229828, is in an acceptance range of t 2.085962478, we fail to reject the null hypothesis. We can also draw this conclusion by comparing the p-value for a two-tail test and the alpha value. Since the p-value 0.188271278 is greater than a0.05 again . we fail to reject the null hypothesis. Overall we can say, based on sample results, the two populations means are equal. Enter data in an Excel work sheet starting with cell A2 and ending with cell C8. The following steps should be taken to find the proper output for interpretation. Step 1. From the menus select Tools and click on Data Analysis option. Step 2. When data analysis dialog appears, choose Anova single-factor option enter A2:C8 in the input range box. Select labels in first row. Step3. Select any cell as output(in here we selected A11). Klicken Sie auf OK. The general form of Anova table looks like following: Source of Variation Suppose the test is done at level of significance a 0.05, we reject the null hypothesis. This means there is a significant difference between means of hourly incomes of student assistants in these departments. The Two-way ANOVA Without Replication In this section, the study involves six students who were offered different hourly wages in three different department services here at the University of Baltimore. The objective is to see whether the hourly incomes are the same. Therefore, we can consider the following: Treatment: Hourly payments in the three departments Blocks: Each student is a block since each student has worked in the three different departments The general form of Anova table would look like: Source of Variation Degrees of freedom To find the Excel output for the above data the following steps can be taken: Step 1. From the menus select Tools and click on Data Analysis option. Schritt 2. When data analysis box appears: select Anova two-factor without replication then Enter A2: D8 in the input range. Select labels in first row. Step3. Select an output range (in here we selected A11) then OK. Source of Variation NOTE: FMSTMSE 0.9805560.497222 1.972067 F 3.33 from table (5 numerator DF and 10 denominator DF) Since 1.972067 Goodness-of-Fit Test for Discrete Random Variables The CHI-SQUARE distribution can be used in a hypothesis test involving a population variance. However, in this section we would like to test and see how close a sample results are to the expected results. Example: The Multinomial Random Variable In this example the objective is to see whether or not based on a randomly selected sample information the standards set for a population is met. There are so many practical examples that can be used in this situation. For example it is assumed the guidelines for hiring people with different ethnic background for the US government is set at 70(WHITE), 20(African American) and 10(others), respectively. A randomly selected sample of 1000 US employees shows the following results that is summarized in a table. EXPECTED NUMBER OF EMPLOYEES OBSERVED FROM SAMPLE As you see the observed sample numbers for groups two and three are lower than their expected values unlike group one which has a higher expected value. Is this a clear sign of discrimination with respect to ethnic background Well depends on how much lower the expected values are. The lower amount might not statistically be significant. To see whether these differences are significant we can use Excel and find the value of the CHI-SQUARE. If this value falls within the acceptance region we can assume that the guidelines are met otherwise they are not. Now lets enter these numbers into Excel spread - sheet. We used cells B7-B9 for the expected proportions, C7-C9 for the observed values and D7-D9 for the expected frequency. To calculate the expected frequency for a category, you can multiply the proportion of that category by the sample size (in here 1000). The formula for the first cell of the expected value column, D7 is 1000B7. To find other entries in the expected value column, use the copy and the paste menu as shown in the following picture. These are important values for the chi-square test. The observed range in this case is C7: C9 while the expected range is D7: D9. The null and the alternative hypothesis for this test are as follows: H A . The population proportions are not P W 0.70, P A 0.20 and P O 0.10 Now lets use Excel to calculate the p-value in a CHI-SQUARE test. Step 1. Select a cell in the work sheet, the location which you like the p value of the CHI-SQUARE to appear. We chose cell D12. Step 2. From the menus, select insert then click on the Function option, Paste Function dialog box appears. Step 3. Refer to function category box and choose statistical . from function name box select CHITEST and click on OK . Step 4. When the CHITEST dialog appears: Enter C7: C9 in the actual-range box then enter D7: D9 in the expected-range box, and finally click on OK . The p-value will appear in the selected cell, D12. As you see the p value is 0.002392 which is less than the value of the level of significance (in this case the level of significance, a 0.10). Hence the null hypothesis should be rejected. This means based on the sample information the guidelines are not met. Notice if you type CHITEST(C7:C9,D7:D9) in the formula bar the p-value will show up in the designated cell. NOTE: Excel can actually find the value of the CHI-SQUARE. To find this value first select an empty cell on the spread sheet then in the formula bar type CHIINV(D12,2). D12 designates the p-Value found previously and 2 is the degrees of freedom (number of rows minus one). The CHI-SQUARE value in this case is 12.07121. If we refer to the CHI-SQUARE table we will see that the cut off is 4.60517 since 12.071214.60517 we reject the null. The following screen shot shows you how to the CHI-SQUARE value. Test of Independence: Contingency Tables The CHI-SQUARE distribution is also used to test and see whether two variables are independent or not. For example based on sample data you might want to see whether smoking and gender are independent events for a certain population. The variables of interest in this case are smoking and the gender of an individual. Another example in this situation could involve the age range of an individual and his or her smoking habit. Similar to case one data may appear in a table but unlike the case one this table may contains several columns in addition to rows. The initial table contains the observed values. To find expected values for this table we set up another table similar to this one. To find the value of each cell in the new table we should multiply the sum of the cell column by the sum of the cell row and divide the results by the grand total. The grand total is the total number of observations in a study. Now based on the following table test whether or not the smoking habit and gender of the population that the following sample taken from are independent. On the other hand is that true that males in this population smoke more than females You could use formula bar to calculate the expected values for the expected range. For example to find the expected value for the cell C5 which is replaced in c11 you could click on the formula bar and enter C6D5D6 then enter in cell C11. Step 1. Observed Range b4:c5 Smoking and gender So the observed range is b4:c5 and the expected range is b10:c11. Step 3. Click on fx (paste function) Step 4. When Paste Function dialog box appears, click on Statistical in function category and CHITEST in the function name then click OK. When the CHITEST box appears, enter b4:c5 for the actual range, then b10:c11 for the expected range. Step 5. Click on OK (the p-value appears). 0.477395 Conclusion: Since p-value is greater than the level of significance (0.05), fails to reject the null. This means smoking and gender are independent events. Based on sample information one can not assure females smoke more than males or the other way around. Step 6. To find the chi-square value, use CHINV function, when Chinv box appears enter 0.477395 for probability part, then 1 for the degrees of freedom. Degrees of freedom(number of columns-1)X(number of rows-1) Test Hypothesis Concerning the Variance of Two Populations In this section we would like to examine whether or not the variances of two populations are equal. Whenever independent simple random samples of equal or different sizes such as n 1 and n 2 are taken from two normal distributions with equal variances, the sampling distribution of s 1 2 s 2 2 has F distribution with n 1 - 1 degrees of freedom for the numerator and n 2 - 1 degrees of freedom for the denominator. In the ratio s 1 2 s 2 2 the numerator s 1 2 and the denominator s 2 2 are variances of the first and the second sample, respectively. The following figure shows the graph of an F distribution with 10 degrees of freedom for both the numerator and the denominator. Unlike the normal distribution as you see the F distribution is not symmetric. The shape of an F distribution is positively skewed and depends on the degrees of freedom for the numerator and the denominator. The value of F is always positive. Now let see whether or not the variances of hourly income of student-assistant and work-study students based on samples taken from populations previously are equal. Assume that the hypothesis test in this case is conducted at a 0.10. The null and the alternative are: Rejection Rule: Reject the null hypothesis if Flt F 0.095 or Fgt F 0.05 where F, the value of the test statistic is equal to s 1 2 s 2 2. with 10 degrees of freedom for both the numerator and the denominator. We can find the value of F .05 from the F distribution table. If s 1 2 s 2 2. we do not need to know the value of F 0.095 otherwise, F 0.95 1 F 0.05 for equal sample sizes. A survey of eleven student-assistant and eleven work-study students shows the following descriptive statistics. Our objective is to find the value of s 1 2 s 2 2. where s 1 2 is the value of the variance of student assistant sample and s 2 2 is the value of the variance of the work study students sample. As you see these values are in cells F8 and D8 of the descriptive statistic output. To calculate the value of s 1 2 s 2 2. select a cell such as A16 and enter cell formula F8D8 and enter. This is the value of F in our problem. Since this value, F1.984615385, falls in acceptance area we fail to reject the null hypothesis. Hence, the sample results do support the conclusion that student assistants hourly income variance is equal to the work study students hourly income variance. The following screen shoot shows how to find the F value. We can follow the same format for one tail test(s). Linear Correlation and Regression Analysis In this section the objective is to see whether there is a correlation between two variables and to find a model that predicts one variable in terms of the other variable. There are so many examples that we could mention but we will mention the popular ones in the world of business. Usually independent variable is presented by the letter x and the dependent variable is presented by the letter y. A business man would like to see whether there is a relationship between the number of cases of sold and the temperature in a hot summer day based on information taken from the past. He also would like to estimate the number cases of soda which will be sold in a particular hot summer day in a ball game. He clearly recorded temperatures and number of cases of soda sold on those particular days. The following table shows the recorded data from June 1 through June 13. The weatherman predicts a 94F degree temperature for June 14. The businessman would like to meet all demands for the cases of sodas ordered by customers on June 14. Now lets use Excel to find the linear correlation coefficient and the regression line equation. The linear correlation coefficient is a quantity between -1 and 1. This quantity is denoted by R . The closer R to 1 the stronger positive (direct) correlation and similarly the closer R to -1 the stronger negative (inverse) correlation exists between the two variables. The general form of the regression line is y mx b. In this formula, m is the slope of the line and b is the y-intercept. You can find these quantities from the Excel output. In this situation the variable y (the dependent variable) is the number of cases of soda and the x (independent variable) is the temperature. To find the Excel output the following steps can be taken: Step 1. From the menus choose Tools and click on Data Analysis. Step 2. When Data Analysis dialog box appears, click on correlation. Step 3. When correlation dialog box appears, enter B1:C14 in the input range box. Click on Labels in first row and enter a16 in the output range box. Click on OK. As you see the correlation between the number of cases of soda demanded and the temperature is a very strong positive correlation. This means as the temperature increases the demand for cases of soda is also increasing. The linear correlation coefficient is 0.966598577 which is very close to 1. Now lets follow same steps but a bit different to find the regression equation. Step 1. From the menus choose Tools and click on Data Analysis Step 2 . When Data Analysis dialog box appears, click on regression . Step 3. When Regression dialog box appears, enter b1:b14 in the y-range box and c1:c14 in the x-range box. Click on labels . Step 4. Enter a19 in the output range box . Note: The regression equation in general should look like Ym X b. In this equation m is the slope of the regression line and b is its y-intercept. Adjusted R Square The relationship between the number of cans of soda and the temperature is: Y 0.879202711 X 9.17800767 The number of cans of soda 0.879202711(Temperature) 9.17800767. Referring to this expression we can approximately predict the number of cases of soda needed on June 14. The weather forecast for this is 94 degrees, hence the number of cans of soda needed is equal to The number of cases of soda0.879202711(94) 9.17800767 91.82 or about 92 cases. Moving Average and Exponential Smoothing Moving Average Models: Use the Add Trendline option to analyze a moving average forecasting model in Excel. You must first create a graph of the time series you want to analyze. Select the range that contains your data and make a scatter plot of the data. Once the chart is created, follow these steps: Click on the chart to select it, and click on any point on the line to select the data series. When you click on the chart to select it, a new option, Chart, s added to the menu bar. From the Chart menu, select Add Trendline. The following is the moving average of order 4 for weekly sales: Exponential Smoothing Models: The simplest way to analyze a timer series using an Exponential Smoothing model in Excel is to use the data analysis tool. This tool works almost exactly like the one for Moving Average, except that you will need to input the value of a instead of the number of periods, k. Once you have entered the data range and the damping factor, 1- a. and indicated what output you want and a location, the analysis is the same as the one for the Moving Average model. Applications and Numerical Examples Descriptive Statistics: Suppose you have the following, n 10, data: 1.2, 1.5, 2.6, 3.8, 2.4, 1.9, 3.5, 2.5, 2.4, 3.0 Type your n data points into the cells A1 through An. Click on the Tools menu. (At the bottom of the Tools menu will be a submenu Data Analysis. , if the Analysis Tool Pack has been properly installed.) Clicking on Data Analysis. will lead to a menu from which Descriptive Statistics is to be selected. Select Descriptive Statistics by pointing at it and clicking twice, or by highlighting it and clicking on the Okay button. Within the Descriptive Statistics submenu, a. for the input range enter A1:Dn, assuming you typed the data into cells A1 to An. b. click on the output range button and enter the output range C1:C16. c. click on the Summary Statistics box d. finally, click on Okay. The Central Tendency: The data can be sorted in ascending order: 1.2, 1.5, 1.9, 2.4, 2.4, 2.5, 2.6, 3.0, 3.5, 3.8 The mean, median and mode are computed as follows: (1.2 1.5 2.6 3.8 2.4 1.9 3.5 2.5 2.4 3.0) 10 2.48 The mode is 2.4, since it is the only value that occurs twice. The midrange is (1.2 3.8) 2 2.5. Note that the mean, median and mode of this set of data are very close to each other. This suggests that the data is very symmetrically distributed. Variance: The variance of a set of data is the average of the cumulative measure of the squares of the difference of all the data values from the mean. The sample variance-based estimation for the population variance are computed differently. The sample variance is simply the arithmetic mean of the squares of the difference between each data value in the sample and the mean of the sample. On the other hand, the formula for an estimate for the variance in the population is similar to the formula for the sample variance, except that the denominator in the fraction is (n-1) instead of n. However, you should not worry about this difference if the sample size is large, say over 30. Compute an estimate for the variance of the population . given the following sorted data: 1.2, 1.5, 1.9, 2.4, 2.4, 2.5, 2.6, 3.0, 3.5, 3.8 mean 2.48 as computed earlier. An estimate for the population variance is: s 2 1 (10-1) (1.2 - 2.48) 2 (1.5 - 2.48) 2 (1.9 - 2.48) 2 (2.4 -2.48) 2 (2.4 - 2.48) 2 (2.5 - 2.48) 2 (2.6 - 2.48) 2 (3.0 - 2.48) 2 (3.5 -2.48) 2 (3.8 - 2.48) 2 (1 9) (1.6384 0.9604 0.3364 0.0064 0.0064 0.0004 0.0144 0.2704 1.0404 1.7424) 0.6684 Therefore, the standard deviation is s ( 0.6684 ) 12 0.8176 Probability and Expected Values: Newsweek reported that average take for bank robberies was 3,244 but 85 percent of the robbers were caught. Assuming 60 percent of those caught lose their entire take and 40 percent lose half, graph the probability mass function using EXCEL. Calculate the expected take from a bank robbery. Does it pay to be a bank robber To construct the probability function for bank robberies, first define the random variable x, bank robbery take. If the robber is not caught, x 3,244. If the robber is caught and manages to keep half, x 1,622. If the robber is caught and loses it all, then x 0. The associated probabilities for these x values are 0.15 (1 - 0.85), 0.34 (0.85)(0.4), and 0.51 (0.85)(0.6). After entering the x values in cells A1, A2 and A3 and after entering the associated probabilities in B1, B2, and B3, the following steps lead to the probability mass function: Click on ChartWizard. The ChartWizard Step 1 of 4 screen will appear. Highlight Column at ChartWizard Step 1 of 4 and click Next. At ChartWizard Step 2 of 4 Chart Source Data, enter B1:B3 for Data range, and click column button for Series in. A graph will appear. Click on series toward the top of the screen to get a new page. At the bottom of the Series page, is a rectangle for Category (X) axis labels: Click on this rectangle and then highlight A1:A3. At Step 3 of 4 move on by clicking on Next, and at Step 4 of 4, click on Finish. The expected value of a robbery is 1,038.08. E(X) (0)(0.51)(1622)(0.34) (3244)(0.15) 0 551.48 486.60 1038.08 The expected return on a bank robbery is positive. On average, bank robbers get 1,038.08 per heist. If criminals make their decisions strictly on this expected value, then it pays to rob banks. A decision rule based only on an expected value, however, ignores the risks or variability in the returns. In addition, our expected value calculations do not include the cost of jail time, which could be viewed by criminals as substantial. Discrete Continuous Random Variables: Binomial Distribution Application: A multiple choice test has four unrelated questions. Each question has five possible choices but only one is correct. Thus, a person who guesses randomly has a probability of 0.2 of guessing correctly. Draw a tree diagram showing the different ways in which a test taker could get 0, 1, 2, 3 and 4 correct answers. Sketch the probability mass function for this test. What is the probability a person who guesses will get two or more correct Solution: Letting Y stand for a correct answer and N a wrong answer, where the probability of Y is 0.2 and the probability of N is 0.8 for each of the four questions, the probability tree diagram is shown in the textbook on page 182. This probability tree diagram shows the branches that must be followed to show the calculations captured in the binomial mass function for n 4 and 0.2. For example, the tree diagram shows the six different branch systems that yield two correct and two wrong answers (which corresponds to 4(22) 6. The binomial mass function shows the probability of two correct answers as P(x 2 n 4, p 0.2) 6(.2)2(.8)2 6(0.0256) 0.1536 P(2) Which is obtained from excel by using the BINOMDIST Command, where the first entry is x, the second is n, and the third is mass (0) or cumulative (1) that is, entering BINOMDIST(2,4,0.2,0) IN ANY EXCEL CELL YIELDS 0.1536 AND BINOMDIST(3,4,0.2,0) YIELDS P(x3n4, p 0.2) 0.0256 BINOMDIST(4,4,0.2,0) YIELDS P(x4n4, p 0.2) 0.0016 1-BINOMDIST(1,4,0.2,1) YIELDS P(x 179 2 n 4, p 0.2) 0.1808 Normal Example: If the time required to complete an examination by those with a certain learning disability is believed to be distributed normally, with mean of 65 minutes and a standard deviation of 15 minutes, then when can the exam be terminated so that 99 percent of those with the disability can finish Solution: Because the average and standard deviation are known, what needs to be established is the amount of time, above the mean time, such that 99 percent of the distribution is lower. This is a distance that is measured in standard deviations as given by the Z value corresponding to the 0.99 probability found in the body of Appendix B, Table 5,as shown in the textbook OR the commands entered into any cell of Excel to find this Z value is NORMINV(0.99,0,1) for 2.326342. The closest cumulative probability that can be found is 0.9901, in the row labeled 2.3 and column headed by .03, Z 2.33, which is only an approximation for the more exact 2.326342 found in Excel. Using this more exact value the calculation with mean m and standard deviation s in the following formula would be Z ( X - m ) s That is, Z ( x - 65)15 Thus, x 65 15(2.32634) 99.9 minutes. Alternatively, instead of standardizing with the Z distribution using Excel we can simply work directly with the normal distribution with a mean of 65 and standard deviation of 15 and enter NORMINV(0.99,65,15). In general to obtain the x value for which alpha percent of a normal random variables values are lower, the following NORMINV command may be used, where the first entry is a. the second is m. and the third is s. Another Example: In the early 1980s, the Toro Company of Minneapolis, Minnesota, advertised that it would refund the purchase price of a snow blower if the following winters snowfall was less than 21 percent of the local average. If the average snowfall is 45.25 inches, with a standard deviation of 12.2 inches, what is the likelihood that Toro will have to make refunds Solution: Within limits, snowfall is a continuous random variable that can be expected to vary symmetrically around its mean, with values closer to the mean occurring most often. Thus, it seems reasonable to assume that snowfall (x) is approximately normally distributed with a mean of 45.25 inches and standard deviation of 12.2 inches. Nine and one half inches is 21 percent of the mean snowfall of 45.25 inches and, with a standard deviation of 12.2 inches, the number of standard deviations between 45.25 inches and 9.5 inches is Z: Z ( x - m ) s (9.50 - 45.25)12.2 -2.93 Using Appendix B, Table 5, the textbook demonstrates the determination of P(x 163 9.50) P(z 163 -2.93) 0.17, the probability of snowfall less than 9.5 inches. Using Excel, this normal probability is obtained with the NORMDIST command, where the first entry is x, the second is mean m. the third is standard deviation s, and the fourth is CUMULATIVE (1). Entering NORMDIST(9.5,45.25,12.2,1), Gives P( x 163 9.50) 0.001693. Sampling Distribution and the Central Limit Theorem : A bakery sells an average of 24 loaves of bread per day. Sales (x) are normally distributed with a standard deviation of 4. If a random sample of size n 1 (day) is selected, what is the probability this x value will exceed 28 If a random sample of size n 4 (days) is selected, what is theprobability that xbar 179 28 Why does the answer in part 1 differ from that in part 2 1. The sampling distribution of the sample mean xbar is normal with a mean of 24 and a standard error of the mean of 4. Thus, using Excel, 0.15866 1-NORMDIST(28,24,4,1). 2. The sampling distribution of the sample mean xbar is normal with a mean of 24 and a standard error of the mean of 2 using Excel, 0.02275 1-NORMDIST(28,24,2,1). Regression Analysis: The highway deaths per 100 million vehicle miles and highway speed limits for 10 countries, are given below: (Death, Speed) (3.0, 55), (3.3, 55), (3.4, 55), (3.5, 70), (4.1, 55), (4.3, 60), (4.7, 55), (4.9, 60), (5.1, 60), and (6.1, 75). From this we can see that five countries with the same speed limit have very different positions on the safety list. For example, Britain. with a speed limit of 70 is demonstrably safer than Japan, at 55. Can we argue that, speed has little to do with safety. Use regression analysis to answer this question. Solution: Enter the ten paired y and x data into cells A2 to A11 and B2 to B11, with the death rate label in A1 and speed limits label in B1, the following steps produce the regression output. Choose Regression from Data Analysis in the Tools menu. The Regression dialog box will will appear. Note: Use the mouse to move between the boxes and buttons. Click on the desired box or button. The large rectangular boxes require a range from the worksheet. A range may be typed in or selected by highlighting the cells with the mouse after clicking on the box. If the dialog box blocks the data, it can be moved on the screen by clicking on the title bar and dragging. For the Input Y Range, enter A1 to A11, and for the Input X Range enter B1 to B11. Because the Y and X ranges include the Death and Speed labels in A1 and B1, select the Labels box with a click. Click the Output Range button and type reference cell, which in this demonstration is A13. To get the predicted values of Y (Death rates) and residuals select the Residuals box with a click. Your screen display should show a Table, clicking OK will give the SUMMARY OUTPUT, ANOVA AND RESIDUAL OUTPUT The first section of the EXCEL printout gives SUMMARY OUTPUT. The Multiple R is the square root of the R Square the computation and interpretation of which we have already discussed. The Standard Error of estimate (which will be discussed in the next chapter) is s 0.86423, which is the square root of Residual SS 5.97511 divided by its degrees of freedom, df 8, as given in the ANOVA section. We will also discuss the adjusted R-square of 0.21325 in the following chapters. Under the ANOVA section are the estimated regression coefficients and related statistics that will be discussed in detail in the next chapter. For now it is sufficient to recognize that the calculated coefficient values for the slope and y intercept are provided (b 0.07556 and a -0.29333). Next to these coefficient estimates is information on the variability in the distribution of the least-squares estimators from which these specific estimates were drawn: the column titled Std. Error contains the standard deviations (standard errors) of the intercept and slope distributions the t-ratio and p columns give the calculated values of the t statistics and associated p-values. As shown in Chapter 13, the t statistic of 1.85458 and p-value of 0.10077, for example, indicates that the sample slope (0.07556) is sufficiently different from zero, at even the 0.10 two-tail Type I error level, to conclude that there is a significant relationship between deaths and speed limits in the population. This conclusion is contrary to assertion that speed has little to do with safety. SUMMARY OUTPUT: Multiple R 0.54833, R Square 0.30067, Adjusted R Square 0.21325, Standard Error 0.86423, Observations 10 ANOVA df SS MS F P-value Regression 1 2.56889 2.56889 3.43945 0.10077 Residual 8 5.97511 0.74689 Total 9 8.54400 Coeffs. Estimate Std. Error T Stat P-value Lower 95 Upper 95 Intercept -0.29333 2.45963 -0.11926 0.90801 -5.96526 5.37860 Speed 0.07556 0.04074 1.85458 0.10077 -0.01839 0.16950 Predicted Residuals 3.86222 -0.86222 3.86222 -0.56222 3.86222 -0.46222 4.99556 -1.49556 3.86222 0.23778 4.24000 0.06000 3.86222 0.83778 4.24000 0.66000 4.24000 0.86000 5.37333 0.72667 Microsoft Excel Add-Ins Forecasting with regression requires the Excel add-in called Analysis ToolPak , and linear programming requires the Excel add-in called Solver . How you check to see if these are activated on your computer, and how to activate them if they are not active, varies with Excel version. Here are instructions for the most common versions. If Excel will not let you activate Data Analysis and Solver, you must use a different computer. Excel 20022003: Start Excel, then click Tools and look for Data Analysis and for Solver. If both are there, press Esc (escape) and continue with the respective assignment. Otherwise click Tools, Add-Ins, and check the boxes for Analysis ToolPak and for Solver, then click OK. Click Tools again, and both tools should be there. Excel 2007: Start Excel 2007 and click the Data tab at the top. Look to see if Data Analysis and Solver show in the Analysis section at the far right. If both are there, continue with the respective assignment. Otherwise, do the following steps exactly as indicated: - click the 8220Office Button8221 at top left - click the Excel Options button near the bottom of the resulting window - click the Add-ins button on the left of the next screen - near the bottom at Manage Excel Add-ins, click Go - check the boxes for Analysis ToolPak and Solver Add-in if they are not already checked, then click OK - click the Data tab as above and verify that the add-ins show. Excel 2010: Start Excel 2010 and click the Data tab at the top. Look to see if Data Analysis and Solver show in the Analysis section at the far right. If both are there, continue with the respective assignment. Otherwise, do the following steps exactly as indicated: - click the File tab at top left - click the Options button near the bottom of the left side - click the Add-ins button near the bottom left of the next screen - near the bottom at Manage Excel Add-ins, click Go - check the boxes for Analysis ToolPak and Solver Add-in if they are not already checked, then click OK - click the Data tab as above and verify that the add-ins show. Solving Linear Programs by Excel Some of these examples can be modified for other types problems Computer-assisted Learning: E-Labs and Computational Tools My teaching style deprecates the plug the numbers into the software and let the magic box work it out approach. Personal computers, spreadsheets, e. g. Excel. professional statistical packages (e. g. such as SPSS), and other information technologies are now ubiquitous in statistical data analysis. Without using these tools, one cannot perform any realistic statistical data analysis on large data sets. The appearance of other computer software, JavaScript Applets. Statistical Demonstrations Applets. and Online Computation are the most important events in the process of teaching and learning concepts in model-based statistical decision making courses. These tools allow you to construct numerical examples to understand the concepts, and to find their significance for yourself. Use any or online interactive tools available on the WWW to perform statistical experiments (with the same purpose, as you used to do experiments in physics labs to learn physics) to understand statistical concepts such as Central Limit Theorem are entertaining and educating. Computer-assisted learning is similar to the experiential model of learning. The adherents of experiential learning are fairly adamant about how we learn. Learning seldom takes place by rote. Learning occurs because we immerse ourselves in a situation in which we are forced to perform and think. You get feedback from the computer output and then adjust your thinking-process if needed. A SPSS-Example . SPSS-Examples . SPSS-More Examples . (Statistical Package for the Social Sciences) is a data management and analysis product. It can perform a variety of data analysis and presentation functions, including statistical analyses and graphical presentation of data. SAS (Statistical Analysis System) is a system of software packages some of its basic functions and uses are: database management inputting, cleaning and manipulating data, statistical analysis, calculating simple statistics such as means, variances, correlations running standard routines such as regressions. Available at: SPSSSAS Packages on Citrix (Installing and Accessing ) Use your email ID and Password: Technical Difficulties OTS Call Center (401) 837-6262 Excel Examples. Excel More Examples It is Excellent for Descriptive Statistics, and getting acceptance is improving, as computational tool for Inferential Statistics. The Value of Performing Experiment: If the learning environment is focused on background information, knowledge of terms and new concepts, the learner is likely to learn that basic information successfully. However, this basic knowledge may not be sufficient to enable the learner to carry out successfully the on-the-job tasks that require more than basic knowledge. Thus, the probability of making real errors in the business environment is high. On the other hand, if the learning environment allows the learner to experience and learn from failures within a variety of situations similar to what they would experience in the real world of their job, the probability of having similar failures in their business environment is low. This is the realm of simulations-a safe place to fail. The appearance of statistical software is one of the most important events in the process of decision making under uncertainty. Statistical software systems are used to construct examples, to understand the existing concepts, and to find new statistical properties. On the other hand, new developments in the process of decision making under uncertainty often motivate developments of new approaches and revision of the existing software systems. Statistical software systems rely on a cooperation of statisticians, and software developers. Beside the professional statistical software Online statistical computation . and the use of a scientific calculator is required for the course. A Scientific Calculator is the one, which has capability to give you, say, the result of square root of 5. Any calculator that goes beyond the 4 operations is fine for this course. These calculators allow you to perform simple calculations you need in this course, for example, enabling you to take square root, to raise e to the power of say, 0.36. und so weiter. These types of calculators are called general Scientific Calculators. There are also more specific and advanced calculators for mathematical computations in other areas such as Finance, Accounting, and even Statistics. The last one, for example, computes mean, variance, skewness, and kurtosis of a sample by simply entering all data one-by-one and then pressing any of the mean, variance, skewness, and kurtosis keys. Without a computer one cannot perform any realistic statistical data analysis. Students who are signing up for the course are expected to know the basics of Excel. As a starting point, you need visiting the Excel Web site created for this course. If you are challenged by or unfamiliar with Excel, you may seek tutorial help from the Academic Resource Center at 410-837-5385, E-mail. What and How to Hand-in My Computer Assignment For the computer assignment I do recommend in checking your hand computation homework, and checking some of the numerical examples from your textbook. As part of your homework assignment you don not have to hand in the printout of the computer assisted learning, however, you must include within your handing homework a paragraph entitled Computer Implementation describing your (positive or negative) experience. Interesting and Useful Sites The Copyright Statement: The fair use, according to the 1996 Fair Use Guidelines for Educational Multimedia. of materials presented on this Web site is permitted for non-commercial and classroom purposes only. This site may be mirrored intact (including these notices), on any server with public access. All files are available at home. ubalt. eduntsbarshBusiness-stat for mirroring. Kindly e-mail me your comments, suggestions, and concerns. Vielen Dank. EOF: CopyRights 1994-2015.Using R for Time Series Analysis Time Series Analysis This booklet itells you how to use the R statistical software to carry out some simple analyses that are common in analysing time series data. This booklet assumes that the reader has some basic knowledge of time series analysis, and the principal focus of the booklet is not to explain time series analysis, but rather to explain how to carry out these analyses using R. If you are new to time series analysis, and want to learn more about any of the concepts presented here, I would highly recommend the Open University book 8220Time series8221 (product code M24902), available from from the Open University Shop . In this booklet, I will be using time series data sets that have been kindly made available by Rob Hyndman in his Time Series Data Library at robjhyndmanTSDL . If you like this booklet, you may also like to check out my booklet on using R for biomedical statistics, a-little-book-of-r-for-biomedical-statistics. readthedocs. org. and my booklet on using R for multivariate analysis, little-book-of-r-for-multivariate-analysis. readthedocs. org . Reading Time Series Data The first thing that you will want to do to analyse your time series data will be to read it into R, and to plot the time series. You can read data into R using the scan() function, which assumes that your data for successive time points is in a simple text file with one column. For example, the file robjhyndmantsdldatamisckings. dat contains data on the age of death of successive kings of England, starting with William the Conqueror (original source: Hipel and Mcleod, 1994). The data set looks like this: Only the first few lines of the file have been shown. The first three lines contain some comment on the data, and we want to ignore this when we read the data into R. We can use this by using the 8220skip8221 parameter of the scan() function, which specifies how many lines at the top of the file to ignore. To read the file into R, ignoring the first three lines, we type: In this case the age of death of 42 successive kings of England has been read into the variable 8216kings8217. Once you have read the time series data into R, the next step is to store the data in a time series object in R, so that you can use R8217s many functions for analysing time series data. To store the data in a time series object, we use the ts() function in R. For example, to store the data in the variable 8216kings8217 as a time series object in R, we type: Sometimes the time series data set that you have may have been collected at regular intervals that were less than one year, for example, monthly or quarterly. In this case, you can specify the number of times that data was collected per year by using the 8216frequency8217 parameter in the ts() function. For monthly time series data, you set frequency12, while for quarterly time series data, you set frequency4. You can also specify the first year that the data was collected, and the first interval in that year by using the 8216start8217 parameter in the ts() function. For example, if the first data point corresponds to the second quarter of 1986, you would set startc(1986,2). An example is a data set of the number of births per month in New York city, from January 1946 to December 1959 (originally collected by Newton). This data is available in the file robjhyndmantsdldatadatanybirths. dat We can read the data into R, and store it as a time series object, by typing: Similarly, the file robjhyndmantsdldatadatafancy. dat contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, for January 1987-December 1993 (original data from Wheelwright and Hyndman, 1998). We can read the data into R by typing: Plotting Time Series Once you have read a time series into R, the next step is usually to make a plot of the time series data, which you can do with the plot. ts() function in R. For example, to plot the time series of the age of death of 42 successive kings of England, we type: We can see from the time plot that this time series could probably be described using an additive model, since the random fluctuations in the data are roughly constant in size over time. Likewise, to plot the time series of the number of births per month in New York city, we type: We can see from this time series that there seems to be seasonal variation in the number of births per month: there is a peak every summer, and a trough every winter. Again, it seems that this time series could probably be described using an additive model, as the seasonal fluctuations are roughly constant in size over time and do not seem to depend on the level of the time series, and the random fluctuations also seem to be roughly constant in size over time. Similarly, to plot the time series of the monthly sales for the souvenir shop at a beach resort town in Queensland, Australia, we type: In this case, it appears that an additive model is not appropriate for describing this time series, since the size of the seasonal fluctuations and random fluctuations seem to increase with the level of the time series. Thus, we may need to transform the time series in order to get a transformed time series that can be described using an additive model. For example, we can transform the time series by calculating the natural log of the original data: Here we can see that the size of the seasonal fluctuations and random fluctuations in the log-transformed time series seem to be roughly constant over time, and do not depend on the level of the time series. Thus, the log-transformed time series can probably be described using an additive model. Decomposing Time Series Decomposing a time series means separating it into its constituent components, which are usually a trend component and an irregular component, and if it is a seasonal time series, a seasonal component. Decomposing Non-Seasonal Data A non-seasonal time series consists of a trend component and an irregular component. Decomposing the time series involves trying to separate the time series into these components, that is, estimating the the trend component and the irregular component. To estimate the trend component of a non-seasonal time series that can be described using an additive model, it is common to use a smoothing method, such as calculating the simple moving average of the time series. The SMA() function in the 8220TTR8221 R package can be used to smooth time series data using a simple moving average. To use this function, we first need to install the 8220TTR8221 R package (for instructions on how to install an R package, see How to install an R package ). Once you have installed the 8220TTR8221 R package, you can load the 8220TTR8221 R package by typing: You can then use the 8220SMA()8221 function to smooth time series data. To use the SMA() function, you need to specify the order (span) of the simple moving average, using the parameter 8220n8221. For example, to calculate a simple moving average of order 5, we set n5 in the SMA() function. For example, as discussed above, the time series of the age of death of 42 successive kings of England appears is non-seasonal, and can probably be described using an additive model, since the random fluctuations in the data are roughly constant in size over time: Thus, we can try to estimate the trend component of this time series by smoothing using a simple moving average. To smooth the time series using a simple moving average of order 3, and plot the smoothed time series data, we type: There still appears to be quite a lot of random fluctuations in the time series smoothed using a simple moving average of order 3. Thus, to estimate the trend component more accurately, we might want to try smoothing the data with a simple moving average of a higher order. This takes a little bit of trial-and-error, to find the right amount of smoothing. For example, we can try using a simple moving average of order 8: The data smoothed with a simple moving average of order 8 gives a clearer picture of the trend component, and we can see that the age of death of the English kings seems to have decreased from about 55 years old to about 38 years old during the reign of the first 20 kings, and then increased after that to about 73 years old by the end of the reign of the 40th king in the time series. Decomposing Seasonal Data A seasonal time series consists of a trend component, a seasonal component and an irregular component. Decomposing the time series means separating the time series into these three components: that is, estimating these three components. To estimate the trend component and seasonal component of a seasonal time series that can be described using an additive model, we can use the 8220decompose()8221 function in R. This function estimates the trend, seasonal, and irregular components of a time series that can be described using an additive model. The function 8220decompose()8221 returns a list object as its result, where the estimates of the seasonal component, trend component and irregular component are stored in named elements of that list objects, called 8220seasonal8221, 8220trend8221, and 8220random8221 respectively. For example, as discussed above, the time series of the number of births per month in New York city is seasonal with a peak every summer and trough every winter, and can probably be described using an additive model since the seasonal and random fluctuations seem to be roughly constant in size over time: To estimate the trend, seasonal and irregular components of this time series, we type: The estimated values of the seasonal, trend and irregular components are now stored in variables birthstimeseriescomponentsseasonal, birthstimeseriescomponentstrend and birthstimeseriescomponentsrandom. For example, we can print out the estimated values of the seasonal component by typing: The estimated seasonal factors are given for the months January-December, and are the same for each year. The largest seasonal factor is for July (about 1.46), and the lowest is for February (about -2.08), indicating that there seems to be a peak in births in July and a trough in births in February each year. We can plot the estimated trend, seasonal, and irregular components of the time series by using the 8220plot()8221 function, for example: The plot above shows the original time series (top), the estimated trend component (second from top), the estimated seasonal component (third from top), and the estimated irregular component (bottom). We see that the estimated trend component shows a small decrease from about 24 in 1947 to about 22 in 1948, followed by a steady increase from then on to about 27 in 1959. Seasonally Adjusting If you have a seasonal time series that can be described using an additive model, you can seasonally adjust the time series by estimating the seasonal component, and subtracting the estimated seasonal component from the original time series. We can do this using the estimate of the seasonal component calculated by the 8220decompose()8221 function. For example, to seasonally adjust the time series of the number of births per month in New York city, we can estimate the seasonal component using 8220decompose()8221, and then subtract the seasonal component from the original time series: We can then plot the seasonally adjusted time series using the 8220plot()8221 function, by typing: You can see that the seasonal variation has been removed from the seasonally adjusted time series. The seasonally adjusted time series now just contains the trend component and an irregular component. Forecasts using Exponential Smoothing Exponential smoothing can be used to make short-term forecasts for time series data. Simple Exponential Smoothing If you have a time series that can be described using an additive model with constant level and no seasonality, you can use simple exponential smoothing to make short-term forecasts. The simple exponential smoothing method provides a way of estimating the level at the current time point. Smoothing is controlled by the parameter alpha for the estimate of the level at the current time point. The value of alpha lies between 0 and 1. Values of alpha that are close to 0 mean that little weight is placed on the most recent observations when making forecasts of future values. For example, the file robjhyndmantsdldatahurstprecip1.dat contains total annual rainfall in inches for London, from 1813-1912 (original data from Hipel and McLeod, 1994). We can read the data into R and plot it by typing: You can see from the plot that there is roughly constant level (the mean stays constant at about 25 inches). The random fluctuations in the time series seem to be roughly constant in size over time, so it is probably appropriate to describe the data using an additive model. Thus, we can make forecasts using simple exponential smoothing. To make forecasts using simple exponential smoothing in R, we can fit a simple exponential smoothing predictive model using the 8220HoltWinters()8221 function in R. To use HoltWinters() for simple exponential smoothing, we need to set the parameters betaFALSE and gammaFALSE in the HoltWinters() function (the beta and gamma parameters are used for Holt8217s exponential smoothing, or Holt-Winters exponential smoothing, as described below). The HoltWinters() function returns a list variable, that contains several named elements. For example, to use simple exponential smoothing to make forecasts for the time series of annual rainfall in London, we type: The output of HoltWinters() tells us that the estimated value of the alpha parameter is about 0.024. This is very close to zero, telling us that the forecasts are based on both recent and less recent observations (although somewhat more weight is placed on recent observations). By default, HoltWinters() just makes forecasts for the same time period covered by our original time series. In this case, our original time series included rainfall for London from 1813-1912, so the forecasts are also for 1813-1912. In the example above, we have stored the output of the HoltWinters() function in the list variable 8220rainseriesforecasts8221. The forecasts made by HoltWinters() are stored in a named element of this list variable called 8220fitted8221, so we can get their values by typing: We can plot the original time series against the forecasts by typing: The plot shows the original time series in black, and the forecasts as a red line. The time series of forecasts is much smoother than the time series of the original data here. As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in-sample forecast errors, that is, the forecast errors for the time period covered by our original time series. The sum-of-squared-errors is stored in a named element of the list variable 8220rainseriesforecasts8221 called 8220SSE8221, so we can get its value by typing: That is, here the sum-of-squared-errors is 1828.855. It is common in simple exponential smoothing to use the first value in the time series as the initial value for the level. For example, in the time series for rainfall in London, the first value is 23.56 (inches) for rainfall in 1813. You can specify the initial value for the level in the HoltWinters() function by using the 8220l. start8221 parameter. For example, to make forecasts with the initial value of the level set to 23.56, we type: As explained above, by default HoltWinters() just makes forecasts for the time period covered by the original data, which is 1813-1912 for the rainfall time series. We can make forecasts for further time points by using the 8220forecast. HoltWinters()8221 function in the R 8220forecast8221 package. To use the forecast. HoltWinters() function, we first need to install the 8220forecast8221 R package (for instructions on how to install an R package, see How to install an R package ). Once you have installed the 8220forecast8221 R package, you can load the 8220forecast8221 R package by typing: When using the forecast. HoltWinters() function, as its first argument (input), you pass it the predictive model that you have already fitted using the HoltWinters() function. For example, in the case of the rainfall time series, we stored the predictive model made using HoltWinters() in the variable 8220rainseriesforecasts8221. You specify how many further time points you want to make forecasts for by using the 8220h8221 parameter in forecast. HoltWinters(). For example, to make a forecast of rainfall for the years 1814-1820 (8 more years) using forecast. HoltWinters(), we type: The forecast. HoltWinters() function gives you the forecast for a year, a 80 prediction interval for the forecast, and a 95 prediction interval for the forecast. For example, the forecasted rainfall for 1920 is about 24.68 inches, with a 95 prediction interval of (16.24, 33.11). To plot the predictions made by forecast. HoltWinters(), we can use the 8220plot. forecast()8221 function: Here the forecasts for 1913-1920 are plotted as a blue line, the 80 prediction interval as an orange shaded area, and the 95 prediction interval as a yellow shaded area. The 8216forecast errors8217 are calculated as the observed values minus predicted values, for each time point. We can only calculate the forecast errors for the time period covered by our original time series, which is 1813-1912 for the rainfall data. As mentioned above, one measure of the accuracy of the predictive model is the sum-of-squared-errors (SSE) for the in-sample forecast errors. The in-sample forecast errors are stored in the named element 8220residuals8221 of the list variable returned by forecast. HoltWinters(). If the predictive model cannot be improved upon, there should be no correlations between forecast errors for successive predictions. In other words, if there are correlations between forecast errors for successive predictions, it is likely that the simple exponential smoothing forecasts could be improved upon by another forecasting technique. To figure out whether this is the case, we can obtain a correlogram of the in-sample forecast errors for lags 1-20. We can calculate a correlogram of the forecast errors using the 8220acf()8221 function in R. To specify the maximum lag that we want to look at, we use the 8220lag. max8221 parameter in acf(). For example, to calculate a correlogram of the in-sample forecast errors for the London rainfall data for lags 1-20, we type: You can see from the sample correlogram that the autocorrelation at lag 3 is just touching the significance bounds. To test whether there is significant evidence for non-zero correlations at lags 1-20, we can carry out a Ljung-Box test. This can be done in R using the 8220Box. test()8221, function. The maximum lag that we want to look at is specified using the 8220lag8221 parameter in the Box. test() function. For example, to test whether there are non-zero autocorrelations at lags 1-20, for the in-sample forecast errors for London rainfall data, we type: Here the Ljung-Box test statistic is 17.4, and the p-value is 0.6, so there is little evidence of non-zero autocorrelations in the in-sample forecast errors at lags 1-20. To be sure that the predictive model cannot be improved upon, it is also a good idea to check whether the forecast errors are normally distributed with mean zero and constant variance. To check whether the forecast errors have constant variance, we can make a time plot of the in-sample forecast errors: The plot shows that the in-sample forecast errors seem to have roughly constant variance over time, although the size of the fluctuations in the start of the time series (1820-1830) may be slightly less than that at later dates (eg. 1840-1850). To check whether the forecast errors are normally distributed with mean zero, we can plot a histogram of the forecast errors, with an overlaid normal curve that has mean zero and the same standard deviation as the distribution of forecast errors. To do this, we can define an R function 8220plotForecastErrors()8221, below: You will have to copy the function above into R in order to use it. You can then use plotForecastErrors() to plot a histogram (with overlaid normal curve) of the forecast errors for the rainfall predictions: The plot shows that the distribution of forecast errors is roughly centred on zero, and is more or less normally distributed, although it seems to be slightly skewed to the right compared to a normal curve. However, the right skew is relatively small, and so it is plausible that the forecast errors are normally distributed with mean zero. The Ljung-Box test showed that there is little evidence of non-zero autocorrelations in the in-sample forecast errors, and the distribution of forecast errors seems to be normally distributed with mean zero. This suggests that the simple exponential smoothing method provides an adequate predictive model for London rainfall, which probably cannot be improved upon. Furthermore, the assumptions that the 80 and 95 predictions intervals were based upon (that there are no autocorrelations in the forecast errors, and the forecast errors are normally distributed with mean zero and constant variance) are probably valid. Holt8217s Exponential Smoothing If you have a time series that can be described using an additive model with increasing or decreasing trend and no seasonality, you can use Holt8217s exponential smoothing to make short-term forecasts. Holt8217s exponential smoothing estimates the level and slope at the current time point. Smoothing is controlled by two parameters, alpha, for the estimate of the level at the current time point, and beta for the estimate of the slope b of the trend component at the current time point. As with simple exponential smoothing, the paramters alpha and beta have values between 0 and 1, and values that are close to 0 mean that little weight is placed on the most recent observations when making forecasts of future values. An example of a time series that can probably be described using an additive model with a trend and no seasonality is the time series of the annual diameter of women8217s skirts at the hem, from 1866 to 1911. The data is available in the file robjhyndmantsdldatarobertsskirts. dat (original data from Hipel and McLeod, 1994). We can read in and plot the data in R by typing: We can see from the plot that there was an increase in hem diameter from about 600 in 1866 to about 1050 in 1880, and that afterwards the hem diameter decreased to about 520 in 1911. To make forecasts, we can fit a predictive model using the HoltWinters() function in R. To use HoltWinters() for Holt8217s exponential smoothing, we need to set the parameter gammaFALSE (the gamma parameter is used for Holt-Winters exponential smoothing, as described below). For example, to use Holt8217s exponential smoothing to fit a predictive model for skirt hem diameter, we type: The estimated value of alpha is 0.84, and of beta is 1.00. These are both high, telling us that both the estimate of the current value of the level, and of the slope b of the trend component, are based mostly upon very recent observations in the time series. This makes good intuitive sense, since the level and the slope of the time series both change quite a lot over time. The value of the sum-of-squared-errors for the in-sample forecast errors is 16954. We can plot the original time series as a black line, with the forecasted values as a red line on top of that, by typing: We can see from the picture that the in-sample forecasts agree pretty well with the observed values, although they tend to lag behind the observed values a little bit. If you wish, you can specify the initial values of the level and the slope b of the trend component by using the 8220l. start8221 and 8220b. start8221 arguments for the HoltWinters() function. It is common to set the initial value of the level to the first value in the time series (608 for the skirts data), and the initial value of the slope to the second value minus the first value (9 for the skirts data). For example, to fit a predictive model to the skirt hem data using Holt8217s exponential smoothing, with initial values of 608 for the level and 9 for the slope b of the trend component, we type: As for simple exponential smoothing, we can make forecasts for future times not covered by the original time series by using the forecast. HoltWinters() function in the 8220forecast8221 package. For example, our time series data for skirt hems was for 1866 to 1911, so we can make predictions for 1912 to 1930 (19 more data points), and plot them, by typing: The forecasts are shown as a blue line, with the 80 prediction intervals as an orange shaded area, and the 95 prediction intervals as a yellow shaded area. As for simple exponential smoothing, we can check whether the predictive model could be improved upon by checking whether the in-sample forecast errors show non-zero autocorrelations at lags 1-20. For example, for the skirt hem data, we can make a correlogram, and carry out the Ljung-Box test, by typing: Here the correlogram shows that the sample autocorrelation for the in-sample forecast errors at lag 5 exceeds the significance bounds. However, we would expect one in 20 of the autocorrelations for the first twenty lags to exceed the 95 significance bounds by chance alone. Indeed, when we carry out the Ljung-Box test, the p-value is 0.47, indicating that there is little evidence of non-zero autocorrelations in the in-sample forecast errors at lags 1-20. As for simple exponential smoothing, we should also check that the forecast errors have constant variance over time, and are normally distributed with mean zero. We can do this by making a time plot of forecast errors, and a histogram of the distribution of forecast errors with an overlaid normal curve: The time plot of forecast errors shows that the forecast errors have roughly constant variance over time. The histogram of forecast errors show that it is plausible that the forecast errors are normally distributed with mean zero and constant variance. Thus, the Ljung-Box test shows that there is little evidence of autocorrelations in the forecast errors, while the time plot and histogram of forecast errors show that it is plausible that the forecast errors are normally distributed with mean zero and constant variance. Therefore, we can conclude that Holt8217s exponential smoothing provides an adequate predictive model for skirt hem diameters, which probably cannot be improved upon. In addition, it means that the assumptions that the 80 and 95 predictions intervals were based upon are probably valid. Holt-Winters Exponential Smoothing If you have a time series that can be described using an additive model with increasing or decreasing trend and seasonality, you can use Holt-Winters exponential smoothing to make short-term forecasts. Holt-Winters exponential smoothing estimates the level, slope and seasonal component at the current time point. Smoothing is controlled by three parameters: alpha, beta, and gamma, for the estimates of the level, slope b of the trend component, and the seasonal component, respectively, at the current time point. The parameters alpha, beta and gamma all have values between 0 and 1, and values that are close to 0 mean that relatively little weight is placed on the most recent observations when making forecasts of future values. An example of a time series that can probably be described using an additive model with a trend and seasonality is the time series of the log of monthly sales for the souvenir shop at a beach resort town in Queensland, Australia (discussed above): To make forecasts, we can fit a predictive model using the HoltWinters() function. For example, to fit a predictive model for the log of the monthly sales in the souvenir shop, we type: The estimated values of alpha, beta and gamma are 0.41, 0.00, and 0.96, respectively. The value of alpha (0.41) is relatively low, indicating that the estimate of the level at the current time point is based upon both recent observations and some observations in the more distant past. The value of beta is 0.00, indicating that the estimate of the slope b of the trend component is not updated over the time series, and instead is set equal to its initial value. This makes good intuitive sense, as the level changes quite a bit over the time series, but the slope b of the trend component remains roughly the same. In contrast, the value of gamma (0.96) is high, indicating that the estimate of the seasonal component at the current time point is just based upon very recent observations. As for simple exponential smoothing and Holt8217s exponential smoothing, we can plot the original time series as a black line, with the forecasted values as a red line on top of that: We see from the plot that the Holt-Winters exponential method is very successful in predicting the seasonal peaks, which occur roughly in November every year. To make forecasts for future times not included in the original time series, we use the 8220forecast. HoltWinters()8221 function in the 8220forecast8221 package. For example, the original data for the souvenir sales is from January 1987 to December 1993. If we wanted to make forecasts for January 1994 to December 1998 (48 more months), and plot the forecasts, we would type: The forecasts are shown as a blue line, and the orange and yellow shaded areas show 80 and 95 prediction intervals, respectively. We can investigate whether the predictive model can be improved upon by checking whether the in-sample forecast errors show non-zero autocorrelations at lags 1-20, by making a correlogram and carrying out the Ljung-Box test: The correlogram shows that the autocorrelations for the in-sample forecast errors do not exceed the significance bounds for lags 1-20. Furthermore, the p-value for Ljung-Box test is 0.6, indicating that there is little evidence of non-zero autocorrelations at lags 1-20. We can check whether the forecast errors have constant variance over time, and are normally distributed with mean zero, by making a time plot of the forecast errors and a histogram (with overlaid normal curve): From the time plot, it appears plausible that the forecast errors have constant variance over time. From the histogram of forecast errors, it seems plausible that the forecast errors are normally distributed with mean zero. Thus, there is little evidence of autocorrelation at lags 1-20 for the forecast errors, and the forecast errors appear to be normally distributed with mean zero and constant variance over time. This suggests that Holt-Winters exponential smoothing provides an adequate predictive model of the log of sales at the souvenir shop, which probably cannot be improved upon. Furthermore, the assumptions upon which the prediction intervals were based are probably valid. ARIMA Models Exponential smoothing methods are useful for making forecasts, and make no assumptions about the correlations between successive values of the time series. However, if you want to make prediction intervals for forecasts made using exponential smoothing methods, the prediction intervals require that the forecast errors are uncorrelated and are normally distributed with mean zero and constant variance. While exponential smoothing methods do not make any assumptions about correlations between successive values of the time series, in some cases you can make a better predictive model by taking correlations in the data into account. Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the irregular component of a time series, that allows for non-zero autocorrelations in the irregular component. Differencing a Time Series ARIMA models are defined for stationary time series. Therefore, if you start off with a non-stationary time series, you will first need to 8216difference8217 the time series until you obtain a stationary time series. If you have to difference the time series d times to obtain a stationary series, then you have an ARIMA(p, d,q) model, where d is the order of differencing used. You can difference a time series using the 8220diff()8221 function in R. For example, the time series of the annual diameter of women8217s skirts at the hem, from 1866 to 1911 is not stationary in mean, as the level changes a lot over time: We can difference the time series (which we stored in 8220skirtsseries8221, see above) once, and plot the differenced series, by typing: The resulting time series of first differences (above) does not appear to be stationary in mean. Therefore, we can difference the time series twice, to see if that gives us a stationary time series: Formal tests for stationarity Formal tests for stationarity called 8220unit root tests8221 are available in the fUnitRoots package, available on CRAN, but will not be discussed here. The time series of second differences (above) does appear to be stationary in mean and variance, as the level of the series stays roughly constant over time, and the variance of the series appears roughly constant over time. Thus, it appears that we need to difference the time series of the diameter of skirts twice in order to achieve a stationary series. If you need to difference your original time series data d times in order to obtain a stationary time series, this means that you can use an ARIMA(p, d,q) model for your time series, where d is the order of differencing used. For example, for the time series of the diameter of women8217s skirts, we had to difference the time series twice, and so the order of differencing (d) is 2. This means that you can use an ARIMA(p,2,q) model for your time series. The next step is to figure out the values of p and q for the ARIMA model. Another example is the time series of the age of death of the successive kings of England (see above): From the time plot (above), we can see that the time series is not stationary in mean. To calculate the time series of first differences, and plot it, we type: The time series of first differences appears to be stationary in mean and variance, and so an ARIMA(p,1,q) model is probably appropriate for the time series of the age of death of the kings of England. By taking the time series of first differences, we have removed the trend component of the time series of the ages at death of the kings, and are left with an irregular component. We can now examine whether there are correlations between successive terms of this irregular component if so, this could help us to make a predictive model for the ages at death of the kings. Selecting a Candidate ARIMA Model If your time series is stationary, or if you have transformed it to a stationary time series by differencing d times, the next step is to select the appropriate ARIMA model, which means finding the values of most appropriate values of p and q for an ARIMA(p, d,q) model. To do this, you usually need to examine the correlogram and partial correlogram of the stationary time series. To plot a correlogram and partial correlogram, we can use the 8220acf()8221 and 8220pacf()8221 functions in R, respectively. To get the actual values of the autocorrelations and partial autocorrelations, we set 8220plotFALSE8221 in the 8220acf()8221 and 8220pacf()8221 functions. Example of the Ages at Death of the Kings of England For example, to plot the correlogram for lags 1-20 of the once differenced time series of the ages at death of the kings of England, and to get the values of the autocorrelations, we type: We see from the correlogram that the autocorrelation at lag 1 (-0.360) exceeds the significance bounds, but all other autocorrelations between lags 1-20 do not exceed the significance bounds. To plot the partial correlogram for lags 1-20 for the once differenced time series of the ages at death of the English kings, and get the values of the partial autocorrelations, we use the 8220pacf()8221 function, by typing: The partial correlogram shows that the partial autocorrelations at lags 1, 2 and 3 exceed the significance bounds, are negative, and are slowly decreasing in magnitude with increasing lag (lag 1: -0.360, lag 2: -0.335, lag 3:-0.321). The partial autocorrelations tail off to zero after lag 3. Since the correlogram is zero after lag 1, and the partial correlogram tails off to zero after lag 3, this means that the following ARMA (autoregressive moving average) models are possible for the time series of first differences: an ARMA(3,0) model, that is, an autoregressive model of order p3, since the partial autocorrelogram is zero after lag 3, and the autocorrelogram tails off to zero (although perhaps too abruptly for this model to be appropriate) an ARMA(0,1) model, that is, a moving average model of order q1, since the autocorrelogram is zero after lag 1 and the partial autocorrelogram tails off to zero an ARMA(p, q) model, that is, a mixed model with p and q greater than 0, since the autocorrelogram and partial correlogram tail off to zero (although the correlogram probably tails off to zero too abruptly for this model to be appropriate) We use the principle of parsimony to decide which model is best: that is, we assume that the model with the fewest parameters is best. The ARMA(3,0) model has 3 parameters, the ARMA(0,1) model has 1 parameter, and the ARMA(p, q) model has at least 2 parameters. Therefore, the ARMA(0,1) model is taken as the best model. An ARMA(0,1) model is a moving average model of order 1, or MA(1) model. This model can be written as: Xt - mu Zt - (theta Zt-1), where Xt is the stationary time series we are studying (the first differenced series of ages at death of English kings), mu is the mean of time series Xt, Zt is white noise with mean zero and constant variance, and theta is a parameter that can be estimated. A MA (moving average) model is usually used to model a time series that shows short-term dependencies between successive observations. Intuitively, it makes good sense that a MA model can be used to describe the irregular component in the time series of ages at death of English kings, as we might expect the age at death of a particular English king to have some effect on the ages at death of the next king or two, but not much effect on the ages at death of kings that reign much longer after that. Shortcut: the auto. arima() function The auto. arima() function can be used to find the appropriate ARIMA model, eg. type 8220library(forecast)8221, then 8220auto. arima(kings)8221. The output says an appropriate model is ARIMA(0,1,1). Since an ARMA(0,1) model (with p0, q1) is taken to be the best candidate model for the time series of first differences of the ages at death of English kings, then the original time series of the ages of death can be modelled using an ARIMA(0,1,1) model (with p0, d1, q1, where d is the order of differencing required). Example of the Volcanic Dust Veil in the Northern Hemisphere Let8217s take another example of selecting an appropriate ARIMA model. The file file robjhyndmantsdldataannualdvi. dat contains data on the volcanic dust veil index in the northern hemisphere, from 1500-1969 (original data from Hipel and Mcleod, 1994). This is a measure of the impact of volcanic eruptions8217 release of dust and aerosols into the environment. We can read it into R and make a time plot by typing: From the time plot, it appears that the random fluctuations in the time series are roughly constant in size over time, so an additive model is probably appropriate for describing this time series. Furthermore, the time series appears to be stationary in mean and variance, as its level and variance appear to be roughly constant over time. Therefore, we do not need to difference this series in order to fit an ARIMA model, but can fit an ARIMA model to the original series (the order of differencing required, d, is zero here). We can now plot a correlogram and partial correlogram for lags 1-20 to investigate what ARIMA model to use: We see from the correlogram that the autocorrelations for lags 1, 2 and 3 exceed the significance bounds, and that the autocorrelations tail off to zero after lag 3. The autocorrelations for lags 1, 2, 3 are positive, and decrease in magnitude with increasing lag (lag 1: 0.666, lag 2: 0.374, lag 3: 0.162). The autocorrelation for lags 19 and 20 exceed the significance bounds too, but it is likely that this is due to chance, since they just exceed the significance bounds (especially for lag 19), the autocorrelations for lags 4-18 do not exceed the signifiance bounds, and we would expect 1 in 20 lags to exceed the 95 significance bounds by chance alone. From the partial autocorrelogram, we see that the partial autocorrelation at lag 1 is positive and exceeds the significance bounds (0.666), while the partial autocorrelation at lag 2 is negative and also exceeds the significance bounds (-0.126). The partial autocorrelations tail off to zero after lag 2. Since the correlogram tails off to zero after lag 3, and the partial correlogram is zero after lag 2, the following ARMA models are possible for the time series: an ARMA(2,0) model, since the partial autocorrelogram is zero after lag 2, and the correlogram tails off to zero after lag 3, and the partial correlogram is zero after lag 2 an ARMA(0,3) model, since the autocorrelogram is zero after lag 3, and the partial correlogram tails off to zero (although perhaps too abruptly for this model to be appropriate) an ARMA(p, q) mixed model, since the correlogram and partial correlogram tail off to zero (although the partial correlogram perhaps tails off too abruptly for this model to be appropriate) Shortcut: the auto. arima() function Again, we can use auto. arima() to find an appropriate model, by typing 8220auto. arima(volcanodust)8221, which gives us ARIMA(1,0,2), which has 3 parameters. However, different criteria can be used to select a model (see auto. arima() help page). If we use the 8220bic8221 criterion, which penalises the number of parameters, we get ARIMA(2,0,0), which is ARMA(2,0): 8220auto. arima(volcanodust, ic8221bic8221)8221. The ARMA(2,0) model has 2 parameters, the ARMA(0,3) model has 3 parameters, and the ARMA(p, q) model has at least 2 parameters. Therefore, using the principle of parsimony, the ARMA(2,0) model and ARMA(p, q) model are equally good candidate models. An ARMA(2,0) model is an autoregressive model of order 2, or AR(2) model. This model can be written as: Xt - mu (Beta1 (Xt-1 - mu)) (Beta2 (Xt-2 - mu)) Zt, where Xt is the stationary time series we are studying (the time series of volcanic dust veil index), mu is the mean of time series Xt, Beta1 and Beta2 are parameters to be estimated, and Zt is white noise with mean zero and constant variance. An AR (autoregressive) model is usually used to model a time series which shows longer term dependencies between successive observations. Intuitively, it makes sense that an AR model could be used to describe the time series of volcanic dust veil index, as we would expect volcanic dust and aerosol levels in one year to affect those in much later years, since the dust and aerosols are unlikely to disappear quickly. If an ARMA(2,0) model (with p2, q0) is used to model the time series of volcanic dust veil index, it would mean that an ARIMA(2,0,0) model can be used (with p2, d0, q0, where d is the order of differencing required). Similarly, if an ARMA(p, q) mixed model is used, where p and q are both greater than zero, than an ARIMA(p,0,q) model can be used. Forecasting Using an ARIMA Model Once you have selected the best candidate ARIMA(p, d,q) model for your time series data, you can estimate the parameters of that ARIMA model, and use that as a predictive model for making forecasts for future values of your time series. You can estimate the parameters of an ARIMA(p, d,q) model using the 8220arima()8221 function in R. Example of the Ages at Death of the Kings of England For example, we discussed above that an ARIMA(0,1,1) model seems a plausible model for the ages at deaths of the kings of England. You can specify the values of p, d and q in the ARIMA model by using the 8220order8221 argument of the 8220arima()8221 function in R. To fit an ARIMA(p, d,q) model to this time series (which we stored in the variable 8220kingstimeseries8221, see above), we type: As mentioned above, if we are fitting an ARIMA(0,1,1) model to our time series, it means we are fitting an an ARMA(0,1) model to the time series of first differences. An ARMA(0,1) model can be written Xt - mu Zt - (theta Zt-1), where theta is a parameter to be estimated. From the output of the 8220arima()8221 R function (above), the estimated value of theta (given as 8216ma18217 in the R output) is -0.7218 in the case of the ARIMA(0,1,1) model fitted to the time series of ages at death of kings. Specifying the confidence level for prediction intervals You can specify the confidence level for prediction intervals in forecast. Arima() by using the 8220level8221 argument. For example, to get a 99.5 prediction interval, we would type 8220forecast. Arima(kingstimeseriesarima, h5, levelc(99.5))8221. We can then use the ARIMA model to make forecasts for future values of the time series, using the 8220forecast. Arima()8221 function in the 8220forecast8221 R package. For example, to forecast the ages at death of the next five English kings, we type: The original time series for the English kings includes the ages at death of 42 English kings. The forecast. Arima() function gives us a forecast of the age of death of the next five English kings (kings 43-47), as well as 80 and 95 prediction intervals for those predictions. The age of death of the 42nd English king was 56 years (the last observed value in our time series), and the ARIMA model gives the forecasted age at death of the next five kings as 67.8 years. We can plot the observed ages of death for the first 42 kings, as well as the ages that would be predicted for these 42 kings and for the next 5 kings using our ARIMA(0,1,1) model, by typing: As in the case of exponential smoothing models, it is a good idea to investigate whether the forecast errors of an ARIMA model are normally distributed with mean zero and constant variance, and whether the are correlations between successive forecast errors. For example, we can make a correlogram of the forecast errors for our ARIMA(0,1,1) model for the ages at death of kings, and perform the Ljung-Box test for lags 1-20, by typing: Since the correlogram shows that none of the sample autocorrelations for lags 1-20 exceed the significance bounds, and the p-value for the Ljung-Box test is 0.9, we can conclude that there is very little evidence for non-zero autocorrelations in the forecast errors at lags 1-20. To investigate whether the forecast errors are normally distributed with mean zero and constant variance, we can make a time plot and histogram (with overlaid normal curve) of the forecast errors: The time plot of the in-sample forecast errors shows that the variance of the forecast errors seems to be roughly constant over time (though perhaps there is slightly higher variance for the second half of the time series). The histogram of the time series shows that the forecast errors are roughly normally distributed and the mean seems to be close to zero. Therefore, it is plausible that the forecast errors are normally distributed with mean zero and constant variance. Since successive forecast errors do not seem to be correlated, and the forecast errors seem to be normally distributed with mean zero and constant variance, the ARIMA(0,1,1) does seem to provide an adequate predictive model for the ages at death of English kings. Example of the Volcanic Dust Veil in the Northern Hemisphere We discussed above that an appropriate ARIMA model for the time series of volcanic dust veil index may be an ARIMA(2,0,0) model. To fit an ARIMA(2,0,0) model to this time series, we can type: As mentioned above, an ARIMA(2,0,0) model can be written as: written as: Xt - mu (Beta1 (Xt-1 - mu)) (Beta2 (Xt-2 - mu)) Zt, where Beta1 and Beta2 are parameters to be estimated. The output of the arima() function tells us that Beta1 and Beta2 are estimated as 0.7533 and -0.1268 here (given as ar1 and ar2 in the output of arima()). Now we have fitted the ARIMA(2,0,0) model, we can use the 8220forecast. ARIMA()8221 model to predict future values of the volcanic dust veil index. The original data includes the years 1500-1969. To make predictions for the years 1970-2000 (31 more years), we type: We can plot the original time series, and the forecasted values, by typing: One worrying thing is that the model has predicted negative values for the volcanic dust veil index, but this variable can only have positive values The reason is that the arima() and forecast. Arima() functions don8217t know that the variable can only take positive values. Clearly, this is not a very desirable feature of our current predictive model. Again, we should investigate whether the forecast errors seem to be correlated, and whether they are normally distributed with mean zero and constant variance. To check for correlations between successive forecast errors, we can make a correlogram and use the Ljung-Box test: The correlogram shows that the sample autocorrelation at lag 20 exceeds the significance bounds. However, this is probably due to chance, since we would expect one out of 20 sample autocorrelations to exceed the 95 significance bounds. Furthermore, the p-value for the Ljung-Box test is 0.2, indicating that there is little evidence for non-zero autocorrelations in the forecast errors for lags 1-20. To check whether the forecast errors are normally distributed with mean zero and constant variance, we make a time plot of the forecast errors, and a histogram: The time plot of forecast errors shows that the forecast errors seem to have roughly constant variance over time. However, the time series of forecast errors seems to have a negative mean, rather than a zero mean. We can confirm this by calculating the mean forecast error, which turns out to be about -0.22: The histogram of forecast errors (above) shows that although the mean value of the forecast errors is negative, the distribution of forecast errors is skewed to the right compared to a normal curve. Therefore, it seems that we cannot comfortably conclude that the forecast errors are normally distributed with mean zero and constant variance Thus, it is likely that our ARIMA(2,0,0) model for the time series of volcanic dust veil index is not the best model that we could make, and could almost definitely be improved upon Links and Further Reading Here are some links for further reading. For a more in-depth introduction to R, a good online tutorial is available on the 8220Kickstarting R8221 website, cran. r-project. orgdoccontribLemon-kickstart . There is another nice (slightly more in-depth) tutorial to R available on the 8220Introduction to R8221 website, cran. r-project. orgdocmanualsR-intro. html . You can find a list of R packages for analysing time series data on the CRAN Time Series Task View webpage . To learn about time series analysis, I would highly recommend the book 8220Time series8221 (product code M24902) by the Open University, available from the Open University Shop . There are two books available in the 8220Use R8221 series on using R for time series analyses, the first is Introductory Time Series with R by Cowpertwait and Metcalfe, and the second is Analysis of Integrated and Cointegrated Time Series with R by Pfaff. Acknowledgements I am grateful to Professor Rob Hyndman. for kindly allowing me to use the time series data sets from his Time Series Data Library (TSDL) in the examples in this booklet. Many of the examples in this booklet are inspired by examples in the excellent Open University book, 8220Time series8221 (product code M24902), available from the Open University Shop . Thank you to Ravi Aranke for bringing auto. arima() to my attention, and Maurice Omane-Adjepong for bringing unit root tests to my attention, and Christian Seubert for noticing a small bug in plotForecastErrors(). Thank you for other comments to Antoine Binard and Bill Johnston. I will be grateful if you will send me (Avril Coghlan) corrections or suggestions for improvements to my email address alc 64 sanger 46 ac 46 uk

Comments

Popular posts from this blog

Mechanische Handelssysteme Richard Weissman

Optionen Handel Kurs Bangalore

Markt Welt Binäre Optionen Demo