NMST539 | Lab Test No.2

LS 2017 | Tuesday 23/05/2017

Data description

The data file contains results on a geography questionaire in primary schools in the Czech Republic and Slovakia taking place in 2015. The questionaire was designed to test the geographical knowledge of students and their skill to read in geographical maps. To be more specific, the test compares the skill of students from three different grades (formally 11 years old students denoted as grade11, 15 years students denoted as grade15, and 18 years old students denoted as grade18).

Each student was asked to answer 21 geographical questions while each question was a multiple type question with set of multiple answers which were provided in the test. Each student was supposed to select the right answer(s) (it was possible that more than just one answer was correct). The questions in the test were carefully designed as follows: the first seven questions in the test should be easily answered by all grade students (in general, the knowledge of the 11 “years” old students was required). The next set of questions, questions 8-14, where more difficult, and they were assumed to be correctly answered by the 15 “years” old (grade) students (but they should be also able to correctly answer the easier set of questions). Finaly, the last set of questions, questions 15-21, were meant test the skills of the oldest students - the 18 “years” old (grade) students. However, each student had to answer all 21 questions (the first grade students were just assumed to have enought knowledge to correctly answer questions 1-7, the second grade students were assumed to have enough knowledge to be able to answer questions 1-14, and finaly, the third grade students should be educated enough to answer all 21 questions properly). In the given dataset there is only some partial information on each student’s performance provided: the overal percentual gain from the first set of questions (covariate tot1), the overall percentual gain for the second set of questions (covariate tot2), and finaly, the percentual gain for the last set of questions (covariate tot3). The covariate total denotes the overall percentual gain calculated as (tot1 + tot2 + tot3)/3.

The dataset contains 2048 observations (individual students) and 14 different covariates. A detailed description of all covariates is provided below.

  • country - two level factor distinguishing for the country (‘cz’ or ‘sk’);
  • gender - student’s gender (‘male’ or ‘female’);
  • age - student’s age (given in years);
  • class - student’s grage (‘grade11’ for 11 year old students, ‘grade15’ for 15 years old students, and ‘grade18’ for 18 years old students). The grade should optimaly reflect the real age of the student however, this is not always true (some students had to repeat some grade, some were upgraded to proceed facter for instance).
  • Nsibs - the number of siblings;
  • sibOrder - the order among siblings (1 for the oldest, ‘Nsib’ value for the youngest);
  • studTime - the number of hours per one week spent by studying geography (student’s subjective opinion);
  • Ftravel - traveling frequency (0 = student does not travel, 5 = strudent travels a lot);
  • grade - the overall geography mark on the last cerificate (1 = best, 5 = worst);
  • popularity - popularity/preferency of the geography class (1 = I like it a lot, 3 = I do not like it);
  • total - the overall percentual gain from the geography test (all 21 questions together);
  • tot1 - percentual gain from the first set of questions (questions 1–7);
  • tot2 - percentual gain from the second set of questions (questions 8–14);
  • tot3 - percentual gain from the last set of questions ( questions 15–21);

The corresponding data file can be loaded into the R working environment by running the following command:

rm(list = ls())
data <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/geoDataExt.csv", header = T)
attach(data)
head(data)
##   country gender age   class Nsibs sibOrder studTime Ftravel grade
## 1      sk female  12 grade11     2        1       11       1     1
## 2      sk   male  12 grade11     3        3        2       1     3
## 3      sk   male  12 grade11     1        1        2       1     2
## 4      sk   male  12 grade11     1        2        3       0     3
## 5      sk   male  12 grade11     0        1        3       0     2
## 6      sk female  12 grade11     0        1        4       2     1
##   popularity total   tot1   tot2  tot3
## 1          1 84.52 100.00 100.00 53.57
## 2          3 34.52  46.43  42.86 14.29
## 3          1 58.16  75.00  45.92 53.57
## 4          3 29.37  23.81  50.00 14.29
## 5          2 37.30  45.24  59.52  7.14
## 6          1 53.57  64.29  82.14 14.29




Questions and task for the individual work


  • Consider the given dataset and use some graphical tools available in the R software to find some interesting structures hidded in the data. Interpret and carrefully explain the provided figures. Try to provide some “interesting” outputs.
  • The dataset can be formally considered to be made out of two datasets: the first dataset includes the student’s specific covariates - those are covariates in columns 1–10, while the second dataset, covariates in columns 11–14 express each student’s performance in the geographical test.
    Consider these two datasets and apply the canonical correlation approach. Interpret the results and provide some discussion.
  • Use the cluster analysis and try to find some well-interpretable clusters. Are there some natural covariates in the data, which could be used to classify students into the given clusters (e.g. some linear discrimination rule)?

End of the individual work




Technical details and final instructions

  • The work is supposed to be evaluated individually, by each student separately.
  • It is necessary to submit either a PDF file or an HTML file (both can be created using the RStudio software and the Sweave or Knitr package). Do not provide the R source code in your report, but rather state some important results and clearly discuss your important findings. Interpret the results and provide some graphical tools for a better visualization of the results.
  • If you use the first option (PDF file submission), you need to create the .Rnw file (same approach you used for the homework assignments for instance), which you can later compile in the RStudio software to create the final PDF file. In the second case (HTML file submission), you need to create the .Rmd file, similarly as we did it for the lab sessions, and again, you can compile it in the RStudio to create the HTML file at the end.
  • If you make a decision to create the HTML file instead, you can directly use the source file of this lab test assignment (see the download link in the header of this page).

  • Rename the submission file as surname_givenName.pdf or surname_givenName.html.
  • The final PDF (or the HTML) file should to be sent via email to one of the following adresses:
    • maciak AT ualberta.ca
    • maciak AT karlin.mff.cuni.cz

    not later than three minutes after the lab session ends (not later than 12:13).