Computational Environment for Statistical Data Analysis (NMST440)

Arnošt Komárek


Home (CZ) | Teaching (CZ) | BESEDA | NMST552 |

Teaching summer

NMST432 | NMST440 |

Teaching winter

NMSA407 | NMST431 |

Teaching, software

Rko (CZ) |


Diploma theses (CZ) | Bachelor theses (CZ) |

Computational Environment for Statistical Data Analysis (NMST440)

Summer semester 2017–18

SIS pages of the course:    ENG    CZE


Tutorial: Thursday 17:20 in K11   


  • There will be no tutorial on 15/03/2018. Students are advised to work on the assignments given by that time.


Tutorial 1 (22/02/2018)

Topic: HTML and bibliographic information sources on Internet.

    HTML tags:     Page at w3schools     Page at htmldog
    CSS Templates:     CSS Templates For Free     Andreas Viklund     Example from CSS Templates For Free
    Classification systems:     MSC     JEL
    Bibliographic databases:     Web of Science (WOS)     Scopus     MathSciNet     ZentralBlatt MATH     Google Scholar
    DOI number:     DOI at Wiki
    Articles databases:     JSTOR     JSTOR (Statistics)     Wiley Online Library     ScienceDirect     SpringerLink
    htpasswd:     .htaccess Example     .htaccess Example 2     On-line htpasswd generator

  1. Create your homepage at Artax server and then send a link to this page to the lecturer via e-mail.
  2. Add to this webpage information concerning your Bachelor or Master thesis including its MSC and/or JEL classification, keywords in both Czech/Slovak [if this is your native language] and English. Further, provide three references from your thesis including the following information: DOI number (as an active link), number of citations according to WOS and Scopus, information whether a full text of the publication is available from IP's of MFF UK. If it is, include the link to this full text.

Tutorial 2 (01/03/2018)

Topic: Data management in R.

Data.R  Data.xls  Data.csv  


Data included in the LibreOffice sheet Consum.ods contain information on spendings of participants to a certain scientific event during their stay at the conference site. Personal information contains: gender (m/f), category (professor (prof), associate professor (doc), assistant (asist), researcher (res), Ph.D. student (phd), guest (host)), institution. Additionally, length of a talk (if given) is included. The remaining columns provide numbers of consumed drinks of different types, total spendings and spendings on other services (column other). Missing values are indicated by empty cell or a string "na".

  1. Prepare data for statistical analysis which aim would mainly be to explore mutual relationships among personal and consumption variables or mutually among consumption variables. Use your subject matter knowledge to clean data and especially information on institution.

Tutorial 3 (08/03/2018)

Topic: R: Functions and programming, classes and methods.

R script (functions and programming)    R script (classes and methods)    CovMat.R


Write an R function which takes an object of class glm and creates two tables (each being returned as a data.frame).

Table 1 will contain for each non-intercept coefficient (i) exponential of the MLE of the coefficient (which has a useful interpretation for many GLM's), (ii) related standard error calculated by the mean of a delta method, (iii) p-value from the Wald test, (iv) p-value from the likelihood-ratio (deviance) test, (v) confidence interval for the exponential of the coefficient being dual to the Wald test, (vi) confidence interval for the exponential of the coefficient being dual to the likelihood-ratio test. User should be able to specify a coverage of the confidence intervals.

Table 2 will contain for each term (effect) included in the model (i) related degrees of freedom, (ii) Wald test statistic and a p-value, (iii) likelihood-ratio test statistic and a p-value.

Additionally write a function which prints the results in a nice form. Minimal niceness consists of (a) providing some explanatory titles to the two tables, (b) printed p-values being formatted such that those being lower than 0.001 are printed as <0.001, (c) printed numbers (other than p-values) will be rounded to a value being specified by the user (take 2 as a default value for number of digits after a decimal sign).

Test your function on a logistic model based on Consum data with response being indicator of whether more was spent on alcoholic rather than non-alcoholic drinks (count 25 CZK for Radler/nealko and 30 CZK for liquer, do not count other spendings) and covariates (all included in an additive way) (i) gender, (ii) category, (iii) talk categorized as none/at most 30 min/more than 30 min. Missing values for talk should be considered as no talk. Disregard subjects with category guest.

Tutorial 4 (15/03/2018)

Individual work at home.

Tutorial 5 (22/03/2017)

Topic: R: Big data, vectorized calculation.

R script (big data and apply)    Data (Kojeni)


No new assignment for this tutorial. Work on previous assignments.

Tutorial 6 (29/03/2018)

Topic: R: Hundreds of tables in one second.

Motto: Use your time to do creative tasks (or to rest). Computers are here for routine (if one is willing to use his/her brain first).

R script (routine analysis)    formatOut.R    funTabDescr.R
Data (nelsNE)    report (LaTeX)    report (pdf)


Write an R function to convert tables from Assignment 3 into LaTeX tables. Use LaTeX to prepare a toy report in pdf containing results of the test analysis from Assignment 3.

Tutorial 7 (05/04/2018)

Topic: R: graphics

Data (nelsNE, processed)    R script (graphics)    


No new assignment for this tutorial. Work on previous assignments.

Tutorial 8 (12/04/2018)

Topic: Sweave.

R script (routine analysis)    report (Sweave)    report (pdf)
   report 2 (Sweave)    report 2 (pdf)
R script (process Sweave)    bib file    
TeX style    bib style    


Convert the LaTeX document from Assignment 6 into the Sweave document.

Tutorial 9 (19/04/2018)

Topic: R: 3D graphics, shape maps

R script (palettes)      data (PS PČR 2017, txt)
R script (3D plots)      R function (dmixn)
R script (shape maps)      shape files (tar.gz)



Take results of the PS PČR 2017 elections and calculate conditional distributions of votes by regions. For at least one party, visualize the respective regional proportions in a~map. Include the map in a separate section of the document which is being prepared by previous assignments.

Tutorial 10 (26/04/2018)

Topic: Monte Carlo studies in statistics

R script (simulation 1)    R script (simulation 2)    


As you all (hopefully) know, the χ2 distribution of the test statistic of the Pearson χ2 test of independence in the contingency table is only asymptotic. It is traditionally claimed that the asymptotic χ2 approximation works reasonably well when all expected counts (under independence) are higher than a magical number 5.

Perform a simulation study towards exploration of a true significance level and true distribution of the test statistic of the χ2 test of independence in a 2x2 table corresponding to comparison of two independent binomial distributions. This is in fact a test towards comparison of proportions of a certain property (``success'') in two independent populations.

In the following, let p1 and p2 be proportions of ``success'' in population 1 and 2, respectively and let n1 and n2 be sample sizes in population 1 and 2, respectively. Consider a χ2 test of independence with a nominal significance level of 5% and use continuity corrections when calculating the value of the test statistic. Further, assume equal sample sizes in the two groups, i.e., n1 = n2 = n and consider three scenarios (of independence):

  • p1 = p2 = p = 0.01;
  • p1 = p2 = p = 0.1;
  • p1 = p2 = p = 0.5.

For each scenario, consider values of n (sample size in each group) that gradually correspond to the lowest value of the expected count (under the respective scenario) being 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50, 100. That is, you have in total 3 x 12 = 36 scenarios. Use a simulation length of at least M = 10000.

Report results (empirical significance levels) in a form of well-formatted table(s) included in a LaTeX (or Sweave) document that you already prepared for earlier Assignments.

Additionally, use suitable graphical tools to compare empirical distributions of the test statistics (under considered scenarios) to assumed χ2 distributions. Include relevant plots in the document.

Remark: Before you start the simulation, think a little bit whether some scenarios cannot pose computational/theoretical problems.

Tutorial 11 (03/05/2018)

Topic: Computationally efficient simulations, cluster computation

R script (batch simulation)    Shell script    R script (process results)
R script (parallel simulation)        
R script (cluster simulation)    Shell script (sbatch it)    
R script (prepare cluster scripts)    R script (process results)    

Sněhurka (Karlín)    Snow White (Karlín)    IT4Innovations (Ostrava)

Assignment (optional):

Try to implement the simulation study from the Assignment 10 in an efficient way.

Tutorial 12 (10/05/2018)

Topic: Use of a compiled code in R

       R script (some test of independence)    
indTest.c (C file)    indTest.R (R function)    rMVN2.R (R function)
   R script (simulation)    R script (process results)    Shell script (start simulations)

On Windows machines, Rtools are needed (direct download here) along with the R package R package devtools.

Selected topics from "Writing R Extensions" manual:
   Interface function .C
   dyn.load and dyn.unload
   Creating shared objects
   Random number generation
   Numerical analysis subroutines
      (distribution and mathematical functions, mathematical constants)


Take Consum data and use a test of independence implemented in indTest.R (with a = 1) to evaluate dependence of spendings on beer consumption (total for beer and Plzeň) and spendings on non-alcoholic drinks (total for Radler/nealko, cola/kofola, čaj). Count 25 CZK for Radler/nealko. Perform the analysis for (a) the whole dataset, (b) "senior" people only (category prof, doc, asist, res, host), (c) "junior" people only (category phd). Use a method of bootstrap to calculate the P-values of the tests. Include results in a document which is being created in the framework of previous assignments.

Remark: Explanation on how to use bootstrap to calculate the P-value of the considered test of independence will be/was provided during the lecture.

Tutorial 13 (17/05/2018)

Topic: R grid graphics (lattice, ggplot2), R Markdown, R shiny

R script (lattice, ggplot2)      data (auta2004, RData)
R markdown files (tar.gz archive)
R shiny files (tar.gz archive)

lattice package description      Getting started with lattice graphics
ggplot2 package page      Short tutorial on ggplot2


The course credit will be awarded to the student who hands in a satisfactory solution to each assignment by the prescribed deadline. The nature of these requirements precludes any possibility of additional attempts to obtain the course credit.


AH    AM    CB    JV    PM    TB   


View My Stats