Computational Environment for Statistical Data Analysis (NMST440)

Arnošt Komárek


Home (CZ) | Teaching (CZ) |

Teaching winter

NMSA407 | NSTP021 |

Teaching summer

NMSA230 | NMST440 |

Teaching, software

Rko (CZ) |


Diploma theses (CZ) | Bachelor theses (CZ) |

Computational Environment for Statistical Data Analysis (NMST440)

Summer semester 2013–14


Lectures: Thursday 14:00 in K11   
Exercises: Thursday 15:40 in K11   
  • Lectures and exercises will be taught in English if at least one student will require so. If nobody will require English as a teaching language then both lectures and exercises will be taught in Czech.


Matúš Čellár    Adéla Drabinová    Jaroslav Dufek    Karel Chuchel    Kateřina Janoušková    Dominik Matula    Jan Moravec   


  • Please, by the end of March, install SAS software on your laptop (if you have some). The software is available (for academic purposes only) to all MFF UK students in the framework of the SAS Academic Programme. Installation is provided by RNDr. Ing. Jaroslav Richter on the 3rd floor (please, arrange an appointment either via telephone (221 913 206 or just line 3206 from any phone in the Karlín building) or via e-mail richter[AT]karlin.ETC-YOU-KNOW-WELL-WHAT).

Additional software we (you) may use:
PuTTY  putty-0.63-installer.exe
FileZilla  FileZilla_3.7.4.1_win32-setup.exe
GIMP  gimp-2.8.10-setup.exe
Ghostscript and GhostView  GPL Ghostscript 9.01
 GSview release v5.0
XEmacs  XEmacs_Setup_21.4.22.exe
ESS plugin for XEmacs
Rtools  Rtools31.exe
R package devtools  


Lecture 1 (20/02/2014)
Topic: HTML and bibliographic information sources on Internet.

    HTML tags:     Page at w3schools     Page at htmldog
    CSS Templates:     CSS Templates For Free     Andreas Viklund     Example from CSS Templates For Free
    Classification systems:     MSC     JEL
    Bibliographic databases:     Web of Science (WOS)     MathSciNet     Scopus     ZentralBlatt MATH     Google Scholar
    DOI number:     DOI at Wiki
    Articles databases:     JSTOR     Wiley Online Library     ScienceDirect     SpringerLink
    htpasswd:     .htaccess Example     .htaccess Example 2     On-line htpasswd generator

  1. Create your homepage at Artax server and then send a link to this page to the lecturer via e-mail.
  2. Add to this webpage information concerning your Bachelor or Master thesis including its MSC and/or JEL classification, keywords in both Czech/Slovak and English. Further, provide three references from your thesis including the following information: DOI number (as an active link), number of citations according to WOS and Scopus, information whether a full text of the publication is available from IP's of MFF UK. If it is, include the link to this full text.

Lecture 2 (27/02/2014)
Topic: BibTeX, figures in LaTeX. Dynamic plots for web.  
nmst440-latex.tex  nmst440-latex.bib  akplainnat.bst  Makefile  nmst440-latex.pdf
AK_small.jpg  AK_small.eps  AK.jpg
nmst440-tdens.R  dt_1.eps  dt_all.pdf  dt_all.gif

  1. Use LaTeX package custom-bib to prepare a bst file to be able to produce a list of references as close as possible to the style requested by Statistical Modelling journal, see here.
  2. Use one of databases (WOS, MathSciNet, ...) introduced last week or other resources to find references related to keywords from the previous assignment. Find at least five papers and at least one book.
  3. Create a bib file containing those references.
  4. Use LaTeX and the bst style file from assignment 1 and write a short text where you use different types of referencing (direct, indirect) to works from your bib file (when working on this part, try also other standard bibliography styles like plain, unsrt, abbrv, ...).
  5. Use Gimp and ps2pdf to convert any jpg file (any photograph, printscreen, ...) into eps and pdf and include it in your LaTeX document.
  6. Prepare a series of plots illustrating the central limit theorem applied to the chi-squared distribution (do not forget that some standardization of the chi-squared density is needed), include the plots in your LaTeX file. Prepare not only plots with densities but also with corresponding cumulative distirbution functions (cdf's). Create a pdf document from your LaTeX file. Include a link to this pdf file on your webpage.
  7. Use convert to prepare dynamic gif files (based on densities and cdf's) based on the plots prepared in the previous item. Include those gif files on your webpage.

Lecture 3 (06/03/2014)
Topic: R graphics, reading data into R.
nmst440-graphics.R  pchShow.R  dmix2.R  
 Adobe symbols encoding  
nmst440-readData.R  auta2004.dat  auta2004.csv  
 cars.xls  cars.csv  

  1. Consider bivariate t-distributions with ν=5 and ν=50 degrees of freedom and a scale matrix having values of 1 and 4 on a diagonal and an off-diagonal value of 1. Draw a heat map supplemented by contour lines of densities of those t-distributions. Further, draw a 3D plot of those densities. Additionally, sample randomly 100 points from each of those distributions and add the sample points to the heat maps. Include all plots in the LaTeX document from the previous assignment.
    Remark: Multivariate t-distribution (density, distribution function, random sampling) is implemented, e.g., in an R package mvtnorm.
  2. Take data included in the Excel sheet partners.xls related to this questionnaire and prepare an RData file containing a data frame with well-formatted data (no gross errors, categorical variables as factors, ...). At this stage, keep two date columns DOB and DateInterv as having a class character. Additionally, create the following derived variables:
    1. NumPtnr: Real number of reported partners.
    2. Vppnarg: Self-reported number of acts where partner is a regular one (spouse or boy-/girlfriend). Define it NA if there is no regular partner.
    3. Vppncns: Self-reported number of protected acts where partner is other than spouse. Define it NA if participant does not have any partner who is other than spouse.
    4. Vppnamm: Self-reported number of acts where participant and partner are both males. Define it NA if participant is female.
    5. Vppagdf: Age difference between participant and his/her most frequent sexual partner of opposite sex. Define it NA if participant does not have any partner of opposite sex.

Further reading:
  • Paul Murrell (2011). R Graphics. Second Edition. Boca Raton: CRC Press. ISBN 978-1-4398-3176-2.

Lecture 4 (13/03/2014)
Topic: R: dates, formatted output, tables, Sweave (automatically created reports).
nmst440-partners.R  partners.csv    
nmst440-tables.R  p2string.R  cars.RData  
nmst440-Sweave.Rnw  nmst440-Sweave.bib  akplainnat.bst  
Sweave.sty  SweaveAK.sty  sweaveIt.R  

  1. Use Sweave to create a PDF report on the analysis of partners data trying to answer the following question: Does the value of Vppagdf depend on gender and age of participant?
    Examine both marginal and partial effect (being adjusted for the effect of the second factor) of gender and age on Vppagdf. Provide results in a form of a table being similar to this table. Include also two plots being suitable for evaluation of a marginal relationship between Vppagdf and gender and between Vppagdf and age. On a plot of the Vppagdf to age relationship, use different symbols/colors to distinguish male and female participants.

Lecture 5 (20/03/2014)
Topic: Simple simulations, calculation of standard errors, confidence intervals, critical values using a method of Monte Carlo and using bootstrap.
nmst440-simul1.R  nmst440-simul2.R  nmst440-bootstrap.R  

  1. As you all (hopefully) know, the χ2 distribution of the test statistic of the Pearson χ2 test of independence in the contingency table is only asymptotic. It is traditionally claimed that the asymptotic χ2 approximation works reasonably well when all expected counts (under independence) are higher than a magical number 5. Perform a simulation study towards exploration of a true significance level of the χ2 test of independence in a 2x2 table corresponding to comparison of two independent binomial distributions. This is in fact a test towards comparison of proportions of a certain property (``success'') in two independent populations. In the following, let p1 and p2 be proportions of ``success'' in population 1 and 2, respectively and let n1 and n2 be sample sizes in population 1 and 2, respectively. Consider a χ2 test of independence with a nominal significance level of 5% and use continuity corrections when calculating the value of the test statistic. Further, assume equal sample sizes in the two groups, i.e., n1 = n2 = n and consider three scenarios (of independence):
    • p1 = p2 = p = 0.01;
    • p1 = p2 = p = 0.1;
    • p1 = p2 = p = 0.5.
    For each scenario, consider values of n (sample size in each group) that gradually correspond to the lowest value of the expected count (under the respective scenario) being 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50, 100. That is, you have in total 3 x 12 = 36 scenarios. Use a simulation length of at least M = 10000.

    Report results (empirical significance levels) in a form of well-formatted table(s) included in a document prepared using LaTeX. Use also a suitable plot to visualize the results. Sweave can be used to prepare the report.

    Remark: Before you start the simulation, think a little bit whether some scenarios cannot pose computational/theoretical problems.

Lecture 6 (27/03/2014)
Topic: Computational improvement of simulations: basics of C/C++ programming, calculation on a cluster.
Test statistic of a certain independence test
nmst440-indTest.R  indTest.R  rMVN2.R  
nmst440-devel.R  indTest.c  Makefile  
nmst440-simIndTest.R  nmst440-prepareScripts.R  nmst440-resultIndTest.R  Sněhurka results

  1. Take IQ date and use a test of independence implemented in nmst440-indTest.R (with a = 1) to evaluate separately for boys and girls (variable fgender) whether IQ (variable iq) depends on an average grade from the 8th year of a Primary School (variable zn8). Use a method of bootstrap to calculate the P-values of the tests.

    Remark: Explanation on how to use bootstrap to calculate the P-value of the considered test of independence will be provided during the lecture.

Further information:
Lectures 7–10 (03/04/2014, 10/04/2014, 17/04/2014, 24/04/2014)
Topic: SAS software (Jakub Chovanec, SAS Institute ČR).
Data (set 1)  Data (set 2)
Presentation 1 (PPTX)  Presentation 2 (PPTX)  Presentation 3 (PPTX)  Presentation 3b (PPTX)
Presentation 5 (PPTX)  Presentation 6 (PPTX)  Presentation 7 (PPTX)  Presentation 8 (PPTX)
Presentation 9 (PPTX)      

Lecture 11 (15/05/2014)
Topic: Non-linear mixed and generalized linear mixed models in R and SAS.
SAS/STAT Documentation SAS/STAT Procedures
SAS proc nlmixed SAS proc glimmix
nmst440-nlme.pdf  nmst440-nlme.R  argconc.txt

  1. See Section 2 of nmst440-nlme.pdf for details. Data for the assignments: toenail.txt.
Lecture 12 (22/05/2014)
Topic: Some commercial statistical packages, GUI extensions of R
S-plus (probably not any more)
  –>TIBCO Spotfire
SPSS (IBM, Acrea in CZ)
Statistica (StatSoft/Dell)
nmst440-rpanel.R  rp_samples.R    

Further reading:
  • Michael Lawrence, John Verzani (2012). Programming Graphical User Interfaces in R. Boca Raton: CRC Press. ISBN 978-1-4398-5682-6.


View My Stats