"G. d'Annunzio"
Basic knowledge of statistics
The course aims to provide the student with the tools to extract relevant information from large amounts of data, with particular attention to statistical learning (statistical learning) both in a predictive and non-supervised context (supervised and non-supervised learning). Besides, this course introduces students to Text mining. Text mining is a recent field of research whose development is strictly linked to the increasing volume of online text data and the development of statistical methodologies and algorithms for information retrieval and automatic classification. Analysis will be performed through the statistical language R.
LEARNING OUTCOMES 1. Understand the nature of multivariate and textual data and the statistical techniques exploited to analyse them. 2. Understanding and ability to explain the fundamentals of algorithms for extracting information from multivariate and textual databases 3. Ability to apply the principles of statistical reasoning in the preparation and interpretation of company reports 4. Ability to use the R software for statistical analysis
Making judgements - To learn the logical and statistical concepts that are indispensable for working independently in the research, selection and processing of data relevant in the Data mining and Text mining context. Communication skills - Learn the terminology and statistical techniques to communicate or correctly discuss the results of the analysis of company data relevant in the Data mining and Text mining context.
The following topics are considered as important parts of the teaching program for the fulfilment of the objectives: Introduction to computing in R; Introduction to Statistical Learning; Data visualization; Regression and Classification; Non-supervised learning (principal component analysis, Clustering); Introduction to Text data mining; Text preparation; Text Analytics; Visualization or textual data; Web scraping.
1. Introduction to R 2. Introduction to data mining and statistical learning. 3. Data visualization techniques 4. Review of probability 5. The multivariate Normal distribution 6. Supervised Learning Models (Regression, Classification) 7 Unsupervised Learning Models (Clustering, ACP) 8. Introduction to Text Mining 9. Preparation of texts (Standardization or preprocessing, tokenization, Stopwords, Stemming, "Bag of words" model) 10. Textual data display 11. Statistical analysis of textual data 12. Automatic classification of texts 13. Topic models 14. Web scraping
Coursebooks:
Vardanega, Agnese. 2011–2021. «R per l’analisi dei dati.Una wiki per l’analisi dei dati con R». 2011–2021. https://www.agnesevardanega.eu/wiki/r/start.
Vardanega, Agnese. 2022. «Strumenti per l’analisi testuale e il text mining con R». https://www.agnesevardanega.eu/books/analisi-testuale-2021/index.html
Further materials (e.g. slides) can be downloaded from https://fad.unich.it
Direct link: https://fad.unich.it/course/view.php?id=1342
English Textbooks: James, Witten, Hastie, Tibshirani (2013) An Introduction to Statistical Learning (with Applications in R), Springer-Verlag
Julia Silge, David Robinson, Text Mining With R: A Tidy Approach, Oreilly & Associates Inc (31 luglio 2017)
Frontal lectures as well as practical exercises with the use of the software R. Attendance to teaching activities, even if not compulsory, is strongly recommended
The exam consists of a presentation of a project designed and developed during the course and an oral discussion on the same topics.
Non-attending students can find instructions for carrying out projects on the FAD website and are invited to contact the teacher for any clarifications.
E-mail: lara.fontanella@unich.it Students will be received after the lectures Appointments can be fixed by e-mail