Annoucement

  1. Respond to the introduction discussion self-introduction.

  2. We will do self-introduction this Friday via zoom.

Course logistics

Course Project

Project ideas/Dataset resources

Brief Description components

  • Introduce the dataset (data type, origin, etc). Explain why you choose the dataset. List some questions you want to explore with the dataset.

Mid-term report components

  • Include the brief description with modifications if needed

  • Give an abstract on your plan

    • What analyses you want to perform for answering your questions
  • Current progress and future plan

Final report components

  • Introduce the dataset. Explain why you choose it. Explain what questions you want to ask and explore using the dataset.

  • Analysis. Explain the statistical methods that you use for analyzing the dataset. Explain what you have done to generate the results (make your analysis reproducible).

  • Results. Illustrate your results. Use figures and tables to imiprove readability.

  • Discussions. This is the place to put in almost whatever you want to share. Some difficulties you met in the analysis, what you learned from the analysis, some future directions.

Comments from 2020 course evaluations

From “Additional comments about your experience in this course”:

From “Comment on the strongest aspects of this course”:


Giant’s shoulders

Statistics and data science

A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.

Big data in 1990s

@Huber94HugeData; @Huber96MassiveData

Data Size Bytes Storage Mode
tiny \(10^2\) piece of paper
small \(10^4\) a few pieces of paper
medium \(10^6\) (MB) a floppy disk
large \(10^8\) hard disk
huge \(10^9\) (GB) hard disk(s)
massive \(10^{12}\) (TB) hard disk(s); RAID storage

Big data in 21st centry

Four V’s of big data:

Source: IBM.

A typical data scientist on Linkedin

A random online cartoon for data scientist

Course description