Readings for 9/5/18
Greetings π
If you haven't already gone through the welcome post, please do so now.
I've put up the readings for next week (see tab above). You will need your UoM email id/password to log in (e.g. foobar@memphis.edu would log in as foobar).
I've also created a refresher video for the Kaggle kernel lab we did yesterday. We will keep working with that kernel in our lab next week:
If you haven't already gone through the welcome post, please do so now.
I've put up the readings for next week (see tab above). You will need your UoM email id/password to log in (e.g. foobar@memphis.edu would log in as foobar).
I've also created a refresher video for the Kaggle kernel lab we did yesterday. We will keep working with that kernel in our lab next week:
BTW here is an example comment; this is how you're supposed to do your post :) Also, the first couple of weeks we're going to have some background reading that is a bit more technical. Don't worry if you don't fully understand it now. It is meant to be something you refer back to for the rest of the semester.
ReplyDeleteP 15-52 Introduction to Statistical Learning
ReplyDeleteAs I was reading through the Introduction to Statistical Learning document (p. 15-52), I saw many things that I did not understand and maybe a few that I did. Statistical Learning is a very new concept to me. Of course, I have had undergraduate statistics and I am currently in a graduate course of statistics. Many of the items I saw in this text were similar to the statistics that I am used to but there were also some noticeable differences. I am not sure if those differences are caused by my rudimentary knowledge of statistics or due to a difference in general statistics and statistical learning. Many of the names such as input and output variables, parametric and non-parametric are the same and overall questions of predictions, relationships, etc. are similar. Some of the variables used for the different equations are slightly different than my currents statistics course. For the most part, I have been exposed to linear models, regression, and other similar analysis. Many of the figures and graphs throughout the chapter are much more advanced than I have ever produced. Overall, I found the first reading overwhelming with new information (Thank you Dr. Olney for making me feel better about that in your post) but also intriguing that models this sophisticated exist. I would like to pose a question to those of you that are more familiar with this area that I believe will help me grasp a better understanding: How does statistical learning differ from a course in regular statistics? It seems so familiar, yet different to me right now. Is my gap in information coming from my basic knowledge of statistics or my lack of knowledge in statistical learning?
My favorite part of the chapter was the section about R. I have never used R (or really even heard of it that much) before our class on Wednesday. I downloaded it and followed for the most part along with the chapter pretty easily until it wanted specific data from the book. I think given the commands, I could handle inputting them but I would be lost if someone was not telling me what each command is and how to “word” it. Are there helpful commands that are useful to know? I found the up arrow bringing all my previous commands up for editing to be my favorite one so far.
Zuur 2010 Protocol
I found the Zuur (2010) article very relatable. The chapter that I am currently reading in my statistics course covered a similar topic for “avoiding common statistical problems.” From the introduction, it said “rubbish in, rubbish out” which pretty much summed up the article. The steps outlined in the article for questioning data gave a good example of questions that should be asked of data sets. In the discussion section of the article, Zuur comments that the steps are not necessarily linear and not all of them needs to be asked for each data set. I think these steps are a good guideline for research in order to give accurate and representative conclusions to one’s research. I think sometimes, some researchers want the data to “fit” their idea. This set of steps can help researchers hold themselves accountable for providing accurate and useful information to the body of knowledge in their field.
Courtney Peters
Courtney, I agree completely on being overwhelmed by the new information presented in the Introduction to Statistical Learning reading. I have very little experience with statistics, and have never been fully comfortable with the subject. One of my favorite (and thankfully, one of the most self explanatory) functions available through R is the one that allows you to save R plots as various file types, such as pdfs or jpegs. It seems like R has the potential to be less complicated than SPSS can sometimes be, so I am very interested in learning more about what all it can do.
DeleteI also agree about the Zuur article being a good guideline for researchers and statisticians alike. I appreciate that they provided real and specific examples throughout the article, like the example data about bird species in step 8. I also like that it ended on an example described as "noisy" and "unpredictable," emphasizing that statistical analyses will rarely if ever be cut and dry.
-Rachel Moore
My interest is in the background knowledge, practices, and habits involved in statistically manipulating data. Using different statistical analyses requires assumptions to be made about the state of the world, about the data, about the relationships holding between the data—in addition to real-world considerations such as application. There is a great deal of institutional, situational, and procedural knowledge involved that is not mathematical per se, but that serves as the precondition for this mathematical knowledge. There is a “messiness” to scientific practice predicated upon the fact that science is an institutional practice composed of humans and nonhuman tools. Big data doesn’t escape that messiness, but does it reduce it, increase it, or perhaps even hide it?
ReplyDeleteZak Neemeh
I liked learning how to visualize data as graphs because I think when describing statistics this is clearer than wordy explanations. For example, I hadn't heard of the Cleveland dotplot shown in Zuur. Building models using variables is something new to me and I like how this can be applied to data from many different fields. The formulas from the Statistical Learning book went over my head at times, and while still useful it felt like I would learn this quicker by building an actual model to test data. How old are most statistical tests that were discussed? Is data science using newer methods than other related areas of statistics, on average?
ReplyDeleteDavis
After reading, I understand why these pieces will be referred to throughout the semester.
ReplyDeleteI agree with others that the Zuur article is a good reference piece; it outlines real world examples and illustrates an adaptable process for researchers to use. This seems like it will be very helpful, not only to ecologists, but to researchers in various fields who want to check the statistical integrity of their work. My previous statistics experience hasn't gone into as much depth as all of the steps in this process; I've had to check for outliers (probably in a way that isn't correct according to these standards) and a normal distribution, and that is the current extent of my statistical problem avoidance. This article seems very practical, and I can see myself using a process like this one for future research analyses.
The Introduction to Statistical Learning contains a lot of information. For someone who has not done complicated statistics, it definitely seems different than the basics that I do know. The graphs throughout the reading provide good visuals that helped me understand what was being explained, but the information is still overwhelming. Since I lack experience with a lot of this kind of data, I'm sure a lot of things went over my head that I would understand better if I saw all the data in front of me. I am glad to have an introduction to R, since I am looking forward to learning and using R in our class labs. Similar to other languages I have some experience in, it's difficult for me to read about a language and grasp it until I've written it out myself. For the people in this class that have used R, do most people refer back to guides while using R, like most developers do while they write code?
Kailey Brumwell
I would say that a lot of the information presented in the Introduction to Statistical Learning reading was not familiar to me. Having only taken one prior stats class, I am not well versed in Statistics, and some of the material was foreign to me. It was helpful to have the multiple diagrams to be able to compare with the text though, and I was thankful for the practical variables used in the examples such as income and education. I really enjoyed reading the last section on R. Because of the prior lab in class last week introducing the programming, reading about it gave me a lot more understanding and clarification on some of the gaps that I had. It seems like a super cool tool to be able to use and a lot less complex than SPSS, so I agree with others that it will be interesting to learn more about.
ReplyDeleteBeing a very type-A person, I liked the practicality of the Zuur article. It was broken down in a way that was simple to understand and interpret. Again, the visuals were helpful in following the text. I think whether a designer is well-experienced or just beginning, the acknowledgement of recognizing common errors discussed in the article is advantageous. Some seemed like common sense and I found myself thinking "well I hope they would take these observations and analyses into consideration," such as the variable vs each covariate, or even outliers that may be present. It seemed as though a lot of the strategies to address problems centered around a small portion of research design--such as linear regression (which seemed to be heavily discussed and used as an example), but in the intro several other techniques were mentioned and it made me wonder if these same solutions could be applied to multiple other techniques? Overall, I found it to be a beneficial article.
Sam Choukalas
Although I have taken several stats courses (e.g., Mixed Model Regression), I always feel as though there is more to learn, even about the analyses that are viewed as simple/foundational (e.g., ANOVA), as well as learning which statistical tests are right. As an experimental psychologist, I have tended to live a life of t-tests and ANOVA; however, in the past year I have had to work with a couple atypical data sets that really drove home how much I do not know. For example, I attempted to “do something statistical” (ultimately, logistic regression) with a data set my fiancΓ© and I created of features on some ceramic pots. Ultimately, this became a large list of binary categorical traits, as well as a few continuous measures, and I was struck by the fact that I do not know what to do when I do not have an outcome/dependent variable (I had to create one to make sense of it).
ReplyDeleteThe readings for this week illustrate two important ideas that may be assumed, or may be just under-taught in statistics (in my opinion): 1) checking assumptions before proceeding with data analysis and 2) choosing the right statistical test. For example, for all of my statistical courses, I do not understand what “Mauchly’s Test of Sphericity” represents, apart from that you do not want a significant value in this box. Similarly, until my Mixed Model Regression course, I never understood homogeneity (and still don’t REALLY understand convergence).
To summarize this tangential and rambly post, I really enjoyed the Zurr article for breaking down these things that are discussed often, and rarely understood (by myself, at least). The only downside is that, like many statistical texts, I wish there were some supplemental programming content to show you how to test these assumptions. While the abstract mathematical formulae are useful in concept, I WOULD REALLY ENJOY SEEING HOW TO CONDUCT SOME OF THESE TESTS (I hope the hint is clear haha). In the past, I have run into so many problems trying to check assumptions as there is such variability in data and it can be tough to find a set that is similar to your own (e.g., anything with lots of categorical data).
I also thought the introductory chapters were phenomenally useful. In a short number of pages, it effectively suggested appropriate tests to use, why we use the tests we do, the different types of error, overfitting, and a myriad of useful statistical ideas. While much of this is somewhat new to me, even the ideas I thought I understood were clarified. I also am excited to try to program-along with the procedures in the back of the chapter.
I am not sure why it does not add my email as my name, but if this counts for participation, this is Alex Johnson
DeleteI particularly appreciated the first two chapters of the textbook readings for the authors' effort to demonstrate the wide range of options that researchers have to analyze data. Specifically, I like how they describe schematic versions of certain techniques that I'm sure we'll talk about more deeply in later classes, such as least squares regression, bagging, and SVM, while not overdoing it with a mass of equations hypothetical models. The implication that we have these options, with their various strengths and weaknesses, reinforces the idea that data science requires an element of careful thought, planning, and intuition based around the goals of the researchers. I think that many courses (and even jobs) become centered around completing an instructed task, and it's important to remember that, in many cases, you can justify a range of analytic approaches based on your goals. It's nice to see that at the beginning of a class that is probably on the more computational side than many students of social sciences are used to. The simple chart on page 25 underscores this point perfectly (I'd reproduce it here, but I'm not sure how to embed graphics at this point. It'll be great to get into exactly why, for example, a highly interpretable but inflexible model may be a more appropriate choice than one with opposite qualities. We touched on it briefly in the first class, but delving into these sorts of decisions may may data science more interesting and accessible to those who are still discovering the scope of what these tools can do for them.
ReplyDeleteThe Zurr article hit on a lot of key points that I've encountered in previous statistics courses, and highlighted some interesting new ones. In the courses that I've had, we've typically been required to check for "the assumptions of null hypothesis testing," which can vary slightly based on the specific analyses involved. Many of these appear here, in what I would call a best practices article. Checking for qualities of the data like normality, equal variance, multicollinearity, and independence have been important to get full credit on previous stats assignments, but it was interesting to see why these features may degrade the quality of certain analyses on those data. However, one thing I want to mention is that we're often not taught exactly what to do when these qualities don't check out. I've usually been taught how to check for them and how to appraise the appropriateness of certain models given the situation, but I haven't really encountered a situation in which I've had to readjust my analyses completely due to data qualities, and I'd be interested to know more about what to do other than just calling it bad data. This article didn't quite get there with many of the topics it touched on, but I'm hopeful that we'll go over that in class at some point.
John Hollander
Hi everyone -
ReplyDeleteThe two themes I'm picking up on are:
1. What is the relationship between statistical learning and traditional statistics?
2. These things that go wrong with models: what do they really mean, and what can you do about it?
I've taken some notes of specifics for tomorrow; should be a good discussion.
In the meantime I suggest watching this humorous video. This isn't how statistical learning works in general (we can discuss why tomorrow), but it may help with some ideas:
https://youtu.be/R9OHn5ZF4Uo
Having taken introductory machine learning and data science courses, both the statistical learning chapter and the paper, while still difficult to grasp at times, have each provided some interesting connections with material I am familiar with or at least have been exposed to. The statistical learning chapter provided both a good review of concepts I learned in machine learning and AI such as supervised vs unsupervised learning, quality of fit, bias vs variance tradeoff, and bayesian and knn classifiers, and while it wasn’t quite as in depth, it provided a very intuitive overview of those concepts also suitable for anyone new to them. I have only very briefly used R for some web scraping and NLP, so the section on R was remarkable useful for familiarizing me with syntax and functionality I have not been exposed to. As for the paper, I have heard many advocates of data dredging as a valid method for finding patterns and truth in large data sets. However, I have also heard probably equally as many caution against it. In my opinion, for someone of my experience level, the paper provided what seems to be a very healthy balance of cautiously advocating for responsible data dredging in that it can be useful for hypothesis generation, but that those hypotheses should be tested independently with new data. Additionally, I had a general idea about the manner in which outliers are treated (omitting vs retaining), but this paper did a lot for me in learning what specific cases might justify omitting or retaining outlying observations. Additionally, while briefly discussing many well-known methods such as boxplot, Euclidean distance, etc., especially in regards to criticizing them, describing the utility of more complex methods that are not as readily taught was very welcome and only further elucidated the utility of this paper in enhancing better measurement and analysis method selection. All in all, I found the paper much denser and more difficult to get through without a background in the subject matter, so while I would like to take a more critical approach in analyzing it, I was mostly just trying to absorb and grapple with the material.
ReplyDeleteI also did not figure out the naming, so this is John Britton
DeleteIntroduction to statistical learning
ReplyDeleteI think this chapter is really great that shows us a framework of the statistical learning. There are many new concepts I need to remember and understand, e.g. supervised vs. unsupervised, regression versus classification, bias-variance. But I think it is a good start to learn how to analyse data and why we do so. After reading this chapter, I have a deeper understanding of previous knowledge that linear regression is in the prediction paradigm and ANOVA is in the reference paradigm. And when selecting the method to do prediction, we should make a trade-off between prediction accuracy and model interpretability. For the R, this chapter shows some simple functions that are useful.
Zuur2010-protocol
This paper is really helpful because it summarizes the problems and solutions that we need to pay attention to. Recently I am analyzing my experiment data, I screened out the outliers and had the homogeneity problems. But I didn't use graphs to explore the data first. I think I will use the protocol in my data later.
Meng Cao