Readings for 10/3/18

Just put the readings up for next week; please post your comments by 10/2/18 at noon.

During the lab portion we'll keep working on the exoplanet data :)

Comments

  1. Ozdenerol, Bialkowska-Jelinska, & Taff (2008):

    I’m rather unfamiliar with GIS. I have a friend who uses GIS to support indigenous land rights in Suriname—so I know it has a wide variety of interesting applications (like the mosquito-busting in this paper). Can we talk a bit about what GIS is, what applications it has, and how it relates to the big data revolution? Is Google Maps a GIS application? How do business applications of GIS differ from scientific ones? What is a “raster-based model” (p. 2)?

    McNeish (2015):

    This is about how the Lasso leads to better statistical inferences than more commonly-used techniques in the behavioral sciences. I’m really interested in the quality of statistics in the cognitive sciences. What would be really helpful to me is to see a few exemplars of mediocre statistics in actual cogsci papers. McNeish gives one example derived from a test score dataset. It looks like the stepwise technique yielded a type I error for a couple of the predictors. What’s an example of e.g. large type I errors in a published paper that could be avoided with a more advanced statistical technique? Were questionable statistical inferences made? What consequences are there for the paper’s conclusions?

    ReplyDelete
  2. Ozdenerol et al (2008) stated that Shelby County is the only county in TN in which the West Nile Virus has been present each year since the onset in 2001. I wonder why?, maybe lack of stats from the other counties in TN. McNeish (2015) stated that Lasso is not frequently used in the behavioral sciences research. Also is indicated that Lasso is identical to OLS. The drawback of Lasso is that operates in all predictor variables simultaneously and cannot accommodate multi-parameter factors, like ethnicity.

    ReplyDelete
  3. The Ozdenerol article displayed some really cool visualizations; I’m always jealous of GIS researchers and their cool mapping tools. The use of Mahalanobis Distance as the primary statistic of analysis is new to me. I’ve only ever used it to check for the assumption of multivariate normality when trying to do a MANOVA, but it’s really cool to see statistics used in ways that I’ve never seen before; it reminds me that deciding how to tackle a broad problem with such a varied statistical toolkit is all still a bit of an art, which I think provides some perspective in regards to that “role of the researcher” discussion we seem to have every week. This article also demonstrated some pretty amazing computational power, too. From what I understand, each pixel they analyzed was 30 square meters, and some quick Googling tells me that Shelby county is 785 square miles, which is apparently 1,263,335 square meters, which means that there were about 42,111 analyses run. My computer couldn’t do that. Anyway, what other kinds of statistics could the authors have used to a similar effect? Could they have used random forest decision trees (or anything else) to map the likelihood of infected mosquitoes just as (if not more) effectively?

    Formulas are usually a bit off-putting to many behavioral scientists (especially those outside of computational psych or behavior analysis), but I thought that the McNeish article did a wonderful job in highlighting why the case was being made at all for Lasso regression. Of course, it’s not entirely accessible to those who have never really talked about regression in-depth before, but I think the section on why OLS regression can be suboptimal was succinct. Most importantly, I think Table 2, the one comparing OLS and Lasso p values really summarized the article well, and that closer inspection of this particular table can help conceptualize the nature of the entire article. The table clearly shows that Lasso tends to be more conservative, with each p value being greater for Lasso than its counterpart derived from OLS. While I think more conservative tests in general are usually better practice, looking at the variables in this table makes it more clear how this distinction comes into play without the need for deeply understanding the formulas. Variables on the table that are conceptually (and numerically) closely related to other variables on the table tend to have larger differences between the Lasso and OLS p values. For example, the number of computers per student, total number of computers, and total enrollment each have fairly large p differences. These variables are all closely related; they are literally a ratio of each other, and as a result, it’s easy to see that they should be highly collinear. Referring back to the values on the table, it becomes clear that OLS regression overestimates the significance of each of these variables, but Lasso controls for their shared influence and provides a better estimate of their individual influences. I think that the most conspicuous question to ask pertains to why the behavioral sciences haven’t implemented Lasso methods, part of which I think is addressed in the discussion section. McNeish mentions that Lasso can’t appropriately handle multiparameter factors, i.e. categorical variables in which more than two individual outcomes are possible. While certain caveats are introduced in order to adapt to certain conceptual implications of this, I was wondering again about hierarchical regression. Could you, for example, combine an ORL and Lasso hierarchically, such that multiparameter factors are controlled for by ORL, after which Lasso regression is applied? Would this account for the most possible variance in cases where each model’s strengths can be maximized and weaknesses minimized? This may be completely unfeasible mathematically, or even conceptually; I’ve never really gotten into it this deeply or had anything regarding mixed models before, but it sounds like Lasso is fairly extensible.

    ReplyDelete
  4. Ozdenerol, Bialkowska-Jelinska, & Taff (2008)

    For three years of teaching, I taught a course called EAST. EAST, at the time, stood for Environmental and Spatial Technology. It was a student-led and driven class that used technology to improve communities. Students came up with problems to help solve problems in their communities. I know very little about the use of GIS/GPS; however, my students used a software called ArcGIS. This software had many different versions, such as ArcMAP, etc. like what is mentioned in the paper. One of my favorite projects my students did with this was to work with the Arkansas Game & Fish Commission to map fish species and popularity of fishing at area lakes and rivers to determine future steps for AGFC & hopefully find areas that could be improved economically with more businesses, etc. This article seemed familiar in a sense that although the overall goal was different, many of the processes were the same and would actually make for a very good, and possible, EAST project. The findings of this experiment make sense and I think this is a good use of technology that can help improve the well-being of many people. I think this model could be applied to many different scenarios. I would like to see more of how to address the problem or problems like this (this may exist it just may not be in the context of this paper). My big question that I asked myself as I was reading, is how do you account for human and mosquito movement and how much did that affect results of this study?

    McNeish (2015)

    As I have said before, I need to grow in my statistical understanding. Like Zak mentioned, I understand that the purpose of this paper is to address why Lasso leads to better inferences than other techniques that are used more often. The paper draws to question why Lasso and all of its advantages are not being utilized & compares and contrasts the processes. It makes me wonder if this might have to do with processes that are taught to students when they are learning how to research. Like someone said (I think last week), some students are taught “this is the way you do this and this is the best way.” Maybe this thought process is why we don’t see Lasso as often. Maybe new researchers are not being given the ability and foundational knowledge to make judgments or maybe new researchers fear delineation from a set standard that has been the norm. I am not sure if that is true or not, but it’s a thought.


    ReplyDelete
  5. Ozdenerol, Bialkowska-Jelinska, & Taff (2008)
    This is an interesting research and the method is totally new to me. I am wondering why researchers only use the data in August and September since traps with infected mosquitoes were found from June to September.
    McNeish (2015)
    I think maybe in behavioral science researchers want to get significant results, so they ignore the the overfitting issue. The writer said that Yuan and Lin (2006) addressed the problem
    of selecting sets of predictors (e.g., multiparameter factors, linear and quadratic components) with metric outcomes with the group Lasso that allows set of parameters to be selected together and also penalizes them together. What should we do with the multiparameter factors (e.g. ethnicity)? Do we only need to use the grplasso R package?

    ReplyDelete
  6. Ozdenerol 2008
    I didn't knew what GIS was before I read this paper, other than it was a geological system of some sort and I was pretty sure things like google maps use it. I think that this was a good paper to introduce some of the important uses of GIS to me. Recording different health (and illness) patterns is an important thing to do to help ensure public health, and using GIS to map those trends creates a more easily understandable visual of the data you have. I was thinking about this in comparison to how we often plot other data, such as our statistics, in order to quickly see if there are any sort of relationships between variables. This is similar in the sense that it allows us to clearly see if there are clusters of data in any geographical location, rather than on an x-y coordinate plot. Another thing that I thought was pretty cool is that the county already had the data that was given to the researchers - it didn't cost anything extra to gather information about the mosquitos, the researchers just took the available info and mapped it in a way that city officials could see where they should target their mosquito-busting efforts.

    McNeish 2015 - The Lasso technique is not something I have personally had experience with. I thought the data towards the beginning of the paper was interesting - how there are still articles are being published using statistical techniques that are shown to be inferior to others such as Lasso. I feel that the authors explained why this is the case pretty well, and being a social scientist, I think that it's important to realize that I might not always (or often) know the best statistical method to use on my particular data. I think it was smart of the researchers to intentionally publish this article in a journal that psychologists and similar researchers will read, since I'm sure most of us don't keep up with advanced mathematic and statistic journals. What I'm wondering is how we should be telling social scientists about better techniques to use? - should researchers follow suit and publish in journals that they'll read, or is there a better way to teach researchers about these methods?

    ReplyDelete
  7. For the Ozdenerol paper, I thought using GIS data as features for a model to track the potential spread of a virus is a very interesting real word, concrete example of implementing statistical learning. Some uses are more abstract or difficult to see the impact on the world, but this kind of study is a highly salient, practical use of these kinds of techniques. I had actually never heard of Mahalanobis distance and thought it a very cool way to measure similarity of a set of variables. I suppose something like Mahalanobis distance is necessary when all the variables have very difference variance and range of values. The study did mention that the technique is effective for determining the spread of vector borne diseases. I’m curious what kinds of diseases this model would not be useful in investigating. I would think diseases that aren’t spread by insects or that don’t have as many environmental factors, but I’m curious how we could tweak their model for contagious diseases or diseases that spread by other means.

    The McNeish paper seems to somewhat get at the difference between a predictive model and actually inferring causality from the features used in its critical analysis of statistical methods. It probably has more to do with my current knowledge state and the authors of this paper, but I really like that they analyzed statistical techniques like OLS in terms of bias and variance. A long time ago when I first learned about methods like OLS, it was not at all framed in this way. But since then, having taken a machine learning course and reading papers for this class, I am much more familiar with concepts like the bias-variance tradeoff, which I think is an intuitive way to analyze these statistical techniques that was unavailable to me. While I have heard of Lasso in the past, I knew nothing about it. It seems superior to OLS in every way since it can actually be OLS when lambda is 0 but it is also far more flexible when variables need to be selected out as well as preserving the larger effects of seemingly more important variables. It is curious that Lasso is something seldom implemented in behavioral sciences. I suppose that is the purpose of this paper, to familiarize people outside theoretical statistics with Lasso, but I wonder if the reason for its scarcity in behavioral sciences is simply a matter of it not being disseminated properly among behavioral science researchers, or is it instead that many behavioral science researchers like to stick to the methods already familiar to them and seldom seek potentially better but less familiar methods.

    ReplyDelete
  8. Ozdenerol et al. (2008)
    I felt this article was a really fascinating use of GIS. I also really enjoyed that they go into so much detail, without speaking in dense mathematical “quantitative-ese.” They explained what they did, why, and how, and reported results in a very manner-of-fact way that I found very refreshing. I also found some of the indices they used as very interesting. For example, using the proportion of vacant houses, renter houses, and owner-occupied houses to represent another component of SES was a unique approach (with little familiarity of geography).
    I feel the element of space is frequently omitted in psychology and I wonder what applications it may have, as there are some very different methodologies. For example, my fiancé does archaeological prospection (looking for archaeological sites without digging), and one approach they use is to put different layers on a map representing standardized z-scores of different readings, such as mapping levels of phosphate in the soil (which has archaeological significance), mapping magnetometer data (another interesting tool), and then plotting the residuals from these two methods to look for patterns that could be explained by some other phenomenon (e.g., sometimes the absence of signatures means people swept). Since hearing about this, I have been curious about applications of geographic systems in psychology.

    McNeish (2015)
    This article presented an interesting idea, though I could not quite follow completely. They unpacked terms down to the level that one needs an intuitive understanding of ordinary least squares and loss functions to really understand it, which I do not have, regrettably. I remember some discussion of Lasso from a talk Andrew gave a year or two ago, but I knew even less about regression, then. I feel another challenge here is the way stats are taught across disciplines. In psychology, most of my statistical understanding is tied to some representation of the variables and measures (e.g., you use ANOVA when you want to compare > 2 groups or levels, MANOVA if there is more than one DV, ANCOVA if you wish to control for a continuous variable/s), instead of a deeper understanding of the statistics underlying the procedure. I wonder what the case would be for choosing Lasso over traditional regression? Does it perform equally well on logistic regression problems (e.g., binary DVs)?
    I hope to re-read this article before class and try to apply it to data I understand to see what it does (I am very grateful for their multi-lingual code in the online supplement to use in SAS). It certainly seems like an interesting solution to the problem of stepwise regression. While I only know that it is problematic as it provides an avenue for “researcher degrees of freedom,” I do know that there has been a great deal of negative sentiment toward it.

    ReplyDelete
  9. McNeish (2015):
    In comparing Lasso to ridge regression, it seems as though the biggest difference between the two is the regularization term in the absolute zero. So is lasso seen a step up/advanced method to ridge regression or is the use of one or the other solely determined by the set of data and what the researcher is wanting to obtain from that data? Basically do researchers still use ridge regression, or is it retired to the use of lasso? because it seems with both models you’re making some sort of trade-off that can end up being pretty advantageous in the results. Also, with the major drawback of not being able to accommodate for multi-parameter factors, what do researchers in this case do? how is this problem resolved? I agree with John that it seems as though the use of it in behavioral science is pretty small, because I’ve only ever heard of the actual method, not the method being used, and wonder why that is?

    Ozdenerol (2008):
    Before reading this article I had never heard of Mahalanobis Distance—is this a type of analysis that is predominantly used in geographical type data or could it be used alongside other methods? It is neat to be exposed to articles that apply statistical analyses to other domains outside of psychology. But makes me wonder if all these analyses spread across all the different domains or specific methods are used for particular types of study. Additionally, I had never explicitly been aware of what GIS was, I’ve only ever heard reference of it vaguely and thought the article did a thorough job in breaking it down.

    ReplyDelete
  10. Ozdenerol 2018: This study tried to use environmental data to model areas where mosquitos with WNV were likely to be. In modeling habitat suitability it initially seemed odd to me that they'd use p-values of 0.5 for 'moderately suitable', but I guess they just wanted to be safe and were OK with a bunch of false positives as long as they didn't miss infected areas. They validated their model by comparing it to the locations of only 7 infected people in August, is that enough? This study was done in 2008, and WNV doesn't receive the media attention it once did. Does it infect as many people in Shelby County was it once did?
    McNeish 2015- What happens when the Lasso penalty is too large? It's surprising to me that Lasso is so recent an introduction to science (1996). It also surprised Lasso wasn't more widely used Psychology and Education. I didn't know that Lasso couldn't be used on all predictor variables at once.
    f

    ReplyDelete
  11. Ozdenerol, Bialkowska-Jelinska, & Taff (2008)-This is a very meaningful study. The study is a good illustration of the influencing factors of WNV in Shelby County. The method used by the institute was the first time I saw it. My question is to use income to explain whether the incidence of WNV is reasonable? It is also possible that as individual incomes increase, they move out of densely populated areas to seek a larger, better living environment, resulting in individuals with lower incomes remaining in the area of possible infection.
    McNeish (2015)-For me, this is a new approach that I have never seen before. It is very instructive for future research.

    ReplyDelete

Post a Comment

Popular posts from this blog

Readings for 10/17/18

Readings for 10/31/18