Readings for 11/7/18

Readings are up for next week 😀 In addition to the speaker's reading I've assigned some background reading on SVD and latent semantic analysis/indexing. These are closely related to PCA and are a kind of unsupervised learning.

Comments

  1. In regards to the textbook reading, it was great to finally get a proper basic rundown of how LSI works; I was really surprised to see how closely it resembles and functions like a least-squares regression. At first, it seems like LSI analyses are immediately applicable to things like search engines, and with research design in mind, may be able to tell us some interesting things in psycholinguistics and political science.

    I certainly didn’t expect it to turn up in biomedical research and bioinformatics, though it makes sense that there’s no reason that co-occurrence analysis couldn’t apply to any type of co-occurring units of information, including genetic sequences. The Xu et al. article outlined groups of genes based on their co-occurent functionalities. However, I wasn’t sure if this type of analysis was able to approximate the relative role and importance of each gene within each functional group. Say, for example, that one gene moderates the effect of another, or is always co-occurrent but actually inhibitory to a small degree. Would LSI analyses be able to establish these types of relationships?

    The Roy et al. article used similar analyses to determine a measure of functional cohesion within DNA microarray data in order to compare their quality with a simulated “ground truth.” This article made the (un)supervised learning distinction much more salient to me. I was especially reminded of something we talked about in class recently, about how the goal of science may resemble an asymptotic approach to truth, and it seems like using this sort of analysis showed a microcosm of that. This isn’t specifically related to the DNA sequencing, but I was thinking about whether some kind of collective misdirection or selective input could really skew the suggestions that coherence analysis could imply. It sounds like LSI coherence would be pretty robust to a few inputs of misinformation, but if the distribution of results in a corpus was centered around an erroneous conclusion due to a common error in practice, wouldn’t this analysis suggest against the utility of a novel methodology that unveils a new direction in a scientific corpus?

    ReplyDelete
  2. Xu et al. (2011) & Roy et al.:

    These texts propose LPv as an improvement over gene set enrichment analysis (GSEA) using LSI. Unlike GSEA, it works across different categories, size of gene set, etc. They are responding to one of the challenges that the contemporary sciences are facing: the massive production of data, research, and papers. The amount of research generated is too massive for individuals to sift through it manually anymore to find trends and correlations. Gone are the days that someone like William James could have an encyclopedic knowledge of the state of the art of a particular science and its developments (that was 1890, when he published his monumental The Principles of Psychology). These authors are doing something like text mining of abstracts to try to detect trends in functional cohesion of gene sets. This seems like it could potentially be a powerful tool as science continues to increase its output. Larsen and von Ins (2010) estimate there are about 24,000 active refereed journals (out of 250,000) as of 2007, or 11 years ago. One estimate for the production of peer-reviewed articles in 2006 alone was about 1,350,000 (ibid.). The quality of these, of course, varies widely, but we aren’t counting non-refereed articles. In a small discipline like philosophy, it is not possible to keep up with everything within one philosophical subdiscipline (e.g., the philosophy of cognitive science), but in smaller subdisciplines (again, phil. of cogsci) you can kind of keep up with the important material. Scaling up to significantly larger disciplines like biology, or just omics, I can’t imagine how lost one must be within that sea of research. It seems like machine learning is exactly what we need. What other methods, machine learning or otherwise, are there to keep up with new research and keep track of literature-wide trends?

    Manning & Schütze (2000):

    I’m wondering if something like this could have any application to humanities research. It probably already does, and I just don’t know about it.

    ReplyDelete
  3. Textbook

    The discussion in the textbook about Vector similarity & LSI were both new concepts to me. I found the discussion of these terms interesting and something I would like to hear more about. I do not think I have a complete understanding yet, but I felt that this was a good starting point and it has peaked my interest for the future. The process did have some similarities to items I am already familiar with and this helped in understanding parts of the assigned articles.

    Xu et al 2011

    Both of these articles were also completely new concepts to me. As I read both of these articles, I kept thinking back to last week and the crowdsourcing article. The articles discussed how there are “numerous statistical approaches [that] have been developed to identify differentially expressed genes (Xu 2011, p. 1).” I am not sure the extent that the information gathered from these processes are being used; however, it concerns me very much that any of the information gathered in these processes might differ depending on the statistical technique used to conduct the test. The Xu article even says “there was a big discrepancy in gene sets produced by different algorithms (p.1).” Do the authors of Xu et al (2011) believe that their process works better than the other processes? They claim that it is a robust method. In particular they point out that there are “numerous statistical methods (p.7)” that are available for use with different assumptions. Is there a set way that these researchers could make this process better? As usual, I ask what are the implications of this claim? How does it affect future research and implementation?

    Roy et al 2018

    Again, it is claimed that “literature cohesion analysis is useful for evaluating the quality of probes and microarray datasets (p.1).” Both of these articles really push for a literature based method. What are the other methods besides literature based? Did I miss this in my reading or are they all variations of different literature based processes? Ultimately, Roy et al (2018) claim that “lack of appropriate quality benchmarking will lead to false discovery and hinder biological interpretation of the data (p.9).” What is the impact of this? What does this mean for this area of science? What are the “real-life” implications and how does it affect human development and progress?

    ReplyDelete
  4. Manning & Schütze (1999)
    Vector Space Model:
    I found this section very useful and comprehendible. The idea of representing documents as multidimensional spaces of words and seeing similarity based on frequency made a lot of sense. I have come across the idea of TF-IDF scores before, in the context of sarcasm classification (identifying authors’ most salient terms using TF-IDF), but this is the first time it made sense. This book, in general, seems very useful.
    As an aside, if anyone else is interested in the supplemental content, the new book website (for those who feel this info may be useful someday) is: https://nlp.stanford.edu/fsnlp/
    Latent Semantic Indexing:
    This section was also incredible clear and informative (these authors do a great job breaking things down to layman terms). LSI seems like an interesting approach, too, in reducing dimensionality. While PCA, Factor Analysis, and other dimension reduction techniques make some intuitive sense, with quantitative values, these pseudo-categorical dimension reduction techniques are very interesting.
    I wonder how this might relate to the relatively uncommon procedure of correspondence analysis (or, more related, multiple correspondence analysis). In MCA, categorical co-occurrences of features are used to create this similar dimension reduction, though I’ve always had some issue understanding the principles underlying it. Might this be one extension of a comparable idea?

    Xu et al. (2011)
    This tool for automatically scraping and analyzing MEDLINE abstracts and titles seemed like a very interesting application of LSI. While I do not know much about the microarray field, or the challenges faced by it, GCAT seems like a very useful tool. They also did a great job explaining the results and why certain patterns were observed (e.g., cosine values for well-studied genes being lower, given their roles in many functions). Without more knowledge about microarrays, I’m curious how susceptible this is to methodological issues and bias. This approach of looking for pseudo-semantic-similarity (i.e., with gene-term vectors instead) seems like it may overcome statistical limitations.
    It does seem like it may be susceptible to theoretical bias, though. That is, it seems great for summarizing what is there, but could not be used to get at other limitations that I’m sure affect microbiology (as they do all fields), such as failing to consider how genes may contribute to functions that aren’t yet mainstream (i.e., it may better represent biological dogma than actual gene function). That is not to say this isn’t tremendously useful, just pointing out that I do not see this as a replacement for traditional reviews.

    Roy et al. (2018)
    Using the same GCAT system, they produced some interesting findings regarding a more specific application of this to identify errors in “probe stuff” (honestly, no idea what is going on with these procedures). I thought it was very interesting that they were even able to identify a couple companies whose probes are targeting the wrong features, as well as one that seemed like it had messy activation/measurement. I thought the idea of GCAT was very interesting and the use of it to identify problematic methods and materials seems like an even more compelling application.

    ReplyDelete
    Replies
    1. Interesting. This seems to be a good reference for MCA: https://www.utdallas.edu/~herve/Abdi-MCA2007-pretty.pdf

      Delete
  5. Textbook:
    This reading was interesting as it is really my first time being exposed to in depth information about this topic. While I am not sure I fully understand everything they discussed, I liked the detailed explanation about term weighting, as I probably would have been one of the ones to just use the count of a term in a document as the term weight, so it was interesting to see the other ways of doing that.

    Xu et al. (2011) and Roy et. al. (2018)
    I believe this article/methods could be very useful, especially in a field like genetics which I am sure has a lot of dense research. I am curious as to how easily the web tool they developed is able to handle lots of new information, as the medical field is quickly advancing, especially in areas like genetics. I also wonder what other fields this idea could be useful in. Also, while both articles praise this literature based approach that they use, I am not familiar with the other methods that could be used instead of theirs.

    ReplyDelete
  6. textbook:
    From the description of the vector space model, it seems to have super beneficial implications. It also seems like it would span across several different domains, is this the case? I was also under the impression, by reading, that it’s a pretty dynamic model in that it can look at a number of different frequencies, which again, I think could have a lot of advantages. As far as LSI—not super familiar with it but I wonder how complex it is. For a little while, the author mentions mappings and I wonder if the automaticity of this could have drawbacks of some sort?

    Roy et al. (2018):
    A lot of the terminology of this article seemed a bit over my head, but the application of literature based methodology did spark my interest as I have never done this before or heard of it being done. It made me wonder why these methods were chosen opposed to others? And what about them makes them so trustworthy to conduct in comparison to other techniques?

    Xu et al. (2011):
    The co-occurence analysis specifically in the field of biomedical research seems like it would have a lot of limitations. Similar to my comment from the textbook readings, I wonder why out of all the analyses available to be conducted, LSI was chosen. Either way, it was interesting to see what the results rendered.
    In general, I wonder if some of these statisticians got together and collaborated different techniques and models, if anything beneficial would come of it—that is, comparing the different types of models in relevance to the data given.

    ReplyDelete
  7. Textbook:
    This was my first time seeing anything about vector space and LSI, and I felt like this text was a pretty concise introduction to these topics.

    Xu 2011:
    Right away when reading the abstract, the sentence about how “numerous statistical approaches have been developed…” to test the genes immediately reminded me of last week’s reading, where the 29 analytical teams all used different methods to analyze the same data. This research did seem like it was a good use of LSI, from what I understand of it, and I very much like looking at real world implications of research. It can still be helpful and cool when it is more theoretical, but it tends to feel more important to me when it deals with something that can be helpful to people or society, such as the relationships between gene sets which could be beneficial in a medical context.

    Roy 2018:
    Like some others have mentioned, I found the literature basis of this article to be interesting. Additionally, I greatly enjoyed scrolling through the numerous figures at the end of the article. Since I tend to learn better visually, it is very helpful for me to see those patterns in the data. Since the actual terminology is higher than my understanding, I don’t know if I entirely grasp the importance of this work, but I am sure the presentation this week will assist with that part.

    ReplyDelete
  8. The Manning paper talks about Vector Spacing Modeling. I didn’t realize there was this much math that went into calculating the distances. Can LSI be visualized in a four-dimensional space? Is there an advantage to portraying LSI in a three-dimensional space versus a two-dimensional space?

    The Xu paper uses LSI to identify differentially expressed genes. They used LSI to build a model using thousands of mouse genes and millions of studies. At first it seemed odd to me that LSI could be applied to genes but the paper explained it really well. It’s very creative that the authors are applying a method to genetics that was originally invented to analyze words and language. I wonder if this method will become more popular and widespread in the genetic and biological sciences.

    The Roy paper looked at Sirt3 genes coexpressed in the brain or liver of mice. It was pretty cool seeing how methods from literature can be used to analyze gene expression. Why did the authors focus on Sirt3 and not other genes, would this approach work with other genes too or just Sirt3? The coexpression of the gene in tissues associated with Huntington’s disease, Parkinson’s disease, and Alzheimer’s disease shows that this is worth investigating. Are there other methods that are as good as using this LSI approach?

    ReplyDelete
  9. Textbook:
    In the vector space model, how to deal with synonym of the word when considering term weighting? For example, automobile, vehicle and car mean the same thing. It is said that LSI performs better than vector space search in many cases. Then does this mean that the vector space model is useless?
    Xu 2011
    This is good application of LSI method in genetics. The writer said that a few genes had many abstracts, with the maximum number reaching 2,923. The text book also said that the appropriateness of LSI also depends on the document collection. How will this influence the results?
    Roy 2018
    This research also used literature-derived cohesion p-value. I know little about the genetic research. I am wondering whether we can use literature-derived cohesion p-value in psychological research. And, what it will be like?

    ReplyDelete

Post a Comment

Popular posts from this blog

Readings for 9/26/18

Readings for 9/12/18