tladeras’s avatartladeras’s Twitter Archive—№ 6,450

  1. I think bioinformaticians need to have a really strong shell scripting background. We underemphasize the importance of this (in terms of time spent learning) compared to R/Python. This is especially important when running cloud jobs and you need to glue together code.
    1. …in reply to @tladeras
      Running on a remote worker is an indirect process, and I find that shell scripting is so important to doing things on a worker where you need to bring your software and dependencies to it. That, and learning how to reproducibly manage software dependencies are real gaps.
      1. …in reply to @tladeras
        Yes, the technologies shift, but we should probably spend a couple sessions just on reproducibly managing software dependencies. This would include container-based tech, utilizing and extending.
        1. …in reply to @tladeras
          I often joke that bioinformatics is about transforming one file format to another. There's usually a lot of glue involved in doing that reproducibly, and I think we underestimate its importance in training bioinformaticians.
          1. …in reply to @tladeras
            I think part of this is the curse of knowledge - for a lot of us, our knowledge of Linux/shell scripting was self-taught and hard-learned. We underestimate how much time (and trial and error) it takes to learn these skills
            1. …in reply to @tladeras
              In terms of solid foundations and recommendations, I highly recommend both the Carpentries materials (swcarpentry.github.io/shell-novice/) and The Missing Semester of your CS Education: missing.csail.mit.edu/
              1. …in reply to @tladeras
                But we need applications courses for bioinformaticists as well that goes well beyond this material. I don't know what that looks like, but I'd be willing to work on it.
                1. …in reply to @tladeras
                  Just a quick note: when I’m discussing the importance of shell scripting, I’m really mostly talking about secondary analysis: the running and batching of aligners/variant callers, etc.
                  1. …in reply to @tladeras
                    Running GATK on one file on your local computer is one thing, but automating a pipeline to run on 200K files on cloud/HPC is another thing entirely.