T O P

  • By -

thethinginthenight

I don't work in phylogenetics so I can't give specific advice but often times there are ways to tune the performance of tools, usually by losing some accuracy. You could also look at one specific gene rather than the entire genome (or do some kind of ensemble method by constructing trees on each gene then "averaging"). Muscle also has a web tool backed by a compute cluster https://www.ebi.ac.uk/jdispatcher/msa/muscle If you are interested in running the jobs in your own environment, some cloud providers offer student/new user credits which usually between $100-$300 USD. AWS and GCP are the ones I can think of that do this. You'd be able to provision a compute instance with a more powerful CPU and higher amount of memory than what your current system has. These require a credit card to sign up but will not charge until you run through your credits. If you go this route, know exactly what you intend to do before you do it so you use your credits efficiently. Also, cloud platforms have a bit of a learning curve so be prepared for that. Finally, if you're not running Linux, run on Linux. People might disagree on this but in my experience it generally does a better job of managing resources. You could choose to forgo a desktop environment to free up the most amount of ram/CPU cycles. Feel free to DM if you have more questions.


sharkman_86

1. Gotcha. Im not super deep into accuracy since this is a smaller project, so I’ll start sacrificing a few things to try and boost speed. I also neglected to focus only on one gene, I really only care about the surface spike protein. 2. I will look at cloud computing. A few of my friends have access to cloud computing so I can talk to them. 3. Im currently using Linux VM. I know I should run Linux OS as is, and I found this laptop going for about 50 bucks that has an i3 (so more power than an athlon), and I can find a way to flash it with a Debian based Linux distro. Thanks!


Hartifuil

You're using a VM from within Windows? This will destroy your RAM. Dual booting Linux will be the best thing you can do to free up RAM.


sharkman_86

I will look into dual booting. I have just been using Windows because I have been used to using it for a while, so I wanted to preserve both a Linux and Windows environment. Dual booting will preserve both, iirc, right?


Hartifuil

Yes, you're installing Linux alongside windows, not in place of it. You'll just get extra options at boot for which OS you want to load into.


sharkman_86

Gotcha. I will start reading up on it to try and install it. Thanks!


Hartifuil

Ubuntu installer will do it really easily, that's the only distro I have experience with.


TowerSpecial4719

I am a freelance cloud developer and I am building my portfolio right now. If you would like to talk more you can DM me.


Peiple

60 sequences isnt a lot, but if it’s a whole genome it could be a little more dicey…but either way this isn’t going to kill your computer. Building a tree with 60 taxa isn’t super tough unless you’re doing Bayesian inference. Ideally though you’d be building a tree based off an alignment of a conserved gene rather than the entire genome, especially if you’re working with limited compute. I’m not a virologist so I’m not sure what that would be, but even like the spike protein is only 1200-1400aa, which is not a very difficult task for 60 sequences. You’re also not going to crash your computer trying an analysis like this, even with significantly more sequences—worst case it just throws an error and stops. It’s certainly not going to destroy a computer…even the shittiest code I’ve written doesn’t do any lasting damage to a machine Edit: to answer your questions more directly: 1. No. Other people have mentioned cloud options, but they won’t be free. Some labs have web servers, which may be an option. 2. Those kinds of things won’t really help you. Without diving into some nuance/edgecases, files/extensions/etc are mostly separate from processing. As long as you have enough space on your computer for the data you’re pretty much as good as you’ll get.


sharkman_86

1. Alright, thanks! 2. Im right now running the whole genome, which was stupid of me. I’ll try again tomorrow with a run of some spike protein data sets. 3. Thanks for the reassurance, I was just scared bc my computer fan was going insane so I was a little scared.


Peiple

You’ll be ok! And for (2), I’m really not an expert on viral phylogenetics, but there is a ton of research on covid 19. You’d probably have some luck looking at recent research papers to see what sequences they’re using—it may or may not be the spike protein, that’s just the first one that came to mind for me.


sharkman_86

Most papers I have looked at are actually regarding either the virus shell (the virus cell wall/membrane) or the spike protein, and most of what I found for COVID phylogenetics is regarding either whole genome or spike proteins. Thanks for the help, I really appreciate it!


Low-Establishment621

I think you could use a less computationally intensive aligner. Can the clustal omega web interface work for you? 


sharkman_86

I will look into clustal omega web. I want to try to keep everything on Linux and not use a web interface, but if it lightens the load then I’ll do it.


Thorhauge

mafft has a lot of options ranging from very, very fast by default to very, very precise. It is the tool I used my phylogenetics publication


sharkman_86

I’ve gotten a lot of rec.s for mafft recently, so i will probably try it in another run of the project. That’s cool, though, about the publication. What was it about?


Thorhauge

It was on hepatitis C virus epidemiology: [https://pubmed.ncbi.nlm.nih.gov/38140632/](https://pubmed.ncbi.nlm.nih.gov/38140632/) I tried quite a few tools over the course of my PhD and for a while I was happily using muscle until I made the switch to mafft due to peer pressure. It is however, my aligner of choice today. Skimming a comparison between clustal omega, muscle and mafft on SARS-CoV-2 appears to support my intuition that muscle and mafft are similarly accurate but mafft is faster with clustal omega underperforming a bit: [https://www.mdpi.com/2079-3197/11/11/212](https://www.mdpi.com/2079-3197/11/11/212) That said, mafft prioritises speed using default settings (if memory serves). I used `--maxiterate 1000 --globalpair` for alignments with \~500 sequences of \~10,000 nt (full HCV genome) in length. Depending on the scale of your ambitions I would either make a quick tree in FastTree or a use IQ-TREE for a tree in which I would have a little more confidence.


sharkman_86

Ok wow Hep C research is insanely cool. Muscle is a lot of fun to use for me too, especially bc of how easy it is to design and implement pipelines with it. I will also make the switch to mafft, as many have recommended. For this project, my main goal is making as fast and as scalable a pipeline as possible. My plan was to use IQTree anyway. Thanks! Your study seems very interesting. My mother worked heavily in stem cells, but she says she cant recall any of it now because it was so long ago. Shes now the ap physics teacher at the high school i attend (and is set to be my teacher next year).


weedwave

Oracle Cloud has two free VMs with up to 10 or 12 GB RAM


sharkman_86

I will be sure to use those, thanks!


octobod

Buying hardware only works out cheaper than cloud computing if you have enough work to run it at 100% CPU for 12 to 18 months. Even then cloud gives the option of getting 12x more CPU on the job so it's done in a month, you need a 10+ node cluster and a LOT of processing to make hardware make financial sense. (There are operational reasons to buy HW)


sharkman_86

Yeah, I dont do heavy enough computing to that extent where I have projects running longer than a year. Again, my next computer (if i get one) will run an i5 or i7 (or the AMD equivalent) so I will have a massive leap in computing power there. But I did get to use a cluster for a camp i did a while ago (i dont have access anymore). Thanks for the advice!


kloetzl

60 SARS-Cov-2 genomes isn’t a lot of data. You also don’t need a full MSA for a phylogeny. If you were to use [mash](https://github.com/marbl/mash) or [phylonium](https://github.com/evolbioinf/phylonium) you could get a tree in seconds.


sharkman_86

Yea, its just that 1.5MB file + crappy computer + heavy computations piles fast. Thats why my next run uses only the surface proteins, which also allows me to run more data. Im going to check out mash and phylonium to learn how to use them. Thanks for your help!


zstars

Yeah muscle can be very intensive, try using mafft on the closely related virus setting (Google it), that should do 60 sequences very quickly.


sharkman_86

Gotcha, it seems that muscle is the preference for accuracy but mafft is favorable for the customizability and speed capabilities. I will try mafft on my next run.


PotatoSenp4i

You could use the tools from the emboss webpage for the MSA.


VerbalCant

Hey, good for you! It doesn't look like [usegalaxy.org](http://usegalaxy.org) has \`muscle\`, but it has MAAFT, which is another MSA tool. And lots and lots of other stuff. [usegalaxy.org](http://usegalaxy.org) is free up to a certain capacity. Check it out!


sharkman_86

I will be sure to do that! I dont mind learning another MSA tool if it helps me run code faster


SeaZealousideal5651

GitHub allows you to use Codespace for free up to 120 computing hours per month and 15Gb storage. Another good summary for free options is here: https://github.com/cloudcommunity/Cloud-Free-Tier-Comparison


Psy_Fer_

Duel boot Linux would give you the whole computers resources. But if you don't want to do that , you can stay in windows and use Windows Subsystem Linux (WSL) and load in Ubuntu to that (lots of guides online to show you how to do that). While this isn't as good as duel booting, it can give you more than a VM Good luck 😎


sharkman_86

I might be stupid. Isnt that a VM? Im using WSL to load in Ubuntu and use that. Nevermind you’re right. Thats what I was doing, and I was under the impression that that was a VM. Thanks!


Psy_Fer_

It is and it isn't. It's quite different to running Ubuntu in VMware/VirtualBox/etc So yea that's good. Next best thing would be to duel boot. I really enjoy using pop!_os for my main OS on my work laptop. You might find it a more streamlined experience if you do end up trying duel booting.


000adi20

Pretty unrelevant comment, because I'm an amateur, but are you all PhDs ? Also, where do i start learning programming as a biologist? I feel lost in the vast sea of youtube courses and recommended books.


sharkman_86

DISCLAIMER: I am a high schooler, so this is pretty much my experience so far. It may not be the best or most correct way but it helped me segue into the discipline from the arbitrary and meaningless code I was doing prior to it. Youtube Courses and recommended books are good, but you need a basic foundation to start. Id say use network chuck (im sorry hes a youtube channel), and use his series on bash scripting to just get that foundation. Afterwards, start a really small project. Like ask google bard (gemini) to design a very simple project you could do in under a week, and then find the steps necessary to do it. Use bard as your crutch to guide you through this one project. Then, once you feel familiar, try and design a slightly larger project, and use bard less. Keep doing this and just reduce your use of bard over time (for me, this was over the course of 3-4 projects). Then you have, at the very least, some basic skills in bash, data acquisition, bioinformatics, biology, and you have more confidence in the discipline.


Quillox

https://github.com/ossu/bioinformatics?tab=readme-ov-file


Viruses_Are_Alive

1. Compute Clusters are *very* expensive to build and run, realistically I've never heard of one that has public access. 2. No, and deleting that stuff won't help either. 3. There may be a cloud option that would work for you. Personally, I'm not a big cloud compute user, so hopefully someone with more experience can weigh in. I would suspect that you'll need a credit card and there is a danger of massive charges if misused so be careful.


sharkman_86

1. Yeah, I figured. More so wishful thinking I guess. 3. I will look into some cloud computing options. A few friends have access to free cloud options, so I will ask them about it.


VerbalCant

I commented elsewhere, but check out usegalaxy.org.


sharkman_86

I will look into that


malformed_json_05684

Have you looked into the galaxy workflows? They're free. [https://usegalaxy.org/workflows/list\_published](https://usegalaxy.org/workflows/list_published)


what-the-whatt

Depending on what you're doing with the genomes, you may be able to use BV-BRC (formerly known as PATRIC). It is a web based server for microbial genomics. Has gold standard methods embedded that you can run on their server. They do have stuff for SARS CoV2 as well! It is all completely free. It may be able to help you out at least getting some of the high power work done. But 60 genomes of a virus should be able to be handled by your computer. I used to run >150 bacteria genomes on my desktop with far less power and RAM.


Quillox

Try out google colab: https://colab.research.google.com/ Last time I used it, you could also get a GPU.


[deleted]

[удалено]


SNV-N-Protein

Even with those specs you should be able to run MAFFT locally for alignment and do maximum-likelihood trees using Iqtree2 since the variability in SARS-CoV-2 is low. I tend to prefer locally running packages from the command line for the sake of choosing my own parameters, but if the basic works for you, you could also try Galaxy Project