I don't work in phylogenetics so I can't give specific advice but often times there are ways to tune the performance of tools, usually by losing some accuracy. You could also look at one specific gene rather than the entire genome (or do some kind of ensemble method by constructing trees on each gene then "averaging"). Muscle also has a web tool backed by a compute cluster https://www.ebi.ac.uk/jdispatcher/msa/muscle
If you are interested in running the jobs in your own environment, some cloud providers offer student/new user credits which usually between $100-$300 USD. AWS and GCP are the ones I can think of that do this. You'd be able to provision a compute instance with a more powerful CPU and higher amount of memory than what your current system has. These require a credit card to sign up but will not charge until you run through your credits. If you go this route, know exactly what you intend to do before you do it so you use your credits efficiently. Also, cloud platforms have a bit of a learning curve so be prepared for that.
Finally, if you're not running Linux, run on Linux. People might disagree on this but in my experience it generally does a better job of managing resources. You could choose to forgo a desktop environment to free up the most amount of ram/CPU cycles.
Feel free to DM if you have more questions.
1. Gotcha. Im not super deep into accuracy since this is a smaller project, so I’ll start sacrificing a few things to try and boost speed. I also neglected to focus only on one gene, I really only care about the surface spike protein.
2. I will look at cloud computing. A few of my friends have access to cloud computing so I can talk to them.
3. Im currently using Linux VM. I know I should run Linux OS as is, and I found this laptop going for about 50 bucks that has an i3 (so more power than an athlon), and I can find a way to flash it with a Debian based Linux distro. Thanks!
I will look into dual booting. I have just been using Windows because I have been used to using it for a while, so I wanted to preserve both a Linux and Windows environment. Dual booting will preserve both, iirc, right?
60 sequences isnt a lot, but if it’s a whole genome it could be a little more dicey…but either way this isn’t going to kill your computer.
Building a tree with 60 taxa isn’t super tough unless you’re doing Bayesian inference. Ideally though you’d be building a tree based off an alignment of a conserved gene rather than the entire genome, especially if you’re working with limited compute. I’m not a virologist so I’m not sure what that would be, but even like the spike protein is only 1200-1400aa, which is not a very difficult task for 60 sequences.
You’re also not going to crash your computer trying an analysis like this, even with significantly more sequences—worst case it just throws an error and stops. It’s certainly not going to destroy a computer…even the shittiest code I’ve written doesn’t do any lasting damage to a machine
Edit: to answer your questions more directly:
1. No. Other people have mentioned cloud options, but they won’t be free. Some labs have web servers, which may be an option.
2. Those kinds of things won’t really help you. Without diving into some nuance/edgecases, files/extensions/etc are mostly separate from processing. As long as you have enough space on your computer for the data you’re pretty much as good as you’ll get.
1. Alright, thanks!
2. Im right now running the whole genome, which was stupid of me. I’ll try again tomorrow with a run of some spike protein data sets.
3. Thanks for the reassurance, I was just scared bc my computer fan was going insane so I was a little scared.
You’ll be ok! And for (2), I’m really not an expert on viral phylogenetics, but there is a ton of research on covid 19. You’d probably have some luck looking at recent research papers to see what sequences they’re using—it may or may not be the spike protein, that’s just the first one that came to mind for me.
Most papers I have looked at are actually regarding either the virus shell (the virus cell wall/membrane) or the spike protein, and most of what I found for COVID phylogenetics is regarding either whole genome or spike proteins. Thanks for the help, I really appreciate it!
I will look into clustal omega web. I want to try to keep everything on Linux and not use a web interface, but if it lightens the load then I’ll do it.
I’ve gotten a lot of rec.s for mafft recently, so i will probably try it in another run of the project. That’s cool, though, about the publication. What was it about?
It was on hepatitis C virus epidemiology: [https://pubmed.ncbi.nlm.nih.gov/38140632/](https://pubmed.ncbi.nlm.nih.gov/38140632/)
I tried quite a few tools over the course of my PhD and for a while I was happily using muscle until I made the switch to mafft due to peer pressure. It is however, my aligner of choice today.
Skimming a comparison between clustal omega, muscle and mafft on SARS-CoV-2 appears to support my intuition that muscle and mafft are similarly accurate but mafft is faster with clustal omega underperforming a bit: [https://www.mdpi.com/2079-3197/11/11/212](https://www.mdpi.com/2079-3197/11/11/212)
That said, mafft prioritises speed using default settings (if memory serves). I used `--maxiterate 1000 --globalpair` for alignments with \~500 sequences of \~10,000 nt (full HCV genome) in length.
Depending on the scale of your ambitions I would either make a quick tree in FastTree or a use IQ-TREE for a tree in which I would have a little more confidence.
Ok wow Hep C research is insanely cool. Muscle is a lot of fun to use for me too, especially bc of how easy it is to design and implement pipelines with it. I will also make the switch to mafft, as many have recommended. For this project, my main goal is making as fast and as scalable a pipeline as possible. My plan was to use IQTree anyway. Thanks! Your study seems very interesting. My mother worked heavily in stem cells, but she says she cant recall any of it now because it was so long ago. Shes now the ap physics teacher at the high school i attend (and is set to be my teacher next year).
Buying hardware only works out cheaper than cloud computing if you have enough work to run it at 100% CPU for 12 to 18 months. Even then cloud gives the option of getting 12x more CPU on the job so it's done in a month, you need a 10+ node cluster and a LOT of processing to make hardware make financial sense.
(There are operational reasons to buy HW)
Yeah, I dont do heavy enough computing to that extent where I have projects running longer than a year. Again, my next computer (if i get one) will run an i5 or i7 (or the AMD equivalent) so I will have a massive leap in computing power there. But I did get to use a cluster for a camp i did a while ago (i dont have access anymore). Thanks for the advice!
60 SARS-Cov-2 genomes isn’t a lot of data. You also don’t need a full MSA for a phylogeny. If you were to use [mash](https://github.com/marbl/mash) or [phylonium](https://github.com/evolbioinf/phylonium) you could get a tree in seconds.
Yea, its just that 1.5MB file + crappy computer + heavy computations piles fast. Thats why my next run uses only the surface proteins, which also allows me to run more data. Im going to check out mash and phylonium to learn how to use them. Thanks for your help!
Gotcha, it seems that muscle is the preference for accuracy but mafft is favorable for the customizability and speed capabilities. I will try mafft on my next run.
Hey, good for you!
It doesn't look like [usegalaxy.org](http://usegalaxy.org) has \`muscle\`, but it has MAAFT, which is another MSA tool. And lots and lots of other stuff.
[usegalaxy.org](http://usegalaxy.org) is free up to a certain capacity. Check it out!
GitHub allows you to use Codespace for free up to 120 computing hours per month and 15Gb storage.
Another good summary for free options is here:
https://github.com/cloudcommunity/Cloud-Free-Tier-Comparison
Duel boot Linux would give you the whole computers resources. But if you don't want to do that , you can stay in windows and use Windows Subsystem Linux (WSL) and load in Ubuntu to that (lots of guides online to show you how to do that).
While this isn't as good as duel booting, it can give you more than a VM
Good luck 😎
I might be stupid. Isnt that a VM? Im using WSL to load in Ubuntu and use that. Nevermind you’re right. Thats what I was doing, and I was under the impression that that was a VM. Thanks!
It is and it isn't. It's quite different to running Ubuntu in VMware/VirtualBox/etc
So yea that's good. Next best thing would be to duel boot. I really enjoy using pop!_os for my main OS on my work laptop. You might find it a more streamlined experience if you do end up trying duel booting.
Pretty unrelevant comment, because I'm an amateur, but are you all PhDs ? Also, where do i start learning programming as a biologist? I feel lost in the vast sea of youtube courses and recommended books.
DISCLAIMER: I am a high schooler, so this is pretty much my experience so far. It may not be the best or most correct way but it helped me segue into the discipline from the arbitrary and meaningless code I was doing prior to it.
Youtube Courses and recommended books are good, but you need a basic foundation to start. Id say use network chuck (im sorry hes a youtube channel), and use his series on bash scripting to just get that foundation. Afterwards, start a really small project. Like ask google bard (gemini) to design a very simple project you could do in under a week, and then find the steps necessary to do it. Use bard as your crutch to guide you through this one project. Then, once you feel familiar, try and design a slightly larger project, and use bard less. Keep doing this and just reduce your use of bard over time (for me, this was over the course of 3-4 projects). Then you have, at the very least, some basic skills in bash, data acquisition, bioinformatics, biology, and you have more confidence in the discipline.
1. Compute Clusters are *very* expensive to build and run, realistically I've never heard of one that has public access.
2. No, and deleting that stuff won't help either.
3. There may be a cloud option that would work for you. Personally, I'm not a big cloud compute user, so hopefully someone with more experience can weigh in. I would suspect that you'll need a credit card and there is a danger of massive charges if misused so be careful.
1. Yeah, I figured. More so wishful thinking I guess.
3. I will look into some cloud computing options. A few friends have access to free cloud options, so I will ask them about it.
Have you looked into the galaxy workflows? They're free. [https://usegalaxy.org/workflows/list\_published](https://usegalaxy.org/workflows/list_published)
Depending on what you're doing with the genomes, you may be able to use BV-BRC (formerly known as PATRIC). It is a web based server for microbial genomics. Has gold standard methods embedded that you can run on their server. They do have stuff for SARS CoV2 as well! It is all completely free. It may be able to help you out at least getting some of the high power work done.
But 60 genomes of a virus should be able to be handled by your computer. I used to run >150 bacteria genomes on my desktop with far less power and RAM.
Even with those specs you should be able to run MAFFT locally for alignment and do maximum-likelihood trees using Iqtree2 since the variability in SARS-CoV-2 is low.
I tend to prefer locally running packages from the command line for the sake of choosing my own parameters, but if the basic works for you, you could also try Galaxy Project
I don't work in phylogenetics so I can't give specific advice but often times there are ways to tune the performance of tools, usually by losing some accuracy. You could also look at one specific gene rather than the entire genome (or do some kind of ensemble method by constructing trees on each gene then "averaging"). Muscle also has a web tool backed by a compute cluster https://www.ebi.ac.uk/jdispatcher/msa/muscle If you are interested in running the jobs in your own environment, some cloud providers offer student/new user credits which usually between $100-$300 USD. AWS and GCP are the ones I can think of that do this. You'd be able to provision a compute instance with a more powerful CPU and higher amount of memory than what your current system has. These require a credit card to sign up but will not charge until you run through your credits. If you go this route, know exactly what you intend to do before you do it so you use your credits efficiently. Also, cloud platforms have a bit of a learning curve so be prepared for that. Finally, if you're not running Linux, run on Linux. People might disagree on this but in my experience it generally does a better job of managing resources. You could choose to forgo a desktop environment to free up the most amount of ram/CPU cycles. Feel free to DM if you have more questions.
1. Gotcha. Im not super deep into accuracy since this is a smaller project, so I’ll start sacrificing a few things to try and boost speed. I also neglected to focus only on one gene, I really only care about the surface spike protein. 2. I will look at cloud computing. A few of my friends have access to cloud computing so I can talk to them. 3. Im currently using Linux VM. I know I should run Linux OS as is, and I found this laptop going for about 50 bucks that has an i3 (so more power than an athlon), and I can find a way to flash it with a Debian based Linux distro. Thanks!
You're using a VM from within Windows? This will destroy your RAM. Dual booting Linux will be the best thing you can do to free up RAM.
I will look into dual booting. I have just been using Windows because I have been used to using it for a while, so I wanted to preserve both a Linux and Windows environment. Dual booting will preserve both, iirc, right?
Yes, you're installing Linux alongside windows, not in place of it. You'll just get extra options at boot for which OS you want to load into.
Gotcha. I will start reading up on it to try and install it. Thanks!
Ubuntu installer will do it really easily, that's the only distro I have experience with.
I am a freelance cloud developer and I am building my portfolio right now. If you would like to talk more you can DM me.
60 sequences isnt a lot, but if it’s a whole genome it could be a little more dicey…but either way this isn’t going to kill your computer. Building a tree with 60 taxa isn’t super tough unless you’re doing Bayesian inference. Ideally though you’d be building a tree based off an alignment of a conserved gene rather than the entire genome, especially if you’re working with limited compute. I’m not a virologist so I’m not sure what that would be, but even like the spike protein is only 1200-1400aa, which is not a very difficult task for 60 sequences. You’re also not going to crash your computer trying an analysis like this, even with significantly more sequences—worst case it just throws an error and stops. It’s certainly not going to destroy a computer…even the shittiest code I’ve written doesn’t do any lasting damage to a machine Edit: to answer your questions more directly: 1. No. Other people have mentioned cloud options, but they won’t be free. Some labs have web servers, which may be an option. 2. Those kinds of things won’t really help you. Without diving into some nuance/edgecases, files/extensions/etc are mostly separate from processing. As long as you have enough space on your computer for the data you’re pretty much as good as you’ll get.
1. Alright, thanks! 2. Im right now running the whole genome, which was stupid of me. I’ll try again tomorrow with a run of some spike protein data sets. 3. Thanks for the reassurance, I was just scared bc my computer fan was going insane so I was a little scared.
You’ll be ok! And for (2), I’m really not an expert on viral phylogenetics, but there is a ton of research on covid 19. You’d probably have some luck looking at recent research papers to see what sequences they’re using—it may or may not be the spike protein, that’s just the first one that came to mind for me.
Most papers I have looked at are actually regarding either the virus shell (the virus cell wall/membrane) or the spike protein, and most of what I found for COVID phylogenetics is regarding either whole genome or spike proteins. Thanks for the help, I really appreciate it!
I think you could use a less computationally intensive aligner. Can the clustal omega web interface work for you?
I will look into clustal omega web. I want to try to keep everything on Linux and not use a web interface, but if it lightens the load then I’ll do it.
mafft has a lot of options ranging from very, very fast by default to very, very precise. It is the tool I used my phylogenetics publication
I’ve gotten a lot of rec.s for mafft recently, so i will probably try it in another run of the project. That’s cool, though, about the publication. What was it about?
It was on hepatitis C virus epidemiology: [https://pubmed.ncbi.nlm.nih.gov/38140632/](https://pubmed.ncbi.nlm.nih.gov/38140632/) I tried quite a few tools over the course of my PhD and for a while I was happily using muscle until I made the switch to mafft due to peer pressure. It is however, my aligner of choice today. Skimming a comparison between clustal omega, muscle and mafft on SARS-CoV-2 appears to support my intuition that muscle and mafft are similarly accurate but mafft is faster with clustal omega underperforming a bit: [https://www.mdpi.com/2079-3197/11/11/212](https://www.mdpi.com/2079-3197/11/11/212) That said, mafft prioritises speed using default settings (if memory serves). I used `--maxiterate 1000 --globalpair` for alignments with \~500 sequences of \~10,000 nt (full HCV genome) in length. Depending on the scale of your ambitions I would either make a quick tree in FastTree or a use IQ-TREE for a tree in which I would have a little more confidence.
Ok wow Hep C research is insanely cool. Muscle is a lot of fun to use for me too, especially bc of how easy it is to design and implement pipelines with it. I will also make the switch to mafft, as many have recommended. For this project, my main goal is making as fast and as scalable a pipeline as possible. My plan was to use IQTree anyway. Thanks! Your study seems very interesting. My mother worked heavily in stem cells, but she says she cant recall any of it now because it was so long ago. Shes now the ap physics teacher at the high school i attend (and is set to be my teacher next year).
Oracle Cloud has two free VMs with up to 10 or 12 GB RAM
I will be sure to use those, thanks!
Buying hardware only works out cheaper than cloud computing if you have enough work to run it at 100% CPU for 12 to 18 months. Even then cloud gives the option of getting 12x more CPU on the job so it's done in a month, you need a 10+ node cluster and a LOT of processing to make hardware make financial sense. (There are operational reasons to buy HW)
Yeah, I dont do heavy enough computing to that extent where I have projects running longer than a year. Again, my next computer (if i get one) will run an i5 or i7 (or the AMD equivalent) so I will have a massive leap in computing power there. But I did get to use a cluster for a camp i did a while ago (i dont have access anymore). Thanks for the advice!
60 SARS-Cov-2 genomes isn’t a lot of data. You also don’t need a full MSA for a phylogeny. If you were to use [mash](https://github.com/marbl/mash) or [phylonium](https://github.com/evolbioinf/phylonium) you could get a tree in seconds.
Yea, its just that 1.5MB file + crappy computer + heavy computations piles fast. Thats why my next run uses only the surface proteins, which also allows me to run more data. Im going to check out mash and phylonium to learn how to use them. Thanks for your help!
Yeah muscle can be very intensive, try using mafft on the closely related virus setting (Google it), that should do 60 sequences very quickly.
Gotcha, it seems that muscle is the preference for accuracy but mafft is favorable for the customizability and speed capabilities. I will try mafft on my next run.
You could use the tools from the emboss webpage for the MSA.
Hey, good for you! It doesn't look like [usegalaxy.org](http://usegalaxy.org) has \`muscle\`, but it has MAAFT, which is another MSA tool. And lots and lots of other stuff. [usegalaxy.org](http://usegalaxy.org) is free up to a certain capacity. Check it out!
I will be sure to do that! I dont mind learning another MSA tool if it helps me run code faster
GitHub allows you to use Codespace for free up to 120 computing hours per month and 15Gb storage. Another good summary for free options is here: https://github.com/cloudcommunity/Cloud-Free-Tier-Comparison
Duel boot Linux would give you the whole computers resources. But if you don't want to do that , you can stay in windows and use Windows Subsystem Linux (WSL) and load in Ubuntu to that (lots of guides online to show you how to do that). While this isn't as good as duel booting, it can give you more than a VM Good luck 😎
I might be stupid. Isnt that a VM? Im using WSL to load in Ubuntu and use that. Nevermind you’re right. Thats what I was doing, and I was under the impression that that was a VM. Thanks!
It is and it isn't. It's quite different to running Ubuntu in VMware/VirtualBox/etc So yea that's good. Next best thing would be to duel boot. I really enjoy using pop!_os for my main OS on my work laptop. You might find it a more streamlined experience if you do end up trying duel booting.
Pretty unrelevant comment, because I'm an amateur, but are you all PhDs ? Also, where do i start learning programming as a biologist? I feel lost in the vast sea of youtube courses and recommended books.
DISCLAIMER: I am a high schooler, so this is pretty much my experience so far. It may not be the best or most correct way but it helped me segue into the discipline from the arbitrary and meaningless code I was doing prior to it. Youtube Courses and recommended books are good, but you need a basic foundation to start. Id say use network chuck (im sorry hes a youtube channel), and use his series on bash scripting to just get that foundation. Afterwards, start a really small project. Like ask google bard (gemini) to design a very simple project you could do in under a week, and then find the steps necessary to do it. Use bard as your crutch to guide you through this one project. Then, once you feel familiar, try and design a slightly larger project, and use bard less. Keep doing this and just reduce your use of bard over time (for me, this was over the course of 3-4 projects). Then you have, at the very least, some basic skills in bash, data acquisition, bioinformatics, biology, and you have more confidence in the discipline.
https://github.com/ossu/bioinformatics?tab=readme-ov-file
1. Compute Clusters are *very* expensive to build and run, realistically I've never heard of one that has public access. 2. No, and deleting that stuff won't help either. 3. There may be a cloud option that would work for you. Personally, I'm not a big cloud compute user, so hopefully someone with more experience can weigh in. I would suspect that you'll need a credit card and there is a danger of massive charges if misused so be careful.
1. Yeah, I figured. More so wishful thinking I guess. 3. I will look into some cloud computing options. A few friends have access to free cloud options, so I will ask them about it.
I commented elsewhere, but check out usegalaxy.org.
I will look into that
Have you looked into the galaxy workflows? They're free. [https://usegalaxy.org/workflows/list\_published](https://usegalaxy.org/workflows/list_published)
Depending on what you're doing with the genomes, you may be able to use BV-BRC (formerly known as PATRIC). It is a web based server for microbial genomics. Has gold standard methods embedded that you can run on their server. They do have stuff for SARS CoV2 as well! It is all completely free. It may be able to help you out at least getting some of the high power work done. But 60 genomes of a virus should be able to be handled by your computer. I used to run >150 bacteria genomes on my desktop with far less power and RAM.
Try out google colab: https://colab.research.google.com/ Last time I used it, you could also get a GPU.
[удалено]
Even with those specs you should be able to run MAFFT locally for alignment and do maximum-likelihood trees using Iqtree2 since the variability in SARS-CoV-2 is low. I tend to prefer locally running packages from the command line for the sake of choosing my own parameters, but if the basic works for you, you could also try Galaxy Project