I'm reading an article with title "characterizing sars cov 2 mutations in the united states" lately and got confused about this Jaccard distance. So, the article is trying to use the Jaccard distance to measure the similarity between SNP variants and compare the difference between the SNP variant profiles of SARS-CoV-2 genomes. Firstly, the Jaccard similarity coefficients is defined as the intersection size divided by the union of the two sets A and B. And then the Jaccard distance of two sets A and B is scored as the difference between one and the Jaccard similarity coefficient and is a metric on the collection of all finite sets. This is easy to understand, as distance complements similarity. But defining distance and similarity in this way, would ignore the order information underlying the sequence structures, right? Would this be sufficient ? I mean, is this a good distance / similarity definition then ?

Similarity measurements are useful generally for painting a quick picture of how divergent a site in a set of sequences are. You are correct that they do not capture all of the useful information about a protein structure however, but whether this really presents a problem or not depends on the question you're asking of the data.

I think it is a reasonable metric. Just to be sure the details are clear: imagine you want to see how distant corona strain A is from corona strain B. First thing you do is sequence both strains, align them to the reference corona sequence and perform variant calling. Most of the sequence is identical (hence, uninformative); only at certain sites, you observe variants at strain A with respect to the reference and variants at strain B with respect to the reference. This can be represented at two sets: set A contains variants detected in strain A while set B contains variants detected in strain B.
The number of variants contained in each set is a measure of divergence with respect to the reference (how distant a strain is compared to the reference). But the question is not that, but rather how distant is set A form set B. Hence, you can employ Jaccard distance to measure the distance between set A and set B: large intersections between set A and B will serve a good evidence for similarity between strain A and B. This strategy makes sense if the sequences are relatively similar between each other; e.g. not too many variants (as it is the case for viral sequences belonging to the same species). It is a bit like measuring the length of a phylogenetic branch from A-to-B passing through the reference.

You mean it is not very precise, so it only applies to the situation when the sequences are relatively similar to each other, right? The set A and set B are not the genome sequences, but rather the variants detected in set A and set B respectively. In another article with title "genotyping coronavirus sars cov 2 methods and implications", the author mentioned that "The Jaccard distance measure of SNP variants takes account of the ordering of SNP mutations.". So why is that? Why is the ordering of SNP mutations being considered here? Is it somehow implicitly considered during the sequence alignment process then?

I think this @Ventrilocus provided a good argument, but i will add some caveats that maybe were not covered.

If memory serves well, Jaccard similarity is normally used to measure the overlap between datasets of the same size. It is used, for example, in comparing a known clustering solution with clusters obtained by various methods. Setting aside for a moment the requirement for the same size, the explanation by @Ventrilocus holds in that Jaccard similarity between the two related strains is most likely very close to 1 (so the distance is close to 0) - if we choose to look at the whole genome. That said, the argument above about the subsets is valid only if the subsets somehow include all present and future mutants we'd be interested in studying, or if we are content to look only at mutations in the originally chosen subset. To me this is a potentially serious drawback.

Finally, the main reason I think Jaccard similarity is not a good measure is because it doesn't account for the effect of silent mutations. Changing CGT from the reference into CGC in strain A will lower the Jaccard similarity, even though both of them code for arginine. In that sense, Jaccard similarity/distance measure may be good in accounting for absolute mutation rates, but it would be deficient when it comes to biological consequences of those mutations.

1: Why do we have the "same size datasets" requirement, as I did not see this from the formula definition.
2: So we can choose what we'd like the subsets to be in the definition then? What do you mean "the argument above about the subsets is valid only if the subsets somehow include all present and future mutants we'd be interested in studying..."?

I know for sure that jaccard function in sklearn requires that the sets for comparison be of equal size. I don't know off-hand the mathematical reason for that, but intuitively I think they have to be of equal size or else one could always find many different unions between a smaller vector and a particular fraction of a larger vector. How would we decide which of those unions to report, especially since many of them would likely have the same Jaccard score?

Similarity measurements are useful generally for painting a quick picture of how divergent a site in a set of sequences are. You are correct that they do not capture all of the useful information about a protein structure however, but whether this really presents a problem or not depends on the question you're asking of the data.