Vol.19no.102003,pages1294–1295DOI:10.1093/bioinformatics/btg135
Longestbiasedintervalandlongestnon-negativesuminterval
LloydAllison
SchoolofComputerScienceandSoftwareEngineering,MonashUniversity,Clayton,Victoria,Australia3800
ReceivedonOctober16,2002;revisedonNovember17,2002;acceptedonJanuary22,2003
ABSTRACT
Summary:Describedisanalgorithmtofindthelongestintervalhavingatleastaspecifiedminimumbiasinasequenceofcharacters(bases,aminoacids),e.g.‘atleast0.95(A+T)-rich’.Itisbasedonanalgorithmtofindthelongestintervalhavinganon-negativesuminasequenceofpositiveandnegativenumbers.Inpractice,itrunsinlineartime;thiscanbeguaranteedifthebiasisrational.Availability:Javacodeofthealgorithmcanbefoundathttp://www.csse.monash.edu.au/∼lloyd/tildeProgLang/Java2/Biased/
Contact:lloyd@bruce.cs.monash.edu.au
Supplementaryinformation:ExamplesofapplicationstoPlasmodiumfalciparumgenomicDNAcanbefoundattheaboveURL.
INTRODUCTION
Regionsoflowinformationcontentinbiologicalse-quencescanbeofconsiderableinterest,forexampleinconnectionwiththecentromeresofPlasmodiumfalci-parum‘Elevenofthe14chromosomescontainedasingleregionof2–3kbwithextremelyhigh(A+T)content(0.97)’Gardneretal.(2002).
Here,analgorithmisdescribedtofindthelongestbiasedinterval(LBI),havingaspecifiedminimumbias,inasequenceofcharacters.Itreliesonanalgorithmwhich,givenasequenceofnumbers,findsthelongestnon-negativesuminterval.
Theproblemconsideredhereisrelatedtothemaximumsuminterval(MSI)problem,thatisgivenasequenceofnumbers,x[0...n−1],findaninterval,x[i...j],suchthatthesumofthenumbersintheintervalisaslargeaspossible.Bentley(1986)attributesthe2DversionofthatproblemtoU.Grenanderandthesolutiontothe1DproblemtoJ.Kadane.TheMSIproblemissolvedbyconsideringcumulativesumsinthesequencefromx[0]tox[k](solidline,Fig.1).AnMSIcorrespondstoamaximumincreaseinsumsforx[0...i−1]tox[0...j]whereij+1.Below,arelatedideaisusedtosolvethenewproblem.
1294
NON-NEGATIVESUMINTERVALS
Givenasequenceofnumbers,x[0...n−1],thelongestnon-negativesuminterval(LNNSI)problemistofindaninterval,x[i...j],that(a)sumstoanon-negativetotaland(b)isaslongaspossible.Ittoocanbesolvedbyconsideringcumulativesumsfromx[0]tox[k].Butinthiscase,anycandidateintervalcorrespondstoanon-decreaseinsumsforx[0...i−1]tox[0...j]whereij+1.Thecumulativesumiscalculatedinalefttorightscanofthesequence.Auxiliaryarraysareusedtostore:(a)firstoccurrencesofnewminimainthesuccessivecumulativesumsand(b)thepositionsofthosefirstoccurrences(dottedline,Fig.1);thedashedlineinthefigureindicatesthelastpointatwhichthesumisatoraboveeachvalue.Ateachscanposition,j,thealgorithmconsidersthebestcandidateinterval,x[i...j],overiin0...j+1.Thecumulativesumforx[0...i−1]mustbenomorethanthatforx[0...j],infactimustbethesmallestsuchindex;imaginejadvancingalongthesolidlineanditrackingbackandforthalongthedottedlineinthefigurewhilestayingatorjustbelowj’sheight.Thebestoutofallsuchcandidateintervals,x[i...j],isrememberedduringthescantoyieldthesolution.
BIASEDINTERVALS
Givenasequenceofcharacters,s[0...n−1],aselectionofpreferredcharacters,andaminimumproportion(bias),f,ofpreferredcharacters,thelongestbiasedinterval(LBI)problemistofindaninterval,s[i...j],suchthat:(a)theintervalcontainsatleastthespecifiedproportionofpreferredcharacters;and(b)theintervalisaslongaspossible.
Preferredcharactersare‘good’andaregivenapositivescore,1−f.Othercharactersof‘bad’andaregivenanegativescore,−f.Onlycandidateintervalshavenon-negativetotalscores:Foranintervalofggoodandbbadcharacters,g/(g+b)fisequivalenttog(g+b)·fandtog·(1−f)−b·f0,soLBIisequivalenttoanLNNSIproblem.
AsforLNNSI,thesequence(ofcharacters)isscannedfromlefttoright,computingcumulativesumsofscores,
cOxfordUniversityPress2003;allrightsreserved.Bioinformatics19(10)
LBIandLNNSI
Fig.1.Cumulativesums:solidline:cumulativesum;dottedline:firstbelow(newminima):dashedline:lastabove.
andcandidateintervals,s[i...j].Thealgorithmrunsinlinear-timeinpracticebecausewhenjisincreasedtoj+1,ialmostalwayschangesbyonlyasmallamountunderreasonableconditions:Ifthesumtoj+1increases,thenimay‘retreat’afewstepstoanearlierandhigherminimum(dottedline,Fig.1).Ifthesumtoj+1decreases,thenimay‘advance’byafewstepstoalaterandlowerminimum.Forpathologicalsequencesand‘suitable’choicesoff,thealgorithmcouldinprinciplerunmoreslowly.Inpractice,ascanofchromosome-14(3mb)ofP.falciparumtakeslessthan1s:Javacode,just-in-timecompileranda900MHzprocessor.Itisthuspracticaltoscanevenverylongsequencesformanydifferentvaluesofbias,f.
Thereisanalternativecodingofthealgorithmwhichisguaranteedtoruninlinear-timeiftheminimumbias,f,isarationalnumber:Integerscores,+pand−qsay,canthenbeusedandthelocationsofnewminimastoredinanarrayindexedbyvalue.Thefirstoccurrenceofagivencumulative-sumvaluecanthenbefoundusingrandomaccess,andthestartpositionofthecandidateintervaladjustedinO(1)-time.Worst-casespace-complexityislinearbutthemultiplieristhelargerofpandq.
RuzzoandTompa(1999)gavealinear-timealgorithmtofindallmaximumsumintervals(MSIs)ofasequence,i.e.everyintervalthatcannotbeextendedwithoutdecreasingitssum.Forourproblem,itispossibletohavetwooverlappingbiasedsequences,yetneithercanbeextendedwithoutfallingbelowtheminimumbias:e.g.Consider(A)-rich,biasf=0.6andsequence‘ATAT...ATAT’;anyinstanceof‘ATATA’hasthreeAsoutoffivecharactersandsatisfiestheconditionbutnoextensiondoes.Thedefinitionofalocal,ratherthanaglobal,biasconditionisthereforeproblematic.Howeveradivideandconquerapproachwouldusuallyfindregionsofinterestinpractice:FindaglobalLBI,andrepeatrecursivelyonthesectionsofthesequencetoitsleftanditsright—O(n·logn)-timeonaverageifthesplitsareoftenreasonablyeven,O(n2)-timeatworstifthesplitsarealmostalwaysunbalanced.
REFERENCES
Bentley,J.(1986)ProgrammingPearls.Addison-Wesley,Reading,MA.
Gardner,M.J.,Hall,N.,Fung,E.,White,O.,Berriman,M.,Hy-man,R.W.,Carlton,J.M.,Pain,A.,Nelson,K.E.,Bowman.,S.etal.(2002)GenomesequenceofthehumanmalariaparasitePlas-modiumfalciparum.Nature,419,498–511.
Ruzzo,W.L.andTompa,M.(1999)Alineartimealgorithmforfind-ingallmaximalscoringsequences.7thInt.Conf.onIntelligentSystemsforMolecularBiology(ISMB).Heidelberg,Germany,pp.234–241.
1295
因篇幅问题不能全部显示,请点此查看更多更全内容