[Vari] Contest 4: DNA [Archivio] - Hardware Upgrade Forum

gugoXX

28-07-2008, 18:08

Siano date 2 stringhe molto lunghe, di lunghezza diversa, contenenti solo i caratteri CGAT

CGTGTCGCGATACATAGTACaggagtaaagagatGaggaTacgATAGCGATACGATACAGATGTCGGAGAATC
TTCCTACGCAGaggagtaaagagatCaggaCacgGAGCTGATCGGTCGACCATCATTATA

Domanda 1:
Quanto e' lunga la piu' lunga sottosequenza di DNA comune alle 2 stringhe, e l'indice di partenza della sottosequenza in entrambe le stringhe.
Nell'esempio la risposta e' (dovrebbe) 14, con indici 20 e 11, in quanto la sottostringa aggagtaaagagat, nell'esempio riportata per facilita' in minuscolo, dovrebbe essere la piu' lunga comune.

Domanda 2:
Durante una DNA polimerasi possono generarsi errori di trascrizione.
Detto ERR il numero di errori massimi ammessi, trovare la lunghezza della piu' lunga sottosequenza di DNA comune alle 2 stringhe, e l'indice di partenza in ciascuna di esse.

Gli errori NON potranno essere consecutivi. 2 errori consecutivi invalideranno comunque la prosecuzione della sequenza.
Inoltre una sottosequenza non puo' iniziare o finire con un errore.

Nell'esempio precedente, dato ERR=1 la risposta e' (dovrebbe) 19, sempre con indici 20 e 11, in quanto aggagtaaagagatGaggaTacg differisce da aggagtaaagagatCagga per un solo carattere.
Con ERR=2 la riposta e' (dovrebbe) 23 (analogo), sempre con 20 e 11 come indici.

Le 2 stringhe di TEST arriveranno presto, e si suppone che siano molto, molto lunghe.

Ecco le 2 stringhe di TEST
http://www.usaupload.net/d/s4b7z8fpakd
Sono da circa 100.000 nucleotidi la prima e da 80.000 nucleotidi la seconda.
E i 2 test da provare sono quindi:
1) Trovare la piu' lunga sottosequenza identica (prototipo: Funzione (stringa1, stringa2))
2) Trovare la piu' lunga sottosequenza comprendente un massimo di 2 errori (Prototipo: Funzione (stringa1, stringa2, ERR) )

Proviamo cosi', se poi i risultati non ci soddisfano cercheremo 2 nuove sequenze di DNA.
Per il secondo esercizio consideriamo che il numero di errori sara' comunque sempre ragionevolmente basso.

cionci

28-07-2008, 18:50

Ci puoi dare un'idea di quanto siano lunghe ?

gugoXX

28-07-2008, 19:25

Ci puoi dare un'idea di quanto siano lunghe ?

Devo fare prima una stima sulle tempistiche.
Direi che 200.000 caratteri dovrebbe essere una buona partenza, soprattutto per il secondo algoritmo.

shinya

29-07-2008, 09:35

Non vorrei rompere le uova nel paniere... ma questo purtroppo è un problema noto e risolto in letteratura.

Non vorrei sfociasse in "la mia versione in C ha un bug ma è velocissima" vs. "L'ho risolto in 3 righe in Blurp!"...

gugoXX

29-07-2008, 09:36

Non vorrei rompere le uova nel paniere... ma questo purtroppo è un problema noto e risolto in letteratura.

Non vorrei sfociasse in "la mia versione in C ha un bug ma è velocissima" vs. "L'ho risolto in 3 righe in Blurp!"...

Quello senza errori lo so... ma anche con gli errori? :(

shinya

29-07-2008, 10:16

Quello senza errori lo so... ma anche con gli errori? :(

Non sono decisamente un esperto nel campo... ma se faccio una ricerca su citeseer vengono fuori mille-mila paper sull'argomento (con e senza errori)...
Esistono ovviamente tantissime varianti, quindi se non ci si concentra su una sola soluzione facendo a gara di velocità/bug potrebbe venir fuori un thread interessante :)

Giullo

29-07-2008, 13:20

appena rientro a casa gli dò un'occhiata seria.

ah, bella gugo, ormai i tuoi contest sono sono uno dei motivi principali di visita di questo forum (per me chiaramente) :)

rеpne scasb

29-07-2008, 13:48

Siano date ...[CUT]

Vedo un nuovo contest. Spero che cio' non voglia dire che ti sei ritirato dal contest 3. :p

gugoXX

30-07-2008, 23:12

Dai, inizio a puntare io
Per quanto riguarda il secondo esercizio, punto
una stringa lunga 29 caratteri, con 2 errori
trovata nel vergognoso tempo di 174 secondi = 3 minuti circa

sottovento

31-07-2008, 14:34

Non vorrei rompere le uova nel paniere... ma questo purtroppo è un problema noto e risolto in letteratura.

Non vorrei sfociasse in "la mia versione in C ha un bug ma è velocissima" vs. "L'ho risolto in 3 righe in Blurp!"...

Hai ragione, ma... perche' "purtroppo"? Una volta tanto che esiste un'ottima soluzione, e possiamo metterla nel nostro cassetto degli attrezzi... ;)

L'algoritmo e' l'LCS, funziona benissimo ed e' adatto a questo tipo di problemi. Fra l'altro, e' facilmente adattabile anche per risolvere il problema successivo, cioe' quello di trovare la sequenza piu' lunga contenente un numero massimo di errori (basta un intero).
L'LCS e' l'algoritmo usato per la diff di Unix, per intenderci, ed e' ampiamente documentato in rete

Vincenzo1968

31-07-2008, 14:51

Per la domanda 1 sto utilizzando l'algoritmo di Knuth-Morris-Pratt (il quale si basa sul concetto di automi a stati finiti che adoro ;) ).

L'idea è quella di partire da una stringa lunga la metà di quella di lunghezza minore. Se la stringa non viene trovata divido ancora per due e così via fino a quando non la trovo(se, invece, la trovo, non divido ma aumento la lunghezza della stringa di ricerca).
La ricerca prosegue sulle stringhe di lunghezza maggiore(minore) di quest'ultima e minore(maggiore) dell'ultima ricerca infruttuosa.

Per ora l'ho provato sulle stringhe di esempio e funziona:

http://www.guidealgoritmi.it/images/ImgForums/contest04_01.jpg

questo è il codice:

#include <stdio.h>
#include <string.h>
#include <malloc.h>

int next[1000];

void InitNext(char *p)
{
int i, j;
int M = strlen(p);

next[0] = -1;

for ( i = 0, j = -1; i < M; i++, j++, next[i] = j )
{
while ( (j >= 0) && (p[i] != p[j]) )
j = next[j];
}
}

int KnuthMorrisPratt_search(char *p, char *a)
{
int i, j;
int M = strlen(p);
int N = strlen(a);

InitNext(p);

for ( i = 0, j = 0; j < M && i < N; i++, j++ )
{
while ( (j >= 0) && (a[i] != p[j]) )
{
j = next[j];
}
}

if ( j == M )
return i - M;
else
return i;
}

int main()
{
int p1_len, p2_len;
char * strSearch;
size_t len;

int pos1, pos2;

int bTrovato = 0;

char * p1 = "CGTGTCGCGATACATAGTACaggagtaaagagatGaggaTacgATAGCGATACGATACAGATGTCGGAGAATC";
char * p2 = "TTCCTACGCAGaggagtaaagagatCaggaCacgGAGCTGATCGGTCGACCATCATTATA";

p1_len = strlen(p1);
p2_len = strlen(p2);

//len = p2_len/2;
len = p2_len;

strSearch = (char*)malloc(sizeof(char)*len);
if ( !strSearch )
{
printf("Memoria insufficiente.\n");
return -1;
}
strSearch[0] = '\0';

memcpy(strSearch, p2, len);
*(strSearch + len) = '\0';

bTrovato = 0;
pos1 = 0;
pos2 = 0;
while ( len > 0 )
{
while ( p2_len - (pos2 + len) > 0 )
{
pos1 = KnuthMorrisPratt_search(strSearch, p1);
if ( pos1 < p1_len )
{
bTrovato = 1;
printf("\nStringa '%s' di lunghezza %d trovata nelle posizioni %d e %d\n", strSearch, len, pos1, pos2);
break;
}
else
{
pos2++;
memcpy(strSearch, p2 + pos2, len);
*(strSearch + len) = '\0';
}
}
if ( bTrovato )
break;
len--;
pos1 = 0;
pos2 = 0;
}

free(strSearch);

return 0;
}

VICIUS

31-07-2008, 14:58

Carino questo problema sul dna :O
Per ora sto cercando di implementare l'algoritmo che si trova su wikipedia che purtroppo non è il massimo della velocità. Ho provato con stringhe di 1000 caratteri è piuttosto lentino e occupa una vagonata di ram. Con i file originali non ci ho neanche provato perché verrebbe fuori una array di 8 miliardi di elementi.
def initialize_lcs_matrix(s, t)
l = Array.new(s.size + 1) { Array.new(t.size + 1, 0)}
z = 0
for i in 1..s.size
for j in 1..t.size
if s[i-1] == t[j-1]
l[i][j] = l[i-1][j-1] + 1
if l[i][j] > z
z = l[i][j]
end
end
end
end
return l
end

def get_index_from_matrix(m)
x = y = z = 0
for i in 0..m.size-1
for j in 0..m[i].size-1
if m[i][j] > z
z = m[i][j]
x = i
y = j
end
end
end
return [z, x, y]
end

def get_string(a, s, e)
string = ""
for i in s..(s+e)
string += a[i]
end
return string
end

def read_dna(file_name)
dna = Array.new
File.open(file_name, 'r') do |f|
f.each do |s|
s.scan(/./).each do |c|
dna.push c
end
end
end
return dna
end

ary_a = read_dna('DNA1.txt')
ary_b = read_dna('DNA2.txt')

lcsm = initialize_lcs_matrix(ary_a, ary_b)
z, x, y = get_index_from_matrix(lcsm)
seq = get_string(ary_a, x, z)

puts "Trovata la sequenza #{seq}"
puts "Che è lunga #{z} caratteri"
puts "Indice nella stringa a #{x}"
puts "Indice nella stringa b #{y}"

E questo è l'output:
Trovata la sequenza ATAGCTAACACGC
Che è lunga 12 caratteri
Indice nella stringa a 832
Indice nella stringa b 767
Tempo di esecuzione circa 6 secondi.

shinya

31-07-2008, 15:50

Carino questo problema sul dna :O
Per ora sto cercando di implementare l'algoritmo che si trova su wikipedia che purtroppo non è il massimo della velocità. Ho provato con stringhe di 1000 caratteri è piuttosto lentino e occupa una vagonata di ram.

Stai usando l'algoritmo O(N1 * N2). Per andare forte bisogna usare i suffix tree. :p

cionci

31-07-2008, 16:20

Mi ero fatto un algoritmo carino, però appena si sale sopra 8000 elementi esaurisce la memoria. Il modo per terminare la ricerca lo so già, suddividendo il problema, ma credo che mi ci voglia un bel po' a mettere su tutto.
Per 8000 elementi:

time ./Contest4
8000 8000
First string position: 7782 Matched string: ttaggtggtggtaaag
Second string position: 6381 Matched string: ttaggtggtggtaaag

real 0m4.230s
user 0m1.828s
sys 0m1.072s

DanieleC88

31-07-2008, 17:14

Ho provato a risolvere il primo pezzo del problema: su questo arcaico computer ci mette la bellezza di 13 minuti... :D

Ecco il codice del programma:
from time import clock

def leggiFile(percorso):
f = open(percorso)

stringa = ""
for linea in f:
stringa += linea
f.close()
return stringa.capitalize()

def trovaSequenza(stringa1, stringa2):
x = 0
coordinate = ()
lunghezza = 0
sequenza = ""
while x + lunghezza < len(stringa1):
sequenza += stringa1[x + lunghezza]
try:
y = stringa2.index(sequenza)
coordinate = (x, y)
lunghezza += 1
except ValueError:
x += 1
sequenza = sequenza[1:]
return lunghezza, coordinate

def msec(t1, t2):
return (t2-t1)*1000.0

origine = "DNA"
print "Lettura in corso...",
t1 = clock()
stringa1 = leggiFile(origine + "1.txt")
stringa2 = leggiFile(origine + "2.txt")
t2 = clock()
print "completata in", msec(t1, t2), "millisecondi."
print

print "Cerco la stringa di lunghezza massima..."
t1 = clock()
lunghezza, coordinate = trovaSequenza(stringa1, stringa2)
t2 = clock()

x, y = coordinate
print " ->", lunghezza, "caratteri alla posizione", coordinate
print " -> sottosequenza della stringa 1:", stringa1[x:(x+lunghezza)]
print " -> sottosequenza della stringa 2:", stringa2[y:(y+lunghezza)]
print "completato in", msec(t1, t2), "millisecondi."
Ed ecco l'output che ottengo (senza la stampa delle due sottosequenze perché l'ho aggiunta solo ora, di aspettare altri 13 minuti non mi va :p):
Lettura in corso... completata in 252.805109036 millisecondi.

Cerco la stringa di lunghezza massima...
-> 21 caratteri alla posizione (10002, 40002)
completato in 777964.344022 millisecondi.

sottovento

31-07-2008, 17:42

Provato anch'io, limitando le stringhe ad 8000 caratteri.

Usato l'algoritmo LCS sotto Java, sul mio laptop DELL Latitude D830. Risultato: 800 millisecondi

Non mi aspetto grosse variazioni di tempo per quanto riguarda la seconda parte del contest

grigor91

31-07-2008, 17:46

Ho provato a risolvere il primo pezzo del problema: su questo arcaico computer ci mette la bellezza di 13 minuti... :D

Ecco il codice del programma:
code ...

ho provato a eseguirlo sul mio pc e i tempi sono questi:
Lettura in corso... completata in 10.0537917529 millisecondi.

Cerco la stringa di lunghezza massima...
-> 21 caratteri alla posizione (10002, 40002)
-> sottosequenza della stringa 1: actgtcctgtcaacaaggagt
-> sottosequenza della stringa 2: actgtcctgtcaacaaggagt
completato in 38560.2209219 millisecondi.

gugoXX

31-07-2008, 18:03

Provato anch'io, limitando le stringhe ad 8000 caratteri.

Usato l'algoritmo LCS sotto Java, sul mio laptop DELL Latitude D830. Risultato: 800 millisecondi

Non mi aspetto grosse variazioni di tempo per quanto riguarda la seconda parte del contest

Il problema dell'LCS e' che richeide una matrice rettangolare MxN, e M e N per le sequenze di DMA sono taaaaaanto lunghe :D
Scherzi a parte, come pensi di conciliare LCS con la trattazione degli errori ammessi nelle stringhe?

cionci

31-07-2008, 18:04

Solo per la prima parte con 200000 caratteri nella stringa uno e 200000 nella stringa due:

~$ time Contest4
200000 200000
First string position: 96827 Matched string: aaaaggtagtttggtagtggatt
Second string position: 134120 Matched string: aaaaggtagtttggtagtggatt

real 0m21.040s
user 0m21.001s
sys 0m0.016s

Algoritmo fatto in casa basato su un albero che enumera totalmente tutte le possibili stringhe.
Metto un po' in ordine il codice e lo posto.

gugoXX

31-07-2008, 18:05

Per la domanda 1 sto utilizzando l'algoritmo di Knuth-Morris-Pratt (il quale si basa sul concetto di automi a stati finiti che adoro ;) ).

Secondo me e' una strada molto promettente.
Non la sto percorrendo (e infatti pago), ma ritengo che sia una delle migliori. Facci sapere...

cionci

31-07-2008, 18:06

gugoXX, non avevo visto che avevi postato i file con le stringhe ora provo.
Posteresti il risultato per la verifica ?

sottovento

31-07-2008, 18:07

Il problema dell'LCS e' che richeide una matrice rettangolare MxN, e M e N per le sequenze di DMA sono taaaaaanto lunghe :D

Si, questo e' il problema principale. Esistono delle soluzioni in letteratura, ma non mi sono mai premurato di impararle :D

Scherzi a parte, come pensi di conciliare LCS con la trattazione degli errori ammessi nelle stringhe?
Un semplice contatore: se i caratteri in questione sono uguali, nessun problema. Altrimenti decrementa il numero di errori ammissibili. That's it

sottovento

31-07-2008, 18:23

Ecco la soluzione:

An Improved Longest Common Subsequence Algorithm for Reducing Memory Complexity in Global Alignment of DNA Sequences
Parvinnia, E.; Taheri, M.; ziarati, K.
BioMedical Engineering and Informatics, 2008. BMEI 2008. International Conference on
Volume 1, Issue , 27-30 May 2008 Page(s):57 - 61
Digital Object Identifier 10.1109/BMEI.2008.212

Lo trovi a
http://ieeexplore.ieee.org/iel5/4548614/4548615/04548635.pdf

cionci

31-07-2008, 18:23

gugoXX...intendevo il risultato per il punto A ;) Per il punto B l'hai già postato...
Una soluzione mia l'ho già trovata, mi servivano i valori per la verifica...

VICIUS

31-07-2008, 18:27

Il problema dell'LCS e' che richeide una matrice rettangolare MxN, e M e N per le sequenze di DMA sono taaaaaanto lunghe :D
Scherzi a parte, come pensi di conciliare LCS con la trattazione degli errori ammessi nelle stringhe?

Su wiki suggerisce di mantenere in memoria solo la penultima riga e colonna e di usare un hash che contiene le sole celle diverse da zero. In questo modo la memoria richiesta è più o meno m+n quindi non è più un problema. L'algoritmo rimane comunque lento da far paura. :muro:

gugoXX

31-07-2008, 18:28

gugoXX...intendevo il risultato per il punto A ;) Per il punto B l'hai già postato...
Una soluzione mia l'ho già trovata, mi servivano i valori per la verifica...

Certo, se volete posto le soluzioni (le ho a casa).
Ma quando si trovano si capisce subito, dato che ho iniettato manualmente gli intrusi in posizioni ben precise e facilmente identificabili :)

cionci

31-07-2008, 18:32

Allora spero che tornino ;)

time Contest4/bin/Release/Contest4
80058 100058
First string position: 40002 Matched string: ACTGTCCTGTCAACAAGGAGT
Second string position: 10002 Matched string: ACTGTCCTGTCAACAAGGAGT

real 0m4.678s
user 0m4.672s
sys 0m0.000s

sottovento

31-07-2008, 19:10

Su wiki suggerisce di mantenere in memoria solo la penultima riga e colonna e di usare un hash che contiene le sole celle diverse da zero. In questo modo la memoria richiesta è più o meno m+n quindi non è più un problema. L'algoritmo rimane comunque lento da far paura. :muro:

Lento da far paura? :mbe:
Attualmente con una semplice prova in Java su 8000X8000, come ho postato, ho ottenuto un risultato in 800 millisecondi (0.8 secondi). Non mi sembra lento, cosa ne dici? Tanto piu' che ho scelto questa dimensione per confrontare i risultati con quelli postati da qualcun altro (decine di secondi).

Se fosse stato cosi' lento, come mai Unix ne fa un largo uso sin dal 1970? E come mai, nel paper che ho indicato nei post di cui sopra (ricercatori in informatica/biotecnologie che si occupano espressamente di queste ricerche) la indicano come la soluzione ottimale?

VICIUS

31-07-2008, 19:12

Lento da far paura? :mbe:
Attualmente con una semplice prova in Java su 8000X8000, come ho postato, ho ottenuto un risultato in 800 millisecondi (0.8 secondi). Non mi sembra lento, cosa ne dici? Tanto piu' che ho scelto questa dimensione per confrontare i risultati con quelli postati da qualcun altro (decine di secondi).

Se fosse stato cosi' lento, come mai Unix ne fa un largo uso sin dal 1970?

Ok. La mia versione ruby è lenta da paura :P

sottovento

31-07-2008, 19:15

Ok. La mia versione ruby è lenta da paura :P

Ho riletto il mio post ed ho avuto l'impressione che potesse sembrare aggressivo. Lungi da me!!! Cmq chiedo scusa nel caso qualcuno si possa essere offeso. Ricambio il :P :D

Vincenzo1968

31-07-2008, 19:28

Secondo me e' una strada molto promettente.
Non la sto percorrendo (e infatti pago), ma ritengo che sia una delle migliori. Facci sapere...

E invece sembra di no: impiega un sacco di tempo :( (forse l'idea è buona ma la mia implementazione fa schifo).

Secondo me ha ragione shinya: Per andare forte bisogna usare i suffix tree.
Non ho tempo di provare questa soluzione. Mi trasferisco al mare(Cefalù) dove non ho purtroppo(o per fortuna) internet.
A rileggervi a settembre :)

Ciao a tutti.

DanieleC88

31-07-2008, 19:29

ho provato a eseguirlo sul mio pc e i tempi sono questi:
Lettura in corso... completata in 10.0537917529 millisecondi.

Cerco la stringa di lunghezza massima...
-> 21 caratteri alla posizione (10002, 40002)
-> sottosequenza della stringa 1: actgtcctgtcaacaaggagt
-> sottosequenza della stringa 2: actgtcctgtcaacaaggagt
completato in 38560.2209219 millisecondi.

Uhm... 38 secondi... sempre mostruosamente lento, ma comunque meglio di 13 minuti. :D
Grazie :)

cionci

31-07-2008, 19:32

Ah ok, non avevo visto i risultati postato da DanieleC88...uguali ai miei ;)
Sto rimettendo a posto il codice, ci sono un paio di errori, anche se non dovrebbero cambiare l'ordine di grandezza del tempo in cui viene trovato il risultato.

grigor91

31-07-2008, 19:35

Uhm... 38 secondi... sempre mostruosamente lento, ma comunque meglio di 13 minuti. :D
Grazie :)

a questo punto però per avere dei punti di riferimento + precisi bisogna tenere presente anche le prestazioni del PC. come si è ben notato da Pc a PC la durata del elaborazione varia anche sensibilmente.

cionci

31-07-2008, 20:29

Ecco la mia soluzione per il problema 1:

#include <iostream>
#include <string>
#include <map>
#include <fstream>

using namespace std;

#define DICTIONARY_SIZE 250

class DictionaryElement
{
DictionaryElement * next[4];
int position;
public:
DictionaryElement(int position)
{
this->position = position;
bzero(next, sizeof(next));
};

DictionaryElement *addNext(int index, int position)
{
if (next[index])
return next[index];

return next[index] = new DictionaryElement(position);
};

DictionaryElement *getNext(int index)
{
return next[index];
};

int getPosition()
{
return position;
};

~DictionaryElement()
{
if(next[0])
delete next[0];
if(next[1])
delete next[1];
if(next[2])
delete next[2];
if(next[3])
delete next[3];
}
};

class Contest4
{
string str1, str2;
map<char,int> mapping;
public:
Contest4()
{

mapping['A'] = 0;
mapping['C'] = 1;
mapping['G'] = 2;
mapping['T'] = 3;

ifstream f1("file1.txt");
f1 >> str1;
f1.close();

ifstream f2("file2.txt");
f2 >> str2;
f2.close();

cout << str1.length() << " " << str2.length() << endl;
};

DictionaryElement *buildDictionary(int start)
{
DictionaryElement *current;
int len = (str1.length() < start + DICTIONARY_SIZE) ? str1.length() : start + DICTIONARY_SIZE;
DictionaryElement *base = new DictionaryElement(-1);
for (int i = start; i < len; ++i)
{
current = base;

for (int j = i; j < len; ++j)
{
current = current->addNext(mapping[str1.at(j)], i);
}
}
return base;
};

void searchLongestMatchingSubstring()
{
int maxPos1 = -1, maxPos2 = -1, maxLength = 0;

int j, k;
int len2 = str2.length();
int len1 = str1.length();
DictionaryElement *current, *next;
DictionaryElement *base;

for (int start = 0; start < len1; start += DICTIONARY_SIZE)
{
base = buildDictionary(start);

for (int i = 0; i < len2; ++i)
{
current = base;
for (j = i; j < len2; ++j)
{
next = current->getNext(mapping[str2.at(j)]);
if (!next)
{
k = current->getPosition() + j - i;
if (k == start + DICTIONARY_SIZE)
{
while(1)
{
if(j >= len2 || k >= len1)
break;
if(str2.at(j) != str1.at(k))
break;
++j;
++k;
}

}
if (j-i > maxLength)
{
maxLength = j - i;
maxPos1 = current->getPosition();
maxPos2 = i;
}
break;
}
current = next;
}

if (current == next && j-i > maxLength)
{
maxLength = j - i;
maxPos1 = current->getPosition();
maxPos2 = i;
}
}
delete base;
}
cout << "First string position: " << maxPos1 << " Matched string: " << str1.substr(maxPos1, maxLength) << endl;
cout << "Second string position: " << maxPos2 << " Matched string: " << str2.substr(maxPos2, maxLength) << endl;
};
};

int main()
{
Contest4 contest;
contest.searchLongestMatchingSubstring();
return 0;
}

E' ancora brutta, ma vola ;)
~$ time ./Contest4
80058 100058
First string position: 40002 Matched string: ACTGTCCTGTCAACAAGGAGT
Second string position: 10002 Matched string: ACTGTCCTGTCAACAAGGAGT

real 0m4.584s
user 0m4.580s
sys 0m0.008s

Ora dovrei inserire gli errori, vediamo...non è così agevole.
Edit: trovato un buggettone ;)

cionci

31-07-2008, 20:46

Due parole sulla soluzione: in pratica enumera TUTTE le sotto stringhe creando un albero. Ogni nodo ha quattro possibili figli, uno per lettera.
Viene memorizzata la posizione SOLO della prima sotto stringa avente quella data sequenza. Il trucco principale sta proprio qui.
Nella stringa AACCTGAACCGTTT, la stringa AACC verrà memorizzata una sola volta così come la stringa AA, CC...anzi la sequenza AACC conterrà sia la stringa A. che la stringa AA, che la stringa AAC, che la stringa AACC memorizzando solo la prima occorrenza della stringa. Le occorrenze successive non inseriranno o modificheranno nodi nell'albero.
Quindi per la ricerca basterà visitare l'albero in base all'elemento corrente, se l'elemento corrente si avanza nella visita. La visita termina quando ci sarà un elemento null.
La costruzione dell'albero è stata limitata a 250 caratteri per volta per motivi prestazionali empirici e di occupazione di memoria. Se la stringa da controllare sta a cavallo di questi 250 caratteri viene utilizzata una comparazione carattere per carattere.

gugoXX

31-07-2008, 21:48

Due parole sulla soluzione: in pratica enumera TUTTE le sotto stringhe creando un albero. Ogni nodo ha quattro possibili figli, uno per lettera.
Viene memorizzata la posizione SOLO della prima sotto stringa avente quella data sequenza. Il trucco principale sta proprio qui.
Nella stringa AACCTGAACCGTTT, la stringa AACC verrà memorizzata una sola volta così come la stringa AA, CC...anzi la sequenza AACC conterrà sia la stringa A. che la stringa AA, che la stringa AAC, che la stringa AACC memorizzando solo la prima occorrenza della stringa. Le occorrenze successive non inseriranno o modificheranno nodi nell'albero.
Quindi per la ricerca basterà visitare l'albero in base all'elemento corrente, se l'elemento corrente si avanza nella visita. La visita termina quando ci sarà un elemento null.
La costruzione dell'albero è stata limitata a 250 caratteri per volta per motivi prestazionali empirici e di occupazione di memoria. Se la stringa da controllare sta a cavallo di questi 250 caratteri viene utilizzata una comparazione carattere per carattere.

Bello, e sembra anche parecchio efficiente.

DanieleC88

31-07-2008, 22:00

Già, è decisamente molto bella come soluzione... chissà se sarò mai in grado di cacciare qualcosa di simile. :stordita: :cry: (faccio le veci di by_to_by :asd: )

gugoXX

01-08-2008, 00:57

Per la soluzione al punto B, 93secondi.
Non riesco a scendere per ora, ed ho pure barato (Parallelismo, con conseguente temperatura da fusione del nocciolo)
Mi sa che devo proprio cambiare tutto.

Maximum common length 29
Maximum substring indexes 50002,70002
Substring1: ACTGTCCTGAAGATCGCTTGGCATCTCCG
Substring2: ACTGTCCTGCAGATCGCTTTGCATCTCCG
92795ms

cionci

01-08-2008, 07:59

Nel mio sarà davvero un problema inserire l'errore :muro:

DanieleC88

01-08-2008, 12:06

A quando un contest 5? :D

cionci

01-08-2008, 13:16

Per la soluzione al punto B, 93secondi.
Non riesco a scendere per ora, ed ho pure barato (Parallelismo, con conseguente temperatura da fusione del nocciolo)
Mi sa che devo proprio cambiare tutto.
Ma solo la ricerca senza errori quanto ti ci mette ?

gugoXX

01-08-2008, 13:24

Ma solo la ricerca senza errori quanto ti ci mette ?

:) Saltata per ora, ho scritto da qualche parte un Knuth-Morris-Pratt in C# che potrei riadattare.
Certo, posso usare l'algoritmo con errore e mettere errori 0, ma ovviamente farebbe altrettanto pena e non ci metterebbe molto meno di 90 secondi :(
Ma sto pensando ad altro.

sottovento

01-08-2008, 13:43

Qui un'altra soluzione. In C puro, non C++.

int max_length (char *text1, char *text2, int *out_max, int *out_pos1, int *out_pos2)
{
int l1, l2;
char *row, *lastrow;
register char *p2r;
register char *p2lr;
register int i;
register int j;
register char *p2t1;
register char *p2t2;
register int max;
register int pos1;
register int pos2;
register char *pr_p2r;
register char *pr_p2lr;
register char *pr_p2t2;
register int k;
int l2p1;
int m;

l1 = (int)strlen (text1);
l2 = (int)strlen (text2);

l2p1 = l2+1;
row = (char *)malloc (l2p1);
if (!row) return -1;
lastrow = (char *)malloc (l2p1);
if (!lastrow)
{
free (row);
return -1;
}

memset (row, 0, l2p1);
memset (lastrow, 0, l2p1);

max = 0;
pos1 = -1;
pos2 = -1;

pr_p2r = &row[l2];
pr_p2lr = &lastrow[l2];
pr_p2t2 = &text2[l2];

p2t1 = &text1[l1];
for (i = l1; i >= 0; i--)
{
memset (row, 0, l2p1);

p2t2 = pr_p2t2;
p2r = pr_p2r;
p2lr = pr_p2lr;
for (j = l2; j >= 0; j--)
{
if (i == l1 || j == l2)
*p2r = 0;
else if (*p2t1 == *p2t2--)
*p2r = 1 + *(p2lr+1);
else
{
if ((m=*p2lr) > max)
{
max = m;
pos1 = i+1;
pos2 = j+1;
}
*p2r = 0;
}

p2r--;
p2lr--;
}
p2t1--;

if (i > 0)
memcpy (lastrow, row, l2p1);
}

*out_max = max;
*out_pos1 = pos1;
*out_pos2 = pos2;

free (row);
free (lastrow);
return 0;
}

cionci

01-08-2008, 15:50

Azzz non credevo...il mio è un ordine di grandezza più veloce :eek:
Questo che sopra che algoritmo sarebbe ?

cionci

01-08-2008, 15:54

L'ho reso ancora più veloce :eek:
~$ time Contest4
80058 100058
First string position: 40002 Matched string: ACTGTCCTGTCAACAAGGAGT
Second string position: 10002 Matched string: ACTGTCCTGTCAACAAGGAGT

real 0m1.929s
user 0m1.928s
sys 0m0.004s

#include <iostream>
#include <string>
#include <map>
#include <fstream>

using namespace std;

#define DICTIONARY_SIZE 250
//#define DICTIONARY_SIZE 4

class DictionaryElement
{
DictionaryElement * next[4];
int position;
public:
DictionaryElement(int position)
{
this->position = position;
bzero(next, sizeof(next));
};

DictionaryElement *addNext(int index, int position)
{
if (next[index])
return next[index];

return next[index] = new DictionaryElement(position);
};

DictionaryElement *getNext(int index)
{
return next[index];
};

int getPosition()
{
return position;
};

~DictionaryElement()
{
if (next[0])
delete next[0];
if (next[1])
delete next[1];
if (next[2])
delete next[2];
if (next[3])
delete next[3];
}
};

class Contest4
{
int mapping[256];
int maxPos1, maxPos2, maxLength;
int len1, len2;
string str1, str2;
public:

Contest4()
{
maxPos1 = -1;
maxPos2 = -1;
maxLength = 0;

mapping['A'] = 0;
mapping['C'] = 1;
mapping['G'] = 2;
mapping['T'] = 3;

ifstream f1("/home/cionci/file1.txt");
f1 >> str1;
f1.close();

ifstream f2("/home/cionci/file2.txt");
f2 >> str2;
f2.close();

len1 = str1.length();
len2 = str2.length();
cout << str1.length() << " " << str2.length() << endl;

};

DictionaryElement *buildDictionary(int start)
{
DictionaryElement *current;
int len = (str1.length() < start + DICTIONARY_SIZE) ? str1.length() : start + DICTIONARY_SIZE;
DictionaryElement *base = new DictionaryElement(-1);
for (int i = start; i < len; ++i)
{
current = base;

for (int j = i; j < len; ++j)
{
current = current->addNext(mapping[str1.at(j)], i);
}
}
return base;
};

void searchLongestMatchingSubstring()
{
int maxPos1 = -1, maxPos2 = -1, maxLength = 0;

int j, k;
int len2 = str2.length();
int len1 = str1.length();
DictionaryElement *current, *next;
DictionaryElement *base;

for (int start = 0; start < len1; start += DICTIONARY_SIZE)
{
base = buildDictionary(start);

for (int i = 0; i < len2; ++i)
{
current = base;
for (j = i; j < len2; ++j)
{
next = current->getNext(mapping[str2.at(j)]);
if (!next)
{
k = current->getPosition() + j - i;
if (k == start + DICTIONARY_SIZE)
{
while (1)
{
if (j >= len2 || k >= len1)
break;
if (str2.at(j) != str1.at(k))
break;
++j;
++k;
}

}
if (j-i > maxLength)
{
maxLength = j - i;
maxPos1 = current->getPosition();
maxPos2 = i;
}
break;
}
current = next;
}

if (current == next && j-i > maxLength)
{
maxLength = j - i;
maxPos1 = current->getPosition();
maxPos2 = i;
}
}
delete base;
}
cout << "First string position: " << maxPos1 << " Matched string: " << str1.substr(maxPos1, maxLength) << endl;
cout << "Second string position: " << maxPos2 << " Matched string: " << str2.substr(maxPos2, maxLength) << endl;
};
};

sottovento

01-08-2008, 15:55

Azzz non credevo...il mio è un ordine di grandezza più veloce :eek:
Questo che sopra che algoritmo sarebbe ?

Una variante personalizzata dell'LCS. L'ho provato insieme al tuo algoritmo su qualche computer e... mi sono accorto che sono vecchio!
Vinco sui computer vecchi e perdo sui computer nuovi. Non ho ancora capito il perche'. Cmq la differenza e' di qualche secondo sui computer che ho provato, non di ordini di grandezza.

E non riesco ancora a capire cosa faccia la differenza, visto che teoricamente il tuo sarebbe un O(nmm) mentre il "mio" solo O(nm)...

cionci

01-08-2008, 15:59

Io ho infilato il tuo nel codice C++, probabilmente è per questo. Stesse opzioni di compilazione. 47 secondi. Ora con il mio sono sceso sotto 2 secondi togliendo il map per convertire da lettera ad intero ed aggiungendo un vettore.

cionci

01-08-2008, 16:03

Ah, ovviamente il mio lavora al meglio mettendo in str1 la stringa più corta ;)
Vero che è O(n*m*m), ma appena il check su una sotto stringa fallisce interrompe il controllo. Quindi in teoria la quantità di iterazioni è molto diversa, anche se difficile da calcolare.
Ora metto un contatore e vedo quante ne fa.

cionci

01-08-2008, 16:10

Il numero di iterazioni è 156.904.592 quindi 1/50 di n*m.

cionci

02-08-2008, 08:56

Per scrivere la soluzione per il secondo problema sfruttando l'albero del punto A sono dovuto ricorrere alla ricorsione. Risultato 126 secondi.
Oro provo a renderlo iterativo con uno stack. Anche così è discretamente veloce per la ricerca senza errori (impostando l'errore a 0), sono circa 7 secondi.

Ah sottovento...ho usato le ottimizzazioni O3 del compilatore, magari è per quello.

cionci

02-08-2008, 11:00

Ecco qua l'algoritmo ricorsivo reso iterativo (82 secondi contro 115 secondi della versione ricorsiva):
~$ time Contest4
80058 100058
First string position: 70002
Second string position: 50002
Matched substring in first string: ACTGTCCTGCAGATCGCTTTGCATCTCCG
Matched substring in second string: ACTGTCCTGAAGATCGCTTGGCATCTCCG

real 1m22.685s
user 1m21.377s
sys 0m0.064s

Allego sia la versione iterativa che quella ricorsiva:

#include <iostream>
#include <string>
#include <map>
#include <fstream>

using namespace std;

#define DICTIONARY_SIZE 1000

class DictionaryElement
{
DictionaryElement * next[4];
int position;
public:
DictionaryElement(int position)
{
this->position = position;
bzero(next, sizeof(next));
};

DictionaryElement *addNext(int index, int position)
{
if (next[index])
return next[index];

return next[index] = new DictionaryElement(position);
};

DictionaryElement *getNext(int index)
{
return next[index];
};

int getPosition()
{
return position;
};

~DictionaryElement()
{
if (next[0])
delete next[0];
if (next[1])
delete next[1];
if (next[2])
delete next[2];
if (next[3])
delete next[3];
}
};

struct MatchingPosition
{
int positionOne;
int positionTwo;
int length;
MatchingPosition()
{
positionOne = -1;
positionTwo = -1;
length = 0;
};
};

class Contest4
{
int mapping[256];
int maxPos1, maxPos2, maxLength;
int len1, len2;

DictionaryElement *buildDictionary(int start)
{
DictionaryElement *current;
int len = (str1.length() < start + DICTIONARY_SIZE) ? str1.length() : start + DICTIONARY_SIZE;
DictionaryElement *base = new DictionaryElement(-1);
for (int i = start; i < len; ++i)
{
current = base;

for (int j = i; j < len; ++j)
{
current = current->addNext(mapping[str1.at(j)], i);
}
}
return base;
}

void recursiveSearch(int errors, int j, DictionaryElement *current, MatchingPosition &position)
{
if (errors < 0 || j == len2)
return;

DictionaryElement *next;
DictionaryElement *rightChoice = current->getNext(mapping[str2.at(j)]);
bool checkLength = true;

for (int i = 0; i < 4; ++i)
{
next = current->getNext(i);
int localErrors = errors - ((rightChoice == next) ? 0 : 1);
if (next && localErrors >= 0)
{
recursiveSearch(localErrors, j + 1, next, position);
checkLength = false;
}
}

if (checkLength)
{
int length = j - position.positionTwo;
if (length == DICTIONARY_SIZE)
{
int k = current->getPosition() + length;
int localError = errors;
while (1)
{
if (j >= len2 || k >= len1)
break;
if (str2.at(j) != str1.at(k))
if (localError-- <= 0)
break;
++j;
++k;
}
}
if (length > position.length)
{
position.length = length;
position.positionOne = current->getPosition();
}
}
}

void iterativeSearch(int errors, int j, DictionaryElement *current, MatchingPosition &position)
{
struct StackElement
{
int j;
DictionaryElement *next;
int localErrors;
};

StackElement stack[DICTIONARY_SIZE * 4];
int stackTop = 0;

DictionaryElement *next, *rightChoice;

stack[0].j = j;
stack[0].localErrors = errors;
stack[0].next = current;
stackTop++;

while (stackTop > 0)
{
--stackTop;
current = stack[stackTop].next;
errors = stack[stackTop].localErrors;
j = stack[stackTop].j;
bool checkLength = true;

if(j < len2)
{
rightChoice = current->getNext(mapping[str2.at(j)]);

for (int i = 0; i < 4; ++i)
{
next = current->getNext(i);
int localErrors = errors - ((rightChoice == next) ? 0 : 1);
if (next && localErrors >= 0)
{
stack[stackTop].j = j + 1;
stack[stackTop].next = next;
stack[stackTop++].localErrors = localErrors;
checkLength = false;
}
}
}
if (checkLength)
{
int length = j - position.positionTwo;
if (length == DICTIONARY_SIZE)
{
int k = current->getPosition() + length;
int localError = errors;
while (1)
{
if (j >= len2 || k >= len1)
break;
if (str2.at(j) != str1.at(k))
if (localError-- <= 0)
break;
++j;
++k;
}
}
if (length > position.length)
{
position.length = length;
position.positionOne = current->getPosition();
}
}
}
}

public:
string str1, str2;

Contest4()
{
maxPos1 = -1;
maxPos2 = -1;
maxLength = 0;

mapping['A'] = 0;
mapping['C'] = 1;
mapping['G'] = 2;
mapping['T'] = 3;

ifstream f1("/home/cionci/file1.txt");
f1 >> str1;
f1.close();

ifstream f2("/home/cionci/file2.txt");
f2 >> str2;
f2.close();

len1 = str1.length();
len2 = str2.length();
cout << str1.length() << " " << str2.length() << endl;

}

void searchLongestMatchingSubstringWithErrors(int errors)
{
DictionaryElement *base;
MatchingPosition longestMatching;

for (int start = 0; start < len1; start += DICTIONARY_SIZE)
{
base = buildDictionary(start);

for (int i = 0; i < len2; ++i)
{
MatchingPosition matchingPosition;
matchingPosition.positionTwo = i;
iterativeSearch(errors, i, base, matchingPosition);
if (matchingPosition.length > longestMatching.length)
{
longestMatching = matchingPosition;
}
}
delete base;
}
cout << "First string position: " << longestMatching.positionOne << endl;
cout << "Second string position: " << longestMatching.positionTwo << endl;
cout << "Matched substring in first string: " << str1.substr(longestMatching.positionOne, longestMatching.length) << endl;
cout << "Matched substring in second string: " << str2.substr(longestMatching.positionTwo, longestMatching.length) << endl;
}
};

Vincenzo1968

02-08-2008, 22:13

... Per andare forte bisogna usare i suffix tree. :p

Vero è ;)

http://www.guidealgoritmi.it/images/ImgForums/dna.jpg

Per l'implementazione del suffix tree utilizzo i file che ho scaricato da qui:

http://mila.cs.technion.ac.il/~yona/suffix_tree/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>
#include "suffix_tree.h"

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

#define BUFFER_SIZE 4096

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
int numread = 0;
int dimFile = 0;
int count = 0;
fpos_t pos;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

if( fgetpos(fp, &pos) != 0 )
{
fclose(fp);
return 0;
}

fclose(fp);

return (int)pos;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
int p1_len, p2_len;
char * strSearch;
size_t len;
int k;

char strTemp[1024] = "";
int bTrovato;

SUFFIX_TREE* tree;

clock_t c_start, c_end;

DBL_WORD pos1, pos2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

strSearch = (char*)malloc(sizeof(char)*p1_len);
if ( !strSearch )
{
printf("Memoria insufficiente.\n");
return -1;
}
strSearch[0] = '\0';

tree = ST_CreateTree(p1, p1_len);

len = p2_len/10;
pos2 = 0;
memcpy(strSearch, p2 + pos2, len);
*(strSearch + len) = '\0';

bTrovato = 0;
pos1 = 0;
pos2 = 0;
while ( len >= 0 )
{
while ( p2_len - (pos2 + len) > 0 )
{
pos1 = ST_FindSubstring(tree, strSearch, len);
if ( pos1 != ST_ERROR )
{
bTrovato = 1;
k = 0;
len = 0;
while ( *(strSearch + k++) )
len++;
k = 0;
while ( *(p2 + pos2 + len + k) == *(p1 + (pos1 - 1) + len + k ) )
{
*(strSearch + len + k) = *(p2 + pos2 + len + k);
k++;
}
*(strSearch + len + k ) = '\0';
sprintf(strTemp, "\nStringa '%s' di lunghezza %d trovata nelle posizioni %d e %d\n", strSearch, len + k, pos1 - 1, pos2);
break;
}
else
{
if ( strTemp[0] != '\0' )
break;
pos2++;
memcpy(strSearch, p2 + pos2, len);
*(strSearch + len) = '\0';
}
}
if ( bTrovato )
break;
len /= 2;
pos1 = 0;
pos2 = 0;
}

if ( bTrovato )
printf(strTemp);

c_end = clock();

printf( "Tempo impiegato -> %2.1f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

free(p1);
free(p2);
free(strSearch);
ST_DeleteTree(tree);

return 0;
}

cionci

03-08-2008, 08:36

Veramente veloce...sul mio computer è 9 decimi di secondo più veloce del mio. Non avevo mai visto i suffix_tree, ci stavo andando vicino come idea :D

Vincenzo1968

03-08-2008, 13:49

Veramente veloce...sul mio computer è 9 decimi di secondo più veloce del mio. Non avevo mai visto i suffix_tree, ci stavo andando vicino come idea :D

Ciao cionci,

neanch'io li avevo mai visti prima. Il codice che avevo postato all'inizio, applicato sui due file, ci metteva 27 minuti.
Dopo le ferie corro a comprarmi il libro di Gusfield (http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198).

:)

cionci

04-08-2008, 08:56

Ah Vincenzo...il cast (int)pos è contrario allo standard C. fpos_t non è assicurato sia un valore numerico, potrebbe essere anche una struttura dati.
In questi casi usa ftell ;)

marco.r

04-08-2008, 13:54

Ho provato ad implementarne una versione in Common Lisp.
E' un po' caotica perche' l'ho buttata giu' di getto nel weekend, se non altro ha prestazioni decenti per le sottosequenze senza errori (circa 5 sec. 1/2 sulla mia macchina, contro circa il secondo della versione di vincenzo)

(declaim (optimize (speed 3) (debug 0) (safety 0)))

(defclass substring ()
((str :initarg :str)
(start :initarg :start)
(end :initarg :end)))

(defmacro str (s)
`(the simple-base-string (slot-value ,s 'str)))
(defmacro start (s)
`(the fixnum (slot-value ,s 'start)))
(defmacro end (s)
`(the fixnum (slot-value ,s 'end)))
(defmacro slength (s)
`(the fixnum (- (end ,s) (start ,s))))

(defmethod ss-pos ((s substring) n)
(char (str s) (+ n (start s))))

(declaim (inline make-substring))
(defun make-substring (s start &optional end)
(make-instance 'substring
:str (str s)
:start (min (+ start (start s)) (length (str s)))
:end (if (null end)
(end s)
(min (length (str s)) (+ end (start s))))))

(defun string->substring (s start &optional end)
(make-instance 'substring
:str s
:start (min start (length s))
:end (if (null end)
(length s)
(min (length s) end))))

(defun substring->string (s)
(declare (type string s))
(subseq (str s) (start s) (end s)))

(defun range (min max)
(loop for n from min upto (- max 1) collecting n))

(defun max-by (fn sequence)
(let ((best (first sequence)))
(loop for x in sequence do (if (funcall fn x best)
(setf best x)))
best))

(defun foldl (initial-value fun list)
(if (null list)
initial-value
(foldl (funcall fun initial-value (car list)) fun (cdr list))))

(declaim (inline make-solution))
(defun make-solution (leaves-1 leaves-2 length)
(list leaves-1 leaves-2 length))

(defun first-result (solution)
(first solution))

(defun second-result (solution)
(second solution))

(defun compare-result (sol1 sol2)
(> (the fixnum (third sol1)) (the fixnum (third sol2))))

(defun find-by (fun1 fun2 tree)
(rec-find-by 0 fun1 fun2 tree))

(defun rec-find-by (length fun1 fun2 tree)
(let* ((children-results (mapcar #'(lambda (e)
(rec-find-by (+ length (slength (edge-label e)))
fun1
fun2
(edge-tree e)))
(edges tree)))
(good-results (remove-if-not #'(lambda (r) (and (first-result r)
(second-result r)))
children-results)))
(or (and good-results (max-by #'compare-result good-results))
(merge-results fun1 fun2 children-results (leaves tree) length))))

(defun merge-results (fun1 fun2 res lt length)
(declare (optimize (speed 3)))
(let ((leaves-1 (loop for r in res when (first r) collect (first r)))
(leaves-2 (loop for r in res when (second r) collect (second r)))
(sol1 nil)
(sol2 nil))
(setf leaves-1 (nconc leaves-1 (remove-if-not fun1 lt)))
(setf leaves-2 (nconc leaves-2 (remove-if-not fun2 lt)))
(if leaves-1 (setf sol1 (car leaves-1)))
(if leaves-2 (setf sol2 (car leaves-2)))
(make-solution sol1 sol2 length)))

(defun lcs (string1 string2)
(declare (type string string1)
(type string string2))
(let* ((split (+ 1 (length string1)))
(tree1 (make-tree 0 string1))
(tree2 (make-tree split string2))
(tree (merge-tree tree1 tree2)))
(format t "Tree constructed, starting search...~%")
(let ((result (find-by #'(lambda (x) (< x split)) #'(lambda (x) (>= x split)) tree)))
(setf (second result) (- (second result) split))
result)))

(defclass node ()
((leaves :initarg :leaves
:accessor leaves)
(edges :initarg :edges
:accessor edges)))

(declaim (inline make-node))
(defun make-node (leaves edges)
(make-instance 'node :leaves leaves :edges edges))

(declaim (inline make-edge))
(defun make-edge (char sub-tree)
(list char sub-tree))

(declaim (inline edge-label))
(defun edge-label (edge)
(first edge))

(declaim (inline edge-tree))
(defun edge-tree (edge)
(second edge))

(defparameter empty-tree (make-node () ()))

(defun build-string (start delta string)
(let ((marker (+ delta start)))
(if (eql start (length string))
(make-node (list marker) ())
(make-node () (list (make-edge (string->substring string start)
(make-node (list marker) ())))))))

(defun print-tree (tree)
(rec-print-tree 0 tree))

(defun spaces (ind)
(format nil "~{~a~}" (loop for n in (range 0 ind) collecting #\space)))

(defun rec-print-tree (ind tree)
(format t "~a~a~%"
(spaces ind)
(leaves tree))
(loop for e in (edges tree) do
(rec-print-edge (+ 4 ind) e)))

(defun rec-print-edge (ind edge)
(let ((s (edge-label edge)))
(format t "~a~a~%" (spaces ind) (subseq (str s) (start s) (end s)))
(rec-print-tree (+ 2 ind) (edge-tree edge))))

(defun join-edges (position e1 e2)
(declare (optimize (speed 3)))
(let* ((label (make-substring (edge-label e1) 0 position))
(l1 (make-substring (edge-label e1) position))
(l2 (make-substring (edge-label e2) position))
(t1 (edge-tree e1))
(t2 (edge-tree e2))
(new-edges1 nil)
(new-edges2 nil)
(new-leaves1 nil)
(new-leaves2 nil))
(if (= 0 (slength l1))
(setf new-leaves1 (leaves t1)
new-edges1 (edges t1))
(setf new-edges1 (list (make-edge l1 t1))))
(if (= 0 (slength l2))
(setf new-leaves2 (leaves t2)
new-edges2 (edges t2))
(setf new-edges2 (list (make-edge l2 t2))))
(make-edge label
(make-node (append new-leaves1 new-leaves2)
(merge-edges new-edges1 new-edges2)))))

(defun same-substring (s1 s2)
(declare (type substring s1)
(type substring s2))
(let ((str1 (str s1))
(str2 (str s2)))
(string= str1 str2
:start1 (start s1)
:end1 (end s1)
:start2 (start s2)
:end2 (end s2))))

(defun substring< (s1 s2)
(declare (type substring s1)
(type substring s2))
(let ((str1 (str s1))
(str2 (str s2)))
(string< str1 str2
:start1 (start s1)
:end1 (end s1)
:start2 (start s2)
:end2 (end s2))))

(defun longest-common-prefix (substring1 substring2)
(let ((l 0)
(s1 (str substring1))
(s2 (str substring2)))
(declare (type fixnum l))
(do
((p1 (start substring1) (incf p1))
(p2 (start substring2) (incf p2))
(e1 (end substring1))
(e2 (end substring2)))
((or (= p1 e1) (= p2 e2) (not (char= (char s1 p1) (char s2 p2)))))
(incf l))
l))

(defun merge-edges (edge-list1 edge-list2)
(rec-merge-edges () edge-list1 edge-list2))

(defun rec-merge-edges (acc l1 l2)
(declare (optimize (speed 3)))
(cond
((null l1) (nconc (nreverse acc) l2))
((null l2) (nconc (nreverse acc) l1))
(t (let* ((e1 (edge-label (first l1)))
(e2 (edge-label (first l2)))
(t1 (edge-tree (first l1)))
(t2 (edge-tree (first l2)))
(common (longest-common-prefix e1 e2)))
(cond
((and (= common (slength e1))
(= common (slength e2)))
(rec-merge-edges (cons (make-edge e1 (merge-tree t1 t2)) acc)
(cdr l1)
(cdr l2)))

((> common 0)
(let ((new-edge (join-edges common (first l1) (first l2))))
(rec-merge-edges (cons new-edge acc)
(cdr l1)
(cdr l2))))
((substring< e1 e2)
(rec-merge-edges (cons (car l1) acc)
(cdr l1)
l2))
(t
(rec-merge-edges (cons (car l2) acc)
l1
(cdr l2))))))))

(defun merge-tree (tree1 tree2)
(let ((new-leaves (concatenate 'list
(leaves tree1)
(leaves tree2)))
(new-edges (merge-edges (edges tree1) (edges tree2))))
(make-node new-leaves new-edges)))

(defun make-tree (depth string)
(foldl empty-tree #'merge-tree
(loop for n from (- (length string) 1) downto 0 collecting (build-string n depth string))))

(with-open-file (in "DNA/DNA1.txt")
(defparameter *dna1* (coerce (read-line in) 'simple-base-string)))
(with-open-file (in "DNA/DNA2.txt")
(defparameter *dna2* (coerce (read-line in) 'simple-base-string)))

(defparameter *dna3* (subseq *dna1* 0 50000))
(defparameter *dna4* (subseq *dna2* 0 50000))

Sono purtroppo un sacco di linee, in quanto non si possono fare sottostringhe "shared" in CL, e ho cercato di minimizzare lo spazio usato "accorpando" pezzi di sottostringa in un unico arco, tra l'altro la parte di codice piu' incasinata.

gugoXX

05-08-2008, 00:46

Mi sono perso.
Come sono messo?

Versione con i 2 errori.

Maximum common length 29
Maximum substring indexes 50002,70002
Substring1: CCACTGTCCTGAAGATCGCTTGGCATCTCCGTT
Substring2: GGACTGTCCTGCAGATCGCTTTGCATCTCCGAA
35521ms

gugoXX

05-08-2008, 01:41

Per stasera basta cosi'
Vedro di riuscire a mettere il parallelismo anche qui.

Maximum common length 29
Maximum substring indexes 50002,70002
Substring1: CCACTGTCCTGAAGATCGCTTGGCATCTCCGTT
Substring2: GGACTGTCCTGCAGATCGCTTTGCATCTCCGAA
30852ms

cionci

05-08-2008, 08:07

Ehm...le stringhe non tornano ;)

Comunque promette bene, cosa hai usato ?

gugoXX

05-08-2008, 08:51

Ehm...le stringhe non tornano ;)

Comunque promette bene, cosa hai usato ?

:)
E' solo che ho stampato i 2 caratteri che precedono e i 2 caratteri che seguono, per identificare bene il limite e per capire se avevo fatto errori.

Il concetto e' il seguente.
Inizio con una ricerca di valore costante di lunghezza minima. Ho inziato con 4, ma nulla vieta di iniziare con qualcosa meno o di costruire un algoritmo ad hoc se la lunghezza totale (compresi gli errori) fosse meno di 4.

Ad ogni iterazione dell'algoritmo aumento questo valore, fino a che non trovero' una condizione soddisfacente.

Ipotiziamo di essere arrivati a X=16
La seguente sottoparte di algoritmo vuole cercare tutte le sottostringhe lunghe "almeno" 16. Se ce ne sono troppe, ciclo, e aumento X a 17 e ricomincio.
1. Per ciascuna sottostringa lunga esattamente X nella PRIMA stringa (quindi O(N)) effettuo un'analisi delle occorrenze delle singole basi. Se per esempio la sottostringa in esame fosse
TACGTGCAACGCGGTG, allora associerei:
A = 3
C = 4
T = 3
G = 6
Inserisco questa chiave in una hashtable, che conterra', per ciascuna distribuzione, quali sono gli "inizi" possibili sulla prima stringa.
Il valore di ciascuna chiave e' quindi una lista di offset.
Un qualsiasi anagramma della sottostringa di cui sopra verra' quindi associato alla stessa chiave della hashtable.
Questo passo e' davvero veloce.

2. Inizio a ciclare sulla SECONDA stringa, analizzando di volta in volta tutte le stringhe lunghe X
2a. Calcolo la distribuzione delle occorrenze.
2b. A partire da questa, derivo tutte le ipotetiche distribuzioni delle occorrenze, ipotizzando un numero di errori crescente da 0 a ERR
ES: Data una distribuzione occorrenze 3,4,3,6, ed ERR = 1
dovro' valutare l'esistenza delle distrubuzioni
3,4,3,6 ( = zero errori)
2,4,3,7 - 2,4,4,6 - 2,5,3,6
3,3,3,7 - 3,3,4,6 - 4,3,3,6
3,4,2,7 - 3,5,2,6 - 4,4,2,6
3,4,4,5 - 3,5,3,5 - 4,4,3,6 ( = un errore)

2c. Per ciascuna di queste distribuzioni, andro' a cercare nella hastable tutti gli offset possibili sulla prima stringa, e andro' a valutare per ciascuna di essere se veramente ci sono solo ERR errori tra la sottostringa in esame e quella ricavata dalla hashtable. Tipicamente no, in quanto e' difficile che gli anagrammi siano proprio ordinati, ma se il nostro target e' li' dentro allora lo troveremo di sicuro.

2d. Otterro' un certo numero di sottostringhe della seconda stringa, lunghe X, che differiscono di al piu' ERR errori dalla controparte della prima stringa.

2e. Se la lista, durante il ciclo, si allunga troppo, allora ricomincio dal punto 1, incrementando X. (passando quindi a 17)

2f. Per ciascuna di queste vado a cercare effettivamente la lunghezza finale, che puo' anche essere piu' lunga di X.

Ottimizzazioni:
A - Non serve che la chiave della hashtable sia costruita con tutti e 4 i valori della distribuzione delle basi.
Dato X=16, ovvero quando si stanno analizzando tutte le sottostringhe lunghe 16
di una distribuzione 3,4,3,6 almeno uno dei valori e' ridondante. Sara' necessario costruire la chiave con soli 3 valori
3,4,3
B - Per definizione non si puo' inziare con un errore. Almeno il primo carattere dovra' essere uguale per entrambe le stringhe. Memorizzare questo carattere nella chiave aiuta a discriminare meglio nella seconda parte di algoritmo.
Es: A,3,4,3
Infatti se stessi valutando una distribuzione della seconda parte, pur avendo le stesse frequenze 3,4,3,6 ma iniziando con T allora non sara' sicuramente il nostro target. T,3,4,3 sara' diverso e quindi non valutato.
C - La chiave della hastable verra' scritta all'interno di un singolo numero intero 32bit, suddividendo lo spazio dei bit come segue:
2 bit per codificare il carattere di inizio
9 bit per ciascuna delle frequenze dei 3 valori (Ho scelto CTG, tralasciando A)
Cio' significa che in questa versione di algoritmo non posso trovare sottostringhe piu' lunghe di 512 caratteri. Nel caso dovro' coprire con un altro algoritmo, che magari usi le int64.

A descriverlo ci sono piu' righe che in C# (be, non proprio, ma non e' un papiro stile C)

banryu79

05-08-2008, 19:11

Humm interessante.
Prima di leggere dei suffix tree, suggeriti da shinya e utilizzati da Vincenzo1968, nel mio piccolo stavo riflettendo anch'io su questo contest e siccome avevo appena letto qualcosa circa l'algoritmo di compressione di Huffman, mi stavo chiedendo se fosse una buona idea provare a "condensare" l'informazione nelle stringhe associando carattere con numero di occorrenze sperando poi di usare una struttura tipo albero per organizzare in qualche modo l'informazione delle superstringhe per poterle visitare più efficacemente e incrementare la velocità dei confronti.
Ma a oggi non ho ancora trovato il tempo di approfondire.

Questi contest sono assolutamente stuzzichevoli :D

DanieleC88

05-08-2008, 19:29

Anche io avevo sviluppato una versione alternativa del mio precedente codice: sostanzialmente lo stesso, solo cercavo di eliminare per quanto possibile le ricerche usando un dizionario per i caratteri ACGT con le posizioni di ogni lettera (per una ricerca veloce delle sottostringhe). Solo che sembra essere molto più lento, almeno, per come l'ho implementato, tanto che non l'ho mai visto arrivare alla fine, dopo un bel po' l'ho interrotto e poi non ho più riprovato. :fagiano:

Piacciono molto anche a me questi contest. :)
A quando il 5? :D

gugoXX

05-08-2008, 19:30

Anche io avevo sviluppato una versione alternativa del mio precedente codice: sostanzialmente lo stesso, solo cercavo di eliminare per quanto possibile le ricerche usando un dizionario per i caratteri ACGT con le posizioni di ogni lettera (per una ricerca veloce delle sottostringhe). Solo che sembra essere molto più lento, almeno, per come l'ho implementato, tanto che non l'ho mai visto arrivare alla fine, dopo un bel po' l'ho interrotto e poi non ho più riprovato. :fagiano:

Piacciono molto anche a me questi contest. :)
A quando il 5? :D

E' quasi pronto, penso di uscire nel weekend.

DanieleC88

05-08-2008, 21:08

Evvai! :yeah:

Vincenzo1968

06-08-2008, 04:00

Ho provato a implementare una versione che utilizza un alberio binario di ricerca. Le chiavi contengono i prefissi della stringa più lunga. Ogni nodo contiene una lista concatenata con le posizioni del prefisso nella stringa.

Leggo la stringa più corta da sinistra verso destra e cerco i prefissi nell'albero. Se lo trovo confronto le stringhe fino a quando non trovo un carattere diverso.
Sembrerebbe più veloce rispetto alla versione con suffix tree: 0.2 secondi.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 10

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE];
Lista *pLista;
struct tagTree *father;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->father = NULL;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
node->left->father = node;
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
node->right->father = node;
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
int numread = 0;
int dimFile = 0;
int count = 0;
fpos_t pos;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

if( fgetpos(fp, &pos) != 0 )
{
fclose(fp);
return 0;
}

fclose(fp);

return (int)pos;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;

char strSearch1[1024] = "";
char strSearch2[1024] = "";

char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - PREFIX_SIZE )
{
memcpy(strSearch1, p1 + pos1, PREFIX_SIZE);
*(strSearch1 + PREFIX_SIZE) = '\0';

pTree = TreeInsertNode(pTree, strSearch1, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch1[0] = '\0';
strSearch2[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;

while ( pos2 < p2_len - PREFIX_SIZE )
{
memcpy(strSearch2, p2 + pos2, PREFIX_SIZE);
*(strSearch2 + PREFIX_SIZE) = '\0';
len = 0;

TreeSearch(pTree, &pNode, strSearch2);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
strcpy(strTemp1, strSearch2);
k = PREFIX_SIZE;
while ( *(p2 + pos2 + k) == *(p1 + pos1 + k ) )
{
*(strTemp1 + k) = *(p2 + pos2 + k);
k++;
}
*(strTemp1 + k ) = '\0';
len = strlen(strTemp1);
if ( len > len_prec )
{
sprintf(strTemp2, "\nStringa '%s' di lunghezza %d trovata nelle posizioni %d e %d\n", strTemp1, k, pos1, pos2);
len_prec = len;
}
pLista = pLista->next;
}
}

pos2++;
//printf("pos2 -> %d\n", pos2);
}

printf(strTemp2);

c_end = clock();

printf( "Tempo impiegato -> %2.1f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

free(p1);
free(p2);
free(strSearch1);
free(strSearch2);

return 0;
}

Vincenzo1968

07-08-2008, 02:39

Chiedo scusa per il post ripetuto più volte.

Ieri, tentando di postare, mi dava errore. Oggi invece, mi accorgo che i tentativi erano andati a buon fine :confused:

Posto una soluzione temporanea per il punto B(e dico temporanea, ché non mi piace, perchè impiega troppo tempo: 2 minuti):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

Lista *pLastA = NULL;
Lista *pLastC = NULL;
Lista *pLastG = NULL;
Lista *pLastT = NULL;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val, char c);
void ListFree(Lista* first);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val, char c)
{
Lista *nuovo;

nuovo = ListNewNode(val);

if ( first == NULL )
{
switch ( c )
{
case 'A':
pLastA = nuovo;
break;
case 'C':
pLastC = nuovo;
break;
case 'G':
pLastG = nuovo;
break;
case 'T':
pLastT = nuovo;
break;
}

return nuovo;
}

switch ( c )
{
case 'A':
pLastA->next = nuovo;
pLastA = nuovo;
break;
case 'C':
pLastC->next = nuovo;
pLastC = nuovo;
break;
case 'G':
pLastG->next = nuovo;
pLastG = nuovo;
break;
case 'T':
pLastT->next = nuovo;
pLastT = nuovo;
break;
}

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
int numread = 0;
int dimFile = 0;
int count = 0;
fpos_t pos;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

if( fgetpos(fp, &pos) != 0 )
{
fclose(fp);
return 0;
}

fclose(fp);

return (int)pos;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
Lista *pLista;

Lista *pListaA;
Lista *pListaC;
Lista *pListaG;
Lista *pListaT;

int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;

char strSearch1[1024] = "";
char strSearch2[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

pListaA = NULL;
pListaC = NULL;
pListaG = NULL;
pListaT = NULL;

pos1 = 0;
while ( pos1 < p1_len )
{
switch ( *(p1 + pos1) )
{
case 'A':
pListaA = ListAppend(pListaA, pos1, *(p1 + pos1));
break;
case 'C':
pListaC = ListAppend(pListaC, pos1, *(p1 + pos1));
break;
case 'G':
pListaG = ListAppend(pListaG, pos1, *(p1 + pos1));
break;
case 'T':
pListaT = ListAppend(pListaT, pos1, *(p1 + pos1));
break;
}

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch1[0] = '\0';
strSearch2[0] = '\0';
pLista = NULL;
len = 0;
len_prec = 0;
numErrors = 0;

while ( pos2 < p2_len )
{
switch ( *(p2 + pos2) )
{
case 'A':
pLista = pListaA;
break;
case 'C':
pLista = pListaC;
break;
case 'G':
pLista = pListaG;
break;
case 'T':
pLista = pListaT;
break;
}

len = 0;

while ( pLista != NULL )
{
pos1 = pLista->pos;
k = 0;
numErrors = 0;
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if (*(p2 + pos2 + k) != *(p1 + pos1 + k ))
numErrors++;
*(strTemp1 + k) = *(p2 + pos2 + k);
k++;
}
*(strTemp1 + k ) = '\0';
len = strlen(strTemp1);
if ( len > len_prec )
{
memcpy(strTemp2, p1 + pos1, k);
*(strTemp2 + k) = '\0';
sprintf(strRes, "\nStringhe\n%s\n%s\ndi lunghezza %d trovate nelle posizioni %d e %d\n", strTemp1, strTemp2, k, pos1, pos2);
len_prec = len;
}
pLista = pLista->next;
}

pos2++;
}

printf(strRes);

c_end = clock();

printf( "Tempo impiegato -> %2.1f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

free(p1);
free(p2);
ListFree(pListaA);
ListFree(pListaC);
ListFree(pListaG);
ListFree(pListaT);

return 0;
}

Vincenzo1968

07-08-2008, 03:00

Ecco la mia soluzione definitiva per il punto B:

Tempo impiegato -> 1,2 secondi

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 5

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE];
Lista *pLista;
struct tagTree *father;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

// catena vuota
if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->father = NULL;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
node->left->father = node;
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
node->right->father = node;
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
int numread = 0;
int dimFile = 0;
int count = 0;
fpos_t pos;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

if( fgetpos(fp, &pos) != 0 )
{
fclose(fp);
return 0;
}

fclose(fp);

return (int)pos;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;

char strSearch1[1024] = "";
char strSearch2[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - PREFIX_SIZE )
{
memcpy(strSearch1, p1 + pos1, PREFIX_SIZE);
*(strSearch1 + PREFIX_SIZE) = '\0';

pTree = TreeInsertNode(pTree, strSearch1, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch1[0] = '\0';
strSearch2[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;

while ( pos2 < p2_len - PREFIX_SIZE )
{
memcpy(strSearch2, p2 + pos2, PREFIX_SIZE);
*(strSearch2 + PREFIX_SIZE) = '\0';
len = 0;

TreeSearch(pTree, &pNode, strSearch2);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
strcpy(strTemp1, strSearch2);
k = PREFIX_SIZE;
numErrors = 0;
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
*(strTemp1 + k) = *(p2 + pos2 + k);
k++;
}
*(strTemp1 + k ) = '\0';
len = strlen(strTemp1);
if ( len > len_prec )
{
memcpy(strTemp2, p1 + pos1, k);
*(strTemp2 + k) = '\0';
sprintf(strRes, "\nStringhe\n%s\n%s\ndi lunghezza %d trovate nelle posizioni %d e %d\n", strTemp1, strTemp2, k, pos1, pos2);
len_prec = len;
}
pLista = pLista->next;
}
}

pos2++;
//printf("pos2 -> %d\n", pos2);
}

printf(strRes);

c_end = clock();

printf( "Tempo impiegato -> %2.1f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

free(p1);
free(p2);
TreeFree(pTree);

return 0;
}

Praticamente ho sostituito alle quattro liste concatenate, un albero binario di ricerca

Vincenzo1968

07-08-2008, 03:08

:)
...
A descriverlo ci sono piu' righe che in C# (be, non proprio, ma non e' un papiro stile C)

Col papiro ho risolto, però, in poco più di un secondo ;)

Vincenzo1968

07-08-2008, 18:55

Una piccola ottimizzazione: ho spostato la costruzione delle stringhe al di fuori del ciclo(dove prendo solamente le posizioni e la lunghezza).

Stringhe

ACTGTCCTGAAGATCGCTTGGCATCTCCG
ACTGTCCTGCAGATCGCTTTGCATCTCCG

di lunghezza 29 trovate alle posizioni 50002 e 70002

Tempo impiegato -> 0.98000 secondi

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 5

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE];
Lista *pLista;
struct tagTree *father;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

// catena vuota
if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->father = NULL;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
node->left->father = node;
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
node->right->father = node;
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
int numread = 0;
int dimFile = 0;
int count = 0;
fpos_t pos;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

if( fgetpos(fp, &pos) != 0 )
{
fclose(fp);
return 0;
}

fclose(fp);

return (int)pos;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - PREFIX_SIZE )
{
memcpy(strSearch, p1 + pos1, PREFIX_SIZE);
*(strSearch + PREFIX_SIZE) = '\0';

pTree = TreeInsertNode(pTree, strSearch, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;

while ( pos2 < p2_len - PREFIX_SIZE )
{
memcpy(strSearch, p2 + pos2, PREFIX_SIZE);
*(strSearch + PREFIX_SIZE) = '\0';

TreeSearch(pTree, &pNode, strSearch);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
k = PREFIX_SIZE;
numErrors = 0;
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;
}
pLista = pLista->next;
}
}

pos2++;
}

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes, "\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n", strTemp1, strTemp2, len, pos1x, pos2x);
printf(strRes);

c_end = clock();

printf("\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

free(p1);
free(p2);
TreeFree(pTree);

return 0;
}

cionci

07-08-2008, 19:06

Non credevo che potesse essere così veloce con un albero binario di ricerca :eek:

Vincenzo1968

07-08-2008, 19:13

Non credevo che potesse essere così veloce con un albero binario di ricerca :eek:

Neanch'io. Ho provato anche con una hash table super ottimizzata ma risulta più lento di qualche decimo di secondo(forse sono io che non so implementare adeguatamente le hash table).

:)

cionci

07-08-2008, 19:25

Attenzione però che non funziona ;)
Se alla posizione 70002 del file 2 modifico la stringa da

ACTGTCCTGCAGATCGCTTTGCATCTCCG

a

ACTTTCCTGAAGATCGCTTTGCATCTCCG

non trova più la stringa più lunga.

Dai per scontato che gli errori non si trovino nelle prime PREFIX_SIZE posizioni.

Vincenzo1968

07-08-2008, 19:27

Posto la soluzione con hash table che, sulla mia macchina, impiega 1.01400 secondi. Magari qualcuno riesce a ottimizzare il codice:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 5

//#define DIM_HASHTABLE 8191
//#define DIM_HASHTABLE 16381
//#define DIM_HASHTABLE 32749
//#define DIM_HASHTABLE 65521
#define DIM_HASHTABLE 131071

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagHashTable
{
char key[PREFIX_SIZE];
Lista *pLista;
struct tagHashTable *next;
} HashTable;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

int HashU(char *v, int M);

HashTable* HashTableNewNode(char *Key, int pos)
{
HashTable *n;

n = (HashTable*)malloc(sizeof(HashTable));

if( n == NULL )
return NULL;

strcpy(n->key, Key);
n->pLista = NULL;
n->pLista = ListAppend(n->pLista, pos);
n->next = NULL;

return n;
}

int FindValue(HashTable **pHashTable, char *Key, int M)
{
int index = 0;
HashTable *t;
int a = 31415;
int b = 27183;
char *s = Key;

for ( index = 0; *s != '\0'; s++, a = a*b % (M - 1) )
index = (a*index + *s) % M;
if ( index < 0 )
index *= -1;

t = pHashTable[index];
while ( t != NULL )
{
if ( strcmp(t->key, Key) == 0 )
return index;
t = t->next;
}

return -1;
}

void InsertValue(HashTable **pHashTable, char *Key, int pos, int M)
{
int index = 0;
HashTable *t = NULL;
int a = 31415;
int b = 27183;
char *s = Key;

for ( index = 0; *s != '\0'; s++, a = a*b % (M - 1) )
index = (a*index + *s) % M;
if ( index < 0 )
index *= -1;

t = pHashTable[index];
if ( t == NULL )
{
pHashTable[index] = HashTableNewNode(Key, pos);
return;
}

while ( t != NULL )
{
if ( strcmp(t->key, Key) == 0 )
{
pHashTable[index]->pLista = ListAppend(pHashTable[index]->pLista, pos);
return;
}
if ( t->next == NULL )
{
t->next = HashTableNewNode(Key, pos);
t = t->next;
t->next = NULL;
return;
}
t = t->next;
}
}

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

void HashTableFree(HashTable* first)
{
HashTable *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
ListFree(n1->pLista);
free(n1);
n1 = n2;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
int numread = 0;
int dimFile = 0;
int count = 0;
fpos_t pos;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

if( fgetpos(fp, &pos) != 0 )
{
fclose(fp);
return 0;
}

fclose(fp);

return (int)pos;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
HashTable* pHT[DIM_HASHTABLE];
int index;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

for (k = 0; k < DIM_HASHTABLE; k++ )
pHT[k] = NULL;

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

pos1 = 0;
while ( pos1 < p1_len - PREFIX_SIZE )
{
memcpy(strSearch, p1 + pos1, PREFIX_SIZE);
*(strSearch + PREFIX_SIZE) = '\0';

InsertValue(pHT, strSearch, pos1, DIM_HASHTABLE);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
len = 0;
len_prec = 0;
numErrors = 0;

while ( pos2 < p2_len - PREFIX_SIZE )
{
memcpy(strSearch, p2 + pos2, PREFIX_SIZE);
*(strSearch + PREFIX_SIZE) = '\0';

index = FindValue(pHT, strSearch, DIM_HASHTABLE);

if ( index >= 0 )
{
pLista = pHT[index]->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
k = PREFIX_SIZE;
numErrors = 0;
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;
}
pLista = pLista->next;
}
}

pos2++;
}

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes, "\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n", strTemp1, strTemp2, len, pos1x, pos2x);
printf(strRes);

c_end = clock();

printf("\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

free(p1);
free(p2);

for ( k = 0; k < DIM_HASHTABLE; k++ )
HashTableFree(pHT[k]);

return 0;
}

cionci

07-08-2008, 19:33

Anche l'hash table diventa inutilizzabile in quanto non si può fare un matching corretto dell'hash in quanto ci possono essere gli errori anche nella chiave.

fgetpos non si può usare in quel modo, lo standard non impone come è fatta fpos_t, può essere anche una struttura dati. Non a caso con gcc non compila. Usa ftell al posto di fgetpos.

gugoXX

07-08-2008, 21:05

Col papiro ho risolto, però, in poco più di un secondo ;)

Mannaggia. Mi sa che mi devo dare da fare. :cry:
Cerchero' di inventare qualcosa. :)

Vincenzo1968

07-08-2008, 22:24

Attenzione però che non funziona ;)
Se alla posizione 70002 del file 2 modifico la stringa da

ACTGTCCTGCAGATCGCTTTGCATCTCCG

a

ACTTTCCTGAAGATCGCTTTGCATCTCCG

non trova più la stringa più lunga.

Dai per scontato che gli errori non si trovino nelle prime PREFIX_SIZE posizioni.

Risolto!
Adesso è un po' più lento ( 1.19 secondi ) ma funziona ;)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
//#define FILE2 "dna2.txt"
#define FILE2 "dna2b.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 5

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE];
Lista *pLista;
struct tagTree *father;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

// catena vuota
if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->father = NULL;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
node->left->father = node;
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
node->right->father = node;
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
long k;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

k = ftell(fp);

fclose(fp);

return k;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - PREFIX_SIZE )
{
memcpy(strSearch, p1 + pos1, PREFIX_SIZE);
*(strSearch + PREFIX_SIZE) = '\0';

pTree = TreeInsertNode(pTree, strSearch, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;

while ( pos2 < p2_len - PREFIX_SIZE )
{
memcpy(strSearch, p2 + pos2, PREFIX_SIZE);
*(strSearch + PREFIX_SIZE) = '\0';

TreeSearch(pTree, &pNode, strSearch);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
numErrors = 0;
k = PREFIX_SIZE;
post1 = pos1;
post2 = pos2;
if ( pos2 > PREFIX_SIZE && pos1 > PREFIX_SIZE )
{
k = 0;
while ( (*(p2 + pos2) == *(p1 + pos1)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1 += 1;
pos2 += 1;
numErrors = 0;
}
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;
}
pos2 = post2;
pLista = pLista->next;
}
}

pos2++;
}

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes, "\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n", strTemp1, strTemp2, len, pos1x, pos2x);
printf(strRes);

c_end = clock();

printf("\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

free(p1);
free(p2);
TreeFree(pTree);

return 0;
}

Vincenzo1968

07-08-2008, 22:45

E se utilizzo prefissi di lunghezza 8:

#define PREFIX_SIZE 8

il tempo scende a 0.2 secondi ;) ;) ;)

DanieleC88

08-08-2008, 00:03

:eekk:

cionci

08-08-2008, 02:00

C'è ancora qualcosa che non mi torna. Prova ad immettere solo due stringhe semplici.

ACTGCTGCATGC
ACTGCTG

ora controlla. Prova poi a mettere nel primo file

ACTGGTGCATGC

Se non sbaglio attualmente il tuo programma suppone che ci siano almeno PREFIX_SIZE caratteri uguali nella sottostringa, cosa che può non essere vera anche per sottostringhe di una certa lunghezza (fino a (PREFIX_SIZE - 1) * 2).

cionci

08-08-2008, 02:09

Anzi mi correggo (PREFIX_SIZE * 3) - 2

Vincenzo1968

08-08-2008, 02:15

C'è ancora qualcosa che non mi torna. Prova ad immettere solo due stringhe semplici.

ACTGCTGCATGC
ACTGCTG

ora controlla. Prova poi a mettere nel primo file

ACTGGTGCATGC

Se non sbaglio attualmente il tuo programma suppone che ci siano almeno PREFIX_SIZE caratteri uguali nella sottostringa, cosa che può non essere vera anche per sottostringhe di una certa lunghezza (fino a (PREFIX_SIZE - 1) * 2).

Ho modificato il programma in modo da farlo partire da un prefisso di 8 caratteri. Ciclo diminuendo di una unità la lunghezza del prefisso(ricostruendo l'albero ogni volta) in modo da controllare tutti i prefissi compresi tra 8 e 1.

Il tempo impiegato è 1.19100 secondi

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
//#define FILE2 "dna2.txt"
#define FILE2 "dna2b.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 8

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE];
Lista *pLista;
struct tagTree *father;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

// catena vuota
if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->father = NULL;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
node->left->father = node;
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
node->right->father = node;
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
long k;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

k = ftell(fp);

fclose(fp);

return k;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

for ( x = PREFIX_SIZE; x > 2; x--)
{
pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - PREFIX_SIZE )
{
memcpy(strSearch, p1 + pos1, PREFIX_SIZE);
*(strSearch + PREFIX_SIZE) = '\0';

pTree = TreeInsertNode(pTree, strSearch, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;

while ( pos2 < p2_len - x )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

TreeSearch(pTree, &pNode, strSearch);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
numErrors = 0;
k = x;
post1 = pos1;
post2 = pos2;
if ( pos2 > x && pos1 > x )
{
k = 0;
while ( (*(p2 + pos2) == *(p1 + pos1)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1 += 1;
pos2 += 1;
numErrors = 0;
}
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);
}
pos2 = post2;
pLista = pLista->next;
}
}

pos2++;
}
TreeFree(pTree);
}

printf(strRes);

c_end = clock();

printf("\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

free(p1);
free(p2);

return 0;
}

cionci

08-08-2008, 02:25

Boh...con stringhe corte, come quelle sopra, continua a non funzionare, proprio non stampa niente.

Questo for:

for ( x = PREFIX_SIZE; x > 2; x--)

Con prefix size a 5 cicla con 5, 4, 3 e basta...o mi sono perso qualcosa nel funzionamento ?

Vincenzo1968

08-08-2008, 03:05

Boh...con stringhe corte, come quelle sopra, continua a non funzionare, proprio non stampa niente.

Questo for:

for ( x = PREFIX_SIZE; x > 2; x--)

Con prefix size a 5 cicla con 5, 4, 3 e basta...o mi sono perso qualcosa nel funzionamento ?

Si, chiedo scusa. Ho postato il codice sbagliato. Questo è quello corretto:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

//#define FILE1 "dna1.txt"
//#define FILE2 "dna2.txt"

#define FILE1 "dna1b.txt"
#define FILE2 "dna2b.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 5

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE];
Lista *pLista;
struct tagTree *father;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

// catena vuota
if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->father = NULL;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
node->left->father = node;
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
node->right->father = node;
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
long k;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

k = ftell(fp);

fclose(fp);

return k;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;
int bTrovato;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

for ( x = PREFIX_SIZE; x > 1; x--)
{
pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - x )
{
memcpy(strSearch, p1 + pos1, x);
*(strSearch + x) = '\0';

pTree = TreeInsertNode(pTree, strSearch, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;
bTrovato = 0;

while ( pos2 < p2_len - x )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

TreeSearch(pTree, &pNode, strSearch);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
numErrors = 0;
k = x;
post1 = pos1;
post2 = pos2;
if ( pos2 > x && pos1 > x )
{
k = 0;
while ( (*(p2 + pos2) == *(p1 + pos1)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1 += 1;
pos2 += 1;
numErrors = 0;
}
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);
bTrovato = 1;
}
pos2 = post2;
pLista = pLista->next;
}
}

pos2++;
}
TreeFree(pTree);
if ( bTrovato )
break;
}

printf(strRes);

c_end = clock();

printf("\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

free(p1);
free(p2);

return 0;
}

Funziona anche inserendo le stringhe corte che hai indicato tu. A meno che non ho sbagliato qualcosa. Non è che puoi farmi avere i due file?

gugoXX

08-08-2008, 14:15

Ho partorito un'idea.
Appena trovo tempo la butto giu'
Speriamo di stare sotto i 10 secondi...

cionci

08-08-2008, 14:50

Vincenzo, c'è ancora un problemino sulla lunghezza della stringa.
Metti nel file uno:

CTGC

e nel file due:

CT

e non trova le sottostringhe di lunghezza 3, che è la lunghezza minima con 2 errori (ad esempio metti la sola lettera C nel secondo file).
In ogni caso continuo a non capire come possa essere così veloce, visto che si riduce ad essere O(n*m) con il prefix più corto :confused:

Edit: ci sono anche altre incongruenze, prova con le stringhe:
CCACTGCTGTGCGGATCTCTAAAAA
AACTCTCTCCGGGTCACATCCTCCA

Il risultato dovrebbe essere:

First string position: 6
Second string position: 4
Substring length: 8
Matched substring in first string: CTGTGCGG
Matched substring in second string: CTCTCCGG

Invece è:

Stringhe

TCTCTAA
TCTCTCC

di lunghezza 7 trovate alle posizioni 15 e 3

Potrebbe essere legato al problema che non ricerca sottostringhe di prefisso 1...infatti fra i due errori c'è un solo carattere.

ST6Man

08-08-2008, 16:40

Belle ste cose mi piacciono :D perdonate l'intrusione, a livello teorico non si dovrebbe risolvere con la programmazione dinamica in tempo O(n1*n2) ?

(dove n1 è la lunghezza della prima stringa e n2 la lunghezza della seconda stringa)

Si può fare di meglio?!

gugoXX

08-08-2008, 17:13

Belle ste cose mi piacciono :D perdonate l'intrusione, a livello teorico non si dovrebbe risolvere con la programmazione dinamica in tempo O(n1*n2) ?

(dove n1 è la lunghezza della prima stringa e n2 la lunghezza della seconda stringa)

Si può fare di meglio?!

Certo!!!
Se fossimo ancora a O(n1*n2), data la lunghezza 100.000 caratteri a stringa, saremmo ancora sull'ordine dei minuti.

Vincenzo1968

08-08-2008, 17:37

Per il momento posto i risultati per i due punti del contest. Ottengo, per entrambi, un tempo di 0.24 secondi.

Punto A:

Stringhe

ACTGTCCTGTCAACAAGGAGT
ACTGTCCTGTCAACAAGGAGT

di lunghezza 21 trovate alle posizioni 10002 e 40002

Tempo impiegato -> 0.23400 secondi

Punto B:

Stringhe

ACTGTCCTGAAGATCGCTTGGCATCTCCG
ACTGTCCTGCAGATCGCTTTGCATCTCCG

di lunghezza 29 trovate alle posizioni 50002 e 70002

Tempo impiegato -> 0.23400 secondi

E questi sono i due papiri in linguaggio C:

Codice per il punto A:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 8

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE];
Lista *pLista;
struct tagTree *father;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->father = NULL;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
node->left->father = node;
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
node->right->father = node;
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
long k;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

k = ftell(fp);

fclose(fp);

return k;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
FILE *fp;
char strTempo[512];

Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int bTrovato;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

for ( x = PREFIX_SIZE; x > 1; x--)
{
pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - x )
{
memcpy(strSearch, p1 + pos1, x);
*(strSearch + x) = '\0';

pTree = TreeInsertNode(pTree, strSearch, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
bTrovato = 0;

while ( pos2 < p2_len - x )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

TreeSearch(pTree, &pNode, strSearch);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
k = x;
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) )
k++;
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

bTrovato = 1;
}
pLista = pLista->next;
}
}

pos2++;
}
TreeFree(pTree);
if ( bTrovato )
break;
}

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

c_end = clock();

printf(strRes);

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

fp = fopen("Risultato.txt", "a");
fwrite(strRes, strlen(strRes), 1, fp);
fwrite(strTempo, strlen(strTempo), 1, fp);
fclose(fp);

printf(strTempo);

free(p1);
free(p2);

return 0;
}

Codice per il punto B:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

//#define FILE1 "dna1b.txt"
//#define FILE2 "dna2b.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 8

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE];
Lista *pLista;
struct tagTree *father;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->father = NULL;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
node->left->father = node;
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
node->right->father = node;
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
long k;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

k = ftell(fp);

fclose(fp);

return k;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
FILE *fp;
char strTempo[512];

Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;
int bTrovato;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

for ( x = PREFIX_SIZE; x > 1; x--)
{
pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - x )
{
memcpy(strSearch, p1 + pos1, x);
*(strSearch + x) = '\0';

pTree = TreeInsertNode(pTree, strSearch, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;
bTrovato = 0;

while ( pos2 < p2_len - x )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

TreeSearch(pTree, &pNode, strSearch);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
numErrors = 0;
k = x;
post1 = pos1;
post2 = pos2;
if ( pos2 > x && pos1 > x )
{
k = 0;
while ( (*(p2 + pos2) == *(p1 + pos1)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1 += 1;
pos2 += 1;
numErrors = 0;
}
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);
bTrovato = 1;
}
pos2 = post2;
pLista = pLista->next;
}
}

pos2++;
}
TreeFree(pTree);
if ( bTrovato )
break;
}

c_end = clock();

printf(strRes);

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

fp = fopen("Risultato.txt", "a");
fwrite(strRes, strlen(strRes), 1, fp);
fwrite(strTempo, strlen(strTempo), 1, fp);
fclose(fp);

printf(strTempo);

free(p1);
free(p2);

return 0;
}

Vedo di provare le stringhe che hai indicato ed eventualmente correggo il codice in modo da gestire prefissi di lunghezza 1.

DanieleC88

08-08-2008, 17:43

Certo!!!
Se fossimo ancora a O(n1*n2), data la lunghezza 100.000 caratteri a stringa, saremmo ancora sull'ordine dei minuti.
Io più o meno sto lì... :D
Ho tradotto in C il vecchio codice in Python, l'ho migliorato un po' per eliminare ricerche inutili e sono riuscito a farlo 3 volte più veloce di prima, o qualcosa in più (da 13 minuti su questo PC a 4 minuti sullo stesso). Mi pare che grigor91 fosse riuscito a far girare il mio iniziale codice in Python in 38 secondi, i miei calcoli mi dicono che con questo dovrei essere sugli 11 secondi circa sul suo PC. Un po' migliorato, ma ancora dannatamente lento. :D

gugoXX

08-08-2008, 18:04

Per il momento posto i risultati per i due punti del contest. Ottengo, per entrambi, un tempo di 0.24 secondi.

Due cose Vincenzo
La prima, sarebbe opportuno avere una funzione ben separata, che accetta in input 2 stringhe e il numero di errori, cosicche' possiamo disporre piu' o meno tutti della stessa interfaccia.
Come output, dato che e' richiesto piu' di un valore (2 offset e la lunghezza), puoi usare una struct con il C

La seconda e' che non ho capito bene il significato di quel parametro PREFIX_SIZE.
Ovvero, 8 mi sembra un po' troppo particolare, essendo che guarda caso il risultato del caso di test ha 3 sottostringhe lunghe 9.
Mi chiedo se con altri input l'algoritmo come scritto continui a funzionare.
oppure se mettendo 10 questo caso continui ad essere coperto.

gugoXX

08-08-2008, 23:50

Rilancio con questo.
Non so se riesco a scendere ancora

Il cuore e' nei 2
var Upper = new Dictionary<int, Dictionary<string, List<int>>>();
var Lower = new Dictionary<int, Dictionary<string, List<int>>>();

che sono 2 Dictionary ciascuno dei quali contiene come chiave la lunghezza di una stringa esatta che sara' valutata
Come valore un altro dictionary, contenente tutte le stringhe di tale lunghezza, e come valori la lista degli offset cui quella specifica stringa si puo' trovare.

Maximum common length 29
Maximum substring indexes 50003,70003
Substring1: CACTGTCCTGAAGATCGCTTGGCATCTCCGTTA
Substring2: GACTGTCCTGCAGATCGCTTTGCATCTCCGAAC
3330ms

static void Main(string[] args)
{
string input1 ="";
string input2 = "";

input1 = File.ReadAllText(@"C:\temp\DNA1.txt");
input2 = File.ReadAllText(@"C:\temp\DNA2.txt");

Stopwatch sw = new Stopwatch();
sw.Start();
Run Current = TryV4.Get(input1, input2, 2);
sw.Stop();

Console.WriteLine("Maximum common length {0}", Current.length);
Console.WriteLine("Maximum substring indexes {0},{1}", Current.index0, Current.index1);
Console.WriteLine("Substring1: {0}", input1.Substring(Current.index0-2, Current.length+4));
Console.WriteLine("Substring2: {0}", input2.Substring(Current.index1-2, Current.length+4));
Console.WriteLine("{0}ms",sw.ElapsedMilliseconds);
Console.ReadKey();
}

public static class TryV4
{
static string input1;
static string input2;
static int len1;
static int len2;
static int ERR;

static Dictionary<ulong, bool> Avoid = new Dictionary<ulong, bool>();

public static Run Get(string i_input1, string i_input2, int i_ERR)
{
input1 = i_input1;
input2 = i_input2;
if (input2.Length > input1.Length)
{
string tmp = input1;
input1 = input2;
input2 = tmp;
}

ERR = i_ERR;

len1 = input1.Length;
len2 = input2.Length;

var Upper = new Dictionary<int, Dictionary<string, List<int>>>();
var Lower = new Dictionary<int, Dictionary<string, List<int>>>();

// Valuto la stringa minima

int hopelength = (int)(Math.Log(len1, 4) / 2);
int length=hopelength-1;
bool test;
do
{
length++;
var mup = Upper[length] = BuildDict(input1, length);
var mdw = Lower[length] = BuildDict(input2, length);

test = mdw.Any(t => mup.ContainsKey(t.Key));
} while (test);

if (length == hopelength) throw new Exception("ERR: ");

Run Winner = new Run();
int minlen = GetMinLen(length);
for(int ln=length-1;ln>=minlen;ln--)
{
var mup = Upper[ln];
var mdw = Lower[ln];

var tst = from m in mup
join n in mdw on m.Key equals n.Key
select new { uplist = m.Value, downlist = n.Value };

foreach (var totest in tst)
{
Run thrun=Test(totest.uplist, totest.downlist, ln);
if (thrun.length > Winner.length)
{
Winner = thrun;
minlen = GetMinLen(thrun.length);
if (minlen > ln) break;
}
}
}

return Winner;
}

public static int GetMinLen(int curmaxlen)
{
int minlen = (int)Math.Ceiling((float)(curmaxlen) / (float)(ERR+1));
return minlen;
}

private static Run Test(IEnumerable<int> uplist,IEnumerable<int> downlist,int sure)
{
Run winner = new Run();

var runner = from u in uplist
from d in downlist
let avoidkey = GetAvoidKey(u, d)
where !Avoid.ContainsKey(avoidkey)
select new { u = u, d = d, avoidkey = avoidkey };

foreach (var run in runner)
{
Run trn = Search(run.u, run.d, sure);
if (trn.length > winner.length) winner = trn;
Avoid[run.avoidkey] = true;
}

return winner;
}

private static ulong GetAvoidKey(int u, int d)
{
ulong uu = (ulong)u;
ulong ud = (ulong)d;
return uu | (ud << 32);
}

private static Run Search(int upoffset, int downoffset,int sure)
{
int[] Pre = new int[ERR];
int[] Post = new int[ERR];

//SearchPost

bool lasterr = true;
for(int erfnd=0,pch1=upoffset+sure+1,pch2=downoffset+sure+1; erfnd<ERR && pch1<len1 && pch2<len2 ;pch1++,pch2++)
{
char ch1 = input1[pch1];
char ch2 = input2[pch2];
if (ch1 == ch2) Post[erfnd]++;
else
{
if (lasterr) break;
erfnd++;
lasterr = true;
}
}

//SearchPre
lasterr = true;
for (int erfnd = 0, pch1=upoffset-2, pch2=downoffset-2; erfnd < ERR && pch1>=0 && pch2>=0; pch1--,pch2--)
{
char ch1 = input1[pch1];
char ch2 = input2[pch2];
if (ch1 == ch2) Pre[erfnd]++;
else
{
if (lasterr) break;
erfnd++;
lasterr = true;
}
}

int pr = 0;
int po = 0;
for (int t = 0; t < ERR; t++)
{
int ppd = Post[po];
int ppr = Pre[pr];
if ((ppd == 0) && (ppr == 0)) break;
if (ppd > ppr) po++;
else pr++;
}

int prim=0;
for (int t = 0; t < pr; t++)
{
prim += Pre[t];
}
int ofprim = prim + pr;

int dop = 0;
for (int t = 0; t < po; t++)
{
dop += Post[t];
}
int ofdop = dop + po;

int len = sure + ofprim + ofdop;
return new Run(len, upoffset - prim, downoffset - prim);
}

private static Dictionary<string, List<int>> BuildDict(string input, int len)
{
int fin = input.Length - len;
var ret = new Dictionary<string, List<int>>(fin);

for (int t = 0; t < fin; t++)
{
string str = input.Substring(t, len);
List<int> adder;
if (!ret.TryGetValue(str, out adder))
{
adder = ret[str] = new List<int>();
}
adder.Add(t);
}
return ret;
}

public class CRun
{
public int offset1;
public int offset2;
public int length=1;
public int err = 0;
public bool previouserr = false;
public CRun(int i_offset1,int i_offset2)
{
offset1 = i_offset1;
offset2 = i_offset2;
}
}

}

cionci

09-08-2008, 11:01

Il cuore e' nei 2
var Upper = new Dictionary<int, Dictionary<string, List<int>>>();
var Lower = new Dictionary<int, Dictionary<string, List<int>>>();

che sono 2 Dictionary ciascuno dei quali contiene come chiave la lunghezza di una stringa esatta che sara' valutata
Stavo pensando di farlo proprio così anche io :D

marco.r

09-08-2008, 17:15

Ho sistemato l'algoritmo della mia versione in CL. La riporto piu' per dovere di cronaca visto che e' ancora illeggibile :rolleyes: :(.

(defclass substring ()
((str :initarg :str)
(start :initarg :start)
(end :initarg :end)
(length :initarg :length)))

(defmacro str (s)
`(the simple-string (slot-value ,s 'str)))
(defmacro start (s)
`(the fixnum (slot-value ,s 'start)))
(defmacro end (s)
`(the fixnum (slot-value ,s 'end)))
(defmacro slength (s)
`(the fixnum (slot-value ,s 'length)))

(defclass leaf ()
((tag :initarg :tag)
(substr :initarg :substr)))

(defmacro tag (leaf)
`(the fixnum (slot-value ,leaf 'tag)))

(defmacro substr (leaf)
`(the substring (slot-value ,leaf 'substr)))

(declaim (inline make-leaf))
(defun make-leaf (substr tag)
(make-instance 'leaf :tag tag :substr substr))

(defclass node ()
((tags :initarg :tags)
(edges :initarg :edges)))

(defmacro tags (node)
`(slot-value ,node 'tags))

(defmacro edges (node)
`(slot-value ,node 'edges))

(defmacro ss-pos (substring pos)
`(char (str ,substring) (+ ,pos (start ,substring))))

(declaim (inline make-substring))
(defun make-substring (s start &optional end)
(declare (type fixnum start)
(type (or null fixnum) end))
(let ((real-start (min (+ start (start s)) (length (str s))))
(real-end (if (null end)
(end s)
(min (length (str s)) (+ end (start s))))))
(declare (type fixnum real-start)
(type fixnum real-end))
(make-instance 'substring
:str (str s)
:start real-start
:end real-end
:length (- real-end real-start))))

(defun string->substring (s start &optional end)
(declare (type fixnum start)
(type (or fixnum null) end))
(let ((real-start (min start (length s)))
(real-end (if (null end)
(length s)
(min (length s) end))))
(make-instance 'substring
:str s
:start real-start
:end real-end
:length (- real-end real-start))))

(defun substring->string (s)
(declare (type substring s))
(subseq (str s) (start s) (end s)))

(defun make-node (tags edges)
(make-instance 'node :tags tags :edges edges))

(defun make-edge (char sub-tree)
(list char sub-tree))

(defun edge-label (edge)
(first edge))

(defun edge-tree (edge)
(second edge))

(defun max-by (fn sequence)
(let ((best (first sequence)))
(loop for x in sequence do (if (funcall fn x best)
(setf best x)))
best))

(defun make-solution (tags-1 tags-2 length)
(list tags-1 tags-2 length))

(defun first-result (solution)
(first solution))

(defun second-result (solution)
(second solution))

(defun compare-result (sol1 sol2)
(> (third sol1) (third sol2)))

(defun find-by (fun1 fun2 tree)
(rec-find-by 0 fun1 fun2 tree))

(defgeneric rec-find-by (length fun1 fun2 tree))

(defmethod rec-find-by (length fun1 fun2 (tree leaf))
(let ((x (tag tree)))
(make-solution (and (funcall fun1 x) x)
(and (funcall fun2 x) x) length)))

(defmethod rec-find-by (length fun1 fun2 (tree node))
(labels ((good-enough (res) (and (first-result res) (second-result res)))
(rec-call (e) (rec-find-by (1+ length)
fun1
fun2
(edge-tree e))))
(let* ((children-results (mapcar #'rec-call (edges tree)))
(good-results (remove-if-not #'good-enough children-results)))
(or (and good-results (max-by #'compare-result good-results))
(merge-results fun1 fun2 children-results (tags tree) length)))))

(defun merge-results (fun1 fun2 res lt length)
(let ((tags-1 (loop for r in res when (first r) collect (first r)))
(tags-2 (loop for r in res when (second r) collect (second r)))
(sol1 nil)
(sol2 nil))
(setf tags-1 (nconc tags-1 (remove-if-not fun1 lt)))
(setf tags-2 (nconc tags-2 (remove-if-not fun2 lt)))
(if tags-1 (setf sol1 (car tags-1)))
(if tags-2 (setf sol2 (car tags-2)))
(make-solution sol1 sol2 length)))

(declaim (inline leaf->node))
(defun leaf->node (leaf)
(let* ((old-ss (substr leaf))
(tag (tag leaf)))
(if (= 0 (slength old-ss))
(make-node (list tag) ())
(make-node () (list (make-edge (ss-pos old-ss 0)
(make-leaf (make-substring old-ss 1) tag)))))))

(defgeneric merge-tree (tree1 tree2))

(defmethod merge-tree ((tree1 leaf) (tree2 node))
(merge-node (leaf->node tree1) tree2))

(defmethod merge-tree ((tree1 node) (tree2 leaf))
(merge-node tree1 (leaf->node tree2)))

(defmethod merge-tree ((tree1 leaf) (tree2 leaf))
(merge-node (leaf->node tree1) (leaf->node tree2)))

(defmethod merge-tree ((tree1 node) (tree2 node))
(merge-node tree1 tree2))

(defun merge-node (tree1 tree2)
(let ((new-tags (concatenate 'list
(tags tree1)
(tags tree2)))
(new-edges (merge-edges (edges tree1) (edges tree2))))
(make-node new-tags new-edges)))

(defun lcs (string1 string2)
(let* ((split (+ 1 (length string1)))
(tree1 (make-tree 0 string1))
(tree2 (make-tree split string2))
(tree (merge-tree tree1 tree2)))
(let ((result (find-by #'(lambda (x) (< x split)) #'(lambda (x) (>= x split)) tree)))
(setf (second result) (- (second result) split))
result)))

(defparameter empty-tree (make-node () ()))

(defun merge-edges (edge-list1 edge-list2)
(rec-merge-edges () edge-list1 edge-list2))

(defun rec-merge-edges (acc l1 l2)
(cond
((null l1) (concatenate 'list (reverse acc) l2))
((null l2) (concatenate 'list (reverse acc) l1))
((char= (the character (edge-label (first l1)))
(the character (edge-label (first l2))))
(let ((e1 (first l1))
(e2 (first l2)))
(rec-merge-edges (cons (make-edge (edge-label e1) (merge-tree (edge-tree e1)
(edge-tree e2)))
acc)
(cdr l1)
(cdr l2))))
((char< (edge-label (first l1)) (edge-label (first l2)))
(rec-merge-edges (cons (car l1) acc)
(cdr l1)
l2))
(t
(rec-merge-edges (cons (car l2) acc)
l1
(cdr l2)))))

(defun find-edge (char edges)
(declare (type character char)
(dynamic-extent char))
(if edges
(if (char= char (caar edges))
(car edges)
(find-edge char (cdr edges)))
nil))

(defgeneric add-tag (tree tag))

(defmethod add-tag ((tree leaf) tag)
(add-tag (leaf->node tree) tag))

(defmethod add-tag ((tree node) tag)
(setf (tags tree) (cons tag (tags tree)))
tree)

(defmacro update (fun place)
`(setf ,place (,fun ,place)))

(defun insert-edge (new-edge old-edges)
(if (null old-edges)
(list new-edge)
(let ((new-label (edge-label new-edge))
(old-label (edge-label (car old-edges))))
(if (char< new-label old-label)
(cons new-edge old-edges)
(cons (car old-edges) (insert-edge new-edge (cdr old-edges)))))))

(defgeneric update-edge (edge tree))

(defmethod update-edge (edge (tree leaf))
(setf (second edge) (leaf->node tree)))

(defmethod update-edge (edge (tree node))
tree)

(defun insert-string (tree string pos tag)
(declare (type simple-string string)
(type fixnum pos))
(let ((finished nil))
(do ((curpos pos (incf curpos))
(endpos (length string))
(curtree tree))
(finished)
(declare (type fixnum curpos endpos))
(if (= curpos endpos)
(progn (setf curtree (add-tag curtree tag))
(setf finished t))
(let ((edge (find-edge (char string curpos) (edges curtree))))
(if edge
(setf curtree (update-edge edge (second edge)))
(progn (setf (edges curtree)
(insert-edge (make-edge (char string curpos)
(make-leaf (string->substring string (1+ curpos)) tag))
(edges curtree)))
(setf finished t))))))
tree))

(defun make-tree (depth string)
(let ((tree (make-node () ())))
(loop for n from 0 upto (length string) do
(let ((tag (+ depth n)))
(setf tree (insert-string tree string n tag))))
tree))

Le performance non sono male ora, mediamente sotto i due secondi contro circa il secondo della versione C. Probabilmente si puo' limare qualcosa visto che uso semplici liste per l'elenco dei figli, e il programma passa circa meta' del tempo tra scorrere quelle liste e fare GC :mbe: :D , ma non ne ho ne' tempo ne' voglia :p.

Vincenzo1968

09-08-2008, 17:23

Risolto:

Stringhe

ACTGTCCTGAAGATCGCTTGGCATCTCCG
ACTGTCCTGCAGATCGCTTTGCATCTCCG

di lunghezza 29 trovate alle posizioni 50002 e 70002

Tempo impiegato -> 0.24900 secondi

Questo è il codice:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

//#define FILE1 "dna1b.txt"
//#define FILE2 "dna2b.txt"

//#define FILE1 "dna1c.txt"
//#define FILE2 "dna2c.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 8

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE];
Lista *pLista;
struct tagTree *father;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

// catena vuota
if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->father = NULL;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
node->left->father = node;
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
node->right->father = node;
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
long k;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

k = ftell(fp);

fclose(fp);

return k;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
FILE *fp;
char strTempo[512];

Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;
int bTrovato;
int MaxPrefix;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

if ( PREFIX_SIZE < p2_len )
MaxPrefix = PREFIX_SIZE;
else
MaxPrefix = p2_len;

if ( p1_len < PREFIX_SIZE )
{
printf("\n\nLa stringa piu' lunga risulta composta da %d caratteri.\nFai prima se ti apri i due file con blocco note e te le cerchi da solo.\nCiao ciao.\n\n", p1_len);

free(p1);
free(p2);

return 0;
}

for ( x = MaxPrefix; x > 0; x--)
{
pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - x )
{
memcpy(strSearch, p1 + pos1, x);
*(strSearch + x) = '\0';

pTree = TreeInsertNode(pTree, strSearch, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;
bTrovato = 0;

while ( pos2 < p2_len - x )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

TreeSearch(pTree, &pNode, strSearch);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
numErrors = 0;
k = x;
post1 = pos1;
post2 = pos2;
if ( pos2 > x && pos1 > x )
{
k = 0;
while ( (*(p2 + pos2) == *(p1 + pos1)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1 += 1;
pos2 += 1;
numErrors = 0;
}
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

bTrovato = 1;
}
pos2 = post2;
pLista = pLista->next;
}
}

pos2++;
}
TreeFree(pTree);
if ( bTrovato )
break;
}

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

x = len_prec + 1;

pos1 = 0;
pos2 = pos2x + 1;
strTemp1[0] = '\0';
strSearch[0] = '\0';
len_prec = len;
len = 0;
numErrors = 0;

memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

k = 0;
while ( *(p1 + k) != *(p2 + k) )
k++;
pos1 = k;

while ( pos1 < p1_len - x )
{
memcpy(strTemp1, p1 + pos1, x);
*(strTemp1 + x) = '\0';

k = 0;
while ( k < x )
{
if ( *(strSearch + k) != *(strTemp1 + k) )
numErrors++;
if ( numErrors > 2 )
break;
k++;
}
if ( k == x )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

break;
}

pos1++;
}

c_end = clock();

printf(strRes);

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

fp = fopen("Risultato.txt", "a");
fwrite(strRes, strlen(strRes), 1, fp);
fwrite(strTempo, strlen(strTempo), 1, fp);
fclose(fp);

printf(strTempo);

free(p1);
free(p2);

return 0;
}

Vincenzo1968

09-08-2008, 17:30

Due cose Vincenzo
La prima, sarebbe opportuno avere una funzione ben separata, che accetta in input 2 stringhe e il numero di errori, cosicche' possiamo disporre piu' o meno tutti della stessa interfaccia.
Come output, dato che e' richiesto piu' di un valore (2 offset e la lunghezza), puoi usare una struct con il C

La seconda e' che non ho capito bene il significato di quel parametro PREFIX_SIZE.
Ovvero, 8 mi sembra un po' troppo particolare, essendo che guarda caso il risultato del caso di test ha 3 sottostringhe lunghe 9.
Mi chiedo se con altri input l'algoritmo come scritto continui a funzionare.
oppure se mettendo 10 questo caso continui ad essere coperto.

Ho provato l'ultima versione postata con stringhe di diversa lunghezza, e in diverse posizioni, e funziona.
Se qualcuno trova degli errori, per favore, me li segnali.

Modifico il codice come hai detto tu(con la funzione separata). Purtroppo, in questi giorni, non posso collegarmi tanto spesso(condivido il portatile con tutta la famiglia).

Funziona anche con prefissi di lunghezza 5(che avevo scelto all'inizio) o 3. Non avevo idea del caso di test ha 3 sottostringhe lunghe 9.
I numeri 3, 5, 8 li ho scelti perché fanno parte della successione di Fibonacci(che mi affascina tantissimo).

Non sarebbe il caso di provare con file di dimensioni maggiori(diciamo, dieci volte più grandi degli attuali) ?

cionci

09-08-2008, 18:00

Vincenzo: non capisco perché, ma semplicemente incollando il codice in un file e compilandolo mi torna un risultato diverso sui soliti due file :confused:

Stringhe

CCACTGTCCTGTCAACAAGGAGT
GGACTGTCCTGTCAACAAGGAGT

di lunghezza 23 trovate alle posizioni 10000 e 40000

Tempo impiegato -> 0.18000 secondi

cdimauro

09-08-2008, 21:52

Ho sistemato l'algoritmo della mia versione in CL. La riporto piu' per dovere di cronaca visto che e' ancora illeggibile :rolleyes: :(.
Adoro i linguaggi funzionali. :D
Le performance non sono male ora, mediamente sotto i due secondi contro circa il secondo della versione C. Probabilmente si puo' limare qualcosa visto che uso semplici liste per l'elenco dei figli, e il programma passa circa meta' del tempo tra scorrere quelle liste e fare GC :mbe: :D , ma non ne ho ne' tempo ne' voglia :p.
Non puoi disattivarla prima dei calcoli e riattivarla subito dopo?

Vincenzo1968

09-08-2008, 21:54

Vincenzo: non capisco perché, ma semplicemente incollando il codice in un file e compilandolo mi torna un risultato diverso sui soliti due file :confused:

Stringhe

CCACTGTCCTGTCAACAAGGAGT
GGACTGTCCTGTCAACAAGGAGT

di lunghezza 23 trovate alle posizioni 10000 e 40000

Tempo impiegato -> 0.18000 secondi

Ho appena riscaricato i due file e funziona :confused:
Può darsi che abbia postato il codice sbagliato(ne ho tremila versioni ormai :cry: )

Posto nuovamente il codice:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

//#define FILE1 "dna1b.txt"
//#define FILE2 "dna2b.txt"

//#define FILE1 "dna1c.txt"
//#define FILE2 "dna2c.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 8

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE];
Lista *pLista;
struct tagTree *father;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

// catena vuota
if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->father = NULL;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
node->left->father = node;
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
node->right->father = node;
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
long k;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

k = ftell(fp);

fclose(fp);

return k;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int main()
{
FILE *fp;
char strTempo[512];

Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;
int bTrovato;
int MaxPrefix;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(FILE1);
p2_len = LeggiDimensioniFile(FILE2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return -1;
}
p2[0] = '\0';

if ( !LeggiStringa(FILE1, p1, p1_len) )
return -1;

if ( !LeggiStringa(FILE2, p2, p2_len) )
return -1;

if ( PREFIX_SIZE < p2_len )
MaxPrefix = PREFIX_SIZE;
else
MaxPrefix = p2_len;

if ( p1_len < PREFIX_SIZE )
{
printf("\n\nLa stringa piu' lunga risulta composta da %d caratteri.\nFai prima se ti apri i due file con blocco note e te le cerchi da solo.\nCiao ciao.\n\n", p1_len);

free(p1);
free(p2);

return 0;
}

for ( x = MaxPrefix; x > 0; x--)
{
pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - x )
{
memcpy(strSearch, p1 + pos1, x);
*(strSearch + x) = '\0';

pTree = TreeInsertNode(pTree, strSearch, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;
bTrovato = 0;

while ( pos2 < p2_len - x )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

TreeSearch(pTree, &pNode, strSearch);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
numErrors = 0;
k = x;
post1 = pos1;
post2 = pos2;
if ( pos2 > x && pos1 > x )
{
k = 0;
while ( (*(p2 + pos2) == *(p1 + pos1)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1 += 1;
pos2 += 1;
numErrors = 0;
}
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

bTrovato = 1;
}
pos2 = post2;
pLista = pLista->next;
}
}

pos2++;
}
TreeFree(pTree);
if ( bTrovato )
break;
}

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

x = len_prec + 1;

pos1 = 0;
pos2 = pos2x + 1;
strTemp1[0] = '\0';
strSearch[0] = '\0';
len_prec = len;
len = 0;
numErrors = 0;

memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

k = 0;
while ( *(p1 + k) != *(p2 + k) )
k++;
pos1 = k;

while ( pos1 < p1_len - x )
{
memcpy(strTemp1, p1 + pos1, x);
*(strTemp1 + x) = '\0';

k = 0;
while ( k < x )
{
if ( *(strSearch + k) != *(strTemp1 + k) )
numErrors++;
if ( numErrors > 2 )
break;
k++;
}
if ( k == x )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

break;
}

pos1++;
}

c_end = clock();

printf(strRes);

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

fp = fopen("Risultato.txt", "a");
fwrite(strRes, strlen(strRes), 1, fp);
fwrite(strTempo, strlen(strTempo), 1, fp);
fclose(fp);

printf(strTempo);

free(p1);
free(p2);

return 0;
}

marco.r

09-08-2008, 22:06

Adoro i linguaggi funzionali. :D

E' illeggibile perche' l'ho scritta male e di fretta. La colpa in questo caso e' mia, non del linguaggio :p.

Non puoi disattivarla prima dei calcoli e riattivarla subito dopo?
Ni. Faccio un sacco di allocazioni di qua e di la nel codice, per cui l'intervento e' necessario. O meglio, per pochi elementi (tipo i 200000 del test) probabilmente si potrebbe fare, se si va sull'ordine del milione pero' la memoria occupata e' eccessiva per non raccogliere un po' di pattume. Si potrebbe fare le cose con un po' di piu' criterio ma come dicevo sopra mi accontento del risultato ottenuto :p, almeno per ora.

gugoXX

10-08-2008, 00:57

Rilancio di nuovo

Maximum common length 29
Maximum substring indexes 50002,70002
Substring1: ACTGTCCTGAAGATCGCTTGGCATCTCCG
Substring2: ACTGTCCTGCAGATCGCTTTGCATCTCCG
257ms

Con questo codice.

static void Main(string[] args)
{
string input1 ="";
string input2 = "";

input1 = File.ReadAllText(@"C:\temp\DNA1.txt");
input2 = File.ReadAllText(@"C:\temp\DNA2.txt");

Stopwatch sw = new Stopwatch();
sw.Start();
Run Current = TryV5.Get(input1, input2, 2);
sw.Stop();

Console.WriteLine("Maximum common length {0}", Current.length);
Console.WriteLine("Maximum substring indexes {0},{1}", Current.index0, Current.index1);

string s0 = input1.Substring(Current.index0, Current.length);
string s1 = input2.Substring(Current.index1, Current.length);

Console.Write("Substring1: ");
for (int t = 0; t < Current.length; t++)
{
if (s0[t] == s1[t]) Console.ForegroundColor = ConsoleColor.White;
else Console.ForegroundColor = ConsoleColor.Red;
Console.Write(s0[t]);
}
Console.WriteLine();
Console.Write("Substring2: ");
for (int t = 0; t < Current.length; t++)
{
if (s0[t] == s1[t]) Console.ForegroundColor = ConsoleColor.White;
else Console.ForegroundColor = ConsoleColor.Red;
Console.Write(s1[t]);
}
Console.WriteLine();

Console.WriteLine("{0}ms",sw.ElapsedMilliseconds);
Console.ReadKey();
}

public static class TryV5
{
static string input1;
static string input2;
static int len1;
static int len2;
static int ERR;

public static Run Get(string i_input1, string i_input2, int i_ERR)
{
input1 = i_input1;
input2 = i_input2;
if (input2.Length > input1.Length)
{
string tmp = input1;
input1 = input2;
input2 = tmp;
}

ERR = i_ERR;

len1 = input1.Length;
len2 = input2.Length;

var Upper = new Dictionary<string, List<int>>[100];
var Lower = new Dictionary<string, List<int>>[100];

int length = (int)(Math.Log(len1, 4));
//int length = 6;
bool stay;
do
{
length++;
var mup = Upper[length] = BuildDict(input1, length);
var mdw = Lower[length] = BuildDict(input2, length);

var tst = from m in mup
join n in mdw on m.Key equals n.Key
select true;

int lntest = Math.Min(mup.Count, 1<<19);
int take = tst.Take(lntest).Count();
stay = (take == lntest);
if (take == 0) length -= 2;
} while (stay);

Run Winner = new Run();
int minlen = GetMinLen(length);

for(int ln=length;ln>=minlen;ln--)
{
var mup = Upper[ln];
if (mup == null) mup = BuildDict(input1, ln);
var mdw = Lower[ln];
if (mdw == null) mdw = BuildDict(input2, ln);

var tst = from m in mup
join n in mdw on m.Key equals n.Key
from u in m.Value
from d in n.Value
select Search(u, d, ln);

Run best = tst.Max();
if (best.length > Winner.length)
{
Winner = best;
minlen = GetMinLen(best.length);
if (minlen > ln) break;
}
}
return Winner;
}

public static int GetMinLen(int curmaxlen)
{
int minlen = (int)Math.Ceiling((float)(curmaxlen) / (float)(ERR+1));
return minlen;
}

private static Run Search(int upoffset, int downoffset,int sure)
{
int[] Pre = new int[ERR+1];
int[] Post = new int[ERR+2];

//SearchPost

bool lasterr = false;
for(int erfnd=0,pch1=upoffset+sure,pch2=downoffset+sure; erfnd<=ERR && pch1<len1 && pch2<len2 ;pch1++,pch2++)
{
char ch1 = input1[pch1];
char ch2 = input2[pch2];
if (ch1 == ch2) Post[erfnd]++;
else
{
if (lasterr) break;
erfnd++;
lasterr = true;
}
}
sure += Post[0];

//SearchPre
lasterr = true;
for (int erfnd = 0, pch1=upoffset-2, pch2=downoffset-2; erfnd < ERR && pch1>=0 && pch2>=0; pch1--,pch2--)
{
char ch1 = input1[pch1];
char ch2 = input2[pch2];
if (ch1 == ch2) Pre[erfnd]++;
else
{
if (lasterr) break;
erfnd++;
lasterr = true;
}
}

int pr = 0;
int po = 1;
for (int t = 0; t < ERR; t++)
{
int ppd = Post[po];
int ppr = Pre[pr];
if ((ppd == 0) && (ppr == 0)) break;
if (ppd > ppr) po++;
else pr++;
}

int ofprim = 0;
for (int t = 0; t < pr; t++)
{
ofprim += (Pre[t] + 1);
}

int ofdop = 0;
for (int t = 1; t < po; t++)
{
ofdop += (Post[t] + 1);
}

int len = sure + ofprim + ofdop;
return new Run(len, upoffset - ofprim, downoffset - ofprim);
}

private static Dictionary<string, List<int>> BuildDict(string input, int len)
{
int fin = input.Length - len;
var ret = new Dictionary<string, List<int>>(fin);

for (int t = 0; t < fin; t++)
{
string str = input.Substring(t, len);
List<int> adder;
if (!ret.TryGetValue(str, out adder))
{
adder = ret[str] = new List<int>();
}
adder.Add(t);
}
return ret;
}

public class CRun
{
public int offset1;
public int offset2;
public int length=1;
public int err = 0;
public bool previouserr = false;
public CRun(int i_offset1,int i_offset2)
{
offset1 = i_offset1;
offset2 = i_offset2;
}
}
}

Vincenzo1968

10-08-2008, 07:06

... sarebbe opportuno avere una funzione ben separata, che accetta in input 2 stringhe e il numero di errori, cosicche' possiamo disporre piu' o meno tutti della stessa interfaccia.
Come output, dato che e' richiesto piu' di un valore (2 offset e la lunghezza), puoi usare una struct con il C

eqque qua:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

//#define FILE1 "dna1b.txt"
//#define FILE2 "dna2b.txt"

//#define FILE1 "dna1c.txt"
//#define FILE2 "dna2c.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 8

typedef struct tagRisultato
{
int pos1;
int pos2;
char p1[1024];
char p2[1024];
int len;
double tempo;
} Risultato;

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE];
Lista *pLista;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
long k;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

k = ftell(fp);

fclose(fp);

return k;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

void Trova(char *szNomeFile1, char *szNomeFile2, Risultato *pRisultato)
{
FILE *fp;
char strTempo[512];

Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;
int bTrovato;
int MaxPrefix;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(szNomeFile1);
p2_len = LeggiDimensioniFile(szNomeFile2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p2[0] = '\0';

if ( !LeggiStringa(szNomeFile1, p1, p1_len) )
return;

if ( !LeggiStringa(szNomeFile2, p2, p2_len) )
return;

if ( PREFIX_SIZE < p2_len )
MaxPrefix = PREFIX_SIZE;
else
MaxPrefix = p2_len;

if ( p1_len < PREFIX_SIZE )
{
printf("\n\nLa stringa piu' lunga risulta composta da %d caratteri.\nFai prima se ti apri i due file con blocco note e te le cerchi da solo.\nCiao ciao.\n\n", p1_len);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);
pRisultato->tempo = 0;

free(p1);
free(p2);

return;
}

for ( x = MaxPrefix; x > 0; x--)
{
pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - x )
{
memcpy(strSearch, p1 + pos1, x);
*(strSearch + x) = '\0';

pTree = TreeInsertNode(pTree, strSearch, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;
bTrovato = 0;

while ( pos2 < p2_len - x )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

TreeSearch(pTree, &pNode, strSearch);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
numErrors = 0;
k = x;
post1 = pos1;
post2 = pos2;
if ( pos2 > x && pos1 > x )
{
k = 0;
while ( (*(p2 + pos2) == *(p1 + pos1)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1 += 1;
pos2 += 1;
numErrors = 0;
}
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

bTrovato = 1;
}
pos2 = post2;
pLista = pLista->next;
}
}

pos2++;
}
TreeFree(pTree);
if ( bTrovato )
break;
}

pos1 = pos1x;
pos2 = pos2x;
len_prec = len;
numErrors = 0;
while ( (*(p1 + pos1) == *(p2 + pos2)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos2++;
pos1++;
if ( numErrors == 2 )
break;
}
len = 0;
numErrors = 0;
while ( (*(p1 + pos1 + len) == *(p2 + pos2 + len)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + len) != *(p1 + pos1 + len) )
numErrors++;
len++;
}
if ( len_prec < len )
{
pos1x = pos1;
pos2x = pos2;
}
else
{
len = len_prec;
}

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

x = len_prec + 1;

pos1 = 0;
pos2 = pos2x + 1;
strTemp1[0] = '\0';
strSearch[0] = '\0';
len_prec = len;
len = 0;
numErrors = 0;

memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

k = 0;
while ( *(p1 + k) != *(p2 + k) )
k++;
pos1 = k;

while ( pos1 < p1_len - x )
{
memcpy(strTemp1, p1 + pos1, x);
*(strTemp1 + x) = '\0';

k = 0;
while ( k < x )
{
if ( *(strSearch + k) != *(strTemp1 + k) )
numErrors++;
if ( numErrors > 2 )
break;
k++;
}
if ( k >= x && len_prec > len )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

break;
}

pos1++;
}

c_end = clock();

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

pRisultato->tempo = (double)(c_end - c_start) / CLOCKS_PER_SEC;

fp = fopen("Risultato.txt", "a");
fwrite(strRes, strlen(strRes), 1, fp);
fwrite(strTempo, strlen(strTempo), 1, fp);
fclose(fp);

free(p1);
free(p2);
}

int main()
{
Risultato ris;
char strTempo[512];
char strRes[1024];

Trova(FILE1, FILE2, &ris);

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
ris.p1,
ris.p2,
ris.len,
ris.pos1,
ris.pos2);

printf(strRes);

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", ris.tempo);

printf(strTempo);

return 0;
}

Vincenzo1968

10-08-2008, 07:12

Rilancio di nuovo

Con questo codice.

static void Main(string[] args)
{
string input1 ="";
string input2 = "";

input1 = File.ReadAllText(@"C:\temp\DNA1.txt");
input2 = File.ReadAllText(@"C:\temp\DNA2.txt");

Stopwatch sw = new Stopwatch();
sw.Start();
Run Current = TryV5.Get(input1, input2, 2);
sw.Stop();

...

Un momento! Io, nel tempo finale, ho incluso i tempi di lettura delle stringhe dai file.

Ho tentato di compilare il tuo codice ma ho questo errore:

Errore 1 Impossibile trovare il tipo o il nome dello spazio dei nomi 'Run'; probabilmente manca una direttiva using o un riferimento a un assembly C:\Progetti CSharp\Gugo\Gugo\Gugo\Program.cs 17 23 Gugo

cionci

10-08-2008, 12:12

eqque qua:

Continua a darmi soluzioni strane:
Stringhe

CCACTGTCCTGTCAACAAGGAGT
GGACTGTCCTGTCAACAAGGAGT

di lunghezza 23 trovate alle posizioni 10000 e 40000

Tempo impiegato -> 0.18000 secondi

Secondo me c'è qualche memory leak, altrimenti mi sembra strano che dia soluzioni diverse con compilatori diversi.

Vincenzo1968

10-08-2008, 18:03

Continua a darmi soluzioni strane:
Stringhe

CCACTGTCCTGTCAACAAGGAGT
GGACTGTCCTGTCAACAAGGAGT

di lunghezza 23 trovate alle posizioni 10000 e 40000

Tempo impiegato -> 0.18000 secondi

Secondo me c'è qualche memory leak, altrimenti mi sembra strano che dia soluzioni diverse con compilatori diversi.

Vero è :confused:
Col visual c++ il risultato è corretto, mentre col watcom(che ho scaricato e installato appositamente), il risultato è sbagliato.

Io mi arrendo. Passo il resto delle ferie lontano dal computer. Per il punto A la mia soluzione è quella con i suffix tree che ho postato all'inizio.

A rileggervi a settembre.
Buone vacanze a tutti
:)

Vincenzo1968

12-08-2008, 02:30

Vero è :confused:
Col visual c++ il risultato è corretto, mentre col watcom(che ho scaricato e installato appositamente), il risultato è sbagliato.

Io mi arrendo. Passo il resto delle ferie lontano dal computer. Per il punto A la mia soluzione è quella con i suffix tree che ho postato all'inizio.

A rileggervi a settembre.
Buone vacanze a tutti
:)

Ho trovato l'errore.

Questa linea(il campo key dell'albero binario):

...
char key[PREFIX_SIZE];
...

va sostituita con questa:

...
char key[PREFIX_SIZE + 1];
...

Funziona sia col visual studio che col watcom.
Tempo impiegato -> 0.19400 secondi :)

Vincenzo1968

12-08-2008, 03:16

Ho adattato il codice per il punto A da quello per il punto B. Posto le versioni per i due punti:

Punto A:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

//#define FILE1 "dna1b.txt"
//#define FILE2 "dna2b.txt"

//#define FILE1 "dna1c.txt"
//#define FILE2 "dna2c.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 8

typedef struct tagRisultato
{
int pos1;
int pos2;
char p1[1024];
char p2[1024];
int len;
double tempo;
} Risultato;

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE + 1];
Lista *pLista;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
long k;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

k = ftell(fp);

fclose(fp);

return k;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

void Trova(char *szNomeFile1, char *szNomeFile2, Risultato *pRisultato)
{
FILE *fp;
char strTempo[512];

Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int bTrovato;
int MaxPrefix;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(szNomeFile1);
p2_len = LeggiDimensioniFile(szNomeFile2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p2[0] = '\0';

if ( !LeggiStringa(szNomeFile1, p1, p1_len) )
return;

if ( !LeggiStringa(szNomeFile2, p2, p2_len) )
return;

if ( PREFIX_SIZE < p2_len )
MaxPrefix = PREFIX_SIZE;
else
MaxPrefix = p2_len;

if ( p1_len < PREFIX_SIZE )
{
printf("\n\nLa stringa piu' lunga risulta composta da %d caratteri.\nFai prima se ti apri i due file con blocco note e te le cerchi da solo.\nCiao ciao.\n\n", p1_len);

pRisultato->len = 0;
pRisultato->pos1 = 0;
pRisultato->pos2 = 0;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);
pRisultato->tempo = 0;

free(p1);
free(p2);

return;
}

for ( x = MaxPrefix; x > 0; x--)
{
pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - x )
{
memcpy(strSearch, p1 + pos1, x);
*(strSearch + x) = '\0';

pTree = TreeInsertNode(pTree, strSearch, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
bTrovato = 0;

while ( pos2 < p2_len - x )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

TreeSearch(pTree, &pNode, strSearch);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
k = x;
post1 = pos1;
post2 = pos2;
if ( pos2 > x && pos1 > x )
{
k = 0;
while ( *(p2 + pos2) == *(p1 + pos1) )
{
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1 += 1;
pos2 += 1;
}
while ( *(p2 + pos2 + k) == *(p1 + pos1 + k) )
k++;
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;
bTrovato = 1;
}
pos2 = post2;
pLista = pLista->next;
}
}

pos2++;
}
TreeFree(pTree);
if ( bTrovato )
break;
}

pos1 = pos1x;
pos2 = pos2x;
len_prec = len;
while ( *(p1 + pos1) == *(p2 + pos2) )
{
pos2++;
pos1++;
}
len = 0;
while ( *(p1 + pos1 + len) == *(p2 + pos2 + len) )
len++;
if ( len_prec < len )
{
pos1x = pos1;
pos2x = pos2;
}
else
{
len = len_prec;
}

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

x = len_prec + 1;

pos1 = 0;
pos2 = pos2x + 1;
strTemp1[0] = '\0';
strSearch[0] = '\0';
len_prec = len;
len = 0;

memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

k = 0;
while ( *(p1 + k) != *(p2 + k) )
k++;
pos1 = k;

while ( pos1 < p1_len - x )
{
memcpy(strTemp1, p1 + pos1, x);
*(strTemp1 + x) = '\0';

k = 0;
/*
while ( k < x )
{
if ( *(strSearch + k) != *(strTemp1 + k) )
numErrors++;
if ( numErrors > 2 )
break;
k++;
}
*/
if ( k >= x && len_prec > len )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

break;
}

pos1++;
}

c_end = clock();

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

pRisultato->tempo = (double)(c_end - c_start) / CLOCKS_PER_SEC;

fp = fopen("Risultato.txt", "a");
fwrite(strRes, strlen(strRes), 1, fp);
fwrite(strTempo, strlen(strTempo), 1, fp);
fclose(fp);

free(p1);
free(p2);
}

int main()
{
Risultato ris;
char strTempo[512];
char strRes[1024];

Trova(FILE1, FILE2, &ris);

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
ris.p1,
ris.p2,
ris.len,
ris.pos1,
ris.pos2);

printf(strRes);

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", ris.tempo);

printf(strTempo);

return 0;
}

Punto B:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

#define FILE1 "dna1.txt"
#define FILE2 "dna2.txt"

//#define FILE1 "dna1b.txt"
//#define FILE2 "dna2b.txt"

//#define FILE1 "dna1c.txt"
//#define FILE2 "dna2c.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 8

typedef struct tagRisultato
{
int pos1;
int pos2;
char p1[1024];
char p2[1024];
int len;
double tempo;
} Risultato;

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagTree
{
char key[PREFIX_SIZE + 1];
Lista *pLista;
struct tagTree *left;
struct tagTree *right;
} Tree;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

Tree *TreeNewNode(char *pKey, int pos);
Tree *TreeInsertNode(Tree *node, char * pKey, int pos);
void TreeSearch(Tree *head, Tree **result, char *pKey);
void TreeFree(Tree *head);

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

Tree *TreeNewNode(char *pKey, int pos)
{
Tree *r;
Lista *l;

r = (Tree *) malloc(sizeof(Tree));
if(!r)
{
printf("Memoria insufficiente.\n");
return NULL;
}

l = (Lista *) malloc(sizeof(Lista));
if(!l)
{
printf("Memoria insufficiente.\n");
return NULL;
}
l->pos = pos;
l->next = NULL;

strcpy(r->key, pKey);
r->pLista = l;
r->left = NULL;
r->right = NULL;

return r;
}

Tree *TreeInsertNode(Tree *node, char * pKey, int pos)
{
int res;
Tree *pRadice = NULL;

if( !node )
{
node = TreeNewNode(pKey, pos);
return node;
}

pRadice = node;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
{
node->left = TreeNewNode(pKey, pos);
break;
}
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
{
node->right = TreeNewNode(pKey, pos);
break;
}
node = node->right;
}
else
{
node->pLista = ListAppend(node->pLista, pos);
break;
}

}

node = pRadice;

return node;
}

void TreeSearch(Tree *head, Tree **result, char *pKey)
{
int res;
Tree *node;

*result = NULL;
node = head;

if ( !head )
return;

while( 1 )
{
res = strcmp(pKey, node->key);
if ( res < 0 )
{
if ( !node->left )
break;
node = node->left;
}
else if ( res > 0 )
{
if ( !node->right )
break;
node = node->right;
}
else // key == node->data
{
*result = node;
break;
}
}
}

void TreeFree(Tree *head)
{
Tree *temp1, *temp2;

Tree *stack[MAX_STACK];
int top;

top = 0;

if ( !head )
return;

temp1 = temp2 = head;

while ( temp1 != NULL )
{
for(; temp1->left != NULL; temp1 = temp1->left)
stack[top++] = temp1;

while ( (temp1 != NULL) && (temp1->right == NULL || temp1->right == temp2) )
{
temp2 = temp1;
ListFree(temp2->pLista);
free(temp2);
if ( top == 0 )
return;
temp1 = stack[--top];
}

stack[top++] = temp1;
temp1 = temp1->right;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
long k;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

k = ftell(fp);

fclose(fp);

return k;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

void Trova(char *szNomeFile1, char *szNomeFile2, Risultato *pRisultato)
{
FILE *fp;
char strTempo[512];

Tree *pTree;
Tree *pNode;
Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;
int bTrovato;
int MaxPrefix;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(szNomeFile1);
p2_len = LeggiDimensioniFile(szNomeFile2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p2[0] = '\0';

if ( !LeggiStringa(szNomeFile1, p1, p1_len) )
return;

if ( !LeggiStringa(szNomeFile2, p2, p2_len) )
return;

if ( PREFIX_SIZE < p2_len )
MaxPrefix = PREFIX_SIZE;
else
MaxPrefix = p2_len;

if ( p1_len < PREFIX_SIZE )
{
printf("\n\nLa stringa piu' lunga risulta composta da %d caratteri.\nFai prima se ti apri i due file con blocco note e te le cerchi da solo.\nCiao ciao.\n\n", p1_len);

pRisultato->len = 0;
pRisultato->pos1 = 0;
pRisultato->pos2 = 0;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);
pRisultato->tempo = 0;

free(p1);
free(p2);

return;
}

for ( x = MaxPrefix; x > 0; x--)
{
pTree = NULL;
pos1 = 0;
while ( pos1 < p1_len - x )
{
memcpy(strSearch, p1 + pos1, x);
*(strSearch + x) = '\0';

pTree = TreeInsertNode(pTree, strSearch, pos1);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;
bTrovato = 0;

while ( pos2 < p2_len - x )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

TreeSearch(pTree, &pNode, strSearch);
if ( pNode != NULL )
{
pLista = pNode->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
numErrors = 0;
k = x;
post1 = pos1;
post2 = pos2;
if ( pos2 > x && pos1 > x )
{
k = 0;
while ( (*(p2 + pos2) == *(p1 + pos1)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1 += 1;
pos2 += 1;
numErrors = 0;
}
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

bTrovato = 1;
}
pos2 = post2;
pLista = pLista->next;
}
}

pos2++;
}
TreeFree(pTree);
if ( bTrovato )
break;
}

pos1 = pos1x;
pos2 = pos2x;
len_prec = len;
numErrors = 0;
while ( (*(p1 + pos1) == *(p2 + pos2)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos2++;
pos1++;
if ( numErrors == 2 )
break;
}
len = 0;
numErrors = 0;
while ( (*(p1 + pos1 + len) == *(p2 + pos2 + len)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + len) != *(p1 + pos1 + len) )
numErrors++;
len++;
}
if ( len_prec < len )
{
pos1x = pos1;
pos2x = pos2;
}
else
{
len = len_prec;
}

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

x = len_prec + 1;

pos1 = 0;
pos2 = pos2x + 1;
strTemp1[0] = '\0';
strSearch[0] = '\0';
len_prec = len;
len = 0;
numErrors = 0;

memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

k = 0;
while ( *(p1 + k) != *(p2 + k) )
k++;
pos1 = k;

while ( pos1 < p1_len - x )
{
memcpy(strTemp1, p1 + pos1, x);
*(strTemp1 + x) = '\0';

k = 0;
while ( k < x )
{
if ( *(strSearch + k) != *(strTemp1 + k) )
numErrors++;
if ( numErrors > 2 )
break;
k++;
}
if ( k >= x && len_prec > len )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

break;
}

pos1++;
}

c_end = clock();

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

pRisultato->tempo = (double)(c_end - c_start) / CLOCKS_PER_SEC;

fp = fopen("Risultato.txt", "a");
fwrite(strRes, strlen(strRes), 1, fp);
fwrite(strTempo, strlen(strTempo), 1, fp);
fclose(fp);

free(p1);
free(p2);
}

int main()
{
Risultato ris;
char strTempo[512];
char strRes[1024];

Trova(FILE1, FILE2, &ris);

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
ris.p1,
ris.p2,
ris.len,
ris.pos1,
ris.pos2);

printf(strRes);

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", ris.tempo);

printf(strTempo);

return 0;
}

In entrambi i casi, il tempo ottenuto, sulla mia macchina, è di circa 0.2 secondi :)

Vincenzo1968

12-08-2008, 04:55

Ho effettuato parecchie prove con entrambi i compilatori, con stringhe diverse e in diverse posizioni, e sembrerebbe tutto a posto.

Per esempio, per il punto B:

http://www.guidealgoritmi.it/images/ImgForums/soluzionepuntob.jpg

Cionci, ti sarei grato se volessi confermare i risultati anche col tuo compilatore(se non ricordo male usi gcc in ambiente Unix/Linux?).

:)

gugoXX

12-08-2008, 09:56

Metto anche io il codice completo

Maximum common length 29
Maximum substring indexes 50002,70002
Substring1: ACTGTCCTGAAGATCGCTTGGCATCTCCG
Substring2: ACTGTCCTGCAGATCGCTTTGCATCTCCG
259ms

public class Program
{
static void Main(string[] args)
{
string input1 = "";
string input2 = "";

input1 = File.ReadAllText(@"C:\temp\DNA1.txt");
input2 = File.ReadAllText(@"C:\temp\DNA2.txt");

//Randomizer rnd = new Randomizer();
//rnd.SetRandomTest();
//input1 = rnd.input1;
//input2 = rnd.input2;

Stopwatch sw = new Stopwatch();
sw.Start();
Run Current = TryV5.Get(input1, input2, 2);
sw.Stop();

Console.ForegroundColor = ConsoleColor.White;
Console.WriteLine("Maximum common length {0}", Current.length);
Console.WriteLine("Maximum substring indexes {0},{1}", Current.index0, Current.index1);

string s0 = input1.Substring(Current.index0, Current.length);
string s1 = input2.Substring(Current.index1, Current.length);

Console.Write("Substring1: ");
for (int t = 0; t < Current.length; t++)
{
if (s0[t] == s1[t]) Console.ForegroundColor = ConsoleColor.White;
else Console.ForegroundColor = ConsoleColor.Red;
Console.Write(s0[t]);
}
Console.WriteLine();
Console.Write("Substring2: ");
for (int t = 0; t < Current.length; t++)
{
if (s0[t] == s1[t]) Console.ForegroundColor = ConsoleColor.White;
else Console.ForegroundColor = ConsoleColor.Red;
Console.Write(s1[t]);
}
Console.WriteLine();

Console.WriteLine("{0}ms", sw.ElapsedMilliseconds);
Console.ReadKey();
}
}

public class TryV5
{
static string input1;
static string input2;
static int len1;
static int len2;
static int ERR;

public static Run Get(string i_input1, string i_input2, int i_ERR)
{
input1 = i_input1;
input2 = i_input2;
if (input2.Length > input1.Length)
{
string tmp = input1;
input1 = input2;
input2 = tmp;
}

ERR = i_ERR;

len1 = input1.Length;
len2 = input2.Length;

var Upper = new Dictionary<int, Dictionary<string, List<int>>>(100);
var Lower = new Dictionary<int, Dictionary<string, List<int>>>(100);

int length = (int)(Math.Log(len1, 4));
//int length = 6;
bool stay;
do
{
length++;

Dictionary<string, List<int>> mup, mdw;
Get2Dicts(length, out mup, out mdw);
Upper[length] = mup;
Lower[length] = mdw;

var tst = from m in mup
join n in mdw on m.Key equals n.Key
select true;

int lntest = Math.Min(mup.Count, 1 << 19);
int take = tst.Take(lntest).Count();
stay = (take == lntest);
if (take == 0) length -= 2;
} while (stay);

Run Winner = new Run();
int minlen = GetMinLen(length);

for (int ln = length; ln >= minlen; ln--)
{
Dictionary<string, List<int>> mup;
Dictionary<string, List<int>> mdw;
if (Upper.ContainsKey(ln))
{
mup = Upper[ln];
mdw = Lower[ln];
}
else
{
Get2Dicts(ln, out mup, out mdw);
}

// Parallelismo qui
var tst = from m in mup
join n in mdw on m.Key equals n.Key
from u in m.Value
from d in n.Value
select Search(u, d, ln);

Run best = tst.Max();
if (best.length > Winner.length)
{
Winner = best;
minlen = GetMinLen(best.length);
if (minlen > ln) break;
}
}
return Winner;
}

public static void Get2Dicts(int length, out Dictionary<string, List<int>> mup, out Dictionary<string, List<int>> mdw)
{
// Inserimento parallelismo qui
mup = BuildDict(input1, length);
mdw = BuildDict(input2, length);
}

public static int GetMinLen(int curmaxlen)
{
int minlen = (int)Math.Ceiling((float)(curmaxlen) / (float)(ERR + 1));
return minlen;
}

private static Run Search(int upoffset, int downoffset, int sure)
{
int[] Pre = new int[ERR + 1];
int[] Post = new int[ERR + 2];

//SearchPost

bool lasterr = false;
for (int erfnd = 0, pch1 = upoffset + sure, pch2 = downoffset + sure; erfnd <= ERR && pch1 < len1 && pch2 < len2; pch1++, pch2++)
{
char ch1 = input1[pch1];
char ch2 = input2[pch2];
if (ch1 == ch2)
{
Post[erfnd]++;
lasterr = false;
}
else
{
if (lasterr) break;
erfnd++;
lasterr = true;
}
}
sure += Post[0];

//SearchPre
lasterr = true;
for (int erfnd = 0, pch1 = upoffset - 2, pch2 = downoffset - 2; erfnd < ERR && pch1 >= 0 && pch2 >= 0; pch1--, pch2--)
{
char ch1 = input1[pch1];
char ch2 = input2[pch2];
if (ch1 == ch2)
{
Pre[erfnd]++;
lasterr = false;
}
else
{
if (lasterr) break;
erfnd++;
lasterr = true;
}
}

int pr = 0;
int po = 1;
for (int t = 0; t < ERR; t++)
{
int ppd = Post[po];
int ppr = Pre[pr];
if ((ppd == 0) && (ppr == 0)) break;
if (ppd > ppr) po++;
else pr++;
}

int ofprim = 0;
for (int t = 0; t < pr; t++)
{
ofprim += (Pre[t] + 1);
}

int ofdop = 0;
for (int t = 1; t < po; t++)
{
ofdop += (Post[t] + 1);
}

int len = sure + ofprim + ofdop;
return new Run(len, upoffset - ofprim, downoffset - ofprim);
}

private static Dictionary<string, List<int>> BuildDict(string input, int len)
{
int fin = input.Length - len;
var ret = new Dictionary<string, List<int>>(fin);

for (int t = 0; t < fin; t++)
{
string str = input.Substring(t, len);
List<int> adder;
if (!ret.TryGetValue(str, out adder))
{
adder = ret[str] = new List<int>();
}
adder.Add(t);
}
return ret;
}
}

public class Run : IComparable<Run>
{
public int index0;
public int index1;
public int length;
public Run()
{
index0 = int.MinValue;
index1 = int.MinValue;
length = int.MinValue;
}

public Run(int len, int i0, int i1)
{
length = len;
index0 = i0;
index1 = i1;
}

#region IComparable<Run> Members

public int CompareTo(Run other)
{
return length - other.length;
}

#endregion
}

Vincenzo1968

12-08-2008, 17:08

Ciao Gugo,

sulla mia macchina risulta leggermente più lento:

http://www.guidealgoritmi.it/images/ImgForums/puntoBgugo.jpg

http://www.guidealgoritmi.it/images/ImgForums/puntoBgugo_3.jpg

e ho trovato qualche piccola incongruenza. Per esempio, se nei due file mettiamo le stringhe indicate da cionci:

CCACTGCTGTGCGGATCTCTAAAAA

AACTCTCTCCGGGTCACATCCTCCA

il risultato dovrebbe essere

Stringhe

CTGTGCGG
CTCTCCGG

di lunghezza 8 trovate alle posizioni 6 e 4

e invece è:

http://www.guidealgoritmi.it/images/ImgForums/puntoBgugo_2.jpg

A questo punto, visti i tempi ridicolmente brevi che stiamo ottenendo, non sarebbe il caso di provare su file di dimensione maggiore?

gugoXX

12-08-2008, 17:20

Ciao Gugo,

...

il risultato dovrebbe essere

Stringhe

CTGTGCGG
CTCTCCGG

di lunghezza 8 trovate alle posizioni 6 e 4

e invece è:

http://www.guidealgoritmi.it/images/ImgForums/puntoBgugo_2.jpg

A questo punto, visti i tempi ridicolmente brevi che stiamo ottenendo, non sarebbe il caso di provare su file di dimensione maggiore?

:) Mi sa che io ho trovato per primo l'altro risultato.
Sono entrambi validi vero? e sono lunghi identicamente 8.
Va bene per provare ad allungare, se vuoi stasera (o domani penso), posso buttare giu' un paio di file da una decina di milioni di caratteri.
PS: Bello l'output colorato vero?

Vincenzo1968

12-08-2008, 17:32

:) Mi sa che io ho trovato per primo l'altro risultato.
Sono entrambi validi vero? e sono lunghi identicamente 8.
Va bene per provare ad allungare, se vuoi stasera (o domani penso), posso buttare giu' un paio di file da una decina di milioni di caratteri.
PS: Bello l'output colorato vero?

Si, sono entrambi validi e si, bellissimo l'output colorato ;)

Vincenzo1968

12-08-2008, 20:30

Ho effettuato delle prove su file dieci volte più grandi. Praticamente ho tolto le due stringhe dai file originali; ho copiato e incollato i due file per dieci volte; infine, ho reinserito le stringhe in due punti a caso.

Questi sono i risultati:

Versione C#(GugoXX):
http://www.guidealgoritmi.it/images/ImgForums/dna_gugo.jpg

Versione C(mia):
http://www.guidealgoritmi.it/images/ImgForums/dna_mio.jpg

Per effettuare i test ho modificato la main della versione C# in modo da fargli comprendere i tempi di lettura dai file(come faccio io nella versione C):

...
Stopwatch sw = new Stopwatch();
sw.Start();

input1 = File.ReadAllText("DNA1E.txt");
input2 = File.ReadAllText("DNA2E.txt");

//Randomizer rnd = new Randomizer();
//rnd.SetRandomTest();
//input1 = rnd.input1;
//input2 = rnd.input2;

Run Current = TryV5.Get(input1, input2, 2);
sw.Stop();
...

gugoXX

12-08-2008, 20:51

E' bene non inserire i tempi di caricamento per evitare gli effetti della system cache, che possono cambiare i risultati di parecchio.
Ora che hai separato la funzione di calcolo dovrebbe essere fattibile anche per te circondare solo quella con il timer.

Vincenzo1968

12-08-2008, 21:47

Sono riuscito a scendere sotto i due secondi utilizzando una hash table al posto dell'albero binario:

http://www.guidealgoritmi.it/images/ImgForums/SoluzionePuntoB_HT.jpg

Questo è il codice:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

//#define FILE1 "dna1.txt"
//#define FILE2 "dna2.txt"

//#define FILE1 "dna1b.txt"
//#define FILE2 "dna2b.txt"

//#define FILE1 "dna1c.txt"
//#define FILE2 "dna2c.txt"

//#define FILE1 "dna1d.txt"
//#define FILE2 "dna2d.txt"

#define FILE1 "dna1e.txt"
#define FILE2 "dna2e.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 8

#define DIM_HASHTABLE 131071

typedef struct tagRisultato
{
int pos1;
int pos2;
char p1[1024];
char p2[1024];
int len;
double tempo;
} Risultato;

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagHashTable
{
char key[PREFIX_SIZE + 1];
Lista *pLista;
struct tagHashTable *next;
} HashTable;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

int HashU(char *v, int M);

HashTable* HashTableNewNode(char *Key, int pos)
{
HashTable *n;

n = (HashTable*)malloc(sizeof(HashTable));

if( n == NULL )
return NULL;

strcpy(n->key, Key);
n->pLista = NULL;
n->pLista = ListAppend(n->pLista, pos);
n->next = NULL;

return n;
}

int FindValue(HashTable **pHashTable, char *Key, int M)
{
int index = 0;
HashTable *t;
int a = 31415;
int b = 27183;
char *s = Key;

//for ( index = 0; *s != '\0'; s++, a = a*b % (M - 1) )
// index = (a*index + *s) % M;
//if ( index < 0 )
// index *= -1;

for(; *s != '\0'; s++)
index = (a*index + *s) % M;
if ( index < 0 )
index *= -1;

//printf("%s -> %d\n", Key, index);

t = pHashTable[index];
while ( t != NULL )
{
if ( strcmp(t->key, Key) == 0 )
return index;
t = t->next;
}

return -1;
}

void InsertValue(HashTable **pHashTable, char *Key, int pos, int M)
{
int index = 0;
HashTable *t = NULL;
int a = 31415;
int b = 27183;
char *s = Key;

//for ( index = 0; *s != '\0'; s++, a = a*b % (M - 1) )
// index = (a*index + *s) % M;
//if ( index < 0 )
// index *= -1;

for(; *s != '\0'; s++)
index = (a*index + *s) % M;
if ( index < 0 )
index *= -1;

//printf("%s -> %d\n", Key, index);

t = pHashTable[index];
if ( t == NULL )
{
pHashTable[index] = HashTableNewNode(Key, pos);
return;
}

while ( t != NULL )
{
if ( strcmp(t->key, Key) == 0 )
{
pHashTable[index]->pLista = ListAppend(pHashTable[index]->pLista, pos);
return;
}
if ( t->next == NULL )
{
t->next = HashTableNewNode(Key, pos);
t = t->next;
t->next = NULL;
return;
}
t = t->next;
}
}

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

void HashTableFree(HashTable* first)
{
HashTable *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
ListFree(n1->pLista);
free(n1);
n1 = n2;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
int numread = 0;
int dimFile = 0;
int count = 0;
fpos_t pos;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

if( fgetpos(fp, &pos) != 0 )
{
fclose(fp);
return 0;
}

fclose(fp);

return (int)pos;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

void Trova(char *szNomeFile1, char *szNomeFile2, Risultato *pRisultato)
{
FILE *fp;
char strTempo[512];

//Tree *pTree;
//Tree *pNode;

HashTable *pHT[DIM_HASHTABLE];
Lista *pNode;

Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;
int bTrovato;
int MaxPrefix;
int index;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

c_start = clock();

p1_len = LeggiDimensioniFile(szNomeFile1);
p2_len = LeggiDimensioniFile(szNomeFile2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p2[0] = '\0';

if ( !LeggiStringa(szNomeFile1, p1, p1_len) )
return;

if ( !LeggiStringa(szNomeFile2, p2, p2_len) )
return;

if ( PREFIX_SIZE < p2_len )
MaxPrefix = PREFIX_SIZE;
else
MaxPrefix = p2_len;

if ( p1_len < PREFIX_SIZE )
{
printf("\n\nLa stringa piu' lunga risulta composta da %d caratteri.\nFai prima se ti apri i due file con blocco note e te le cerchi da solo.\nCiao ciao.\n\n", p1_len);

pRisultato->len = 0;
pRisultato->pos1 = 0;
pRisultato->pos2 = 0;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);
pRisultato->tempo = 0;

free(p1);
free(p2);

return;
}

for (k = 0; k < DIM_HASHTABLE; k++ )
pHT[k] = NULL;

for ( x = MaxPrefix; x > 0; x--)
{
for ( k = 0; k < DIM_HASHTABLE; k++ )
{
if ( pHT[k] != NULL )
HashTableFree(pHT[k]);
}

pos1 = 0;
while ( pos1 < p1_len - PREFIX_SIZE )
{
memcpy(strSearch, p1 + pos1, PREFIX_SIZE);
*(strSearch + PREFIX_SIZE) = '\0';

InsertValue(pHT, strSearch, pos1, DIM_HASHTABLE);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;
bTrovato = 0;

while ( pos2 < p2_len - x )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

index = FindValue(pHT, strSearch, DIM_HASHTABLE);
if ( index >= 0 )
{
pLista = pHT[index]->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
numErrors = 0;
k = x;
post1 = pos1;
post2 = pos2;
if ( pos2 > x && pos1 > x )
{
k = 0;
while ( (*(p2 + pos2) == *(p1 + pos1)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1 += 1;
pos2 += 1;
numErrors = 0;
}
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( k > len_prec )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

bTrovato = 1;
}
pos2 = post2;
pLista = pLista->next;
}
}

pos2++;
}
if ( bTrovato )
break;
}

pos1 = pos1x;
pos2 = pos2x;
len_prec = len;
numErrors = 0;
while ( (*(p1 + pos1) == *(p2 + pos2)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos2++;
pos1++;
if ( numErrors == 2 )
break;
}
len = 0;
numErrors = 0;
while ( (*(p1 + pos1 + len) == *(p2 + pos2 + len)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + len) != *(p1 + pos1 + len) )
numErrors++;
len++;
}
if ( len_prec < len )
{
pos1x = pos1;
pos2x = pos2;
}
else
{
len = len_prec;
}

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

x = len_prec + 1;

pos1 = 0;
pos2 = pos2x + 1;
strTemp1[0] = '\0';
strSearch[0] = '\0';
len_prec = len;
len = 0;
numErrors = 0;

memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

k = 0;
while ( *(p1 + k) != *(p2 + k) )
k++;
pos1 = k;

while ( pos1 < p1_len - x )
{
memcpy(strTemp1, p1 + pos1, x);
*(strTemp1 + x) = '\0';

k = 0;
while ( k < x )
{
if ( *(strSearch + k) != *(strTemp1 + k) )
numErrors++;
if ( numErrors > 2 )
break;
k++;
}
if ( k >= x && len_prec > len )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

break;
}

pos1++;
}

c_end = clock();

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

pRisultato->tempo = (double)(c_end - c_start) / CLOCKS_PER_SEC;

fp = fopen("Risultato.txt", "a");
fwrite(strRes, strlen(strRes), 1, fp);
fwrite(strTempo, strlen(strTempo), 1, fp);
fclose(fp);

free(p1);
free(p2);

for ( k = 0; k < DIM_HASHTABLE; k++ )
HashTableFree(pHT[k]);
}

int main()
{
Risultato ris;
char strTempo[512];
char strRes[1024];

Trova(FILE1, FILE2, &ris);

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
ris.p1,
ris.p2,
ris.len,
ris.pos1,
ris.pos2);

printf(strRes);

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", ris.tempo);
printf(strTempo);

return 0;
}

:)

Vincenzo1968

12-08-2008, 21:58

E' bene non inserire i tempi di caricamento per evitare gli effetti della system cache, che possono cambiare i risultati di parecchio.
Ora che hai separato la funzione di calcolo dovrebbe essere fattibile anche per te circondare solo quella con il timer.

Secondo me, relativamente a questo contest, è importante considerare anche i tempi di lettura dai file.
Metti, per esempio, che un biologo debba esaminare in sequenza mille coppie di file.

Vincenzo1968

12-08-2008, 22:17

Una precisazione.
Nel secondo file, ho si spostato la stringa in una posizione diversa, ma mi sono dimenticato di fare il copia e incolla per dieci volte.

I risultati sono quindi da considerarsi sul file 1 che è dieci volte più grande di quello originale, mentre il file B, in dimensioni, è rimasto uguale.

I due file si possono scaricare da qui (http://www.guidealgoritmi.it/public/dna1e.zip).

gugoXX

13-08-2008, 00:54

Da me ci mette

Maximum common length 29
Maximum substring indexes 974640,9235
Substring1: ACTGTCCTGAAGATCGCTTGGCATCTCCG
Substring2: ACTGTCCTGCAGATCGCTTTGCATCTCCG
1364ms

Anche considerando la lettura dal file.
E tu mi dirai: ma dobbiamo fare le prove sulla stessa macchina...
e io ti diro': se il tuo dottore ha deciso di includere la lettura dei file, il mio ha deciso di usare la mia macchina. :)

Seriamente, la lettura del file e' da levare. Se eseguo 2 volte consecutive il programma, la prima volta ci mette 1.8sec.
la seconda volta, trovandosi i file in cache ce ne mette 1.364
Ovviamente e' impossibile riuscire a tirare fuori dei dati attendibili.
Cercherei di concentrasi sui tempi degli algoritmi, tralasciando quanto piu' possibile il rumore, sia per questo che per altri Contest.

Proviamo con questo
http://www.usaupload.net/d/1j5ruhh7wyb

Vincenzo1968

13-08-2008, 04:45

Da me ci mette

Anche considerando la lettura dal file.
E tu mi dirai: ma dobbiamo fare le prove sulla stessa macchina...
e io ti diro': se il tuo dottore ha deciso di includere la lettura dei file, il mio ha deciso di usare la mia macchina. :)

Seriamente, la lettura del file e' da levare. Se eseguo 2 volte consecutive il programma, la prima volta ci mette 1.8sec.
la seconda volta, trovandosi i file in cache ce ne mette 1.364
Ovviamente e' impossibile riuscire a tirare fuori dei dati attendibili.
Cercherei di concentrasi sui tempi degli algoritmi, tralasciando quanto piu' possibile il rumore, sia per questo che per altri Contest.

Proviamo con questo
http://www.usaupload.net/d/1j5ruhh7wyb

Ok, ho tolto la lettura dai file sia dal mio che dal tuo programma.

Questo è il risultato mio:

http://www.guidealgoritmi.it/images/ImgForums/dna_mio2.jpg

Col tuo programma, dopo cinque minuti di vana attesa, ho dovuto riavviare windows che risulta enormemente rallentato(e ho provato più volte a riavviare il sistema e lanciare l'applicazione con nessun'altra finestra aperta).
Quanto cabasisi di memoria usa il tuo programma? :)

Questa è la mia macchina:
http://www.guidealgoritmi.it/images/ImgForums/matrice05.jpg

Vincenzo1968

13-08-2008, 05:59

Ho fatto un ultimo tentativo:

1) ho riavviato il sistema
2) ho lanciato la tua applicazione
3) mi sono vestito, ho preso la macchina, e sono andato a mangiarmi i cornetti caldi a Cefalù(circa 3 Km da dove abito).
4) sono tornato e non ho avuto, purtroppo, il piacere di vedere i risultati sulla console.

Il codice che ho usato è questo:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.IO;

namespace Contest04BGugo
{
public class Program
{
static void Main(string[] args)
{
string input1 = "";
string input2 = "";

input1 = File.ReadAllText("DNA1F.txt");
input2 = File.ReadAllText("DNA2F.txt");

//Randomizer rnd = new Randomizer();
//rnd.SetRandomTest();
//input1 = rnd.input1;
//input2 = rnd.input2;

Stopwatch sw = new Stopwatch();
sw.Start();
Run Current = TryV5.Get(input1, input2, 2);
sw.Stop();

Console.ForegroundColor = ConsoleColor.White;
Console.WriteLine("Maximum common length {0}", Current.length);
Console.WriteLine("Maximum substring indexes {0},{1}", Current.index0, Current.index1);

string s0 = input1.Substring(Current.index0, Current.length);
string s1 = input2.Substring(Current.index1, Current.length);

Console.Write("Substring1: ");
for (int t = 0; t < Current.length; t++)
{
if (s0[t] == s1[t]) Console.ForegroundColor = ConsoleColor.White;
else Console.ForegroundColor = ConsoleColor.Red;
Console.Write(s0[t]);
}
Console.WriteLine();
Console.Write("Substring2: ");
for (int t = 0; t < Current.length; t++)
{
if (s0[t] == s1[t]) Console.ForegroundColor = ConsoleColor.White;
else Console.ForegroundColor = ConsoleColor.Red;
Console.Write(s1[t]);
}
Console.WriteLine();

Console.WriteLine("{0}ms", sw.ElapsedMilliseconds);
Console.ReadKey();
}
}

public class TryV5
{
static string input1;
static string input2;
static int len1;
static int len2;
static int ERR;

public static Run Get(string i_input1, string i_input2, int i_ERR)
{
input1 = i_input1;
input2 = i_input2;
if (input2.Length > input1.Length)
{
string tmp = input1;
input1 = input2;
input2 = tmp;
}

ERR = i_ERR;

len1 = input1.Length;
len2 = input2.Length;

var Upper = new Dictionary<int, Dictionary<string, List<int>>>(100);
var Lower = new Dictionary<int, Dictionary<string, List<int>>>(100);

int length = (int)(Math.Log(len1, 4));
//int length = 6;
bool stay;
do
{
length++;

Dictionary<string, List<int>> mup, mdw;
Get2Dicts(length, out mup, out mdw);
Upper[length] = mup;
Lower[length] = mdw;

var tst = from m in mup
join n in mdw on m.Key equals n.Key
select true;

int lntest = Math.Min(mup.Count, 1 << 19);
int take = tst.Take(lntest).Count();
stay = (take == lntest);
if (take == 0) length -= 2;
} while (stay);

Run Winner = new Run();
int minlen = GetMinLen(length);

for (int ln = length; ln >= minlen; ln--)
{
Dictionary<string, List<int>> mup;
Dictionary<string, List<int>> mdw;
if (Upper.ContainsKey(ln))
{
mup = Upper[ln];
mdw = Lower[ln];
}
else
{
Get2Dicts(ln, out mup, out mdw);
}

// Parallelismo qui
var tst = from m in mup
join n in mdw on m.Key equals n.Key
from u in m.Value
from d in n.Value
select Search(u, d, ln);

Run best = tst.Max();
if (best.length > Winner.length)
{
Winner = best;
minlen = GetMinLen(best.length);
if (minlen > ln) break;
}
}
return Winner;
}

public static void Get2Dicts(int length, out Dictionary<string, List<int>> mup, out Dictionary<string, List<int>> mdw)
{
// Inserimento parallelismo qui
mup = BuildDict(input1, length);
mdw = BuildDict(input2, length);
}

public static int GetMinLen(int curmaxlen)
{
int minlen = (int)Math.Ceiling((float)(curmaxlen) / (float)(ERR + 1));
return minlen;
}

private static Run Search(int upoffset, int downoffset, int sure)
{
int[] Pre = new int[ERR + 1];
int[] Post = new int[ERR + 2];

//SearchPost

bool lasterr = false;
for (int erfnd = 0, pch1 = upoffset + sure, pch2 = downoffset + sure; erfnd <= ERR && pch1 < len1 && pch2 < len2; pch1++, pch2++)
{
char ch1 = input1[pch1];
char ch2 = input2[pch2];
if (ch1 == ch2)
{
Post[erfnd]++;
lasterr = false;
}
else
{
if (lasterr) break;
erfnd++;
lasterr = true;
}
}
sure += Post[0];

//SearchPre
lasterr = true;
for (int erfnd = 0, pch1 = upoffset - 2, pch2 = downoffset - 2; erfnd < ERR && pch1 >= 0 && pch2 >= 0; pch1--, pch2--)
{
char ch1 = input1[pch1];
char ch2 = input2[pch2];
if (ch1 == ch2)
{
Pre[erfnd]++;
lasterr = false;
}
else
{
if (lasterr) break;
erfnd++;
lasterr = true;
}
}

int pr = 0;
int po = 1;
for (int t = 0; t < ERR; t++)
{
int ppd = Post[po];
int ppr = Pre[pr];
if ((ppd == 0) && (ppr == 0)) break;
if (ppd > ppr) po++;
else pr++;
}

int ofprim = 0;
for (int t = 0; t < pr; t++)
{
ofprim += (Pre[t] + 1);
}

int ofdop = 0;
for (int t = 1; t < po; t++)
{
ofdop += (Post[t] + 1);
}

int len = sure + ofprim + ofdop;
return new Run(len, upoffset - ofprim, downoffset - ofprim);
}

private static Dictionary<string, List<int>> BuildDict(string input, int len)
{
int fin = input.Length - len;
var ret = new Dictionary<string, List<int>>(fin);

for (int t = 0; t < fin; t++)
{
string str = input.Substring(t, len);
List<int> adder;
if (!ret.TryGetValue(str, out adder))
{
adder = ret[str] = new List<int>();
}
adder.Add(t);
}
return ret;
}
}

public class Run : IComparable<Run>
{
public int index0;
public int index1;
public int length;
public Run()
{
index0 = int.MinValue;
index1 = int.MinValue;
length = int.MinValue;
}

public Run(int len, int i0, int i1)
{
length = len;
index0 = i0;
index1 = i1;
}

#region IComparable<Run> Members

public int CompareTo(Run other)
{
return length - other.length;
}

#endregion
}
}

Vincenzo1968

13-08-2008, 06:08

Da me ci mette

1364ms

Anche considerando la lettura dal file.

Non so che super computer utilizzi; il risultato migliore, senza calcolare i tempi di lettura dal file, sulla mia macchina, è 2499ms:

http://www.guidealgoritmi.it/images/ImgForums/dna_gugo2.jpg

gugoXX

13-08-2008, 09:22

Non so che dire.
Ho un Intel Core2Duo a 3GHz, e 2GB di RAM.

Da me ci mette

Maximum common length 30
Maximum substring indexes 150002,170002
Substring1: AATTCAGACACTACAGTGACATGTACATGT
Substring2: AATTAAGACACGACAGTGACATGTACATGT
23931ms

E con un po' di facile parallelismo scendo a

Maximum common length 30
Maximum substring indexes 150002,170002
Substring1: AATTCAGACACTACAGTGACATGTACATGT
Substring2: AATTAAGACACGACAGTGACATGTACATGT
20973ms

Anche in modalita' debug, i tempi si alzano di 1 secondo circa.

Vincenzo1968

13-08-2008, 09:49

Non so che dire.
Ho un Intel Core2Duo a 3GHz, e 2GB di RAM.

Da me ci mette

E con un po' di facile parallelismo scendo a

Anche in modalita' debug, i tempi si alzano di 1 secondo circa.

E dunque, per funzionare, ci vogliono almeno 2GB di RAM ?

gugoXX

13-08-2008, 09:53

E dunque, per funzionare, ci vogliono almeno 2GB di RAM ?

Non saprei.
Immagino che dipenda dall'input.

Vincenzo1968

13-08-2008, 10:40

Guarda che tempi sono riuscito ad ottenere sul mio macinino con qualche piccola modifica(ma senza facili parallelismi ;) ) al codice:

http://www.guidealgoritmi.it/images/ImgForums/dna_mio_definitivo.jpg

Prima di postare il codice voglio testarmelo per bene perchè ancora non riesco a crederci :D

Vincenzo1968

13-08-2008, 10:42

Non saprei.
Immagino che dipenda dall'input.

Si lo so. Con i file di dimensioni minori funziona perfettamente.

Vincenzo1968

13-08-2008, 11:23

Per il momento posto questa versione che, sul mio macinino, impiega 17.5 secondi:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

//#define FILE1 "dna1.txt"
//#define FILE2 "dna2.txt"

//#define FILE1 "dna1b.txt"
//#define FILE2 "dna2b.txt"

//#define FILE1 "dna1c.txt"
//#define FILE2 "dna2c.txt"

//#define FILE1 "dna1d.txt"
//#define FILE2 "dna2d.txt"

//#define FILE1 "dna1e.txt"
//#define FILE2 "dna2e.txt"

#define FILE1 "dna1f.txt"
#define FILE2 "dna2f.txt"

#define BUFFER_SIZE 4096
#define MAX_STACK 100
#define PREFIX_SIZE 8

#define DIM_HASHTABLE 2097143

typedef struct tagRisultato
{
int pos1;
int pos2;
char p1[1024];
char p2[1024];
int len;
double tempo;
} Risultato;

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagHashTable
{
char key[PREFIX_SIZE + 1];
Lista *pLista;
struct tagHashTable *next;
} HashTable;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

int HashU(char *v, int M);

HashTable* HashTableNewNode(char *Key, int pos)
{
HashTable *n;

n = (HashTable*)malloc(sizeof(HashTable));

if( n == NULL )
return NULL;

strcpy(n->key, Key);
n->pLista = NULL;
n->pLista = ListAppend(n->pLista, pos);
n->next = NULL;

return n;
}

int FindValue(HashTable **pHashTable, char *Key, int M)
{
int index = 0;
HashTable *t;
int a = 31415;
int b = 27183;
char *s = Key;

for(; *s != '\0'; s++)
index = (a*index + *s) % M;
if ( index < 0 )
index *= -1;

t = pHashTable[index];
while ( t != NULL )
{
if ( strcmp(t->key, Key) == 0 )
return index;
t = t->next;
}

return -1;
}

void InsertValue(HashTable **pHashTable, char *Key, int pos, int M)
{
int index = 0;
HashTable *t = NULL;
int a = 31415;
int b = 27183;
char *s = Key;

for(; *s != '\0'; s++)
index = (a*index + *s) % M;
if ( index < 0 )
index *= -1;

t = pHashTable[index];
if ( t == NULL )
{
pHashTable[index] = HashTableNewNode(Key, pos);
return;
}

while ( t != NULL )
{
if ( strcmp(t->key, Key) == 0 )
{
pHashTable[index]->pLista = ListAppend(pHashTable[index]->pLista, pos);
return;
}
if ( t->next == NULL )
{
t->next = HashTableNewNode(Key, pos);
t = t->next;
t->next = NULL;
return;
}
t = t->next;
}
}

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

void HashTableFree(HashTable* first)
{
HashTable *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
ListFree(n1->pLista);
free(n1);
n1 = n2;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
int numread = 0;
int dimFile = 0;
int count = 0;
fpos_t pos;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

if( fgetpos(fp, &pos) != 0 )
{
fclose(fp);
return 0;
}

fclose(fp);

return (int)pos;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

void Trova(char *szNomeFile1, char *szNomeFile2, Risultato *pRisultato)
{
FILE *fp;
char strTempo[512];

HashTable **pHT;
Lista *pNode;

Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;
int bTrovato;
int MaxPrefix;
int index;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

p1_len = LeggiDimensioniFile(szNomeFile1);
p2_len = LeggiDimensioniFile(szNomeFile2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p2[0] = '\0';

if ( !LeggiStringa(szNomeFile1, p1, p1_len) )
return;

if ( !LeggiStringa(szNomeFile2, p2, p2_len) )
return;

c_start = clock();

if ( PREFIX_SIZE < p2_len )
MaxPrefix = PREFIX_SIZE;
else
MaxPrefix = p2_len;

if ( p1_len < PREFIX_SIZE )
{
printf("\n\nLa stringa piu' lunga risulta composta da %d caratteri.\nFai prima se ti apri i due file con blocco note e te le cerchi da solo.\nCiao ciao.\n\n", p1_len);

pRisultato->len = 0;
pRisultato->pos1 = 0;
pRisultato->pos2 = 0;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);
pRisultato->tempo = 0;

free(p1);
free(p2);

return;
}

pHT = (HashTable**)malloc(sizeof(HashTable*) * DIM_HASHTABLE);
if ( pHT != NULL )
{
for ( x = 0; x < DIM_HASHTABLE; x++ )
{
pHT[x] = (HashTable*)malloc(sizeof(HashTable));
if ( pHT[x] == NULL )
{
printf("Memoria non sufficiente.\n");
return;
}
pHT[x] = NULL;
}
}
else
{
printf("Memoria non sufficiente.\n");
return;
}

for ( x = MaxPrefix; x > 0; x--)
{
pos1 = 0;
while ( pos1 < p1_len - x + 1)
{
memcpy(strSearch, p1 + pos1, x);
*(strSearch + x) = '\0';

InsertValue(pHT, strSearch, pos1, DIM_HASHTABLE);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;
bTrovato = 0;

while ( pos2 < p2_len - x + 1 )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

index = FindValue(pHT, strSearch, DIM_HASHTABLE);
if ( index >= 0 )
{
pLista = pHT[index]->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
numErrors = 0;
k = x;
post1 = pos1;
post2 = pos2;
if ( pos2 > x && pos1 > x )
{
k = 0;
while ( (*(p2 + pos2) == *(p1 + pos1)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1++;
pos2++;
numErrors = 0;
}
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( (k > len_prec) && (*(p2 + pos2) == *(p1 + pos1)) )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

bTrovato = 1;
}
pos2 = post2;
pLista = pLista->next;
}
}
pos2++;
}
if ( bTrovato )
break;
for (k = 0; k < DIM_HASHTABLE; k++ )
pHT[k] = NULL;
}

pos1 = pos1x;
pos2 = pos2x;
len_prec = len;
numErrors = 0;
while ( (*(p1 + pos1) == *(p2 + pos2)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos2++;
pos1++;
if ( numErrors == 2 )
break;
}
len = 0;
numErrors = 0;
while ( (*(p1 + pos1 + len) == *(p2 + pos2 + len)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + len) != *(p1 + pos1 + len) )
numErrors++;
len++;
}
if ( len_prec < len && (*(p2 + pos2) == *(p1 + pos1)) )
{
pos1x = pos1;
pos2x = pos2;
}
else
{
len = len_prec;
}

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

x = len_prec + 1;

pos1 = 0;
pos2 = pos2x + 1;
strTemp1[0] = '\0';
strSearch[0] = '\0';
len_prec = len;
len = 0;
numErrors = 0;

memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

k = 0;
while ( *(p1 + k) != *(p2 + k) )
k++;
pos1 = k;

while ( pos1 < p1_len - x + 1 )
{
memcpy(strTemp1, p1 + pos1, x);
*(strTemp1 + x) = '\0';

k = 0;
while ( k < x )
{
if ( *(strSearch + k) != *(strTemp1 + k) )
numErrors++;
if ( numErrors > 2 )
break;
k++;
}
if ( k >= x && len_prec > len )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

break;
}

pos1++;
}

c_end = clock();

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

pRisultato->tempo = (double)(c_end - c_start) / CLOCKS_PER_SEC;

fp = fopen("Risultato.txt", "a");
fwrite(strRes, strlen(strRes), 1, fp);
fwrite(strTempo, strlen(strTempo), 1, fp);
fclose(fp);

free(p1);
free(p2);

for ( x = 0; x < DIM_HASHTABLE; x++ )
free(pHT[x]);
free(pHT);
}

int main()
{
Risultato ris;
char strTempo[512];
char strRes[1024];

Trova(FILE1, FILE2, &ris);

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
ris.p1,
ris.p2,
ris.len,
ris.pos1,
ris.pos2);

printf(strRes);

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", ris.tempo);
printf(strTempo);

return 0;
}

Per la versione da tre secondi, voglio fare qualche altra prova prima di postarla :)

gugoXX

13-08-2008, 15:38

Ottimo.
Spiegaci poi l'algoritmo non appena sara' stabile.

Vincenzo1968

14-08-2008, 07:59

Queste sono le mie soluzioni finali:

Risultati per il punto A:

ACTGTCCTGTCAACAAGGAGT
ACTGTCCTGTCAACAAGGAGT

di lunghezza 21 trovate alle posizioni 10002 e 40002

Tempo impiegato -> 0.11000 secondi

Stringhe

GTATTGTCGCTGTTTCCTCA
GTATTGTCGCTGTTTCCTCA

di lunghezza 20 trovate alle posizioni 237865 e 52364

Tempo impiegato -> 15.89000 secondi

Risultati per il punto B:

Stringhe

ACTGTCCTGAAGATCGCTTGGCATCTCCG
ACTGTCCTGCAGATCGCTTTGCATCTCCG

di lunghezza 29 trovate alle posizioni 50002 e 70002

Tempo impiegato -> 0.14000 secondi

Stringhe

AATTCAGACACTACAGTGACATGTACATGT
AATTAAGACACGACAGTGACATGTACATGT

di lunghezza 30 trovate alle posizioni 150002 e 170002

Tempo impiegato -> 17.45400 secondi

Ecco i sorgenti:

Punto A:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

//#define FILE1 "dna1.txt"
//#define FILE2 "dna2.txt"

//#define FILE1 "dna1b.txt"
//#define FILE2 "dna2b.txt"

//#define FILE1 "dna1c.txt"
//#define FILE2 "dna2c.txt"

//#define FILE1 "dna1d.txt"
//#define FILE2 "dna2d.txt"

//#define FILE1 "dna1e.txt"
//#define FILE2 "dna2e.txt"

#define FILE1 "dna1f.txt"
#define FILE2 "dna2f.txt"

#define PREFIX_SIZE 8

//#define DIM_HASHTABLE 8191
//#define DIM_HASHTABLE 16381
//#define DIM_HASHTABLE 32749
//#define DIM_HASHTABLE 65521
//#define DIM_HASHTABLE 131071
//#define DIM_HASHTABLE 262139
//#define DIM_HASHTABLE 524287
//#define DIM_HASHTABLE 1048573
//#define DIM_HASHTABLE 2097143
//#define DIM_HASHTABLE 4194301
//#define DIM_HASHTABLE 8388593
//#define DIM_HASHTABLE 16777213
//#define DIM_HASHTABLE 33554393
//#define DIM_HASHTABLE 67108859
//#define DIM_HASHTABLE 134217689
//#define DIM_HASHTABLE 268435399
//#define DIM_HASHTABLE 536870909
//#define DIM_HASHTABLE 1073741789

typedef struct tagRisultato
{
int pos1;
int pos2;
char p1[1024];
char p2[1024];
int len;
double tempo;
} Risultato;

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagHashTable
{
char key[PREFIX_SIZE + 1];
Lista *pLista;
struct tagHashTable *next;
} HashTable;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

int HashU(char *v, int M);

HashTable* HashTableNewNode(char *Key, int pos)
{
HashTable *n;

n = (HashTable*)malloc(sizeof(HashTable));

if( n == NULL )
return NULL;

strcpy(n->key, Key);
n->pLista = NULL;
n->pLista = ListAppend(n->pLista, pos);
n->next = NULL;

return n;
}

int FindValue(HashTable **pHashTable, char *Key, int M)
{
int index = 0;
HashTable *t;
int a = 31415;
int b = 27183;
char *s = Key;

for(; *s != '\0'; s++)
index = (a*index + *s) % M;
if ( index < 0 )
index *= -1;

t = pHashTable[index];
while ( t != NULL )
{
if ( strcmp(t->key, Key) == 0 )
return index;
t = t->next;
}

return -1;
}

void InsertValue(HashTable **pHashTable, char *Key, int pos, int M)
{
int index = 0;
HashTable *t = NULL;
int a = 31415;
int b = 27183;
char *s = Key;

for(; *s != '\0'; s++)
index = (a*index + *s) % M;
if ( index < 0 )
index *= -1;

t = pHashTable[index];
if ( t == NULL )
{
pHashTable[index] = HashTableNewNode(Key, pos);
return;
}

while ( t != NULL )
{
if ( strcmp(t->key, Key) == 0 )
{
pHashTable[index]->pLista = ListAppend(pHashTable[index]->pLista, pos);
return;
}
if ( t->next == NULL )
{
t->next = HashTableNewNode(Key, pos);
t = t->next;
t->next = NULL;
return;
}
t = t->next;
}
}

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

void HashTableFree(HashTable* first)
{
HashTable *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
ListFree(n1->pLista);
free(n1);
n1 = n2;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
int numread = 0;
int dimFile = 0;
int count = 0;
fpos_t pos;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

if( fgetpos(fp, &pos) != 0 )
{
fclose(fp);
return 0;
}

fclose(fp);

return (int)pos;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int GetHashTableDim(int dimFile)
{
if ( dimFile < 8191 )
return 8191;
else if ( dimFile < 16381 )
return 16381;
else if ( dimFile < 32749 )
return 32749;
else if ( dimFile < 65521 )
return 65521;
else if ( dimFile < 131071 )
return 131071;
else if ( dimFile < 262139 )
return 262139;
else if ( dimFile < 524287 )
return 524287;
else if ( dimFile < 1048573 )
return 1048573;
else if ( dimFile < 2097143 )
return 2097143;
else if ( dimFile < 4194301 )
return 4194301;
else
return 8388593;
}

void Trova(char *szNomeFile1, char *szNomeFile2, Risultato *pRisultato)
{
FILE *fp;
char strTempo[512];

HashTable **pHT;
int dimHT;
Lista *pNode;

Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int bTrovato;
int MaxPrefix;
int index;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

p1_len = LeggiDimensioniFile(szNomeFile1);
p2_len = LeggiDimensioniFile(szNomeFile2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p2[0] = '\0';

if ( !LeggiStringa(szNomeFile1, p1, p1_len) )
return;

if ( !LeggiStringa(szNomeFile2, p2, p2_len) )
return;

c_start = clock();

dimHT = GetHashTableDim(p1_len);

if ( PREFIX_SIZE < p2_len )
MaxPrefix = PREFIX_SIZE;
else
MaxPrefix = p2_len;

if ( p1_len < PREFIX_SIZE )
{
printf("\n\nLa stringa piu' lunga risulta composta da %d caratteri.\nFai prima se ti apri i due file con blocco note e te le cerchi da solo.\nCiao ciao.\n\n", p1_len);

pRisultato->len = 0;
pRisultato->pos1 = 0;
pRisultato->pos2 = 0;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);
pRisultato->tempo = 0;

free(p1);
free(p2);

return;
}

pHT = (HashTable**)malloc(sizeof(HashTable*) * dimHT);
if ( pHT != NULL )
{
for ( x = 0; x < dimHT; x++ )
{
pHT[x] = (HashTable*)malloc(sizeof(HashTable));
if ( pHT[x] == NULL )
{
printf("Memoria non sufficiente.\n");
return;
}
pHT[x] = NULL;
}
}
else
{
printf("Memoria non sufficiente.\n");
return;
}

for ( x = MaxPrefix; x > 0; x--)
{
pos1 = 0;
while ( pos1 < p1_len - x + 1)
{
memcpy(strSearch, p1 + pos1, x);
*(strSearch + x) = '\0';

InsertValue(pHT, strSearch, pos1, dimHT);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
bTrovato = 0;

while ( pos2 < p2_len - x + 1 )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

index = FindValue(pHT, strSearch, dimHT);
if ( index >= 0 )
{
pLista = pHT[index]->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
k = x;
post1 = pos1;
post2 = pos2;
if ( pos2 > x && pos1 > x )
{
k = 0;
while ( *(p2 + pos2) == *(p1 + pos1) )
{
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1++;
pos2++;
}
while ( *(p2 + pos2 + k) == *(p1 + pos1 + k) )
k++;
if ( (k > len_prec) && (*(p2 + pos2) == *(p1 + pos1)) )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;
bTrovato = 1;
}
pos2 = post2;
pLista = pLista->next;
}
}
pos2++;
}
if ( bTrovato )
break;
for (k = 0; k < dimHT; k++ )
pHT[k] = NULL;
}

/*
pos1 = pos1x;
pos2 = pos2x;
len_prec = len;
while ( *(p1 + pos1) == *(p2 + pos2) )
{
pos2++;
pos1++;
}
len = 0;
while ( *(p1 + pos1 + len) == *(p2 + pos2 + len) )
len++;
if ( len_prec < len && (*(p2 + pos2) == *(p1 + pos1)) )
{
pos1x = pos1;
pos2x = pos2;
}
else
{
len = len_prec;
}
*/

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

/*
x = len_prec + 1;

pos1 = 0;
pos2 = pos2x + 1;
strTemp1[0] = '\0';
strSearch[0] = '\0';
len_prec = len;
len = 0;

memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

k = 0;
while ( *(p1 + k) != *(p2 + k) )
k++;
pos1 = k;

while ( pos1 < p1_len - x + 1 )
{
memcpy(strTemp1, p1 + pos1, x);
*(strTemp1 + x) = '\0';

k = 0;
while ( k < x )
k++;
if ( k >= x && len_prec > len )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

break;
}

pos1++;
}
*/

c_end = clock();

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

pRisultato->tempo = (double)(c_end - c_start) / CLOCKS_PER_SEC;

fp = fopen("Risultato.txt", "a");
fwrite(strRes, strlen(strRes), 1, fp);
fwrite(strTempo, strlen(strTempo), 1, fp);
fclose(fp);

free(p1);
free(p2);

for ( x = 0; x < dimHT; x++ )
free(pHT[x]);
free(pHT);
}

int main()
{
Risultato ris;
char strTempo[512];
char strRes[1024];

Trova(FILE1, FILE2, &ris);

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
ris.p1,
ris.p2,
ris.len,
ris.pos1,
ris.pos2);

printf(strRes);

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", ris.tempo);
printf(strTempo);

return 0;
}

Punto B:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

//#define FILE1 "dna1.txt"
//#define FILE2 "dna2.txt"

//#define FILE1 "dna1b.txt"
//#define FILE2 "dna2b.txt"

//#define FILE1 "dna1c.txt"
//#define FILE2 "dna2c.txt"

//#define FILE1 "dna1d.txt"
//#define FILE2 "dna2d.txt"

//#define FILE1 "dna1e.txt"
//#define FILE2 "dna2e.txt"

#define FILE1 "dna1f.txt"
#define FILE2 "dna2f.txt"

#define PREFIX_SIZE 8

//#define DIM_HASHTABLE 8191
//#define DIM_HASHTABLE 16381
//#define DIM_HASHTABLE 32749
//#define DIM_HASHTABLE 65521
//#define DIM_HASHTABLE 131071
//#define DIM_HASHTABLE 262139
//#define DIM_HASHTABLE 524287
//#define DIM_HASHTABLE 1048573
//#define DIM_HASHTABLE 2097143
//#define DIM_HASHTABLE 4194301
//#define DIM_HASHTABLE 8388593
//#define DIM_HASHTABLE 16777213
//#define DIM_HASHTABLE 33554393
//#define DIM_HASHTABLE 67108859
//#define DIM_HASHTABLE 134217689
//#define DIM_HASHTABLE 268435399
//#define DIM_HASHTABLE 536870909
//#define DIM_HASHTABLE 1073741789

typedef struct tagRisultato
{
int pos1;
int pos2;
char p1[1024];
char p2[1024];
int len;
double tempo;
} Risultato;

typedef struct tagLista
{
int pos;
struct tagLista* next;
} Lista;

typedef struct tagHashTable
{
char key[PREFIX_SIZE + 1];
Lista *pLista;
struct tagHashTable *next;
} HashTable;

Lista* ListNewNode(int val);
Lista* ListAppend(Lista* first, int val);
void ListFree(Lista* first);

int HashU(char *v, int M);

HashTable* HashTableNewNode(char *Key, int pos)
{
HashTable *n;

n = (HashTable*)malloc(sizeof(HashTable));

if( n == NULL )
return NULL;

strcpy(n->key, Key);
n->pLista = NULL;
n->pLista = ListAppend(n->pLista, pos);
n->next = NULL;

return n;
}

int FindValue(HashTable **pHashTable, char *Key, int M)
{
int index = 0;
HashTable *t;
int a = 31415;
int b = 27183;
char *s = Key;

for(; *s != '\0'; s++)
index = (a*index + *s) % M;
if ( index < 0 )
index *= -1;

t = pHashTable[index];
while ( t != NULL )
{
if ( strcmp(t->key, Key) == 0 )
return index;
t = t->next;
}

return -1;
}

void InsertValue(HashTable **pHashTable, char *Key, int pos, int M)
{
int index = 0;
HashTable *t = NULL;
int a = 31415;
int b = 27183;
char *s = Key;

for(; *s != '\0'; s++)
index = (a*index + *s) % M;
if ( index < 0 )
index *= -1;

t = pHashTable[index];
if ( t == NULL )
{
pHashTable[index] = HashTableNewNode(Key, pos);
return;
}

while ( t != NULL )
{
if ( strcmp(t->key, Key) == 0 )
{
pHashTable[index]->pLista = ListAppend(pHashTable[index]->pLista, pos);
return;
}
if ( t->next == NULL )
{
t->next = HashTableNewNode(Key, pos);
t = t->next;
t->next = NULL;
return;
}
t = t->next;
}
}

Lista* ListNewNode(int val)
{
Lista *n;

n = (Lista *)malloc(sizeof(Lista));

if( n == NULL )
return NULL;

n->pos = val;
n->next = NULL;

return n;
}

Lista* ListAppend(Lista* first, int val)
{
Lista *n = first, *nuovo;

if ( first == NULL )
return ListNewNode(val);

n = first;
while( n->next != NULL )
{
n = n->next;
}

nuovo = ListNewNode(val);
n->next = nuovo;

return first;
}

void ListFree(Lista* first)
{
Lista *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
free(n1);
n1 = n2;
}
}

void HashTableFree(HashTable* first)
{
HashTable *n1 = first, *n2;
while ( n1 != NULL )
{
n2 = n1->next;
ListFree(n1->pLista);
free(n1);
n1 = n2;
}
}

int LeggiDimensioniFile(char *szFileName)
{
FILE *fp;
int numread = 0;
int dimFile = 0;
int count = 0;
fpos_t pos;

fp = fopen(szFileName, "rb");

if ( fp == NULL )
return 0;

if ( fseek(fp, 0, SEEK_END) )
{
fclose(fp);
return 0;
}

if( fgetpos(fp, &pos) != 0 )
{
fclose(fp);
return 0;
}

fclose(fp);

return (int)pos;
}

int LeggiStringa(char *szFileName, char *buffer, int dimFile)
{
FILE *fp;

fp = fopen(szFileName, "r");

if ( fp == NULL )
return 0;

if ( fgets(buffer, dimFile+1, fp) == NULL )
{
printf("\nErrore nella lettura del file %s\n", szFileName);
fclose(fp);
return 0;
}
*(buffer + dimFile) = '\0';

fclose(fp);

return dimFile;
}

int GetHashTableDim(int dimFile)
{
if ( dimFile < 8191 )
return 8191;
else if ( dimFile < 16381 )
return 16381;
else if ( dimFile < 32749 )
return 32749;
else if ( dimFile < 65521 )
return 65521;
else if ( dimFile < 131071 )
return 131071;
else if ( dimFile < 262139 )
return 262139;
else if ( dimFile < 524287 )
return 524287;
else if ( dimFile < 1048573 )
return 1048573;
else if ( dimFile < 2097143 )
return 2097143;
else if ( dimFile < 4194301 )
return 4194301;
else
return 8388593;
}

void Trova(char *szNomeFile1, char *szNomeFile2, Risultato *pRisultato)
{
FILE *fp;
char strTempo[512];

HashTable **pHT;
int dimHT;
Lista *pNode;

Lista *pLista;
int p1_len, p2_len;
int k;
int len, len_prec;
int numErrors;
int bTrovato;
int MaxPrefix;
int index;

int x;

char strSearch[1024] = "";

char strRes[1024] = "";
char strTemp1[1024] = "";
char strTemp2[1024] = "";

clock_t c_start, c_end;

int pos1, pos2;
int pos1x, pos2x;

int post1, post2;

char *p1 = NULL;
char *p2 = NULL;

p1_len = LeggiDimensioniFile(szNomeFile1);
p2_len = LeggiDimensioniFile(szNomeFile2);

p1 = (char*)malloc(sizeof(char)*p1_len + 1);
if ( !p1 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p1[0] = '\0';

p2 = (char*)malloc(sizeof(char)*p2_len + 1);
if ( !p2 )
{
printf("Errore nell'allocazione della memoria.");
return;
}
p2[0] = '\0';

if ( !LeggiStringa(szNomeFile1, p1, p1_len) )
return;

if ( !LeggiStringa(szNomeFile2, p2, p2_len) )
return;

c_start = clock();

dimHT = GetHashTableDim(p1_len);

if ( PREFIX_SIZE < p2_len )
MaxPrefix = PREFIX_SIZE;
else
MaxPrefix = p2_len;

if ( p1_len < PREFIX_SIZE )
{
printf("\n\nLa stringa piu' lunga risulta composta da %d caratteri.\nFai prima se ti apri i due file con blocco note e te le cerchi da solo.\nCiao ciao.\n\n", p1_len);

pRisultato->len = 0;
pRisultato->pos1 = 0;
pRisultato->pos2 = 0;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);
pRisultato->tempo = 0;

free(p1);
free(p2);

return;
}

pHT = (HashTable**)malloc(sizeof(HashTable*) * dimHT);
if ( pHT != NULL )
{
for ( x = 0; x < dimHT; x++ )
{
pHT[x] = (HashTable*)malloc(sizeof(HashTable));
if ( pHT[x] == NULL )
{
printf("Memoria non sufficiente.\n");
return;
}
pHT[x] = NULL;
}
}
else
{
printf("Memoria non sufficiente.\n");
return;
}

for ( x = MaxPrefix; x > 0; x--)
{
pos1 = 0;
while ( pos1 < p1_len - x + 1)
{
memcpy(strSearch, p1 + pos1, x);
*(strSearch + x) = '\0';

InsertValue(pHT, strSearch, pos1, dimHT);

pos1++;
}

pos1 = 0;
pos2 = 0;
strTemp1[0] = '\0';
strTemp2[0] = '\0';
strSearch[0] = '\0';
pLista = NULL;
pNode = NULL;
len = 0;
len_prec = 0;
numErrors = 0;
bTrovato = 0;

while ( pos2 < p2_len - x + 1 )
{
memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

index = FindValue(pHT, strSearch, dimHT);
if ( index >= 0 )
{
pLista = pHT[index]->pLista;

while ( pLista != NULL )
{
pos1 = pLista->pos;
numErrors = 0;
k = x;
post1 = pos1;
post2 = pos2;
if ( pos2 > x && pos1 > x )
{
k = 0;
while ( (*(p2 + pos2) == *(p1 + pos1)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos1--;
pos2--;
if ( pos2 < 0 || pos1 < 0 )
break;
}
pos1++;
pos2++;
numErrors = 0;
}
while ( (*(p2 + pos2 + k) == *(p1 + pos1 + k)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + k) != *(p1 + pos1 + k) )
numErrors++;
k++;
}
if ( (k > len_prec) && (*(p2 + pos2) == *(p1 + pos1)) )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

bTrovato = 1;
}
pos2 = post2;
pLista = pLista->next;
}
}
pos2++;
}
if ( bTrovato )
break;
for (k = 0; k < dimHT; k++ )
pHT[k] = NULL;
}

pos1 = pos1x;
pos2 = pos2x;
len_prec = len;
numErrors = 0;
while ( (*(p1 + pos1) == *(p2 + pos2)) || (numErrors < 2) )
{
if ( *(p2 + pos2) != *(p1 + pos1) )
numErrors++;
pos2++;
pos1++;
if ( numErrors == 2 )
break;
}
len = 0;
numErrors = 0;
while ( (*(p1 + pos1 + len) == *(p2 + pos2 + len)) || (numErrors < 2) )
{
if ( *(p2 + pos2 + len) != *(p1 + pos1 + len) )
numErrors++;
len++;
}
if ( len_prec < len && (*(p2 + pos2) == *(p1 + pos1)) )
{
pos1x = pos1;
pos2x = pos2;
}
else
{
len = len_prec;
}

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

x = len_prec + 1;

pos1 = 0;
pos2 = pos2x + 1;
strTemp1[0] = '\0';
strSearch[0] = '\0';
len_prec = len;
len = 0;
numErrors = 0;

memcpy(strSearch, p2 + pos2, x);
*(strSearch + x) = '\0';

k = 0;
while ( *(p1 + k) != *(p2 + k) )
k++;
pos1 = k;

while ( pos1 < p1_len - x + 1 )
{
memcpy(strTemp1, p1 + pos1, x);
*(strTemp1 + x) = '\0';

k = 0;
while ( k < x )
{
if ( *(strSearch + k) != *(strTemp1 + k) )
numErrors++;
if ( numErrors > 2 )
break;
k++;
}
if ( k >= x && len_prec > len )
{
len = k;
pos1x = pos1;
pos2x = pos2;
len_prec = len;

memcpy(strTemp1, p1 + pos1x, len);
*(strTemp1 + len) = '\0';

memcpy(strTemp2, p2 + pos2x, len);
*(strTemp2 + len) = '\0';

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
strTemp1,
strTemp2,
len,
pos1x,
pos2x);

pRisultato->len = len;
pRisultato->pos1 = pos1x;
pRisultato->pos2 = pos2x;
strcpy(pRisultato->p1, strTemp1);
strcpy(pRisultato->p2, strTemp2);

break;
}

pos1++;
}

c_end = clock();

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", (double)(c_end - c_start) / CLOCKS_PER_SEC);

pRisultato->tempo = (double)(c_end - c_start) / CLOCKS_PER_SEC;

fp = fopen("Risultato.txt", "a");
fwrite(strRes, strlen(strRes), 1, fp);
fwrite(strTempo, strlen(strTempo), 1, fp);
fclose(fp);

free(p1);
free(p2);

for ( x = 0; x < dimHT; x++ )
free(pHT[x]);
free(pHT);
}

int main()
{
Risultato ris;
char strTempo[512];
char strRes[1024];

Trova(FILE1, FILE2, &ris);

sprintf(strRes,
"\nStringhe\n\n%s\n%s\n\ndi lunghezza %d trovate alle posizioni %d e %d\n",
ris.p1,
ris.p2,
ris.len,
ris.pos1,
ris.pos2);

printf(strRes);

sprintf(strTempo, "\nTempo impiegato -> %5.5f secondi\n", ris.tempo);
printf(strTempo);

return 0;
}

Purtroppo ho dovuto rinunciare alla soluzione da tre secondi perché, da varie prove effettuate, risultava velocissima in qualche caso ma, nel caso medio, lenta.

Vincenzo1968

14-08-2008, 08:59

Ho trovato il modo di confrontare le due versioni, C# e C, su questo sistema:

http://www.guidealgoritmi.it/images/ImgForums/ris_gugo3.jpg

e questi sono i risultati:

C#:
http://www.guidealgoritmi.it/images/ImgForums/dna_gugo3.jpg

C:
http://www.guidealgoritmi.it/images/ImgForums/ris_mio3.jpg

Vincenzo1968

14-08-2008, 18:50

...
Spiegaci poi l'algoritmo non appena sara' stabile.

1) Si Legge la stringa più lunga, p1, e si inseriscono le sottostringhe di lunghezza PREFIX_SIZE nella hashtable.
2) Si Legge la stringa più corta, p2, e, per ogni sottostringa di lunghezza PREFIX_SIZE, si effettua una ricerca nella hashtable.
3) Se la ricerca va a buon fine, si confrontano, carattere per carattere, le due stringhe, p1 e p2. La posizione di p2 è quella corrente; la posizione di p1 viene letta dalla hashtable.
4) Il confronto avviene sia in avanti che all'indietro(ogni stringa memorizzata nella hashtable, quindi, può essere considerata sia come un prefisso che come un suffisso).
5) Se non viene trovata nessuna corrispondenza, si abbassa la lunghezza del prefisso/suffisso di una unità e si riparte dal punto 1.

cdimauro

14-08-2008, 22:26

Nota di servizio.
Quanto cabasisi di memoria usa il tuo programma? :)
http://www.vigata.org/dizionario/camilleri_linguaggio.html
:cool:

P.S. Nel siracusano (e, forse, nel catanese) ci sono 2 b. :D

banryu79

21-08-2008, 10:36

Ciao, so che il contest orami è defunto ma ho trovato questo interessante articolo di R. Sedgwick su un particolare tipo di alberi red-black: i left leaning red-black tree (http://www.cs.princeton.edu/~rs/talks/LLRB/RedBlack.pdf).
Pensavo potesse interessarvi (ovviamente a chi già non l'avesse letto) :)

gugoXX

21-08-2008, 11:07

Ciao, so che il contest orami è defunto ma ho trovato questo interessante articolo di R. Sedgwick su un particolare tipo di alberi red-black: i left leaning red-black tree (http://www.cs.princeton.edu/~rs/talks/LLRB/RedBlack.pdf).
Pensavo potesse interessarvi (ovviamente a chi già non l'avesse letto) :)

E' un documento davvero ben fatto.

Vincenzo1968

30-11-2008, 15:36

up

:bimbo: