PDA

View Full Version : Find duplicates


Ninja40
11-16-2003, 05:53 AM
Ok, here is a new one.
Hope you'll find it interesting.

Here it goes :

1. Create an array of 10,000,000 digits chosen at random. In order for everyone to have the same sequence of digits, the pseudo-random generator will be defined as the following ( a 32-bit congruential generator by Knuth):

r = (1664525 * r + 1013904223) modulo 2^32


Each integer will be made of the last digit of each r.
The sequence will start at 7.
In order to be certain that everyone has the same sequence, it would be good to print the first 10 digits and the last 10 digits of the array.

2. Find all the occurences of all the repeats of 5, 6 and 7 successive integers in the random array, and print each of these patterns with the number of occurences found.

Example of output :

5-digit patterns
54871 found 8 times
47234 found 12 times
97862 found 2 times
11097 found 16 imes
....

6-digit patterns
658442 found 3 times
967456 found 2 times
...

Only part 2 will be timed.

Variants of this problem are :
a) find the number of sequences of 5 successive equal digits like 44444 or 88888.
b) find all the sequences of digits in in/decreasing order like (inc) 4567 or 789012 or (dec) 7654 or 210987.
Subsidiary question : find the longest sequence of digits which is repeated at least once.

Have fun ! :P

Ninja40
11-16-2003, 06:07 AM
Of course, for each submission, please give some indications of the hardware used (CPU, speed, RAM).

Ninja40
11-18-2003, 02:56 AM
Nobody interested ??

Yeah I know, it's hard. It's a challenge, isn't it ?

For this particular problem, where you don't need to tell where the copies of each duplicate re, I think there is a very simple method that runs in something like 2n + n log n time for a given length of patterns.

I haven't implemented it yet, though.

Ninja40
11-19-2003, 03:19 PM
Actually, for patterns smaller or equal to 7 digits (depending on available memory), there is a deceptively simple method that is O(n). Will you find it ? :plot:

DNAunion2000
11-19-2003, 10:03 PM
I could solve this easily if I could use VFP tables/cursors instead of arrays.

For the first part, I would just store the values in a table.

The second part is the key ("Only part 2 will be timed"). And it would be trivial for VFP. I'd simply issue a single SQL query such as the following:


SELECT digits, COUNT(*) AS occurs ;
FROM myTable ;
GROUP BY digits ;
HAVING COUNT(*) > 1 ;
ORDER BY occurs ;
INTO TABLE duplicates

Ninja40
11-20-2003, 03:25 AM
Aaaah. Finally an entry !;)

Well, I guess you can use a table instead of an array if you want. The way you store the digits isn't important to the problem, if it takes a reasonable amount of memory (by reasonable, I mean O(n)) or O(n log n), not O(n^2) or more.

Ok, I'm not sure what your program does, though, especially the first line. Can you explain your algorithm ?

Did you try timing it ?


ps : as for my subsidiary question, for now, I haven't got a clue...:confused:

DNAunion2000
11-21-2003, 12:16 AM
Well, I guess you can use a table instead of an array if you want. The way you store the digits isn't important to the problem, if it takes a reasonable amount of memory (by reasonable, I mean O(n)) or O(n log n), not O(n^2) or more.

The time and space needed to store digits would scale linearly, O(n), like an array. The difference (pertaining to the discussion) is that you can’t run an SQL command against an array, but you can against a table.

Ok, I'm not sure what your program does, though, especially the first line. Can you explain your algorithm ?

What I posted isn’t really an algorithm: it’s a single SQL command (SQL being Structured Query Language, an ANSI standard language for working with relational databases). I tell SQL what I want, not how to do it: SQL (well, the engine implementing SQL) figures out how best to do that: if anything formulates an algorithm, it would be the relational engine.

My plan would be to have a table with just one column: digits. A simple SQL command to create that would be as follows:


CREATE TABLE myTable ( ;
digits N(10))


I would then write an algorithm to generate random numbers and insert them into that table, along the lines of:


FOR lnLcv = 1 TO lnMax
INSERT INTO mytable (digits) VALUES (myFunction())
END FOR


That is phase one, which is not timed and apparently not that important to the challenge. The next part is the determining of duplicates…that’s where the single SQL command I originally posted would come in:


SELECT digits, COUNT(*) AS occurs ;
FROM myTable ;
GROUP BY digits ;
HAVING COUNT(*) > 1 ;
ORDER BY occurs ;
INTO TABLE duplicates


That tells the data engine that I want it to return the values stored in the digits column of myTable, grouping those values by their value and counting how members there are in each group, returning only groups that have more than one member, and to store the results in a new table – created on the fly – called duplicates. It then figures out how to do all of that.

Did you try timing it ?

No. SQL uses an optimizer which has to parse the query, validate it, analyze it, and determine the most efficient manner of then executing it, and only then can it start to execute the plan it formulated, which is generalizable instead of a custom fit. All of that overhead would surely clobber my time compared to someone who wrote a custom function tailored specifically for the task.

But, the time spent coding would be MUCH faster using SQL: just a single command that took me 10 seconds to figure out.

Ninja40
11-23-2003, 05:31 PM
Thanks, DNA. I was asking, because I am not sure SQL algorithms are able to tackle this kind of task efficiently.

BTW, this problem may have some connection with genome (DNA) sequencing, where one's goal is to find repeated sequences of acids (A, G, T, C).
Another use could be to find any repeated text in a large text (copy-pasted code in source code, for example).

Ok, since noone else has given an answer, I will give mines, which are very simple.
The general algorithm is this :
Let's call A the array of 10,000,000 random digits.

1. move a m-digit window over the array A by shifting it one digit at a timle. Collect each number formed in a new array B. For a n-digits array, there are n-m numbers. One can delete A progressively to save memory. This is O(n).
2. Sort the array B in increasing order using Quicksort. This is O(n log n).
3. Scan the sorted B and count the number of occurences of each collected number in a new array C. Discard the numbers that appear 0 or 1 time only. What is left is the array of repeated m-digits numbers with their number of occurences. This is O(n).

It is possible to keep in memory the index of each occurence in A of each number in an array step 1. The number itself belongs to a structure that has a pointer that points to this array, so that after the sort, the pointer still points to the array of indexes.

The final algorithm is therefore running in something like O(n)+O(n log n).

But wait, for small windows (say 6 or 7 digits), there is even better !

1. Create an array of size 10^m initialized to 0 (we shall call this array B). Complexity O(n)
2. Slide the m-digits window through array A, moving it one digit at a time. Let's call z the number formed by the 6 digits at one time. Increment B[z] by 1.
After A has been completely scanned, B[z] will contain the number of occurences of each number z. This is O(n).
3. Scan one third time to discard all occurences less than 2. O(n)

So for m small enough that 10^m fits in memory, this algorithm is very fast, O(n), and it is very simple to program. (I haven't done it 'cuz I'm lazy. I let you guys do it in whatever language you want. :P)

Now, if someone can find a solution to my subsidiary question with an algorithm that is better than O(n^2 log n)...:sick:

Ninja40
05-03-2004, 02:40 AM
Well, a better solution, which is a practical equivalent of solution 2 above, is to use a hash_map<string, int>.
Since with efficient implementations, the hash_map lookup is O(1), the solution is O(n).

kryptech.net
05-03-2004, 11:39 AM
man a thread that still has a post by DNA in it. ?!

Whiteknight
05-04-2004, 02:02 PM
what, did DNA leave?

kryptech.net
05-04-2004, 04:32 PM
he was banned...

Whiteknight
05-05-2004, 08:51 PM
banned? he was a little belligerant, thats true, but i would hardly ban him. i mean, he did add a little spice around here.

kryptech.net
05-06-2004, 10:47 AM
get in a fight with a mod about this one, I dont belive in that kind of stuff either.

Ninja40
05-10-2004, 06:12 PM
Here is a (not too efficient) implementation :

#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <hash_map> // or hash_map.h, or ext/hash_map depending on compiler

using namespace std;

unsigned int N;
const unsigned long SEED = 17UL;


inline void alea(unsigned long &rand){
static unsigned long r = SEED;
rand = r = 1664525UL * r + 1013904223UL;
}


int remplit_v(vector<char> &v, const int N)
{
// For a vector of 5000000 figures.
// 40 first : 1042235241197381276230666142972883492185
// 40 last : 1675913308512872428215612116421983834693

unsigned long r;
char buf[20] = "";
vector<char> tmp;

v.reserve(N);

for (int i = 0; i < N; i+= 5)
{
do alea(r); while (r < 10000UL);
sprintf(buf, "%u", r);
tmp.assign(buf, &buf[5]);
copy(tmp.begin(), tmp.end(), back_inserter(v));
}

int num = 40;
cout << num << " first digits : ";
for (int k = 0; k < num; ++k) cout << v[k]; cout << endl;
cout << num << " last digits : ";
for (int j = N-num; j < N; ++j) cout << v[j]; cout << endl;

return v.size();
}


class FNV_hash{
public :
inline unsigned operator()(const string& x) const
{
register unsigned hash = 2166136261u;
const string::size_type x_size(x.size());

for(register string::size_type i = 0; i < x_size; hash = (hash * 16777619u)^x[i++]);

return hash;
}
};

typedef hash_map<const string, vector<int>, FNV_hash> HachageChaine;

int hache(HachageChaine &table_H, const vector<char> &vect, const int n)
{
string str;
HachageChaine::iterator it;
str.reserve(100);
table_H.resize(N-n+1);

for (unsigned int i = 0; i < N-n; ++i)
{
for(unsigned int j = i; j <= i+n; ++j) str += vect[j]; //grossly repetitive

it = table_H.find(str);
if(it == table_H.end())
{
vector<int> v;
v.push_back(i);
table_H.insert(hash_map<string, vector<int>, FNV_hash>::value_type(str, v));
}
else
{
it->second.push_back(i);
}

str.erase();
}
return table_H.size();
}



void affiche_doublons_h(const vector<char>& vect , const int lmin, const int lmax)
{
HachageChaine table;
typedef HachageChaine::iterator H_iter;

for (int l = lmin; l <= lmax; ++l)
{
cout << "\n " << l << " digits sequences:\n";
cout << "Hash : " << N - hache(table, vect, l) - l
<< " sequences found\n";

for(H_iter it = table.begin(); it != table.end(); ++it)
{
string clef = it->first;
int card = it->second.size();
if(card > 1)
{
cout << clef << " has " << card << " occurences[ ";
copy(it->second.begin(), it->second.end(),
ostream_iterator<int>(cout, " "));
cout << "]" << endl;
}
}
table.clear();
}
}

int main(int argc, char* argv[])
{
vector<char> vect;
int lmin, lmax;


if (argc != 4 ) {
cout << "Usage : doubles N lmin lmax" << endl;
return -1;
}


N = atoi(argv[1]);
lmin = atoi(argv[2]);
lmax = atoi(argv[3]);

cout << "Generate a " << N << " digits vector." << endl;
remplit_v(vect, N);
affiche_doublons_h(vect, lmin, lmax);

return 0;
}




It could be optimized but the idea is here. It runs in 6mn30s on a P4 @ 2.4GHz for 5000000 digits with number lengths from 7 to 15. It uses a whopping 500Mb to run, though. Actually, the largest numbers in this sequence are 12-digit long.

The best optimization I can think of, besides using char* instead of strings, is to do the search at iteration N+1 starting from iteration N. The algorithm is a bit more complex.

Ninja40
05-10-2004, 06:19 PM
Sorry for the french language used in the coding.

Ohara
05-31-2004, 06:25 AM
Language: Python 2.3
Chip: Celeron 1.3Ghz
OS: NT4
RAM: 256M

Time taken for 5million elements
* without printing: 62 seconds
* with printing: not measured, but expected to be about 16 minutes
[Printing python strings is slow slow slow. Given that what is being tested is how long it takes to count duplicates, and not I/O, I think this small omission is somewhat justified ;-)]

This uses a few tricks:
* Python dict access is O(1)
* A sequence of digits can be treated as a string
* Lists can't be used as dict keys, so a conversion to string has to happen somewhere, and the cheapest place to do it is converting the list to one big long string
* Python slices are quite fast
* adding spaces to the end is necessary for the slicing to not require any extraneous code
* This version is fully generalised - you could look for all sequences of 17 digits if you had the memory... If you wanted to shave 20% off the time, you could unroll the loops, and manually create and access dictionaries for each length...


import time

def run(minlen, maxlen):
t0 = time.time()

rarr = []
m = 2**32
rand = 17
for i in range(5000000):
rand = (1664525 * rand + 1013904223)%m
rarr.append(str(rand)[-1])
print ">>", time.time()-t0

t1 = time.time()
rarr="".join(rarr) + " "*(maxlen-minlen)

dicts = [0 for i in range(minlen)] + [{} for i in range(maxlen + 1 - minlen)]

r = range(minlen,maxlen+1)
for n in xrange(len(rarr)-maxlen):
for i in r:
curr = rarr[n:n+i]
dicts[i][curr] = dicts[i].get(curr,0)+1

for i in r:
try:
del dicts[i][" "*i]
except KeyError:
pass

del rarr

print ">>", time.time()-t1

t2 = time.time()

for d in dicts[minlen:]:
for key, val in d.items():
if val == 1:
del d[key]

print ">>", time.time()-t2

## for i in range(minlen, maxlen+1):
## print i, "digit patterns:"
## print "\n".join(["%s found %s times"%(k, v) for k,v in dicts[i].items()])

print ">>", time.time()-t1


if __name__ == "__main__":
run(5,7)

Ninja40
06-02-2004, 05:43 AM
Nice !

Here is a new implementation given by another french coder, on http://forums.hardware.fr, based on red-black trees. This one is very fast, about 35 seconds on a xeon@2,4GHz.


#include <iostream.h>
#include <cstdlib>

unsigned int N = 5000000;
const unsigned long SEED = 17L;

class Position **PositionsRedondantesG;
int NbPositionsG;

template <class T> class node
{
private:
node *fl, *fr; //left and right branches
int dq;//desequilibre du node
int Nb;//count the number of occurence of the added element
T Info;

public:
T &RInfo(){return Info;}
node<T> *Add(T &Info, node *&Link, int &L);
//node<T> *Add(node<T> &Info, node *&Link, int &L);
void RotationRight(node *&N);
void RotationLeft(node *&N);
void RotationLeftRight(node *&N);
void RotationRightLeft(node *&N);
void Reequilibrate(node *&Link, int des);
void Empty();


node(T &Info)
{
fl = fr = NULL;
node::Info = Info;
Info.SetParent(this);
Nb = 1;
dq = 0;
};
};

template <class T> class tree
{
private:
int Haut;
int NbElmt;
public:
node<T> *Root;
node<T> *Add(T &Info);
tree();
~tree();
};


class Position
{
private:
unsigned long long n;
int Nb;
unsigned int Positions[10];
void *Contenant;

public:
void SetPos(int Pos){Positions[0] = Pos;};
void SetN(unsigned long long N){n = N;};
unsigned long long GetN(){return n;};
int GetNb(){return Nb;};
int GetPos(int i){return Positions[i];};
void SetParent(void *C) {Contenant = C;};
void *GetParent(){return Contenant;};

Position() : Nb(0), n(0) {}

void Display()
{
cout << "Nb "<<n<<" -> "<< Nb <<" times ";
for (int i = 0; i < Nb; i++)
cout <<"["<<Positions[i]<<"] ";
cout <<endl;
}

inline Position &operator = (Position &p)
{
Positions[Nb++] = p.Positions[0];
n = p.n;
}

bool operator == (Position &p)
{
if (p.n == n)
{
if (Nb == 1) PositionsRedondantesG[NbPositionsG++] = this;
Positions[Nb++] = p.Positions[0];
return true;
}
return false;
}

inline bool operator < (Position &p){ return (n < p.n);}

inline bool operator > (Position &p){ return (n > p.n);}
};

template <class T> void node<T>::Empty()
{
if (fr) fr->Empty();
if (fl) fl->Empty();
if (Nb <= 1) delete this;
else Info.Display();
}

template <class T> void node<T>::Reequilibrate(node *&Link, int des)
{
node *FL=NULL, *FR = NULL;
bool DoubleRotation = false;

if (des > 0)
{
if (fl->dq >= 0)
{
RotationRight(Link);
if (Link->dq == 1)
Link->fr->dq = Link->dq = 0;
else
Link->dq = -1, Link->fr->dq = 1;
}
else
{
RotationLeftRight(Link);
DoubleRotation = true;
FL = Link->fl;
FR = Link->fr;
}
}
else
{
if (fr->dq <= 0)
{
RotationLeft(Link);
if (Link->dq == -1)
Link->fl->dq = Link->dq = 0;
else
Link->fl->dq = -1, Link->dq = 1;
}
else
{
RotationRightLeft(Link);
DoubleRotation = true;
FL = Link->fr;
FR = Link->fl;
}
}
if (DoubleRotation)
{
switch(Link->dq)
{
case 1: FL->dq = 0, FR->dq = -1;
break;
case -1:FL->dq = 1, FR->dq = 0;
break;
case 0: FL->dq = FR->dq = 0;
}
Link->dq = 0;
}
}

template <class T> node<T> *node<T>::Add(T &Info, node *&Link, int &L)
{
node<T> *R;

if (Info > node::Info)
{
if (fr) R = fr->Add(Info, fr, L);
else R = fr = new node(Info), L = 1;
if (L) dq--;
if (dq >= 0) L = 0;
if (dq == -2) Reequilibrate(Link, dq), L=0;
}
else if (Info < node::Info)
{
if (fl) R = fl->Add(Info, fl, L);
else R = fl = new node(Info), L = 1;
if (L) dq++;
if (dq <= 0) L = 0;
if (dq == 2) Reequilibrate(Link, dq), L=0;
}
else
{
node::Info==Info;
Nb++;
return NULL;
}

return R;
}


template <class T> void node<T>::RotationRight(node *&Q)
{
node *P = Q->fl;
Q->fl = P->fr;
P->fr = Q;
Q = P;
}

template <class T> void node<T>::RotationLeft(node *&P)
{
node *Q = P->fr;
P->fr = Q->fl;
Q->fl = P;
P = Q;
}

template <class T> void node<T>::RotationLeftRight(node *&R)
{
RotationLeft(R->fl);
RotationRight(R);
}

template <class T> void node<T>::RotationRightLeft(node *&R)
{
RotationRight(R->fr);
RotationLeft(R);
}

template <class T> node<T> *tree<T>::Add(T &Info)
{
int L = 0;
node<T> *N;

if (Root == NULL)
{
Root = new node<T>(Info);
NbElmt++;
return Root;
}
else
{
N = Root->Add(Info, Root, L);
if (N) NbElmt++;
return N;
}
}


template <class T> tree<T>::tree()
{
NbElmt = 0;
Root = NULL;
Haut = 0;
}

template <class T> tree<T>::~tree()
{
if (Root) Root->Empty();
}



inline void alea(unsigned long &rand){
static unsigned long r = SEED;
rand = r = 1664525UL * r + 1013904223UL;
}

int remplit_t(char *Buffer, const int N)
{

unsigned long r;
char buf[20] = "";

for (int i = 0; i < N; i+= 5)
{
do alea(r); while (r < 10000UL);
sprintf(&Buffer[i], "%u", r);
}
}

int lmin = 7, lmax = 15;



int main(int argc, char *argv[])
{
char *Buffer;
unsigned long long f;
char Tmp;
unsigned int i, j, k, t, v;
int Nb;
int PosTmp;
int NbPositionsTmp;
Position ** PositionsTmp;

PositionsRedondantesG = new Position *[N];
NbPositionsG = 0;


Buffer = new char[N];

remplit_t(Buffer, N);

tree<Position> *Tree = new tree<Position>;
Position Temp;

j = 7;

for (i = 0; i < N-j; i++)
{
Tmp = Buffer[i+j];
Buffer[i+j] = 0;
f = atoll(&Buffer[i]);
Buffer[i+j] = Tmp;
Temp.SetN(f);
Temp.SetPos(i);
Tree->Add(Temp);
}

for (t = 8; t <= 15; t++)
{
cout <<"number of "<<t-1<<" figures --------------" <<endl;
delete Tree;
Tree = new tree<Position>;

NbPositionsTmp = NbPositionsG;
PositionsTmp = PositionsRedondantesG;

if (NbPositionsTmp)
{
PositionsRedondantesG = new Position *[NbPositionsTmp];
NbPositionsG = 0;

for (i = 0; i < NbPositionsTmp; i++)
{
Nb = PositionsTmp[i]->GetNb();

for (j = 0; j < Nb; j++)
{
PosTmp = PositionsTmp[i]->GetPos(j);

if (N - PosTmp > t)
{
Tmp = Buffer[PosTmp+t];
Buffer[PosTmp+t] = 0;
f = atoll(&Buffer[PosTmp]);
Buffer[PosTmp+t] = Tmp;
Temp.SetN(f);
Temp.SetPos(PosTmp);
Tree->Add(Temp);
}
}
}
for (i = 0; i < NbPositionsTmp; i++)
delete (node<Position> *) PositionsTmp[i]->GetParent();
delete PositionsTmp;
}
}
cout <<"number of "<<t-1<<" figures --------------" <<endl;
delete Tree;
if (NbPositionsTmp)
{
for (i = 0; i < NbPositionsTmp; i++)
delete (node<Position> *) PositionsTmp[i]->GetParent();
delete PositionsTmp;
}

delete [] Buffer;

return EXIT_SUCCESS;
}