View Full Version : Dynamically allocated 2 dimensional array
recluse
10-30-2002, 05:56 PM
Is it possible to possible to use a dynamically allocated 2 dimensional array in C++. I'm having trouble finding info on how to do it. Or maybe I'm just using the wrong search terms.
Strike
10-30-2002, 06:11 PM
Sure, just use a vector of vectors. STL is gooooooood.
jamessan
10-30-2002, 09:13 PM
Or you can use the new operator and memset(). If you want to see an example, here's a link to a program I wrote that needed to do that. http://coderforums.net/showthread.php?s=&postid=10865#post10865
Pointy
11-14-2002, 04:07 PM
//char example
char** hello_jpg;
hello_jpg = new char*[x_dim];
for(int n=0;n<x_dim;n++)
{
hello_jpg[n] = new char[ y_dim ];
}
I'd use vector of vectors though stl is good
skidooer
11-14-2002, 11:13 PM
You can also use:
char *array = new char[x * y]; // equivalent to array[x][y]
char v = array[(x * i) + j)]; // equivalent to array[i][j]
DrPizza
11-18-2002, 09:33 AM
No. C++ doesn't have arrays with more than one dimension.
It does have arrays of array and arrays of pointer, which can both be used to fabricate structures roughly equivalent to multidimensional arrays.
Arrays of array are limited as all but one of the dimensions must be fixed at compile-time. They are, however, efficient. A one-dimensional array takes a multiply, an add, and a dereference to access an element; a two dimensional array takes two multiplies, two adds, and a dereference, and so on, and so forth. They also require only a single allocation to be made.
Arrays of pointer are expensive; they require multiple allocations (probably one per dimension, minimum); they require multiple dereferences.
The latter problem is experienced by vectors of vectors -- the use of vectors eliminates the possibility to perform certain useful efficiency optimizations, which makes vectors of vectors slow.
Better, in general, is to create a one-dimensional array and wrap it in a class with an overloaded operator() to access elements.
sedarious
11-18-2002, 11:23 AM
What benifits/loses (if any) are gained by coding a 2 dim. array in a single dimension?
char array1[x][y];
char array2[x*y];
The main benifit I see here is that array1 is easier to handle codewise (IMO). I would guess that the compiled code would be very similar for both cases as your code progessed further. Anyone know any specifics?
DrPizza
11-19-2002, 03:28 AM
Originally posted by sedarious
What benifits/loses (if any) are gained by coding a 2 dim. array in a single dimension?
When the array's size is fixed at compile time, none. Element access ought to be identical; the single dimension will require one to manually determine the offset, the array of array will do it automagically.
The difference is when the size is determined at runtime. You can't do something like:
int i, j;
// get i and j
int** myArrayOfArray = new int[i][j];
Arrays of pointers can sort of simulate it, but then you need a separate dereference for every dimension.
A single array allows one to use only a single dereference.
sedarious
11-19-2002, 09:03 AM
But a single array also requires an i,j multiplication - which isn't an speedy operation. You may have to wait a couple clocks for a reference return from memory or cache, but it takes many clocks for a multiplication. I would say on a cache efficient system, an array of ptrs would be faster than a 2-D array made 1-D.
bwkaz
11-19-2002, 09:13 AM
Originally posted by sedarious
it takes many clocks for a multiplication.
Not with a decent compiler, if your row length is a power of two. Heck, even if it isn't.
If your row length is 64, then you don't have to multiply by 64, you can just shift left 6 bits, which had better only take one clock.
If your row length is 96, you can shift left 6, and shift left 5, and add them together (x*96 = x*(64+32) = x*64+x*32 = x<<6+x<<5), which is three clocks.
DrPizza
11-19-2002, 01:13 PM
Originally posted by sedarious
But a single array also requires an i,j multiplication - which isn't an speedy operation. You may have to wait a couple clocks for a reference return from memory or cache, but it takes many clocks for a multiplication. I would say on a cache efficient system, an array of ptrs would be faster than a 2-D array made 1-D. [/B]
It's not faster on any system I've seen (in practice, there are no straight multiplies -- the multiply-adds are turned into leas) -- and waiting on memory takes more than "a couple clocks".
sedarious
11-19-2002, 01:33 PM
edit: This is all a big edit. I was a bit brief before and decided to explain it better...
DrPizza:
I am pretty sure LEA only works with constants for that sort of multiplication (and I think they have to be 2,4,8 and maybe 16 - I can't find the info in any tech docs right now). Since, i and j are runtime dependant, they can't be utilized in an LEA multiplication scheme.
bwkaz:
For bit shift multiplication subsitutions, you need to know the values at runtime in order for the compiler to optimize it. Something like char *array=new char[i*10] is fine and dandy. char *array=new char[i*j] on the other hand is not. You would have to have a multiplication lookup table for substitutions (which is a cool idea), but would most likely end up being more costly that a slow ass mul operation due to the jumps and condition branching required.
DrPizza
11-19-2002, 05:20 PM
Originally posted by sedarious
I am pretty sure LEA only works with constants for that sort of multiplication (and I think they have to be 2,4,8 and maybe 16 - I can't find the info in any tech docs right now). Since, i and j are runtime dependant, they can't be utilized in an LEA multiplication scheme.
Well, it depends how many degrees of freedom you have.
But, of course, you don't always need to perform the multiplication. To iterate through the array, for instance, requires only pointer increments (or additions if you're going in the other sense).
bwkaz
11-19-2002, 05:55 PM
I see. No, it wouldn't work very well if any more than one dimension could be varied...
However, it is still the case that the memory read (if the cache misses -- if the cache hits, then you're OK) will take much, much longer than the piddly little multiply. On the 686 core, according to my Computer Architecture, A Quantitative Approach textbook, a cache miss takes around 70 cycles to satisfy (now this obviously depends a bit on memory and processor clock speeds, but that seems pretty close). Now the processor can still get some work done in that time if the cache is non-blocking and hit under miss (or something like that), because it's superscalar and speculative, but probably not much.
By the same token, if the multiply stalls the processor for some reason (actually, it won't, it'll just get executed on a different execution unit, and only stall dependent machine language instructions), other instructions can still be executed in the meantime. I don't have any numbers for how long it takes to multiply, but I wouldn't think it could possibly be more than about 20... right?
Of course, a cache hit is 1 cycle.
sedarious
11-19-2002, 06:06 PM
LEA, regardless of the situation, will not multiply two registers together. ANd if you are talking about iteration, it doesn't matter which method is used (2-D or 1-D) since you can just increment the pointer by the size of the element (excluding row changes since you many not be guarenteed the next spot in memory).
This is what I break the calls down to for a random element:
Array of ptrs to arrays:
2 memory calls (99% chance of cache hit in this case I would say) and an LEA op.
Single dim array:
1 memory call (cache) added to I*J MUL (slow)
Another major factor to consider is the pipeline length. The major difference is that the second case will take more time (the MUL) as the size grows. The first method will always take the same amount of time.
edit:
About the multipy - this is from 386 documentation... its probably not too much better now...
8bit - 70-77
16bit - 118-133
32bit - ???
I would say 32bit would take near 200 clock cycles...
bwkaz
11-19-2002, 11:45 PM
Umm, the 386 is definitely not a good processor to pick. Sure it had protected mode, and it had a PMMU, but the similarities with current Intel-compatible processors end there.
It was single-issue, maybe pipelined but I'm not sure on that one, had in-order execution, had no software cache/TLB control support (INVLPG was a 486 instruction), which makes me want to say it had no cache -- though I could be wrong, as I don't remember that processor quite that well.
Where did you find those numbers, though? Might there be anything similar for the 586 or 686 core (the Pentium or PPro/2/3 cores)? I remember that my 80x86 assembly book (the one that shipped with Borland's Turbo Assembler) had clock cycles per instruction in it, but as far as I remember, it only went up to the 386 anyway, so that would be useless. :eh:
I'm not sure on that 99% cache hit number either. We ran some simulations of 8KB, direct-mapped, 64-bit line size caches for our architecture class, on a gcc tracefile, and I remember numbers closer to an 80-90% hit rate. Which isn't bad, but no one got 99%...
DrPizza
11-20-2002, 12:54 AM
About the multipy - this is from 386 documentation... its probably not too much better now...
The Pentium 4 takes approximately 14 clock cycles for a 32-bit multiply. So, yeah, it is much better now.
I say "approximately" because one can't make exact measurements given its OOO pipelined nature.
stuka
11-20-2002, 11:08 AM
<minor hijack> What does "OOO pipelined nature" mean - or at least the OOO part?
DrPizza
11-20-2002, 11:16 AM
Out Of Order.
That is, the processor reorders instructions (well, microops really) in an effort to keep itself as busy as possible and avoid stalls.
sedarious
11-20-2002, 11:18 AM
Well, I guess it did get better. P4 also have a monster ALU which is INTERNALLY double pumped. So is it 14 CPU cycles or 14 ALU cycles? I am guessing its cpu cycles, since that is the major reference. Assuming AMD chips and p3's use the same logic, you are looking at a 28 cycle multiply.
OOO - out of order execution...
In reference to the 99% - that was just a number i threw out there. Since the i and j variable are stack located, there is no reason that they should be dropped from cache since the stack is going to be hit all the time.
If someone has the time today, can they toss together 2 little app that creates a 1000x1000 element matrix using both methods and retrieves say 10,000,000 RANDOM elements from the matrix and report a time? I don't have a C/C++ compiler at work and I can't SSH/Telnet out. If not, I will do it when I get home.
edit: Where did you find the info on the 14 clock cycles???
stuka
11-20-2002, 11:22 AM
Ahhh - just another TLA. Thanks.
sedarious
11-20-2002, 11:25 AM
TLA???
DrPizza
11-20-2002, 11:34 AM
Three Letter Acronym.
The 14 cycles is from the P4 optimization guide from Intel. Parts of the ALU are double-pumped. Not all. I would have to check to see which execution unit is responsible for multiplies.
sedarious
11-20-2002, 11:40 AM
I know for sure adds go through the double pumped - but who cares about that - tis fast enough already. I would like the think that they double pumped the long ops like div and mul in order to attain faster code execution... wouldn't suprise me if they didn't though.
stuka
11-20-2002, 11:49 AM
Three Letter Acronym :D
sedarious
11-20-2002, 11:52 AM
Apparently imuls do not go through the ALU, but rather the FP_MUL. MUL also does not go through the ALU, though its execution unit is not specified...
edit: Athlon is faster - 5-9 for signed, 4-8 for unsigned...
bwkaz
11-20-2002, 12:38 PM
8) Athlons 8)
:P
I'll see if I can write that program sometime this afternoon. It'll be running on an Athlon XP1800+, not o/c'ed, with whatever they have for L1 cache (I think 8K? not positive though), and 256K L2 cache. Not that processor frequency really matters, because both versions of the array will be looked at on the same proc., but I figure I should probably get that out up front anyway.
bwkaz
11-20-2002, 01:29 PM
OK. Code is attached.
When it was compiled with gcc, flags of -O3 -march=athlon-xp only, the results were as shown here:
[bilbo@beta bench]$ ./bench
923861.540000 microseconds on average, for double indirection.
10086.680000 microseconds on average, for single indirection.
10010.130000 microseconds on average, for single indirection with shifts.
[bilbo@beta bench]$ ./bench
923768.570000 microseconds on average, for double indirection.
10048.570000 microseconds on average, for single indirection.
10039.960000 microseconds on average, for single indirection with shifts.
:o
When compiled with gcc, flags of -O0 only, results were as shown here:
[bilbo@beta bench]$ ./bench
1029677.070000 microseconds on average, for double indirection.
10404.360000 microseconds on average, for single indirection.
9983.650000 microseconds on average, for single indirection with shifts.
[bilbo@beta bench]$ ./bench
1029669.760000 microseconds on average, for double indirection.
10378.260000 microseconds on average, for single indirection.
9989.850000 microseconds on average, for single indirection with shifts.
Which tells me, among other things, that -O3 is actually making shifts worse with gcc... but either way you cut it, two indirections is taking ~100x longer with this setup.
Now I'll be the first to admit that this program could easily be screwing up somewhere; I threw it together in about a half hour. Take a look at the code, and see if you can find anything.
The first time I did it, it was using an int array1[1000][1000], and coming up with numbers very comparable to the single indirection case. Now, it's mallocing each element independently (and I know, it should be freeing it too, but I don't care, my OS can clean up after this program ;) ).
sedarious
11-20-2002, 01:41 PM
t += array2[i*1000 + j];
firstly, make your lines t= exp to avoid stalls
second, this should be in the form array2[i*x+j] where x is a variable with a value of 1000. Using a constant allows for compiler optimization and doesn't force intense runtime calculation which is what was important. THe premise is to find the faster one where neither the X or Y dimensions are known at compiletime. To make sure, do a cin>>x; so that the x dimension can't possibly be optimized...
edit: other than that, I can't see any other possible problems just yet...
bwkaz
11-20-2002, 03:20 PM
t = exp? Huh? Sorry, I'm not sure what you're talking about...
That other problem is valid. Of course, it invalidates the shifting version, but I'm not going to change that anyway. Results of changing just that, and only run with no optimization in the interest of time:
[bilbo@beta bench]$ ./bench
Enter '1000' into the system:
1000
998846.010000 microseconds on average, for double indirection.
10374.090000 microseconds on average, for single indirection.
10235.460000 microseconds on average, for single indirection with shifts.
Still comparable. Source is attached again.
sedarious
11-20-2002, 04:16 PM
LOL
compare the for lines... (they are in order)
for(counter=0; counter<1000000000; counter++) {
for(counter=0; counter<10000000; counter++) {
for(counter=0; counter<10000000; counter++) {
I wonder why we are getting ~100x difference???
edit: if we factor out the exta magnitude of 100, then the double-reference comes in at about 4% faster and the AMD chip. I would guess this difference would move to about 6%-8% on a P4 and it would keep growing the older the architecture...
bwkaz
11-20-2002, 05:53 PM
DOH!!!
Wow, do I feel stupid!
Yeah, they were all running ten million references each, but then I was seeing differences between runs, so I figured I'd run each test 100 times, and divide by 100. Seems I forgot to do all but the divide on the second two tests...
:redfaced:
OK, some better results here. With -O0:
[bilbo@beta bench]$ ./bench
Enter '1000' into the system:
1000
1006186.450000 microseconds on average, for double indirection.
1036096.710000 microseconds on average, for single indirection.
1023595.120000 microseconds on average, for single indirection with shifts.
So slightly slower on the multiply version. With -O3 -march=athlon-xp:
[bilbo@beta bench]$ ./bench
Enter '1000' into the system:
1000
919640.070000 microseconds on average, for double indirection.
927181.410000 microseconds on average, for single indirection.
919302.000000 microseconds on average, for single indirection with shifts.
So shifts are only slightly better, when you optimize everything else. But I'd guess that there aren't more than three significant digits in any of these numbers anyway...
sedarious
11-20-2002, 05:59 PM
if you swaped the *1000 back into the multiplication case, it would probably be faster than the shift...
anway, double reference wins the title match!
DrPizza
11-20-2002, 06:54 PM
There are other issues to consider -- this test is too simplistic and, I suspect, not representative of real-world usage patterns.
Not all access is random. An awful lot is sequential.
If I carve out a solid chunk of memory, I can walk the entire thing by iterating the pointer.
If I have array-of-pointer, I can't (at least, not without trickery).
If I have an array-of-pointer, I need to make at least one more memory allocation, and more complicated (i.e. slower) initialization.
I may have some fixed dimensions; I can take extra advantage of these with a single array (I can eliminate some multiplies); not so with an array of pointer (I can't eliminate dereferences).
In any case, with changes to work in Win32 (using QueryPerformanceCounter instead of gettimeofday) my results look like this on a PIII-1000:
1.305145 seconds on average, for double indirection.
1.137325 seconds on average, for single indirection.
1.088434 seconds on average, for single indirection with shifts.
Which is hardly what one would call a win for the double indirection. I moved array2 off the stack, as it was crashing on startup; it was easier to change it o heap allocated than it was to find the correct linker option to enlarge the stack.
recluse
11-20-2002, 07:03 PM
So what dojo did you guys study at to obtain your intricate knowlege of processors and the nitty gritties of c/c++?
sedarious
11-20-2002, 07:23 PM
Very interesting Dr. Watson. I am left pondering why they are now switched for performance. Any specific thoughts on why they are different now?
In regards to real world stuffs, you are right. Random access is not generally how things are done - it was mearly a way of putting the two methods on the same ground since iteration of the double indirection isn't as easy. The double inderection does have the distinct advantage of being able to have different sizes in each array, which might be useful at times.
I was mostly kidding about the "win the title match" thing. As we can see, there is a good deal of change between processors. Cache effeciency might be an issue, as well as some op execution times. I would say that they are fairly equivalent and both have thier place.
I would like to see how this runs on a P4 and a Mac chip (G4 would be ideal) just to get a feel for more processor differences. Also, an old school x86 comparison might be neet too (486 anyone?)
Originally posted by recluse
So what dojo did you guys study at to obtain your intricate knowlege of processors and the nitty gritties of c/c++?
Well, I know a decent amount of x86 asm and I can tell you a good deal about any x86 processor from the 386 up to the P4 as well as AMD's new beauty (Athlon 64). I have some knowledge of the G4 and IBM's cpu for the Mac line (R700 maybe?). Some of the info comes from reading sites like anandtech, tomshardware, hardocp, aces hardware and sharky extreme. If you can handle it, the technical papers from intel and amd are also really informative - but they are rather long and can easily make your eyes roll into the back of thier sockets...
bwkaz
11-20-2002, 08:04 PM
Interesting. Changing it back to *1000 came up with this (full optimization):
$ ./bench
911856.320000 microseconds on average, for double indirection.
895552.640000 microseconds on average, for single indirection.
924322.870000 microseconds on average, for single indirection with shifts.
[bilbo@beta bench]$ ./bench
911753.920000 microseconds on average, for double indirection.
900028.960000 microseconds on average, for single indirection.
923994.770000 microseconds on average, for single indirection with shifts.
Interesting.
Originally posted by recluse
[B]So what dojo did you guys study at to obtain your intricate knowlege of processors and the nitty gritties of c/c++?
Study at? I never really took C or C++ classes... MTU has a C++ class, but I knew the language well enough to not bother with it, when I came in. Most of the first-year stuff is done in Java, though. I've just been programming in C for quite some time on my own. Most of the comments from me about the architecture were based on a reading of the 386 documentation (of protected mode, etc.) both on Intel's website and in a couple of books that, surprisingly, the library in my hometown had. Some reading of the aforementioned Turbo Assembler manuals also helped a bit.
vBulletin® v3.7.0, Copyright ©2000-2008, Jelsoft Enterprises Ltd.