So in our previous lesson, we discussed
one possible way of
storing and representing a graph in
which
we used two list. One to store the
vertices and another to store the
edges. A record in vertex list here
is name of a node
and a record in edge list is an
object
containing references to the two endpoints
of an edge and also the weight of that edge
because this example graph that I am showing
you here is a
weighted graph. We called this kind of
representation
edge list representation but we realised
that this kind of storage is not very
efficient in terms of
time cost of most frequently performed
operations
like finding nodes adjacent to a given
node
or finding if two nodes are
connected are not.
To perform any of these operations, we
need to scan the whole
edge list. We need to perform a
linear search on the edge list.
So the time complexity is big oh of number
of
edges and we know that number of edges
in the graph
can be really really large. In worst case
it can be close to square of number of
vertices.
In a graph, anything running in order
of number of
edges is considered very costly. We
often want to keep the cost
in order of number of vertices. So we
should think of some other efficient
design.
We should think of something better than
this. One more possible design is that
we can store the edges in a
two-dimensional array
or matrix. We can have a
two-dimensional matrix
or array of size V*V
where V is number of vertices.
As you can see, I have drawn an 8*8
array here because number of vertices
in my sample graph here
is 8. Let's name this array A.
Now if we want to store a graph that is
unweighted. Let's just remove the weights
from this sample graph here
and now our graph is unweighted and if we
have
of value or index between 0 and V-1
for each vertex which we have here
if we are storing the vertices in a
vertex list
than we have an index between 0 and V-1
for each vertex. We can say that A
is zeroth node,
B is 1th node, C is
2th
node and so on. We are picking up
indices from vertex list. Okay
so if the graph
is unweighted and each vertex has an
index between 0 and
V-1, then in this matrix
or 2d array. We can set ith row
and jth column that is A[i][j]
as 1 or boolean value
true. if there is an edge from i to j
0 or false otherwise. If I have
to fill this matrix for this example
graph here then I'll go vertex by vertex.
Vertex 0 is connected to Vertex 1
2 and 3. Vertex 1
is connected to 0, 4 and 5.
This is an undirected graph so if we
have and edge from 0 to 1,
we also have an edge from 1 to 0
so
1th row and 0th column should also be
set as 1.
Now let's go to nodes 2, it's connected
to 0
and 6, 3 is connected to 0 and 7,
4 is connected to 1 and 7,
5 once again is connected to 1 and 7,
6 is connected to
2 and 7 and 7 is connected
to 3, 4, 5 and 6.
All the remaining positions in
this array should be set as 0.
Notice that this matrix
is symmetric. For an undirected graph,
this matrix would be symmetric
because A[i][j] would be equal to A[j][i].
We would have two positions filled for
each edge.
In fact to see all the edges in the graph,
we need to go through only one of these
two halves.
Now this would not be true for our
directed graph. Only one position will be
filled for each
edge and we will have to go through
the entire matrix
to see all the edges. Okay,
now this kind of representation of a
graph in which
edges or connections are stored in a
matrix
or 2D array is called adjacency matrix
representation. This particular matrix that
I have drawn here
is an adjacency matrix. Now with this
kind of storage or representation,
what do you think would be the time cost
of finding
all nodes adjacent to a given node. Let's say
given this vertex list
and adjacency matrix, we want to find
all nodes adjacent to node named F.
If we are given name of a node than
we first need to know it's
index and to know the index, we will have to
scan the vertex list.
There is no other way. Once we figured out
index
like for F index is 5 then
we can go to the row with that index
in the adjacency matrix
and we can scan this complete row to
find all the
adjacent nodes. Scanning the vertex
list
to figure it out the index in worst case
will cost us time proportional to the
number of vertices
because in worst case we may have to
scan the whole list,
and scanning a row
in the adjacency matrix would once again
cost us time proportional to number of
what vertices because
in a row we would have exactly
V columns where V is number of a
vertices.
So overall time cost of this operation
is big oh of V. Now most of the time
while performing operations,
we must pass indices to avoid
scanning the vertex list all the time.
If we know an index, we can figure out
the name in constant time,
because in an array we can access element at
any index in constant time but if we know
a name
want to figure out index then it will
cost us big oh of V.
We will have to scan the vertex list.
wWe will have to perform linear search
on it. Okay moving on.
Now what would be the time cost of
finding if 2 nodes
are connected or not. Now once again the
two nodes can be given to us
as indices or names. If the nodes
would be passed test as indices
then we simply need to look at value in
a particular row and
particular column. We simply need to look
at
A[I][J] for some values of I and J
and this will cost us constant time.
You can look at Value in any cell in
a two-dimensional array in constant time.
So if
indices are given time complexity of
this operation would be big oh of 1
which simply means that we will
take constant time
but if names are given then we also need
to do the scanning
to figure out the indices which will
cost us big oh of V.
Overall time complexity would be
Big oh of V.
The constant time access would not mean
anything.
The scanning of vertex list all the
time to figure it out
indices can be avoided. We can use
some extra memory to create
a hash table with names and indices
as key value pairs and then the time
cost of finding
index from name would also be big oh
of 1 that is constant. Hash table is
a data structure
and I have not talked about it in any of
my lessons so far.
If you do not know about hash table, just
search online for
a basic idea of it. Okay, so as you can
see
with adjacency matrix representation
our time cost of some of the most
frequently performed operations
is in order of number of vertices
and not in order of number of
edges which can be as high as square of
number of vertices.
Okay now if we want to store
a weighted graph in adjacency matrix
representation
then A[i][j] in the matrix can be set as
weight of an edge. For non-existent ages we
can have
a default value like a really large
or maximum possible integer value
that is never expected to be an edge
weight. I have just filled in infinity
have to mean that
we can choose the default as infinity
minus infinity
or any other value that would never
ever be a valid
edge weight. Okay, now for further
discussion
I'll come back to an unweighted graph.
Ajacency matrix
looks really good so should we not use it
always.
Well, with this design we have improved
on
time, but we have gone really high on
memory usage
instead of using memory units exactly
equal to the number of edges
what we're doing with
edge list kind of storage.
Here we're using exactly V square
units of memory.
We are using big oh of V square space.
We are not just storing the information
that these two
nodes are connected, we are also storing not
of it
that is these two nodes side not connected
which probably is
redundant information. If a graph is
dense,
if the number of edges is really close
to V square
then this is good but if the graph is
sparse
that is if number of edges is lot lessser
than V square
then we are wasting a lot of
memory in storing the zeros.
Like for this example graph that I have
drawn here, in the edge list we were
consuming
10 units of memory we had ten rows
consumed in the edge list
but here we are consuming 64 unit.
Most graphs with
really large number of vertices would
not be very dense,
would not have number of edges anywhere
close to V sqaure
like for example, Let's say we are modeling
a social network like Facebook as a
graph such that a user in the network
is a node
and there is an undetected edge if two
users are friends.
Facebook has a billion users but I'm
showing only a few in my example graph
here because I'm short of space.
Let's just assume that we have a billion
users in our network,
so number of vertices in a graph is
10 to the power 9
which is billion. Now do you think number
of connections
in our social network can ever be close
to square of number of users
that will mean everyone in the network
is a friend of
everyone else. A user of our social
network will not be friend to all other
billion users.
We can safely assume that a user
on an average would not have more than
a thousand friends
with this assumption we would have
10 to the power 12
edges in our graph. Actually, this is an
undirected graph
so we should do a divide by 2 here. So
that we do not
count an edge twice. So if
average number of friends is 1000 then total
number of connections in my graph is
5 * 10 to power 11. Now this
is lot lesser than a square of number
of vertices.
So basically if you would use an adjacency
matrix for this kind of a graph,
we would waste a hell lot of space
and moreover
even if we are not looking in relative
terms 10 to the power 18
units of memory, even in absolute
sense
is alot. 10 to the power 18 bytes
would be about a 1000 petabytes.
Now this really is a lot of space. This
much data would never ever fit on one
physical disk.
5 into 10 to the power 11 byts on the other
hand
it's just 0.5 terabytes. A typical
personal computer these days would have this
much of storage.
So as you can see for something like a
large
social graph adjacency matrix
representation is not very efficient.
Agency matrix is good when a graph is
dense
that is when the number of edges is
close to square of number of vertices
or sometimes when total number of
possible connection that is V square
is so less that wasted space would not
even matter
but most real-world graphs would be
sparse
and adjacency matrix would not be a good
fit.
Let's think about another example. Let's
think about
world wide web as are directed graph.
If you can think of web pages as nodes
in a graph
and hyperlinks as directed edges
then a webpage would not have linked to
all other pages
and once again number of webpages
would be in order of millions.
A webpage would have link to only
the
a few other pages, so the graph would be
sparse.
Most real world graphs would be sparse
and adjacency matrix. Even though it's
giving us good running time for most
frequently performed
operations would not be a good fit
because it's not very efficient in terms
of space
so what should we do. Well there's
another
representation that gives us similar
or maybe even better running time than
adjacency matrix and does not consume so
much space
It's called adjacency list
representation and we will talk about it
in our next lesson.
This is it for this lesson.
Thanks for watching