This is a very basic introduction to Gephi. It begins with the assumption of no knowledge, and explains how you can import the network dataset you receive from Arch and do some basic transformations on it yourself. This Introduction was written in June 2021 and may be out-of-date with the latest version of Gephi.
Table of contents:
- Importing your data
- Setting up the dates in your data
- Adding some labels
- Basic graph layouts
- Applying a statistical analysis
Importing your data
This tutorial explores what you can learn from the dataset file marked as "Domain Graph" in your collections page.
- The first step is to download and install Gephi, which you can find here.
- Open Gephi and start a “new project”. Then, under the “File” menu, select “Import Spreadsheet.” Select “domain-graph.csv” and then you will see the data ready for import like below. (MacOS users may need to use The UnArchiver to open the compressed file that you’ve downloaded from ARCH).
- Make sure you are importing it as an edges table, and that the separator is correctly set to comma.
- On the next page, select “intervals” and leave the remainder of the information. Click “finish” and you will see an overview of your graph. It will find “Parallel edges” if you have different dates in your crawl but that overlapping links. Select “Don’t merge” as we will resolve those later; you may need to click “more options…” to reveal additional checkboxes and options. Click “OK”.
Setting up the dates in your data
You now need to go in and make sure Gephi can recognize the dates that are found in this file, so you can dynamically explore the web graph. This can be a bit complicated!
- First, click the “Data Laboratory” tab, then click “Edges.” Look at the arrows in the screenshot below for further direction:
- Now click “Merge Columns” located at the bottom of your screen. You want to merge “Interval” with “crawl_date” and then “Create time interval.”
- You then want to parse the dates as yyyymmddd. Use “crawl date” as your start and end time columns.
- At the bottom of the screen, you will now see a timeline which you can click to enable. However, it might look unintelligible like this:
- Click on “time options” (appears as a cog icon) in the lower left, and select “select time format.” Select “datetime.”
You now have a dynamic graph!
Adding some labels
While we are still in the Data Laboratory tab, let us do a similar transformation to bring the domain names over to each of the nodes:
- Click on “Nodes” at the top of the spreadsheet, and you should see this.
- Click on “Merge Columns” and we will copy the “ID” data over to “Label” as well so Gephi knows we might want to use this in our visualization.
- Click “Copy data to other column,” and select “Id.” Copy it to “Label.” The spreadsheet should then look like this:
You are now ready to begin the process of laying your network out!
Basic graph layouts
You'll now see the following basic layout in the Overview tab. Not too useful, is it? Let's begin by creating a new layout, which you'll see highlighted here below:
Select the layout tab at left, and select "Yifan Hu Proportional." Leave the values default, but you can begin to play with the figures and see what it does. To lay the graph out, click the "run" button.
The following image shows what this looks like after clicking "run" on the default visualization.
Let's add some labels so we can see what this all means. Click on the "T" button below the graph, which is highlighted below. You'll then see lots of labels. It is not too readable - don't worry, we will deal with that shortly.
The next step is to resize the nodes (domains) based on a characteristic. Let's make them bigger based on how many times they are linked to. This is called "in-degree" in Gephi.
This can sometimes be a bit challenging to find in the Gephi interface! In the "Appearance" window at left, click on the "size" icon, select "ranking," and then select "In-Degree" with a min size of 3 and a max size of 40. Then click "Apply."
If the above is confusing, look at the screenshot below and try to reproduce what you see there.
Now let's do the same for label size: the bigger the label, the more it is linked to; the smaller the label, the less it is linked to. To do this, click on the "text size" icon, select "Ranking," and then select "In-Degree." Let's do a min size of 0.1 and a max size of 3. If this is confusing, again try to recreate what you see in the screenshot below.
Some of the labels now overlap, so let's run another simple "layout." This time, we select "Label Adjust" and press run.
We now have a decently laid out network!
Applying a statistical analysis
Now let's run a statistical analysis. We'll run a rudimentary community detection algorithm. We can find that in the "statistics" section on the right hand side. Click the "run" button next to modularity, and click through the next report. The two following screenshots show you where to look.
The final step is to apply the modularity categories to the graph. Let's colour the nodes based on the community that they appear in.
To do so, go back to appearance. This time click the painter's palette, select "Partition," and then apply "Modularity Class." As before, try to recreate what you see in the screenshot below if it is confusing.
At the end of this lesson, your graph should be looking similar to this:
Congratulations! You now have a nicely-laid out graph. Now, try experimenting with other features in Gephi.