{"id":2958,"date":"2025-07-09T11:37:13","date_gmt":"2025-07-09T11:37:13","guid":{"rendered":"https:\/\/mindfusion.eu\/blog\/?p=2958"},"modified":"2025-10-06T07:53:40","modified_gmt":"2025-10-06T07:53:40","slug":"building-a-web-crawler-and-web-graph-visualizer-in-javascript","status":"publish","type":"post","link":"https:\/\/mindfusion.dev\/blog\/building-a-web-crawler-and-web-graph-visualizer-in-javascript\/","title":{"rendered":"Building a Web Crawler and Web Graph Visualizer in JavaScript"},"content":{"rendered":"\n<p>In this post, we&#8217;ll walk through the process of creating a simple application that can crawl the web starting from a given URL, and visualize the hyperlinks it finds as an interactive web graph. We will accomplish this using pure JavaScript and the MindFusion.Diagramming library, which provides the powerful features needed for graph creation, layout, and interaction.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/mindfusion.dev\/blog\/wp-content\/uploads\/2025\/07\/web_graph.png\"><img loading=\"lazy\" decoding=\"async\" width=\"743\" height=\"422\" src=\"https:\/\/mindfusion.dev\/blog\/wp-content\/uploads\/2025\/07\/web_graph.png\" alt=\"\" class=\"wp-image-2959\" srcset=\"https:\/\/mindfusion.dev\/blog\/wp-content\/uploads\/2025\/07\/web_graph.png 743w, https:\/\/mindfusion.dev\/blog\/wp-content\/uploads\/2025\/07\/web_graph-300x170.png 300w, https:\/\/mindfusion.dev\/blog\/wp-content\/uploads\/2025\/07\/web_graph-500x284.png 500w\" sizes=\"auto, (max-width: 743px) 100vw, 743px\" \/><\/a><\/figure>\n\n\n\n<!--more-->\n\n\n\n<p>The full source code for this example is available here:<br><a href=\"https:\/\/mindfusion.dev\/_samples\/WebGraph.zip\">https:\/\/mindfusion.dev\/_samples\/WebGraph.zip<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Setting Up the Diagram<\/h2>\n\n\n\n<p>The first step is to set up the host page and initialize the core objects from MindFusion API: Diagram and DiagramView. This is done within a DOMContentLoaded event listener where we have access to the canvas element. We also perform some initial styling, such as setting the arrowhead sizes and creating a Theme object:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>var diagram = null;\nvar diagramView = null;\n\ndocument.addEventListener(\"DOMContentLoaded\", function ()\n{\n    diagram = new Diagram();\n\n    \/\/ DiagramView control renders the diagram\n    diagramView = DiagramView.create(document.getElementById(\"diagram\"));\n    diagramView.diagram = diagram;\n    diagramView.mouseWheelAction = MindFusion.Diagramming.MouseWheelAction.Zoom;\n\n    \/\/ tweak appearance\n    diagram.linkHeadShapeSize = 2;\n    diagram.showGrid = false;\n    diagram.backBrush = \"#e0e9e9\";\n\n    \/\/ style nodes using a theme\n    var theme = new Theme();\n    var shapeNodeStyle = new Style();\n    shapeNodeStyle.brush = \"white\";\n    shapeNodeStyle.stroke = \"#7F7F7F\";\n    shapeNodeStyle.textColor = \"#585A5C\";\n    theme.styles.set(\"std:ShapeNode\", shapeNodeStyle);\n    diagram.theme = theme;\n});\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">The Crawling Logic<\/h2>\n\n\n\n<p>The core of our application is the web crawler. We use <code>setInterval<\/code> to create a processing loop that runs every 100 milliseconds. This loop manages a queue of URLs to be fetched. To prevent getting stuck by backlinks, we use a Set called fetchedPages.<\/p>\n\n\n\n<p>We also track the number of activeFetches to ensure the process doesn&#8217;t terminate while requests are still in flight. The crawling process stops when the queue is empty and no fetches are active, or when we reach the user-defined maximum number of pages:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>function crawlTick() {\n    \/\/ stop when queue is empty and all fetches are done,\n    \/\/ or if we hit the maxPages limit\n    if ((queue.length === 0 &amp;&amp; activeFetches === 0) || nodes.size &gt;= maxPages) {\n        clearInterval(crawlTimer);\n        arrangeDiagram();\n        return;\n    }\n\n    \/\/ if queue is empty, but fetches are still running, wait for them to complete\n    if (queue.length === 0) {\n        return;\n    }\n\n    \/\/ get next URL from queue\n    var currentLink = queue.shift();\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Fetching, Parsing, and Building the Graph<\/h2>\n\n\n\n<p>As we process a page, we create a ShapeNode for it by calling diagram.factory.createShapeNode. We store this node in a Map that associates the URL with its diagram node. That allows us to easily find existing nodes and create DiagramLink instances between them, visualizing our graph. <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ create a diagram node representing target page\nif (!nodes.has(currentLink.toPage))\n{\n\t\/\/ we'll arrange automatically later\n\tvar node = diagram.factory.createShapeNode(1000, 1000, 50, 20);\n\n\t\/\/ remember the target URL to open from click events\n\tnode.hyperLink = currentLink.toPage;\n\tnode.tooltip = currentLink.toPage;\n\n\t\/\/ map URL to diagram node\n\tnodes.set(currentLink.toPage, node);\n}\n\n\/\/ create a diagram link representing the hyperlink\nvar fromNode = nodes.get(currentLink.fromPage);\nvar toNode = nodes.get(currentLink.toPage);\nif (fromNode &amp;&amp; toNode) {\n\tvar link = diagram.factory.createDiagramLink(fromNode, toNode);\n\tlink.zIndex = 0;\n\tif (toNode.incomingLinks.length &gt; 1)\n\t\tlink.stroke = \"lightGray\";\n}<\/code><\/pre>\n\n\n\n<p>For each URL dequeued in our crawlTick, we perform several actions. First, we fetch the page&#8217;s HTML content. Note that due to browser CORS (Cross-Origin Resource Sharing) policies, fetching HTML from a different domain directly from client-side JavaScript is blocked. Our implementation uses a simple local Node.js proxy to bypass this (run npm start to start the proxy, while WebGraph.html itself can be opened from the file system).<\/p>\n\n\n\n<p>Once we receive the page content, we use browser&#8217;s built-in DOMParser to turn the string into a traversable DOM document. From this document, we can easily extract the page title and all hyperlink (<code>&lt;a&gt;<\/code>) tags. New, unique hyperlinks found on the page are added to the back of the queue for future processing.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ Request the page through server proxy\nfetch(`http:\/\/localhost:3000\/proxy?url=${encodeURIComponent(currentLink.toPage)}`)\n    .then(response =&gt; response.text())\n    .then(html =&gt; {\n        \/\/ parse the page to extract title\n        var parser = new DOMParser();\n        var doc = parser.parseFromString(html, \"text\/html\");\n\n        var pageNode = nodes.get(currentLink.toPage);\n        if (pageNode) {\n            var text = doc.title || \"untitled\";\n            if (text.length &gt; 80) {\n                text = text.substring(0, 80) + \"...\";\n            }\n            pageNode.text = text;\n        }\n\n        \/\/ ... and a list of hyperlinks\n        var links = doc.querySelectorAll('a');\n        links.forEach(link =&gt; {\n            var address = link.getAttribute('href');\n            if (address &amp;&amp; (address.startsWith('http') || address.startsWith('https'))) {\n                \/\/ Add to queue to process on next timer tick\n                if (!fetchedPages.has(address) &amp;&amp; fetchedPages.size &lt; maxPages) {\n                    queue.push({fromPage: currentLink.toPage, toPage: address});\n                }\n            }\n        });\n    })\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Automatic Layout and Interactivity<\/h2>\n\n\n\n<p>After the crawler has finished, our diagram contains all the nodes and links, but they aren&#8217;t yet positioned. Let&#8217;s apply SpringLayout to automatically arrange the graph:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>function arrangeDiagram() {\n    \/\/ automatically arrange nodes and links\n    var layout = new MindFusion.Graphs.SpringLayout();\n    diagram.arrange(layout);\n    diagram.resizeToFitItems(5);\n    diagramView.bringIntoView(diagram.nodes&#91;0]);\n}<\/code><\/pre>\n\n\n\n<p>Finally, to make the graph interactive, we add a handler to diagram&#8217;s nodeClicked event. In the handler, we retrieve the URL stored in the node&#8217;s hyperLink property and open it in a new browser tab:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>function onNodeClicked(sender, args) {\n    \/\/ open the hyperlink in a new tab\n    if (args.node.hyperLink) {\n        window.open(args.node.hyperLink, '_blank');\n    }\n}\n\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Using timer-based processing loop and the features of MindFusion.Diagramming library, we&#8217;ve created a functional web crawler and visualizer with a small amount of code. This example demonstrates how to represent and interact with complex network data, and can serve as a foundation for more advanced analysis and visualization applications.<\/p>\n\n\n\n<p>Code above demonstrates MindFusion\u2019s JavaScript diagramming library, but we support same API (model and layout classes, events) in our .NET and Java libraries.<\/p>\n\n\n\n<p>Enjoy!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post, we&#8217;ll walk through the process of creating a simple application that can crawl the web starting from a given URL, and visualize the hyperlinks it finds as an interactive web graph. We will accomplish this using pure &hellip; <a href=\"https:\/\/mindfusion.dev\/blog\/building-a-web-crawler-and-web-graph-visualizer-in-javascript\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[95,513,74],"tags":[],"class_list":["post-2958","post","type-post","status-publish","format-standard","hentry","category-diagramming-2","category-javascript","category-sample-code"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p3RlKs-LI","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/mindfusion.dev\/blog\/wp-json\/wp\/v2\/posts\/2958","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mindfusion.dev\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mindfusion.dev\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mindfusion.dev\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mindfusion.dev\/blog\/wp-json\/wp\/v2\/comments?post=2958"}],"version-history":[{"count":7,"href":"https:\/\/mindfusion.dev\/blog\/wp-json\/wp\/v2\/posts\/2958\/revisions"}],"predecessor-version":[{"id":2968,"href":"https:\/\/mindfusion.dev\/blog\/wp-json\/wp\/v2\/posts\/2958\/revisions\/2968"}],"wp:attachment":[{"href":"https:\/\/mindfusion.dev\/blog\/wp-json\/wp\/v2\/media?parent=2958"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mindfusion.dev\/blog\/wp-json\/wp\/v2\/categories?post=2958"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mindfusion.dev\/blog\/wp-json\/wp\/v2\/tags?post=2958"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}