I often mash and extract data using Excel. I needed a way to pull out the domain from a web address and compare it with domains from other sheets and activities. This then gives me insight into how often we do work on websites across all our projects.
Version Note: The formula below only works in versions of Excel that can read the xlsx format. Older versions of Excel (pre-2004) are limited to seven nested formulas. This equation has eight. The iserror formulas in the if statements checks for errors and displays whatever is in the A6 cell if an error occurs. If the cell is blank, it should stay blank. I simply used the blank check upfront to make the formula more efficient in processing.
Here is the formula:
=IF($A6="","",SUBSTITUTE(IF(ISERROR(FIND("https://",$A6)),SUBSTITUTE(IF(ISERROR(FIND("/",SUBSTITUTE($A6,"http://",""))),$A6,REPLACE($A6,FIND("/",SUBSTITUTE($A6,"http://",""))+7,LEN($A6)+1,"")),"http://",""),SUBSTITUTE(IF(ISERROR(FIND("/",SUBSTITUTE($A6,"https://",""))),$A6,REPLACE($A6,FIND("/",SUBSTITUTE($A6,"https://",""))+8,LEN($A6)+1,"")),"https://","")),"www.","")
A bit complex to explain, but it considers situations where a web address uses either the secure “https” protocol or just the regular “http” and always removes the “www” from the address. The rest of it is looking for the first slash after the “http://” or “https://” and deletes everything from that slash to the end.
Here are the details for each stage:
=IF($A6="","",
// Makes sure nothing happens if nothing is in the cell
SUBSTITUTE(
// Used to delete www from the result
IF(ISERROR(FIND("https://",$A6)),
// If it doesn’t find https, then execute the function to remove everything after the domain for sites with http:// — this is needed since the s adds characters prior to the slash (important for the replace function later)
SUBSTITUTE(
// If https:// was not found, this begins the formula used to delete everything after the first slash after the http://
IF(ISERROR(FIND("/", SUBSTITUTE($A6,"http://",""))),
// If it doesn’t find a slash after the domain, then do nothing and display what’s in the cell
$A6,
// Just display what was in the cell–which should be just a domain address since there is no slash after the http
REPLACE($A6,
// This is the core function that removes everything after the domain
FIND("/", SUBSTITUTE($A6,"http://",""))
// Finds the spot/character number where the first slash is after the http://
+ 7,
// Adds the number of spaces for http so that the starting point of the replacement is correct
LEN($A6)+1,
//This finds the end of the space where text is supposed to be replaced/deleted. Added 1 to cover bases and potential spacing issues.
""
// this is what it will be replaced with…nothing.
)
// Ends the Replace function.
),
// Ends the IsError function.
"http://",""
// Once the result is determined (either after the end of the url is cut off or if it’s just a domain), delete the http:// so that just the domain shows
),
// Ends the substitute function if http:// is found.
SUBSTITUTE(
// If https:// was not found, this begins the formula used to delete everything after the first slash after the http://
IF(ISERROR(FIND("/", SUBSTITUTE($A6,"https://",""))),
// If it doesn’t find a slash after the secure domain, then do nothing and display what’s in the cell
$A6,
// Just display what was in the cell–which should be just a domain address since there is no slash after the https://
REPLACE($A6,
// This is the core function that removes everything after the secure domain
FIND("/", SUBSTITUTE($A6,"https://",""))
// Finds the spot/character number where the first slash is after the https://
+ 8,
// Adds the number of spaces for https so that the starting point of the replacement is correct
LEN($A6)+1,
//This finds the end of the space where text is supposed to be replaced/deleted. Added 1 to cover bases and potential spacing issues.
""
// this is what it will be replaced with…nothing.
)
// Ends the Replace function.
),
// Ends the IsError function.
"https://",""
// Once the result is determined (either after the end of the url is cut off or if it’s just a domain), delete the https:// so that just the domain shows
),
// Ends the substitute function if https:// is found.
),
"www.",""
// so whatever the result is, it should now just be a domain, but sometimes links have www for a site and sometimes not. So, we’ll delete the www if it’s there.
)
There are many great applications for this in search engine optimization (SEO), online marketing, publicity, analytics, web development, and many more areas where reporting and comparing data is helpful! Combine this with a VLOOKUP or SUMIF formula and you will be able to total up visits, backlinks, status, errors, and much more.
This worked great – almost 🙂 The “curly quotes” gave Excel heartburn, and it found one error that it made a recommendation to fix. It worked, and here’s the end result:
=IF($A2=””,””,SUBSTITUTE(IF(ISERROR(FIND(“https://”,$A2)),SUBSTITUTE(IF(ISERROR(FIND(“/”,SUBSTITUTE($A2,”http://”,””))),$A2,REPLACE($A2,FIND(“/”,SUBSTITUTE($A2,”http://”,””))+7,LEN($A2)+1,””)),”http://”,””),SUBSTITUTE(IF(ISERROR(FIND(“/”,SUBSTITUTE($A2,”https://”,””))),$A2,REPLACE($A2,FIND(“/”,SUBSTITUTE($A2,”https://”,””))+8,LEN($A2)+1,””)),”https://”,””)),”www.”,””))
Great, Rich! I appreciate that catch. I originally pasted the formula into iWorks Pages and it must have turned them there. Thanks for the fix. I’ll fix the code in the post above soon! Thanks again for the help!