Removing Duplicate Domain URLs From the Text File Using Bash

2024/7/8 8:36:10

Text file

https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/

Expected Output:

https://www.google.com/1/
https://www.bing.com

What I Tried

awk -F'/' '!a[$3]++' $file;

Output

https://www.google.com/1/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/

I already tried various codes and none of them work as expected. I just want to pick only one unique domain URL per domain from the list.

Please tell me how I can do it by using the Bash script or Python.

PS: I want to filter and save full URLs from the list and not only the root domain.

Answer

With awk and / as field separator:

awk -F '/' '!seen[$3]++' file

If your file contains Windows line breaks (carriage returns) then I suggest:

dos2unix < file | awk -F '/' '!seen[$3]++'

Output:

https://www.google.com/1/
https://www.bing.com
https://en.xdnf.cn/q/119801.html

Related Q&A

How can I create a race circuit using Cubic Spline?

My problem is im using Cubic Spline but i get this error trying to graph a race circuit raise ValueError("x must be strictly increasing sequence.") ValueError: x must be strictly increasing s…

Why cant python find my module?

Im getting this error every time I type python manage.py runserver in the root server of my Django app. ImportError: No module named utilsI just added a new app to my project called utils by running py…

Seperating the numbers from strings to do the maths and return the string with the results [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 4…

Protect an API by using OAuth 2.0 with Azure Active Directory and API Management

I want to Protect amy API by using OAuth 2.0 with Azure Active Directory and API Management. I have added my API in API management and Im following this article https://learn.microsoft.com/en-in/azure…

How can filter by list in django

I am trying to filter a queryset by a list I am getting unicode data into format of 1,4,5,6 bycategory = request.GET.getlist(category) print type(category)data = Leads.objects.filter(item_required__id…

Python: get the return code of ant sub-process in windows

I use python to call ant, I want to get the return code of the ant for detect ant error.for example, in cmd.exe, C:\Documents and Settings\Administrator>ant sfsf Buildfile: build.xml does not exist!…

Skipp the error while scraping a list of urls form a csv

I managed to scrape a list of urls from a CSV file, but I got a problem, the scraping stops when it hits a broken link. Also it prints a lot of None lines, is it possible to get rid of them ? Would ap…

Getting the TypeError - int object is not callable [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.This question was caused by a typo or a problem that can no longer be reproduced. While similar q…

Reordering columns in CSV

Question has been posted before but the requirements were not properly conveyed. I have a csv file with more than 1000 columns:A B C D .... X Y Z 1 0 0.5 5 .... 1 7 6 2 0 0.6 4 …

Variable not defined in while loop in python?

I am trying to write a simple program in python to read command line arguments and print a final word based on the arguments. If there is any argument of the form "-f=" then the will go to t…