Bosse_B
Posts: 1184
Joined: Thu Jan 30, 2014 9:53 am

How to extract an url from a page source in a script

Wed Aug 04, 2021 8:29 pm

I have an RPi4 box running a few scripts which depend on a webpage for which the URL changes at irregular intervals.
The current URL for the webpage is contained on a different webpage as a link to the wanted page.
When the wanted page changes this is reflected in the second webpage.

So now I would like to script the extraction of that url so I can use the extracted url inside my main script.
The idea for this is:

Code: Select all

wget <URL of index page> -o checkpage.txt
someclevercommand_to_extract_URL from file checkpage.txt
echo $extractedurl > urlfile.txt
Then inside the script using the url I can just read the content of urlfile.txt:

Code: Select all

read actualurl < [full path to urlfile.txt]
Obviously someclevercommand_to_extract_URL needs to read the downloaded file from wget and find the position inside the file of the wanted url.
In fact the file contains this in one place only:

Code: Select all

 <a href="https://youtu.be/xxxxxxxxx">
Where xxxxxxxxx is what I want to extract for use.

Can someone please suggest a simple bash command to extract this URL?
Bo Berglund
Sweden

trejan
Posts: 3856
Joined: Tue Jul 02, 2019 2:28 pm

Re: How to extract an url from a page source in a script

Wed Aug 04, 2021 8:48 pm

Code: Select all

curl -s <URL of index page> | grep -Po '(?<=href="https://youtu.be/)[^"]*(?=")' > urlfile.txt
or

Code: Select all

curl -s <URL of index page> | grep -Po '(?<=href=")[^"]*(?=")' > urlfile.txt
As this appears to be a shell script then you can skip the temporary file entirely.

Code: Select all

actualurl=`curl -s <URL of index page> | grep -Po '(?<=href=")[^"]*(?=")'`

Bosse_B
Posts: 1184
Joined: Thu Jan 30, 2014 9:53 am

Re: How to extract an url from a page source in a script

Wed Aug 04, 2021 11:29 pm

Thanks for the suggestions.
The token to find in the file has to be

Code: Select all

https://youtu.be/
because this is the the text preceding the item I need to extract.
There are lots of href= etc in the webpage but only the token above is followed by the string I need.
The link present in the webpage looks like this:

Code: Select all

<a href="https://youtu.be/********">
And I need to extract the ******** part.

I tried this (with xxxxxxxxx replaced by the real url):

Code: Select all

#!/bin/bash
baseurl=http://xxxxxxxxx.com/
token=https://youtu.be/
actualurl=`curl -s "$baseurl" | grep -Po '(?<=${token})[^"]*(?=")'`
echo "${token}${actualurl}"
But when I run this I get this response:

Code: Select all

$ ./testscript.sh
https://youtu.be/
So clearly I am missing something since it comes up emptyhanded.
Bo Berglund
Sweden

trejan
Posts: 3856
Joined: Tue Jul 02, 2019 2:28 pm

Re: How to extract an url from a page source in a script

Wed Aug 04, 2021 11:41 pm

If you want to have a token variable then add single quotes around it to end/start the string. bash won't do substitutions inside a string with single quotes so your change made grep look for the string '${token}' itself.

Code: Select all

actualurl=`curl -s "$baseurl" | grep -Po '(?<='${token}')[^"]*(?=")'`
Alternative is to change it to double quotes where bash will do the substitution but you now need to escape the " inside the string.

Code: Select all

actualurl=`curl -s "$baseurl" | grep -Po "(?<=${token})[^\"]*(?=\")"`

Bosse_B
Posts: 1184
Joined: Thu Jan 30, 2014 9:53 am

Re: How to extract an url from a page source in a script

Thu Aug 05, 2021 10:28 am

Thanks for your suggestions!
However, when I tried to check operations today with your changes nothing worked!
Turns out that overnight the reference webpage has disappeared!
Now waiting for it to re-emerge.

I will try to create a test page with the same content on my own website to test the script with.

LATER:
I have now created a test webpage from a downloaded version of the actual now disappeared page I happened to have saved.
I entered a bogus youtube url into this test page where the actual link would be.

This is the script I used:

Code: Select all

#!/bin/bash
baseurl=https://mydomain.com/extract_test.html
token=https://youtu.be/
#Using direct literals:
#actualurl=`curl -s "$baseurl" | grep -Po '(?<=href="https://youtu.be/)[^"]*(?=")'`
#Using token instead of direct literals:
actualurl=`curl -s "$baseurl" | grep -Po "(?<=${token})[^\"]*(?=\")"`
echo "${token}${actualurl}"
exit
Test execution yields:

Code: Select all

$ ./testscript.sh
https://youtu.be/xyzabcdef
The token usage as well as the direct literal both work.
Bo Berglund
Sweden

Bosse_B
Posts: 1184
Joined: Thu Jan 30, 2014 9:53 am

Re: How to extract an url from a page source in a script

Sun Aug 08, 2021 5:13 am

FINISHING UP
Just to complete the thread after the reference webpage re-appeared:
I could now test the script on the real page and the extraction command actually worked also on that page!
So here is the final script to get the wanted Youtube URL from the page:

Code: Select all

#!/bin/bash
baseurl=http://referencepage.com/
token=https://youtu.be/
actualurl=`curl -s "$baseurl" | grep -Po "(?<=${token})[^\"]*(?=\")"`
echo "${token}${actualurl}"
The test run yielded (URL obfuscated):

Code: Select all

$ ./testscript.sh
https://youtu.be/DM-I6Uuxxxxx
Again Thank You to @trejan
Bo Berglund
Sweden

Bosse_B
Posts: 1184
Joined: Thu Jan 30, 2014 9:53 am

Re: How to extract an url from a page source in a script

Fri Sep 24, 2021 6:35 am

Update for future reference:
This script does no longer work as written since the target page changed from http: to https:
So if anyone finds this in the future:
Be sure to use the correct start of the URL!
Bo Berglund
Sweden

Return to “Other programming languages”