tpyo kingg
Posts: 1194
Joined: Mon Apr 09, 2018 5:26 pm
Location: N. Finland

Quick Notes on Measuring Web Bloat

Sat Nov 14, 2020 12:02 pm

tldr; in a non-scientific survey, web pages appear to contain very little information relative to their volume, with javascript being one of the main culprits. Over 99% of the bytes transmitted for each web page's intiial rendering seem to be waste. The "Additional Reading" section below may point to some better studies. I'll just throw my results over the fence here today, since I have no contacts in academia or defense.
  • Overview
  • Methods
    • Step 01: Spidering
    • Step 02: Extract All Links
    • Step 03: Cull non-text material
    • Step 05: Eliminate Redundant Links
    • Step 06: Fetch and Cache Pages
    • Step 07: Measure the Sizes in Bytes
    • Step 08: Calculation of the Summary
    • Step 09: Examining Samples
  • Conclusion
  • Administrivia
  • Additional Reading
---

This material might be relevant to the Raspberry Pi because leaner, more efficient sites are quite usable even on the older hardware models if they are used as desktops or kiosks. The bloated pages are not. As for defense, nearly everything is and, since March 2020, must be done online and even in times of reduced network capacity or availability, fast efficient transfers are essential. Fast transfers are not possible with heavy pages. Some countries are rumored to require low-bandwidth alternatives for essential state and business services, though it is clear that if such a requirement exists anywhere it is roundly ignored.

This was done out of curiosity and I very informally and non-scientifically quantified a best-case view of web page bloat using a 3B+ last year. I encourage followups as it'd be an easy enough classroom project with any networked model of Raspberry Pi and, depending on the depth and strictness of methods, at any level of study from high school to graduate. There are certainly flaws in my method, and probably typos, but these rudimentary findings are strongly backed up by the links shown below in the section "Additional Reading". The work was probably about 10 hours over a few widely scattered days, excluding this write up.

Overview

Again, this was all done out of my own curiosity about the state of the current World-Wide Web's sites. I wanted to know about how much of the web is content and how much is overhead in the context of written articles and blog posts.

I sampled over 5000 links to general audience articles from a discussion forum, hoping that these would contain the highest ratio of content to overall volume. These links were mainly to text intensive pages. I wrote some simple scripts to harvest the pages and compare the volume of the full page to the volume of the actual text enclosed in each full page. The sample I chose should have been heavily biased towards showing a high content to overhead ratio. I worry that maybe it was.

However, going by the arthmetic means for the size of each the pages and their contents, roughly 1% of the material transmitted for each page is information, that is to say human-readable text, the rest is overhead. Going by the median sizes, roughly 3% of the material transmitted for each page is information. Or by the trimmed mean, it is about 2%. That count includes only the first round of javascripts, not the additional scripts or other objects which the scripts might pull in to an arbitrary depth if allowed to activate.

The summary was generated using the R-project's "psych" module's "summary" and "describe" functions. The numbers below are the size bytes:

Code: Select all

   full page            text-only
 Min.   :      8207   Min.   :    348
 1st Qu.:    125094   1st Qu.:  10461
 Median :    573211   Median :  16924
 Mean   :   2782628   Mean   :  29750
 3rd Qu.:   2208760   3rd Qu.:  28783
 Max.   :2398418030   Max.   :3264621

          vars    n       mean         sd median    trimmed      mad  min
full page    1 5367 2782628.46 34839688.1 573211 1121306.96 752425.4 8207
text-only    2 5367   29750.26    74237.9  16924   19779.49  11783.7  348

                 max      range  skew kurtosis        se
full page 2398418030 2398409823 61.93  4182.96 475563.44
text-only    3264621    3264273 24.40   887.09   1013.35
Methods

It took eight steps to generate the above summary.

Step 01: Spidering

I used wget to harvest pages from the discussion forum Soylent News. The pages are from early February, 2014 to mid- September, 2019.

Code: Select all

#!/bin/sh

# 2019-08-13 
# Fetch all pages from SN todate
# thanks to TMB and martyb

PATH=/usr/local/bin:/usr/bin:/bin;

d="20140212";
delta=0;

while test $d -lt $(date +"%Y%m%d");
do
        d=$(date -d "20140212 + $delta days" +"%Y%m%d");
        wget \
                --continue \
                --directory-prefix='cache' \
                "https://soylentnews.org/index.pl?issue=$d";
        delta=$((delta+1));
        sleep 2;
done;

exit 0;
Because those are daily snapshots of the site and some posts may persist across the start of the new day, there will be some redundancies. Those will be removed later.

Step 02: Extract All Links

Then I ran a perl script over the collected material to extract its links by means of XPaths. The result was a file with 98,484 links.

Code: Select all

#!/usr/bin/perl

# 2019-08-13
# Accept one or more file names and parse them for links, accepts globbing

use File::Glob ':bsd_glob';
use HTML::TreeBuilder::XPath;

use strict;
use warnings;

my $file;
$file = '/dev/stdin' if ($#ARGV < 0);

while ($file = shift) {
    my @filelist = bsd_glob($file);

    foreach my $infile ( @filelist ) {
	&get_links($infile);
    }
}
    
exit(0);

sub get_links {
    my ($infile) = (@_);
    
    my $root = HTML::TreeBuilder::XPath->new;
    $root->parse_file($infile);
    $root->eof();
    
    my $xp ='//div[@class="body"]/div[@class="intro"]/p/a/@href';
    
    foreach my $d ($root->findnodes($xp)) 
    {	
	if ( $d->isAttributeNode ) {
	    print $d->string_value,qq(\n);;
	} else {
	    print $d->as_text,qq(\n);;
	}
    }
    
    $xp ='//div[@class="body"]/div[@class="story_more full"]/p/a/@href';
    
    foreach my $d ($root->findnodes($xp)) 
    {
	if ( $d->isAttributeNode ) {
	    print $d->string_value,qq(\n);;
	} else {
	    print $d->as_text,qq(\n);;
	}
    }
    $root->delete;
}
Step 03: Cull non-text material

Then I culled the links to select only HTTP and HTTPS while trying to eliminate images and other non-textual material, probably. I also skipped the discussion site's internal links. The first file was thus whittled down to 64,441 links.

Code: Select all

#!/usr/bin/perl

# 2019-08-13
# Accept one or more file names and parse them for links, accepts globbing
# choose only certain links: HTTP, HTTP, no PDF

use File::Glob ':bsd_glob';
use URI;

use strict;
use warnings;

my $file;
$file = '/dev/stdin' if ($#ARGV < 0);

while ($file = shift) {
    my @filelist = bsd_glob($file);

    foreach my $infile ( @filelist ) {
	&select_links($infile);
    }
}
    
exit(0);

sub select_links {
    my ($infile) = (@_);
    
    open(my $in, '<', $infile)
	or die("Cannot open '$infile' for reading: $!\n");

    while (my $uri = <$in>) {
	chomp($uri);
	$uri = &get_uri($uri);
	print $uri,qq(\n) if ($uri);
    }
    
    close($in);
    
}

sub get_uri {
    my ($uri) = (@_);

    my $u = URI->new($uri);

    return (0) unless (defined($u->scheme) && $u->scheme=~/^http/);
    return (0) unless (defined($u->path));
    return (0) if ($u->path =~ m/\.pdf/i);
    return (0) unless (defined($u->host));
    return (0) if ($u->host =~ m/^reversethis-/);
    return (0) if ($u->host =~ m/\{at\}/);
    return (0) if ($u->host =~ m/soylentnews.org$/);
    $u->fragment(undef);
    
    return ($u->canonical);
}
Step 05: Eliminate Redundant Links

Since the original data was spidered more or less in chronological order and managed in text files, it was possible to pick the newest link from each domain if a site had multiple links in the list.

Code: Select all

#!/usr/bin/perl

# 2019-09-13
# Accept one or more file names and parse them for links, accepts globbing
# choose only the latest link from each site

use File::Glob ':bsd_glob';
use URI;

use strict;
use warnings;

my $file;
$file = '/dev/stdin' if ($#ARGV < 0);

while ($file = shift) {
    my @filelist = bsd_glob($file);

    foreach my $infile ( @filelist ) {
	&select_links($infile);
    }
}
    
exit(0);

sub select_links {
    my ($infile) = (@_);
    
    open(my $in, '<', $infile)
	or die("Cannot open '$infile' for reading: $!\n");

    my ($site, %fq, %art);
    while (my $uri = <$in>) {
	chomp($uri);
	($site, $uri) = &get_uri($uri);
	# print $site,qq(\t),$uri,qq(\n) if ($uri);

	next unless ($site);
	# track how many times that site is used
	$fq{$site}++;

	# remember only the latest article from each site
	$art{$site} = $uri;
    }
    
    close($in);

    my ($sum,$count);
    my @f;
    foreach my $k (sort {$fq{$b} <=> $fq{$a} || $a cmp $b} keys %fq) {
	printf (qq(% 4d\t%s\n), $fq{$k},$k);
	$sum += $fq{$k};
	$count++;
	push(@f,$fq{$k});
    }
    print qq(unique $count\n);
    print qq(sum total $sum\n);
    printf( qq(average %4.3f\n),$sum/$count);
    print qq(median ),@f[int($count/2+.5)],qq(\n);
}

sub get_uri {
    my ($uri) = (@_);

    my $u = URI->new($uri);

    return (0,0) unless (defined($u->scheme) && $u->scheme=~/^http/);
    return (0,0) unless (defined($u->path));
    return (0,0) if ($u->path =~ /\.pdf/i);
    $u->fragment(undef);
    
    return ($u->host,$u->canonical);
}
Then only unique links were kept, one per domain, just in case something slipped through. The redundancies and a known spam site were discarded and the result was a list of only 7,394 links.

Code: Select all

#!/usr/bin/perl

# 2019-09-18
# Accept one or more file names and parse them for links, accepts globbing
# return only one link per domain

use File::Glob ':bsd_glob';
use URI;

use strict;
use warnings;

my $file;
$file = '/dev/stdin' if ($#ARGV < 0);

our %URIs;

while ($file = shift) {
    my @filelist = bsd_glob($file);

    foreach my $infile ( @filelist ) {
	&select_links($infile);
    }
}

foreach my $key (sort keys %URIs) {
    print $URIs{$key},qq(\n);
}

exit(0);

sub select_links {
    my ($infile) = (@_);
    
    open(my $in, '<', $infile)
	or die("Cannot open '$infile' for reading: $!\n");

    while (my $uri = <$in>) {
	chomp($uri);
	$uri = &get_uri($uri);
    }
    
    close($in);
    
}

sub get_uri {
    my ($uri) = (@_);

    my $u = URI->new($uri);

    return (0) unless (defined($u->scheme) && $u->scheme=~/^http/);
    return (0) unless (defined($u->path));
    
    return (0) if ($u->path eq '/');
    return (0) if ($u->path =~ m/\.pdf/i);
    return (0) unless (defined($u->host));
    return (0) if ($u->host =~ m/^reversethis-/);
    return (0) if ($u->host =~ m/microsoft.com/);
    return (0) if ($u->host =~ m/soylentnews.org$/);
    $u->fragment(undef);

    $URIs{$u->host} = $u->canonical unless(defined($URIs{$u->host}));
    
    return ($u->canonical);
}
Step 06: Fetch and Cache Pages

A shell script read that new list and fetched the pages using lynx, storing them to two caches. One cache was for the first level of objects, the other was for the plain, human-readable text. This could have been done massively parallel since each site had only one link used. However, it was done in batches over most of a day. Bash's wait function has an -n option which would have been useful if Bashisms had been considered.

Code: Select all

#!/bin/sh

# 2019-09-21
# accept URLs as input and download each URL twice, 
# once as text only and once as a full page, etc

# timeouts are important, some sites hang on lynx

fetch() {
	local site="$1";
	local url="$2";
	local c=0;	# count retries

	# exit function if lynx fails
	if ! timeout 45 lynx \
	    -connect_timeout=20 \
	    -read_timeout=20 \
	    -dump \
	    $url > $prefix/$site.txt 2>/dev/null; then
		rm $prefix/$site.txt;
		return 0;
	fi

	# if lynx worked, try wget
	while ! wget \
	    --page-requisites \
	    --convert-links \
	    --directory-prefix=$prefix/$site \
	    --timeout=10 \
	    --tries=2 \
	    --quiet \
	    $url; do
		c=$((c+1));
		if test 0$c -gt 3; then
			break;
		fi
	done;

	if test 0 -eq 0$c; then
		return 1;
	else
		return 0;
	fi;
}

# make a directory to keep the downloaded files in
prefix=$(date +'cache-%F');
test -d $prefix || mkdir -m 755 $prefix || exit 1;

s=0;	# count sites attempted

while read uri; do
	# URIs are more complicated than that but this works with this data
	site=$(echo $uri | sed -E 's|.*://+||; s|/.*$||');
	s=$((s+1));

	fetch "$site" "$uri" &

	# wait after every one hundredth query until they are done
	if test 0 -eq $((s%100)); then
		wait;
		sync;
	fi
done;

exit 0;
Step 07: Measure the Sizes in Bytes

I used a shell script to check the size of the pages in the cache, comparing the full page size to the text-only component. This was done by checking the disk utilization for each page's cache or its text-only surrogate. The result was a tab-delimited table useful for feeding into spreadsheets or, as in the last step, into R-script.

Code: Select all

#!/bin/sh

# 2019-11-26

printf "% 10s\t% 10s\t%s\n" "full page" "text-only" "site name";

file *.txt | awk '$2~/text/{print $1}' FS='.txt:'  \
| while read site; do
        test -d $site || continue;
        
        full=$(du -bs $site | awk '{print $1}');
        text=$(du -bs $site.txt | awk '{print $1}');
        
        printf "% 10s\t% 10s\t%s\n" $full $text $site;
done 
Then I manually reviewed the lower end of these to remove error pages, warnings, redirects, geoblocks, and so on and mark them as duds and thus exclude them from further analysis.

Step 08: Calculation of the Summary

An R-script used the "psych" libary to read and process the tab-delimited table. It produced both the table shown in the summary above as well as some scatter plots (not shown here).

Code: Select all

#!/usr/bin/Rscript

# thanks to VH (PhD) for statistical advice and modifications

# 2019-11-26
# read in a tab-delimited file and calculate the standard deviation
# for the first two columns
# file="~/SN/Round01/2019-sn-links.05.bytes.tab"

library("psych")
library("parallel")

args = commandArgs(trailingOnly=TRUE)
file = args[1]

no_of_cores = detectCores()

d<-read.csv(file,header=TRUE,sep="\t")

print ("Rows and columns")
nrow(d)
ncol(d)

d2<-subset(d,full.page<342000000)	# Restrictions to the data

d3<-d2[,c(1,2)]			# Taking only the needed variables
d3$proportion<-(d3$text.only/d3$full.page)*100	# Compute the proportion

d4<-subset(d3,proportion<=100)		# Restrict the proportion to 100 %

# Summary to check the distribution of proportion numerically
summary(d4$proportion)
describe(d4$proportion, na.rm = TRUE)

plot(d4$full.page,d4$text.only)	# Make a scatter plot

# Same, but using full in x and proportion in y
plot(d4$full.page,d4$proportion,cex.axis=3,cex.lab=3)

quit()
Examining Samples

Then I skimmed through the markup of a few dozen of the cached pages to get an impression of the structure and components. The pages for that were chosen from the low end and the high end.

Conclusion

Again, the above was all non-scientific, informal, and done out of curiosity in a quick and dirty fashion. Also again, as mentioned in the summary, the results are probably an assessment of a best-case scenario. The actual situation for the average web site would have a far worse ratio of content to overall size. A focus on social control media sites (Twitter, Facebook's Whatsapp, or so on) would show yet an even worse ratio, perhaps by several orders of magnitude, but that's just a guess for now. So the real ratio of content to overall size is almost certainly lower than 1%.

The solution is easy, if the will to fix things is present. No new skills are needed to make the improvements only minor changes in attitudes, priorities, or work flow during design and implementation. Proper education in lean, efficient desgin for web sites would go a long way in reducing load times and electical use with its resultant climate impact. Further, the smaller pages would offer greater resiliance in times of reduced network capacity, such as during physical outages, conflicts between backbone services, or distributed denial of service attacks.


Administrivia

In order to have enough space on the 3B+ for the cache, it was necessary to add some USB storage. Since what I had free at that particular time were several small thumb drives, I used those as a RAID unit.

Code: Select all

$ sudo mdadm --brief /dev/md0
/dev/md0: 22.35GiB raid5 4 devices, 0 spares. Use mdadm --detail for more detail.

$ df -Ph /home/pi/SN/Round_01/cache-2019-09-25
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0         22G   15G  5.9G  72% /home/pi/SN/Round_01/cache-2019-09-25

Once the devices were wiped and partitioned for RAID, not shown here, the rest was easy.

Code: Select all

fdisk -l | grep dev/sd

fdisk /dev/sda
fdisk /dev/sdb
fdisk /dev/sdc
fdisk /dev/sdd

mdadm --create /dev/md0 --level=5 --raid-devices=4 /dev/sda /dev/sdb /dev/sdc /dev/sdd

mdadm --detail /dev/md0

mkfs -t ext4 /dev/md0
After I was done with the data above, I could test several RAID5 recovery scenarios, but that is off the topic of this post and not included here.

The thumbdrives have long since been repurposed.


Additional Reading

The links below are mostly in the random order which I found them while looking for other things. There is certainly more material but there are no services answering the role of library catalogs capable of author, subject, or keyword searches. Full-text search engines are not up to snuff and, furthermore, seem to have hidden older pages even if they may still exist.

"CO2 emissions on the web" Danny van Kooten. (2020)
https://dannyvankooten.com/website-carbon-emissions/

"How to speed up WordPress for a faster, greener and eco-friendly site" Marko Saric. (2020)
https://markosaric.com/speed-up-wordpress/

"To save the economy we must reduce video bandwidth use — now (updated)" Johnny Evans. (2020)
https://www.computerworld.com/article/3 ... dated.html

"Europe urges streamers to limit service amid network pressure" Jason Aycock. (2020)
https://seekingalpha.com/news/3553373-e ... k-pressure

"The Website Obesity Crisis" Maciej Cegłowski. (2015)
https://idlewords.com/talks/website_obesity.htm

"This website is killing the planet" Steve Messer. (2020)
https://visitmy.website/2020/07/13/this ... he-planet/

"Webwaste" Gerry McGovern. (2020)
https://alistapart.com/article/webwaste/

"Page Weight Matters" Chris Zacharias. (2012)
https://blog.chriszacharias.com/page-weight-matters

"Is it morally wrong to write inefficient code?" Tom Gamon. (2020)
https://tomgamon.com/posts/is-it-morall ... ient-code/

"Average Web Page Breaks 1600K" Website Optimization, LLC. (2014)
http://www.websiteoptimization.com/spee ... -web-page/

"Prioritizing Web Usability" Jakob Nielsen and Hoa Loranger. (2006)
ISBN 0-321-35031-6

Heater
Posts: 19722
Joined: Tue Jul 17, 2012 3:02 pm

Re: Quick Notes on Measuring Web Bloat

Sat Nov 14, 2020 3:55 pm

Javascript always gets a bad rap for the so called "page bloat".

It's really not fair. JS is just a programming language, delivered as source code. One can cram a lot of functionality in JS into the same space as a little image download or a font file. Check the junk downloaded by this forum page for example.

I will admit JS is abused. Pages will load huge libraries of JS most of which is not actually used. Worse still they will pull in JS from advertising businesses from God knows where.
Slava Ukrayini.

tpyo kingg
Posts: 1194
Joined: Mon Apr 09, 2018 5:26 pm
Location: N. Finland

Re: Quick Notes on Measuring Web Bloat

Sat Nov 14, 2020 4:19 pm

Heater wrote:
Sat Nov 14, 2020 3:55 pm
Pages will load huge libraries of JS most of which is not actually used. Worse still they will pull in JS from advertising businesses from God knows where.
Yes, I too suspect that very little of the libraries are actually used. I'd go further and say that they may often be for eye candy which can be be avoided (for usability) or replaced with CSS (for accessibility), or by processing which has usually has to be done server-side anyway for security. However, I did not spend any time on the scripts. There in Step 06, I checked the first layer and the first layer only, I didn't process the scripts to see what else they would pull in, but did include normal images. What I found was bad enough, especially since the target pages were collected in such a way as to lend a heavy bias towards a high information to volume ratio. Other people have examined the complete page, after all the scripts are pulled in, and found the average was already over 100 objects back in 2014:
You can also see a lot of that anecdotally. Load a page in the browser and then press ctrl-shft-i, choose network, and then reload the page. The browser will then show a breakdown of the various objects pulled in and the time they took to load. I feel for those on very slow connections or with a lot of latency, or those that have metered bandwidth or bandwidth caps.

I guess my point is that not only is there no need for the extras that there is active harm.

As for the ads, in addition to slowing things down, third-party advertising networks are a vector for malware, even the New York Times spread some via their advertising. But I didn't look at the ads either, just "content" to number of bytes from the download.

Heater
Posts: 19722
Joined: Tue Jul 17, 2012 3:02 pm

Re: Quick Notes on Measuring Web Bloat

Sat Nov 14, 2020 4:31 pm

tpyo kingg wrote:
Sat Nov 14, 2020 4:19 pm
Load a page in the browser and then press ctrl-shft-i, choose network, and then reload the page. The browser will then show a breakdown of the various objects pulled in and the time they took to load. I feel for those on very slow connections or with a lot of latency, or those that have metered bandwidth or bandwidth caps.
I find myself playing with the browser dev tools like that a lot.

On sensible pages, even if they reasonably required JS to deliver what they are offering, I usually find the amount of JS code downloaded is comparable to the amount of images, fonts, CSS etc.

On crazy, ad ridden pages, give up hope. Try not to go back there.
Slava Ukrayini.

Return to “Other projects”