World Wide Web

World Wide Web
	Chapter 11. XML

Browser

Firefox

Firefox gets released in different stages:

The nightly build and the aurora stages have different icons, not showing the firefox
Finally there is the beta and the release stages that have the familiar firefox icons

To get flash support emerge adobe-flash or https://www.youtube.com/html5 to get html5 instead of it.

To hold the bookmarks json is used. Backups are in ~/.mozilla/firefox/<some random characters>.default/bookmarkbackups. The files are just backups containing the date in their filename. Firefox seems to be written in a way bookmarks can not be synchronized easily. It seems a business issue to get user data from local computers and store it on a server. I don't like to guess what motivation is behind that.

HTTP

The Hypertext Transfer Protocol (HTTP) http://www.w3.org/Protocols/ is used to get the web pages to your computer.

A nice tutorial can be found under http://www.tutorialspoint.com/http/index.htm.

A http server is usually a web server that as a well known behavior, but it is also possible that ans other server make use of the http protocol.

If a web server uses http then telnet can be used to learn and test http. Nowadays https has replaced mostly http, so curl can be used instead.

Tests using telnet

Practical test can be done using the console with telnet, that is a terminal that can appear instead of an http client.

First open a connection to a host, you have to pass the port 80 so telnet connects to the http server:

telnet 127.1.1.0 80

then type in the request:

GET /index.html HTTP/1.0

and type two times return (officially CRLF but just CR works as well) since HTTP wants a blank line. If everything works you get my html homepage back in ASCII. Note that the connection closes right after that, this is how HTTP works, it is called connection less protocol.

Note

The commands are not case sensitive, however it is a good habit to use upper case, so command can be easily identified.

When more than one virtual host are on the server and an other than the default host is desired choose

telnet <IP address> 80

GET /index.html HTTP/1.1

Host:www.linurs.local

Looking closer to the protocol results that the Request consists of the header an empty line and an optional message body. The header can have just a line as the GET line above but can (or must) have other lines called Header fields. The same works also well cgi scripts telnet <IP address> 80

GET /cgi-bin/getdate.cgi HTTP/1.1

There are different request to the HTTP server:

GET is thew most used request since it retrieves a message body that is usually a web page (not you can test this way what you get, with a full featured browser you are not so sure since the browser might modify the data). GET has no message body, but can still pass data to the server that is packed to the url http://www.linurs.local/cgi-bin/script.cgi?name1=value1&name2=value2. On the server side those data can be found in the QUERY_STRING environment variable.
HEAD is as GET but requests just the header and not the whole page, it is used to check links.

telnet 127.1.1.0 80

HEAD /index.html HTTP/1.0
OPTIONS is used to see what the server supports. Usually the servers are quite restrictive and support just the common requests.

telnet www.linurs.org 80

OPTIONS / HTTP/1.0
POST is used to call a script on the server and pass some data to it. The server must know what and how much it has to receive:

telnet 127.1.1.0 80

POST /cgi-bin/getdate.cgi HTTP/1.0

Content-type:text/plain

Content-length:10

<CR>

1234567890

The data to be posted is in the message body and in the header there is the header field (line): Content-Length: <number of bytes in the message body> telling the size of it.

Important

To have a successful termination the above used cgi script must return something asContent-Type: text/plain

Ok

For sending name value pairs GET is used for sending files POST is used. If you use a HTTP request the server responds and the browser (HTTP client) shows the response on the screen. If you just want to send data but nothing to be updated on your screen you need something more as Ajax (Ajax still an must use HTTP but deals with the browser to just update certain things or nothing on the screen).

Testing using curl

curl https://curl.se/ is a command line program used to communicate with the internet and more advanced than using telnet.

curl has a online book https://everything.curl.dev/

To get a web page curl https://www.linurs.org

If the server responses that the web page has moved to an other location curl is able to redirect to this new location curl --location https://www.linurs.org

To save the web page to a file curl --location https://www.linurs.org --output <page>.html

curl uses per default the HTTP method GET, to use HEAD curl -I https://www.linurs.org

Use cufl -v to be verbose and see https certification and the http requests

User authentication curl --user <username>:<password> http://<website> the password is not encrypted and can be captured with a sniffer program. http is not safe https is better.

HTML

See:https://www.w3schools.com/html/ https://wiki.selfhtml.org/wiki/Startseite

The source code of a simple html page could look as follows:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 
<html> 
  <head>
     <title>Sample HTML Page</title> 
  </head> 
  <body   
    style="color: rgb(0, 0, 0); 
    background-color: rgb(255, 255, 255); 
    background-image: url(back.jpg);
    "alink="#ff0000" link="#0000ff" vlink="#ff00ff">    
    <div style="text-align: left;">       
      <h1>Sample HTML Page</h1>      
      <ul>        
      <li>
        <a href="www.gentoo.org"> The Gentoo Home Page</a>
      </li>
      <li>
        <a href="www.linurs.org">My home page</a>
      </li>
      <br>
      </ul> 
   </div>    
   <b>
     <a href="../index.html">Home</a>
   </b>  
  </body>
</html>

The first line defines that page as HTML including the HTML version used. The HTML contents is then put between the opening tag <html> and the closing tag </html>. In simple words HTML is about such tags, not for all tags an opening tag and a closing tag exist, e.g. the <br> tag makes a line break. Between <head>and </head> the HTML page identifies itself. Different tags are possible as the <title> tag, those tags are not visible to the user but to the program reading the HTML page. The visible section is between <body> and </body>. The page above adds a background picture, formats everything to the left, adds a visible title, a list with hyperlinks and a hyperlink to a top page of the web site. Inside the tags additional parameters can be put:

<div style="text-align: left;"> how to align the text
<a href="www.gentoo.org"> the hyperlink </a>
</div>

This was how HTML was created, however the inventor did not expect that it became such a success and therefore people used it for all kind of things and wanted it to be better looking and full featured. A chaotic situation was that includes documents about HTML and web servers.

Making html nicer

<link rel="shortcut icon" <link rel="shortcut icon" href="
https://www.gentoo.org/favicon.ico" type="image/x-icon">

creates an icon in the browser tab.

is to support http://www.opensearch.org/Home so your web site is better found and analyzed by search engine.

is to link to an rss feed

<title> appears in the widow headline

To have a nice layout tables are used and lists are avoided. Tables use the <table> tag. <tbody> is the table body <thead> is the head and <tfoot> is the head row of the table.

<tr> is the table row

<td> is a cell in the row

The cells can span over rows and columns (merge cells) this is done with the rowspan and colspan attributes.

To make it a nightmare cells can contain tables.

Tables with

<table border="0" ... .

look good but are a pain to troubleshoot, so set them temporarily to border="1"

The meta tags

The goal of putting data onto the web is that others can read it, but first they must find it. Search machines as http://www.google.com should find it. To make them know what your web page is about add between the <head> tags some meta tags:

<meta name="description" content="<simple text describing the page>">

<meta name="keywords" content="<first keyword>, <second keyword>">

You can put your name on the page

<meta name="author" content="Urs Lindegger">

Or you can make that the page automatically forwards to an other one

<meta http-equiv="refresh" content="5; url=<newurl.html>">

Note

Since it is HTML and not XML, there is no </meta> close tag! This can cause troubles when HTML is produced out of XML. If a meta close tag is present the page can be considered as badly formated and therefore its ranking can decrease.

Working with HTML

To create HTML a HTML editor has to be used this can be a dedicated editor or a simple text based editor.
The result of the newly created web page must be verified, this can be done in a regular browser (as firefox) or inside the HTML editor when this feature is available.
The web page is probably created locally on a PC and when finished it needs to be published to a server on the Internet, some HTML tool offer this feature other use standalone tools to do the FTP copy.
Finally a local HTML web server can be used to test the server side tools (GCI scripts, ...).

There are different tools around to do all those things, however most of them do not all. Tools that do all, are very complex and therefore not well suited to learn how it works.

HTML editors

Probably a couple of thousands book exits about HTML and how to create web pages.

Analyzing my problems with HTML editors, I found out that it is worth to invest some time to understand the basics tags in HTML. Working with pure graphical HTML editors a web page accumulates a lot of garbage tags over time, that will be neutralized by other tags, and this was what confused me. Less is more, creating simple web pages and understanding the HTML source inside is a guarantee to have good quality pages compatible with most browsers, operating systems and hardware devices. This does not mean creating HTML pages with an ASCII editor, but means to take a look at the HTML source while editing.

Many text editors have syntax highlighting and can therefore be used as HTML editors.

To not have to remember all HTML tags a HTML editor is the way to go:

http://bluefish.openoffice.nl/index.html has no integrated viewer and also upload to a web server is missing.
https://www.seamonkey-project.org/ (mozilla-application-suite) For gentoo there is a binary package available. So it can be emerged quickly. If you start it it is a browser that looks a s mozilla or firefox. If you want to edit the file just go to edit and the Composer opens.
http://www.bluegriffon.org/ (derived from nvu)
http://www.screem.org/ uses an external viewer as firefox
Quanta just available under kde

Uploading to the server

FTP is commonly used to upload the files to a server on the internet.

Programs as Quanta upload just what is necessary. Quanta has a project file that holds all files being considered.
Seamonkey can upload HTML files opened to be edited (in the composer) on manual demand. Binary files as images used in HTML will not be uploaded.
To have a tool that can upload emerge gftp
Firefox extension FireFTP can be added. The firefox FireFTP plug-in works well. It is a file manager making both sides visual. Files can be copied from one side to the other, and there is also a synchronize command. Enable the timestamps feature to allow having the files in sync. Without that it just looks if the files are there.

Managing your web page

You will not want to put just HTML on your page, you want also put other files as images, sample code. Therefore create a directory structure that fits your needs. Don't create too many subdirectories where you put your files, since it is desirable that the links to those files do not change frequently. Therefore create directories as image where you putt all the pictures used on the web site. Your HTML browsing structure might change in the future and is therefore not the most maintenance friendly structure. Dealing with all files individually is a pain, however it is a good experience.

The expression content management system (CMS) is used for the programs that do this job. A example of such a program is joomla, mediawiki and all kinds of other wiki. Joomla uses MySQL to store the contents of the web pages. Using Templates and Php the MySQL data is converted to HTML pages used by the HTML server. Everything that joomla uses can be on the local hard-disk, and when a HTML server is installed and configures it can be accessed as is http://localhost/joomla

If you run KDE, you probably find: Quanta that comes with it. Quanta is not available as separate ebuild, therefore restricted to kde users.

Quanta is more than just a HTML editor:

Create a web project and add all the files to it
Upload and synchronizes all kinds of files to the Internet
Integrated viewer
KlinkStatus (plugin) verify that all links are correct.

Validate Web pages

The extensions http://chrispederick.com/work/web-developer/ and http://getfirebug.com/allow to analyze and verify web pages. they are plug-ins for browsers as firefox.

Convert HTML to text

To convert HTML to text a text based web browser as lynx (elinks or w3m) can be used:

lynx -dump myfile.html > myfile.txt

Web analyzing tools

Having a web site is one thing, but there is also the demand that others will find the page. Many use https://www.google.org as search engine. Google us a tool webmasters https://www.google.com/webmasters/tools/ where you can get statistics but also configure things. To identify that you are really the owner of your web site, you must login (e.g. with your gmail account that you use on a android device). After that a dummy website is created and you need to download and upload it to your web site. Webmasters have impression counters (how many times your website appeared as search results) and click counters (how many times somebody clicked on it). Crawling is what google does to find your web sites additional you can create a sitemap (xml file) to let google know about your web sites.

Other tools are google analytics and tag manager where some code on your web site is required. This allows going in more details as where are the visitors coming from, what devices and browsers got used.

Finally, if you have o hosting service for your web site, don't miss to see the statistic there. It contains everything and not just where google was involved.

Making money with the web site

Finally with google adsense (or https://wordpress.com/) you can rent space on your websites and earn money. Therefore html code needs to be inserted in your web page. There is a battle who will but his advertisement on your web page, the one that pays the most will win. This pages got about one click every 1000 pages and an earning of 0.2 Euro per click. I told to my friends that I'm testing adsense and it seemed that one clicked to many times so google quit my contract. Conclusion: Do not talk to your friends.

Search robots

If somebody is looking for something, they use a search machine as google and you probably would like that they are going to find your website. Therefor search robots are crawling your website to gather information that can then be used to see if its contents matches the search request.

Titles of web pages is used for that. But there are different things to improve it as adding meta data.

There is the robots.txt file that can be put to the websites top directory. Its purpose is telling the search robot what directories should not be analyzed.

#This is a comment
User-agent: *
Disallow: /cgi-bin/
Disallow: /image/
Disallow: /overlay/
Disallow: /pdf/
Disallow: /sudokueditor/
Disallow: /*.css$
Disallow: /*.ico$
Disallow: /*.txt$
Disallow: /*.png$
Disallow: /*.xsl$

Sitemap: http://www.linurs.org/sitemap.xml

An additional way is listing all the files to be crawled. This can be done in a sitemap.xml file. Therefore allow xml to robots.txt. the sitemap.xml https://www.sitemaps.org/protocol.htmlfollows the xml syntax as:

<?xml version='1.0' encoding='UTF-8'?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.linurs.org/index.html</loc>
  </url>
  <url>
    <loc>http://www.linurs.org/aao/Batteries.html</loc>
  </url>
</urlset>

This site map file can be submitted to the search machine as via goolge webmaster, but also a link can be added to robots.txt

Privacy

It is a big business to spy what persons do also regular nice ones as (hopefully) you and me.

http://www.google.com is the tracking your IP and all your search request. If you do not want your IP to be tracked then use https://startpage.com/eng/? it can be considered as google front end that filters all privacy data before making the request at google.

Cookie

Cookies are information stored on the web browsers computer. Cookies can be read and created by Javascript. Since HTTP is a stateless protocol cookies are sent automatically back to the webserver in the request header, allowing the webserver to react on a previous data exchange. This means server side scripts can read the cookies.

In Firefox go to the website of interest, then Tools => Developer Tools => Storage to see the cookies.


YAML		Chapter 12. Multimedia

World Wide Web

Browser

Firefox

HTTP

Tests using telnet

Note

Important

Testing using curl

HTML

Making html nicer

The meta tags

Note

Working with HTML

HTML editors

Uploading to the server

Managing your web page

Validate Web pages

Convert HTML to text

Web analyzing tools

Making money with the web site

Search robots

Privacy

Cookie