Hey,
sorry for being late,
a 3 months before I put the same question here with no luck with answers,
later I solved the problem as follows (this is the same way u r thinking in)
1- first check if the domain name is arabic domain witch are
( .ps .sy .bh .dz .eg .iq .jo .kw .lb .ly .ma .om .qa .sa .sd .so .ae .tn .ye .mr)
if not check if the site contains a language tag indecator
2- second check the meta tag charset
if arabic encoding then
arabic
else
if UTF-8
do some process here *
else
not arabic
* the hall idea of this is the UTF-8 process algorithm
as u know every charachter in utf-8 has a uniqe representation (Unicode), so u can get the most repeated arabic charachters like (alef , lam for example) and make a quick search for it in the page content
This algorithm works fine on about 95% of arabic sites I tried on
I'll put the function here to make u see the process
function check_url_status($url)
{
$results_array=array();
$ok=check_if_exists($url);
if($ok==1)
{
$results_array[0]="Y";//Url exists</p>
<p> /* get the ip address of the site */
$http_pos=0;
$domain_pos=0;
$site=$url;
$http_pos=strpos($site,'http://',0);
if($http_pos!==false)
$domain_pos=$http_pos+7;
$parsed_site=substr($site,$domain_pos);
$ip = gethostbyname($parsed_site);
$results_array[2]=$ip;
/* get the country of the site */
$country = get_country_from_ip($ip);
$results_array[3]=$country;
/* Check site language*/
$site=$url;
/* check if arabic tag found in the site */
$lang_flag1=$this->search_page($site,'lang=ar');
$text=$this->text_to_search;
$text = strtolower($text);
$lang_flag2=$this->search_text($text,'lang="ar"');
$lang_flag3=$this->search_text($text,"lang='ar'");
/*
the code to check if the lang=ar is in a tag and not in the body of the page goes here
*/
/* get the encoding */
$encoding_array=array();
$encoding_array_values=array("utf-8","windows-1256","iso","asmo-708","dos");// arabic encodings
$encoding[0]=$this->search_text($text,"charset=utf-8");
$encoding[1]=$this->search_text($text,"charset=windows-1256");
$encoding[2]=$this->search_text($text,"charset=iso-6");
$encoding[3]=$this->search_text($text,"charset=asmo-708");
$encoding[4]=$this->search_text($text,"charset=dos");
$i=-1;
while($i<5)
{
$i++;
if($encoding[$i])
break;
}
$results_array[4]=$encoding_array_values[$i];</p>
<p> $ar_domain = $this->check_if_arabic($site);
if($ar_domain==true)// arabic domain
{
$results_array[1]="Y";
}
else
if ( ($lang_flag1) || ($lang_flag2) || ($lang_flg3)) // arabic lang tag found
$results_array[1]="Y";
else // arabic lang tag not found so check the encoding
{
if ($encoding[0]!==false)// UTF-8 encoding
{
$str='?§'; // alef charachter in arabic utf8
$p=$this->search_text($text,$str);
if ($p)// arabic lang found
{
$results_array[1]="Y";
}
else// other lang found
{
$results_array[1]="N";
}
}
else// Arabic encoding
if ( ($encoding[1]) || ($encoding[2]) || ($encoding[3]) || ($encoding[4]))
$results_array[1]="Y";
else
{
$results_array[4]="no arabic encoding";
$results_array[1]="N";
}
}
}
else
{
$msg="URL does not exist";
$results_array[0]="N";
}
return $results_array;
}
I hope this will help