development

컨텐츠 요약 검색을위한 깨끗한 Wikipedia API가 있습니까?

big-blog 2020. 6. 25. 07:33
반응형

컨텐츠 요약 검색을위한 깨끗한 Wikipedia API가 있습니까?


Wikipedia 페이지의 첫 번째 단락 만 검색하면됩니다. 콘텐츠는 html 형식이어야하며 내 웹 사이트에 표시 할 준비가되어 있어야합니다 (따라서 BBCODE 또는 WIKIPEDIA 특수 코드 없음).


html 파싱없이 전체 "소개 섹션"을 얻는 방법이 있습니다! 추가 매개 변수 가있는 AnthonyS의 답변유사하게 explaintext소개 섹션 텍스트를 일반 텍스트로 얻을 수 있습니다.

질문

일반 텍스트로 스택 오버플로 소개하기 :

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Stack%20Overflow

JSON 응답

(경고 제거)

{
    "query": {
        "pages": {
            "21721040": {
                "pageid": 21721040,
                "ns": 0,
                "title": "Stack Overflow",
                "extract": "Stack Overflow is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky, as a more open alternative to earlier Q&A sites such as Experts Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog.\nIt features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. Users of Stack Overflow can earn reputation points and \"badges\"; for example, a person is awarded 10 reputation points for receiving an \"up\" vote on an answer given to a question, and can receive badges for their valued contributions, which represents a kind of gamification of the traditional Q&A site or forum. All user-generated content is licensed under a Creative Commons Attribute-ShareAlike license. Questions are closed in order to allow low quality questions to improve. Jeff Atwood stated in 2010 that duplicate questions are not seen as a problem but rather they constitute an advantage if such additional questions drive extra traffic to the site by multiplying relevant keyword hits in search engines.\nAs of April 2014, Stack Overflow has over 2,700,000 registered users and more than 7,100,000 questions. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML."
            }
        }
    }
}

설명서 : API : query / prop = extracts


편집 : &redirects=1의견에 권장대로 추가 되었습니다.


아주 좋은 실제로이 소품 라는 추출물 이 목적을 위해 특별히 설계된 쿼리와 함께 사용할 수 있습니다. 추출을 사용하면 기사 추출 (잘린 기사 텍스트)을 얻을 수 있습니다. exintro 라는 매개 변수가 있으며 , 이는 0 번째 섹션에서 텍스트검색하는 데 사용할 수 있습니다 (이미지 또는 정보 상자와 같은 추가 자산 없음) 특정 문자 수 ( exchars ) 또는 특정 수의 문장 ( exsentences ) 과 같이 세분화 된 추출을 검색 할 수도 있습니다.

다음은 샘플 쿼리 http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20OverflowAPI 샌드 박스 http://en.wikipedia.org/wiki/ 특수 : ApiSandbox # action = query & prop = extracts & format = json & exintro = & titles = Stack % 20 이 쿼리를 더 실험하기위한 오버플 로.

첫 번째 단락을 구체적으로 원한다면 선택한 답변에 제안 된대로 추가 구문 분석을 수행해야합니다. 차이점은이 쿼리에서 반환하는 응답이 구문 분석 할 API 응답의 이미지와 같은 추가 자산이 없기 때문에 제안 된 다른 API 쿼리 중 일부보다 짧다는 것입니다.


2017 년부터 Wikipedia는 더 나은 캐싱 기능을 갖춘 REST API제공합니다 . 에서 문서 당신은 완벽하게 사용 사례에 맞는 다음과 같은 API를 찾을 수 있습니다. (새로운 페이지 미리보기 기능 에서 사용됨 )

https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow 작은 섬네일로 여름을 표시하는 데 사용할 수있는 다음 데이터를 반환합니다.

{
  "type": "standard",
  "title": "Stack Overflow",
  "displaytitle": "Stack Overflow",
  "extract": "Stack Overflow is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer sites such as Experts-Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog.",
  "extract_html": "<p><b>Stack Overflow</b> is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer sites such as Experts-Exchange. The name for the website was chosen by voting in April 2008 by readers of <i>Coding Horror</i>, Atwood's popular programming blog.</p>",
  "namespace": {
    "id": 0,
    "text": ""
  },
  "wikibase_item": "Q549037",
  "titles": {
    "canonical": "Stack_Overflow",
    "normalized": "Stack Overflow",
    "display": "Stack Overflow"
  },
  "pageid": 21721040,
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/en/thumb/f/fa/Stack_Overflow_homepage%2C_Feb_2017.png/320px-Stack_Overflow_homepage%2C_Feb_2017.png",
    "width": 320,
    "height": 149
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/en/f/fa/Stack_Overflow_homepage%2C_Feb_2017.png",
    "width": 462,
    "height": 215
  },
  "lang": "en",
  "dir": "ltr",
  "revision": "902900099",
  "tid": "1a9cdbc0-949b-11e9-bf92-7cc0de1b4f72",
  "timestamp": "2019-06-22T03:09:01Z",
  "description": "website hosting questions and answers on a wide range of topics in computer programming",
  "content_urls": {
    "desktop": {
      "page": "https://en.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.wikipedia.org/wiki/Stack_Overflow?action=history",
      "edit": "https://en.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.wikipedia.org/wiki/Talk:Stack_Overflow"
    },
    "mobile": {
      "page": "https://en.m.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.m.wikipedia.org/wiki/Special:History/Stack_Overflow",
      "edit": "https://en.m.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.m.wikipedia.org/wiki/Talk:Stack_Overflow"
    }
  },
  "api_urls": {
    "summary": "https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow",
    "metadata": "https://en.wikipedia.org/api/rest_v1/page/metadata/Stack_Overflow",
    "references": "https://en.wikipedia.org/api/rest_v1/page/references/Stack_Overflow",
    "media": "https://en.wikipedia.org/api/rest_v1/page/media/Stack_Overflow",
    "edit_html": "https://en.wikipedia.org/api/rest_v1/page/html/Stack_Overflow",
    "talk_page_html": "https://en.wikipedia.org/api/rest_v1/page/html/Talk:Stack_Overflow"
  }
}

기본적으로 리디렉션을 따르지 /api/rest_v1/page/summary/StackOverflow만 작동하므로 비활성화 할 수 있습니다?redirect=false

다른 도메인에서 API에 액세스해야하는 경우와 CORS 헤더를 설정할 수 있습니다 &origin=(예를 &origin=*)

2019 업데이트 : API가 페이지에 대한 더 유용한 정보를 반환하는 것 같습니다.


이 코드를 사용하면 페이지의 첫 번째 단락 내용을 일반 텍스트로 검색 할 수 있습니다.

Parts of this answer come from here and thus here. See MediaWiki API documentation for more information.

// action=parse: get parsed text
// page=Baseball: from the page Baseball
// format=json: in json format
// prop=text: send the text content of the article
// section=0: top content of the page

$url = 'http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Baseball&prop=text&section=0';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "TestScript"); // required by wikipedia.org server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

$json = json_decode($c);

$content = $json->{'parse'}->{'text'}->{'*'}; // get the main text content of the query (it's parsed HTML)

// pattern for first match of a paragraph
$pattern = '#<p>(.*)</p>#Us'; // http://www.phpbuilder.com/board/showthread.php?t=10352690
if(preg_match($pattern, $content, $matches))
{
    // print $matches[0]; // content of the first paragraph (including wrapping <p> tag)
    print strip_tags($matches[1]); // Content of the first paragraph without the HTML tags.
}

Yes, there is. For example, if you wanted to get the content of the first section of the article Stack Overflow, use a query like this:

http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Stack%20Overflow&rvprop=content&rvsection=0&rvparse

The parts mean this:

  • format=xml: Return the result formatter as XML. Other options (like JSON) are available. This does not affect the format of the page content itself, only the enclosing data format.

  • action=query&prop=revisions: Get information about the revisions of the page. Since we don't specify which revision, the latest one is used.

  • titles=Stack%20Overflow: Get information about the page Stack Overflow. It's possible to get the text of more pages in one go, if you separate their names by |.

  • rvprop=content: Return the content (or text) of the revision.

  • rvsection=0: Return only content from section 0.

  • rvparse: Return the content parsed as HTML.

Keep in mind that this returns the whole first section including things like hatnotes (“For other uses …”), infoboxes or images.

There are several libraries available for various languages that make working with API easier, it may be better for you if you used one of them.


This is the code I'm using right now for a website I'm making that needs to get the leading paragraphs / summary / section 0 of off Wikipedia articles, and it's all done within the browser (client side javascript) thanks to the magick of JSONP! --> http://jsfiddle.net/gautamadude/HMJJg/1/

It uses the Wikipedia API to get the leading paragraphs (called section 0) in HTML like so: http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Stack_Overflow&prop=text&section=0&callback=?

It then strips the HTML and other undesired data, giving you a clean string of an article summary, if you want you can, with a little tweaking, get a "p" html tag around the leading paragraphs but right now there is just a newline character between them.

Code:

var url = "http://en.wikipedia.org/wiki/Stack_Overflow";
var title = url.split("/").slice(4).join("/");

//Get Leading paragraphs (section 0)
$.getJSON("http://en.wikipedia.org/w/api.php?format=json&action=parse&page=" + title + "&prop=text&section=0&callback=?", function (data) {
    for (text in data.parse.text) {
        var text = data.parse.text[text].split("<p>");
        var pText = "";

        for (p in text) {
            //Remove html comment
            text[p] = text[p].split("<!--");
            if (text[p].length > 1) {
                text[p][0] = text[p][0].split(/\r\n|\r|\n/);
                text[p][0] = text[p][0][0];
                text[p][0] += "</p> ";
            }
            text[p] = text[p][0];

            //Construct a string from paragraphs
            if (text[p].indexOf("</p>") == text[p].length - 5) {
                var htmlStrip = text[p].replace(/<(?:.|\n)*?>/gm, '') //Remove HTML
                var splitNewline = htmlStrip.split(/\r\n|\r|\n/); //Split on newlines
                for (newline in splitNewline) {
                    if (splitNewline[newline].substring(0, 11) != "Cite error:") {
                        pText += splitNewline[newline];
                        pText += "\n";
                    }
                }
            }
        }
        pText = pText.substring(0, pText.length - 2); //Remove extra newline
        pText = pText.replace(/\[\d+\]/g, ""); //Remove reference tags (e.x. [1], [4], etc)
        document.getElementById('textarea').value = pText
        document.getElementById('div_text').textContent = pText
    }
});

This url will return summary in xml format.

http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=Agra&MaxHits=1

I have created a function to fetch description of a keyword from wikipedia.

function getDescription($keyword){
    $url='http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString='.urlencode($keyword).'&MaxHits=1';
    $xml=simplexml_load_file($url);
    return $xml->Result->Description;
}
echo getDescription('agra');

You can also get content such as the first pagagraph via DBPedia which takes Wikipedia content and creates structured information from it (RDF) and makes this available via an API. The DBPedia API is a SPARQL one (RDF-based) but it outputs JSON and it is pretty easy to wrap.

As an example here's a super simple JS library named WikipediaJS that can extract structured content including a summary first paragraph: http://okfnlabs.org/wikipediajs/

You can read more about it in this blog post: http://okfnlabs.org/blog/2012/09/10/wikipediajs-a-javascript-library-for-accessing-wikipedia-article-information.html

The JS library code can be found here: https://github.com/okfn/wikipediajs/blob/master/wikipedia.js


The abstract.xml.gz dump sounds like the one you want.


My approach was as follows (in PHP):

$url = "whatever_you_need"

$html = file_get_contents('https://en.wikipedia.org/w/api.php?action=opensearch&search='.$url);
$utf8html = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $html), ENT_NOQUOTES, 'UTF-8');

$utf8html might need further cleaning, but that's basically it.


I tried @Michael Rapadas and @Krinkle's solution but in my case I had trouble to find some articles depending of the capitalization. Like here:

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&exsentences=1&explaintext=&titles=Led%20zeppelin

Note I truncated the response with exsentences=1

Apparently "title normalization" was not working correctly:

Title normalization converts page titles to their canonical form. This means capitalizing the first character, replacing underscores with spaces, and changing namespace to the localized form defined for that wiki. Title normalization is done automatically, regardless of which query modules are used. However, any trailing line breaks in page titles (\n) will cause odd behavior and they should be stripped out first.

I know I could have sorted out the capitalization issue easily but there was also the inconvenience of having to cast the object to an array.

So because I just really wanted the very first paragraph of a well-known and defined search (no risk to fetch info from another articles) I did it like this:

https://en.wikipedia.org/w/api.php?action=opensearch&search=led%20zeppelin&limit=1&format=json

Note in this case I did the truncation with limit=1

This way:

  1. I can access the response data very easily.
  2. The response is quite small.

But we have to keep being careful with the capitalization of our search.

More info: https://www.mediawiki.org/wiki/API:Opensearch


If you are just looking for the text which you can then split up but don't want to use the API take a look at en.wikipedia.org/w/index.php?title=Elephant&action=raw

참고URL : https://stackoverflow.com/questions/8555320/is-there-a-clean-wikipedia-api-just-for-retrieve-content-summary

반응형